This repository contains the official implementation of our multimodal fusion framework for polymer property prediction under data scarcity conditions. Our approach leverages multiple representation learning methods and contrastive alignment to improve property prediction with limited training samples.
- Brijesh LNU
- Viet Thanh Duy Nguyen
- Dr. Chengyi Xu (PI / Corresponding Author)
- Dr. Truong-Son Hy (PI / Corresponding Author)
Predicting polymer properties from molecular structures is challenging when training data is limited. This work addresses data scarcity through:
- Multimodal Representations: Combining complementary molecular representations (sequence-based, graph-based, and fingerprint-based)
- Contrastive Alignment: Property-guided contrastive learning to align heterogeneous embedding spaces
- Fusion Strategies: Systematic comparison of early, late, and latent-space fusion approaches
- 𧬠Multiple Encoders: TransPolymer, PolyBERT, GIN (Graph Isomorphism Network), Morgan Fingerprints
- π Contrastive Alignment: Property-guided alignment of different embedding spaces
- π¬ GPR Prediction: Gaussian Process Regression with optimized hyperparameters
- π Comprehensive Evaluation: Leave-One-Out Cross-Validation (LOOCV) for robust small-data assessment
- π― Multi-Property: Simultaneous prediction of Dielectric Constant and Young's Modulus
.
βββ README.md # This file
βββ requirements.txt # Python dependencies
βββ DE Data Collection.csv # Dataset with polymer properties
βββ artifacts/ # Pre-computed embeddings
β βββ transPolymer_embeddings.pkl # TransPolymer embeddings
β βββ gin_embeddings.pkl # GIN embeddings
β βββ Polybert_Embeddings.pkl # PolyBERT embeddings
βββ GIN_checkpoint/ # GIN model checkpoint
βββ TransPolymer_checkpoint/ # TransPolymer model checkpoint
βββ GIN_Encoder.py # GIN-based property prediction
βββ Sequence_TransPolymer.py # TransPolymer-based prediction
βββ Sequence_Polybert.py # PolyBERT-based prediction
βββ Sequence_Morgan_Fingerprint_GRP.py # Morgan fingerprint baseline
βββ Multi_fusion.py # Multimodal fusion pipeline
- Python 3.8 or higher
- CUDA (optional, for GPU acceleration)
- Clone this repository:
git clone https://github.com/yourusername/multimodal-polymer-prediction.git
cd multimodal-polymer-prediction- Create a virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate- Install dependencies:
pip install -r requirements.txtThe dataset (DE Data Collection.csv) contains:
- 35 polymer samples
- SMILES representations of polymer structures
- Target properties:
- Dielectric Constant (k)
- Young's Modulus (MPa)
Due to the small sample size, we use Leave-One-Out Cross-Validation (LOOCV) for evaluation.
python Sequence_Morgan_Fingerprint_GRP.pypython Sequence_TransPolymer.pypython Sequence_Polybert.pypython GIN_Encoder.pyEach script will:
- Load pre-computed embeddings or generate them
- Perform hyperparameter tuning via 5-fold CV
- Evaluate using LOOCV
- Report RΒ² and RMSE with uncertainty estimates
Run the complete fusion experiment:
python Multi_fusion.pyThis will:
- Load TransPolymer and GIN embeddings
- Train contrastive alignment models (10 runs with different seeds)
- Evaluate multiple fusion strategies:
- Early Fusion: Concatenation and averaging of raw embeddings
- True Late Fusion: Prediction-level fusion from separate models
- Latent-Space Aligned: Contrastive-aligned embeddings with various fusion methods
- Generate a comprehensive results table (Table 2 in paper)
All experiments use fixed random seeds for reproducibility:
MASTER_SEED = 42for the main pipelineRANDOM_SEED = 42for individual baselinesBOOTSTRAP_SEED = 42for uncertainty estimation
Pre-computed embeddings are provided in artifacts/ to ensure exact reproducibility:
transPolymer_embeddings.pkl: 35 Γ 768 dimensionalgin_embeddings.pkl: 35 Γ 256 dimensionalPolybert_Embeddings.pkl: 35 Γ 768 dimensional
To regenerate embeddings from scratch:
- TransPolymer: Use checkpoint in
TransPolymer_checkpoint/ - GIN: Use checkpoint in
GIN_checkpoint/ - PolyBERT: Use the publicly available pretrained PolyBERT model hosted on Hugging Face: https://huggingface.co/kuelumbus/polyBERT
The alignment loss encourages embeddings from different modalities to be similar when their property values are similar:
L = -log(Ξ£ exp(sim(z_tp, z_gnn) / Ο) Γ I[dist(y_i, y_j) < threshold] /
Ξ£ exp(sim(z_tp, z_gnn) / Ο))
Where:
z_tp,z_gnn: TransPolymer and GIN embeddingsΟ: Temperature parameter (0.10)threshold: Property distance percentile (30th)
- Cross-Validation: LOOCV for all experiments (critical for n=35)
- Metrics:
- RΒ² (coefficient of determination)
- RMSE (root mean squared error)
- Uncertainty Quantification:
- RΒ²: Jackknife standard deviation
- RMSE: Bootstrap standard deviation (5000 samples)
- Statistical Testing: Paired t-tests between methods
Key hyperparameters in Multi_fusion.py:
# Contrastive Learning
TEMPERATURE = 0.10 # Contrastive loss temperature
PROPERTY_PERCENTILE = 30 # Property similarity threshold
EPOCHS = 400 # Training epochs
LEARNING_RATE = 5e-4 # AdamW learning rate
WEIGHT_DECAY = 1e-3 # L2 regularization
# Architecture
PROJECTION_DIM = 128 # Aligned embedding dimension
TP_HIDDEN_DIM = 128 # TransPolymer projection hidden size
GNN_HIDDEN_DIM = 256 # GIN projection hidden size
TP_DROPOUT = 0.3 # TransPolymer dropout
GNN_DROPOUT = 0.15 # GIN dropout
# Gaussian Process Regression
PCA_COMPONENTS = 20 # PCA dimensionality
GPR_RESTARTS = 10 # Optimizer restarts
# Experiment
NUM_RUNS = 10 # Number of independent runs
MASTER_SEED = 42 # Random seed
