Skip to content

Multimodal Machine Learning for Soft High-k, Low-Modulus Polymers under Data Scarcity

Notifications You must be signed in to change notification settings

HySonLab/Polymers

Repository files navigation

Multimodal Machine Learning for Soft High-k Elastomers under Data Scarcity

Python 3.8+ License

This repository contains the official implementation of our multimodal fusion framework for polymer property prediction under data scarcity conditions. Our approach leverages multiple representation learning methods and contrastive alignment to improve property prediction with limited training samples.

Contributors

  • Brijesh LNU
  • Viet Thanh Duy Nguyen
  • Dr. Chengyi Xu (PI / Corresponding Author)
  • Dr. Truong-Son Hy (PI / Corresponding Author)

Overview

Predicting polymer properties from molecular structures is challenging when training data is limited. This work addresses data scarcity through:

  1. Multimodal Representations: Combining complementary molecular representations (sequence-based, graph-based, and fingerprint-based)
  2. Contrastive Alignment: Property-guided contrastive learning to align heterogeneous embedding spaces
  3. Fusion Strategies: Systematic comparison of early, late, and latent-space fusion approaches

Key Features

  • 🧬 Multiple Encoders: TransPolymer, PolyBERT, GIN (Graph Isomorphism Network), Morgan Fingerprints
  • πŸ”— Contrastive Alignment: Property-guided alignment of different embedding spaces
  • πŸ”¬ GPR Prediction: Gaussian Process Regression with optimized hyperparameters
  • πŸ“Š Comprehensive Evaluation: Leave-One-Out Cross-Validation (LOOCV) for robust small-data assessment
  • 🎯 Multi-Property: Simultaneous prediction of Dielectric Constant and Young's Modulus

Framework Overview

Repository Structure

.
β”œβ”€β”€ README.md                              # This file
β”œβ”€β”€ requirements.txt                       # Python dependencies
β”œβ”€β”€ DE Data Collection.csv                 # Dataset with polymer properties
β”œβ”€β”€ artifacts/                             # Pre-computed embeddings
β”‚   β”œβ”€β”€ transPolymer_embeddings.pkl       # TransPolymer embeddings
β”‚   β”œβ”€β”€ gin_embeddings.pkl                # GIN embeddings
β”‚   └── Polybert_Embeddings.pkl           # PolyBERT embeddings
β”œβ”€β”€ GIN_checkpoint/                        # GIN model checkpoint
β”œβ”€β”€ TransPolymer_checkpoint/               # TransPolymer model checkpoint
β”œβ”€β”€ GIN_Encoder.py                        # GIN-based property prediction
β”œβ”€β”€ Sequence_TransPolymer.py              # TransPolymer-based prediction
β”œβ”€β”€ Sequence_Polybert.py                  # PolyBERT-based prediction
β”œβ”€β”€ Sequence_Morgan_Fingerprint_GRP.py    # Morgan fingerprint baseline
└── Multi_fusion.py                       # Multimodal fusion pipeline

Installation

Prerequisites

  • Python 3.8 or higher
  • CUDA (optional, for GPU acceleration)

Setup

  1. Clone this repository:
git clone https://github.com/yourusername/multimodal-polymer-prediction.git
cd multimodal-polymer-prediction
  1. Create a virtual environment:
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt

Dataset

The dataset (DE Data Collection.csv) contains:

  • 35 polymer samples
  • SMILES representations of polymer structures
  • Target properties:
    • Dielectric Constant (k)
    • Young's Modulus (MPa)

Due to the small sample size, we use Leave-One-Out Cross-Validation (LOOCV) for evaluation.

Dataset Distributions

Usage

1. Single-Modality Baselines

Morgan Fingerprint + GPR

python Sequence_Morgan_Fingerprint_GRP.py

TransPolymer Embeddings + GPR

python Sequence_TransPolymer.py

PolyBERT Embeddings + GPR

python Sequence_Polybert.py

GIN Embeddings + GPR

python GIN_Encoder.py

Each script will:

  1. Load pre-computed embeddings or generate them
  2. Perform hyperparameter tuning via 5-fold CV
  3. Evaluate using LOOCV
  4. Report RΒ² and RMSE with uncertainty estimates

2. Multimodal Fusion Pipeline

Run the complete fusion experiment:

python Multi_fusion.py

This will:

  1. Load TransPolymer and GIN embeddings
  2. Train contrastive alignment models (10 runs with different seeds)
  3. Evaluate multiple fusion strategies:
    • Early Fusion: Concatenation and averaging of raw embeddings
    • True Late Fusion: Prediction-level fusion from separate models
    • Latent-Space Aligned: Contrastive-aligned embeddings with various fusion methods
  4. Generate a comprehensive results table (Table 2 in paper)

Reproducibility

Random Seeds

All experiments use fixed random seeds for reproducibility:

  • MASTER_SEED = 42 for the main pipeline
  • RANDOM_SEED = 42 for individual baselines
  • BOOTSTRAP_SEED = 42 for uncertainty estimation

Pre-computed Embeddings

Pre-computed embeddings are provided in artifacts/ to ensure exact reproducibility:

  • transPolymer_embeddings.pkl: 35 Γ— 768 dimensional
  • gin_embeddings.pkl: 35 Γ— 256 dimensional
  • Polybert_Embeddings.pkl: 35 Γ— 768 dimensional

Regenerating Embeddings

To regenerate embeddings from scratch:

  1. TransPolymer: Use checkpoint in TransPolymer_checkpoint/
  2. GIN: Use checkpoint in GIN_checkpoint/
  3. PolyBERT: Use the publicly available pretrained PolyBERT model hosted on Hugging Face: https://huggingface.co/kuelumbus/polyBERT

Methodology

Property-Guided Contrastive Learning

The alignment loss encourages embeddings from different modalities to be similar when their property values are similar:

L = -log(Ξ£ exp(sim(z_tp, z_gnn) / Ο„) Γ— I[dist(y_i, y_j) < threshold] /
         Ξ£ exp(sim(z_tp, z_gnn) / Ο„))

Where:

  • z_tp, z_gnn: TransPolymer and GIN embeddings
  • Ο„: Temperature parameter (0.10)
  • threshold: Property distance percentile (30th)

Evaluation Protocol

  1. Cross-Validation: LOOCV for all experiments (critical for n=35)
  2. Metrics:
    • RΒ² (coefficient of determination)
    • RMSE (root mean squared error)
  3. Uncertainty Quantification:
    • RΒ²: Jackknife standard deviation
    • RMSE: Bootstrap standard deviation (5000 samples)
  4. Statistical Testing: Paired t-tests between methods

Configuration

Key hyperparameters in Multi_fusion.py:

# Contrastive Learning
TEMPERATURE = 0.10              # Contrastive loss temperature
PROPERTY_PERCENTILE = 30        # Property similarity threshold
EPOCHS = 400                    # Training epochs
LEARNING_RATE = 5e-4            # AdamW learning rate
WEIGHT_DECAY = 1e-3             # L2 regularization

# Architecture
PROJECTION_DIM = 128            # Aligned embedding dimension
TP_HIDDEN_DIM = 128             # TransPolymer projection hidden size
GNN_HIDDEN_DIM = 256            # GIN projection hidden size
TP_DROPOUT = 0.3                # TransPolymer dropout
GNN_DROPOUT = 0.15              # GIN dropout

# Gaussian Process Regression
PCA_COMPONENTS = 20             # PCA dimensionality
GPR_RESTARTS = 10               # Optimizer restarts

# Experiment
NUM_RUNS = 10                   # Number of independent runs
MASTER_SEED = 42                # Random seed

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages