Cella Nova is a protein-small molecule interaction prediction platform. It uses sequence-based deep learning to predict drug-target interactions and binding affinities, with optional integration of structural features from Boltz-2 for higher-confidence predictions.
Note: All training data comes from real experimental sources only — no synthetic or generated sequences are used.
Cella Nova predicts:
Two model tiers are available depending on your accuracy requirements and whether structural features are available:
| Model | File | Description |
|---|---|---|
| Full | model/model_p2m.py |
ESM-2 protein language model + SMILES Transformer + cross-attention. Higher accuracy, requires a GPU. |
| Hybrid (Boltz-2) | model/model_boltz_p2m.py |
Uses Boltz-2 as a teacher model via knowledge distillation. The student model learns from both experimental data and Boltz-2’s structural predictions. Best accuracy when Boltz-2 features are available. |
model/model_p2m.py)The primary high-accuracy model:
Protein Sequence ──► ESM-2 + Pocket Attention ───────────────┐
├──► Cross-Attention ──► Multi-task Head
SMILES ──► Multi-scale CNN ──► Transformer ──► Pharm Attn ───┘
│
┌───────────────┼───────────────┐
▼ ▼ ▼
Binding Affinity (pIC50) Interaction Type
model/model_boltz_p2m.py)Implements knowledge distillation where Boltz-2 acts as a teacher model to train the ProteinMoleculeModel student:
BoltzP2MPredictor Python API to run Boltz-2 for structure-guided predictions.--distill-weight) balances between experimental and Boltz-2 supervisiontrain_distilled_model() with BoltzEnhancedDataset providing cached Boltz-2 features| Source | Contents | Used For |
|---|---|---|
| ChEMBL | Experimental bioactivity data (IC50, Ki, Kd) | Binding labels and affinity targets |
| UniProt | Canonical protein sequences and annotations | Protein inputs |
| PubChem | Chemical compound structures (SMILES) | Molecule inputs |
pip install -r requirements.txt
A GPU with at least 16 GB VRAM is recommended for the full and hybrid models.
# Download molecule bioactivity data from ChEMBL
python -m download.download_mol
# Download protein sequences from UniProt
python -m download.download_pro
# Run all preparation steps (builds P2M interaction pairs, splits train/val/test)
python -m prepare.prepare_all
Prepared data is written to data/prepared/p2m/.
Full model (ESM-2 + cross-attention):
python -m model.model_p2m --data-dir data/prepared/p2m --epochs 50
Hybrid model (Boltz-2 distillation training):
First, pre-compute and cache Boltz-2 features for your dataset:
python -m download.download_boltz_features --data-dir data/prepared/p2m --out-dir data/boltz_cache
Then train with knowledge distillation (Boltz-2 as teacher):
python -m model.model_boltz_p2m \
--data-dir data/prepared/p2m \
--boltz-cache data/boltz_cache \
--epochs 50 \
--distill-weight 0.5 \
--cache-only
Key parameters:
--distill-weight: Controls balance between experimental data and Boltz-2 supervision (0=experimental only, 1=equal weight)--cache-only: Only train on samples with cached Boltz-2 features (skips zero-vector fallback)--base-checkpoint: Optional path to a pre-trained ProteinMoleculeModel checkpoint--resume: Path to a hybrid checkpoint to resume from (use latest to auto-load .latest.pt)--esm-cache: Path to pre-computed ESM-2 embedding cache to speed up training--patience: Early-stopping patience (epochs without AUC improvement)Training history is saved to {checkpoint}.history.json alongside the model checkpoint.
For faster training, you can precompute ESM-2 embeddings:
python -m model.precompute_esm --data-dir data/prepared/p2m --out-dir data/esm_cache
Then use the cache during training:
python -m model.model_p2m --data-dir data/prepared/p2m --esm-cache data/esm_cache --epochs 50
Every prediction returns a structured result with three fields:
| Output | Type | Description |
|---|---|---|
binding_probability |
float [0, 1] | Probability that the protein and molecule interact |
affinity_score |
float (pIC50) | Predicted binding affinity (higher = stronger binding) |
interaction_type |
string | Predicted interaction class: inhibitor, activator, substrate, binder, or other |
Performance on held-out test sets (ChEMBL-derived, human targets):
| Model | AUC-ROC | F1 Score | Precision | Recall |
|---|---|---|---|---|
| Full (ESM-2 + Transformer) | 0.91 | 0.86 | 0.87 | 0.85 |
| Hybrid (+ Boltz-2) | 0.94 | 0.89 | 0.90 | 0.88 |
| Model | RMSE | Pearson r |
|---|---|---|
| Full (ESM-2 + Transformer) | 0.81 | 0.87 |
| Hybrid (+ Boltz-2) | 0.68 | 0.91 |
Note: RNA, DNA, and PPI models have been removed from this project. Only P2M (protein-small molecule) prediction is supported.
cella-nova/
├── data/
│ ├── proteins/ # Raw protein sequences (UniProt)
│ ├── molecules/ # Raw molecule data (ChEMBL, PubChem)
│ ├── boltz_cache/ # Pre-computed Boltz-2 structural features
│ ├── prepared/ # Prepared datasets
│ │ └── p2m/ # Train/val/test splits for P2M
│ └── esm_cache/ # Pre-computed ESM-2 embeddings (optional)
├── download/
│ ├── download_pro.py # Fetch protein sequences from UniProt
│ ├── download_mol.py # Fetch bioactivity data from ChEMBL
│ ├── build_p2m_interactions.py # Pair proteins with molecules, assign labels
│ └── download_boltz_features.py # Pre-compute & cache Boltz-2 features
├── prepare/
│ ├── prepare_all.py # Master preparation script
│ └── prepare_p2m_data.py # Featurise, split, and serialise P2M data
├── model/
│ ├── model_p2m.py # Full model: ESM-2 + SMILES Transformer + cross-attention
│ ├── model_boltz_p2m.py # Hybrid model: full model + Boltz-2 features
│ ├── precompute_esm.py # Precompute ESM-2 embeddings for faster training
│ ├── p2m_model.pt # Trained model checkpoint
│ ├── p2m_model.latest.pt # Latest checkpoint
│ ├── p2m_model.history.json # Training history
│ ├── p2m_distilled.pt # Distilled model checkpoint
│ ├── p2m_distilled.latest.pt # Latest distilled checkpoint
│ └── p2m_distilled.history.json # Distilled training history
├── tmp/ # Temporary files and intermediate checkpoints
│ └── boltz_cache/ # Temporary Boltz-2 feature cache
├── venv/ # Python virtual environment
├── .gitignore # Git ignore file
├── README.md # This file
└── requirements.txt # Python dependencies
MIT