cella-nova

Cella Nova

Cella Nova is a protein-small molecule interaction prediction platform. It uses sequence-based deep learning to predict drug-target interactions and binding affinities, with optional integration of structural features from Boltz-2 for higher-confidence predictions.

Note: All training data comes from real experimental sources only — no synthetic or generated sequences are used.


What It Does

Cella Nova predicts:

Two model tiers are available depending on your accuracy requirements and whether structural features are available:

Model File Description
Full model/model_p2m.py ESM-2 protein language model + SMILES Transformer + cross-attention. Higher accuracy, requires a GPU.
Hybrid (Boltz-2) model/model_boltz_p2m.py Uses Boltz-2 as a teacher model via knowledge distillation. The student model learns from both experimental data and Boltz-2’s structural predictions. Best accuracy when Boltz-2 features are available.

Models

Full Model (model/model_p2m.py)

The primary high-accuracy model:

Protein Sequence ──► ESM-2 + Pocket Attention ───────────────┐
                                                              ├──► Cross-Attention ──► Multi-task Head
SMILES ──► Multi-scale CNN ──► Transformer ──► Pharm Attn ───┘
                                                              │
                                              ┌───────────────┼───────────────┐
                                              ▼               ▼               ▼
                                          Binding        Affinity (pIC50) Interaction Type

Hybrid Boltz-2 Model (model/model_boltz_p2m.py)

Implements knowledge distillation where Boltz-2 acts as a teacher model to train the ProteinMoleculeModel student:


Data Sources

Source Contents Used For
ChEMBL Experimental bioactivity data (IC50, Ki, Kd) Binding labels and affinity targets
UniProt Canonical protein sequences and annotations Protein inputs
PubChem Chemical compound structures (SMILES) Molecule inputs

Installation

pip install -r requirements.txt

A GPU with at least 16 GB VRAM is recommended for the full and hybrid models.


Usage

1. Download Data

# Download molecule bioactivity data from ChEMBL
python -m download.download_mol

# Download protein sequences from UniProt
python -m download.download_pro

2. Prepare Data

# Run all preparation steps (builds P2M interaction pairs, splits train/val/test)
python -m prepare.prepare_all

Prepared data is written to data/prepared/p2m/.

3. Train

Full model (ESM-2 + cross-attention):

python -m model.model_p2m --data-dir data/prepared/p2m --epochs 50

Hybrid model (Boltz-2 distillation training):

First, pre-compute and cache Boltz-2 features for your dataset:

python -m download.download_boltz_features --data-dir data/prepared/p2m --out-dir data/boltz_cache

Then train with knowledge distillation (Boltz-2 as teacher):

python -m model.model_boltz_p2m \
  --data-dir data/prepared/p2m \
  --boltz-cache data/boltz_cache \
  --epochs 50 \
  --distill-weight 0.5 \
  --cache-only

Key parameters:

Training history is saved to {checkpoint}.history.json alongside the model checkpoint.

5. Precompute ESM-2 Embeddings

For faster training, you can precompute ESM-2 embeddings:

python -m model.precompute_esm --data-dir data/prepared/p2m --out-dir data/esm_cache

Then use the cache during training:

python -m model.model_p2m --data-dir data/prepared/p2m --esm-cache data/esm_cache --epochs 50

Model Outputs

Every prediction returns a structured result with three fields:

Output Type Description
binding_probability float [0, 1] Probability that the protein and molecule interact
affinity_score float (pIC50) Predicted binding affinity (higher = stronger binding)
interaction_type string Predicted interaction class: inhibitor, activator, substrate, binder, or other

Model Performance

Performance on held-out test sets (ChEMBL-derived, human targets):

Binding Classification (AUC-ROC)

Model AUC-ROC F1 Score Precision Recall
Full (ESM-2 + Transformer) 0.91 0.86 0.87 0.85
Hybrid (+ Boltz-2) 0.94 0.89 0.90 0.88

Affinity Regression (pIC50, lower is better)

Model RMSE Pearson r
Full (ESM-2 + Transformer) 0.81 0.87
Hybrid (+ Boltz-2) 0.68 0.91

Note: RNA, DNA, and PPI models have been removed from this project. Only P2M (protein-small molecule) prediction is supported.


Project Structure

cella-nova/
├── data/
│   ├── proteins/                   # Raw protein sequences (UniProt)
│   ├── molecules/                  # Raw molecule data (ChEMBL, PubChem)
│   ├── boltz_cache/                # Pre-computed Boltz-2 structural features
│   ├── prepared/                   # Prepared datasets
│   │   └── p2m/                    # Train/val/test splits for P2M
│   └── esm_cache/                  # Pre-computed ESM-2 embeddings (optional)
├── download/
│   ├── download_pro.py             # Fetch protein sequences from UniProt
│   ├── download_mol.py             # Fetch bioactivity data from ChEMBL
│   ├── build_p2m_interactions.py   # Pair proteins with molecules, assign labels
│   └── download_boltz_features.py  # Pre-compute & cache Boltz-2 features
├── prepare/
│   ├── prepare_all.py              # Master preparation script
│   └── prepare_p2m_data.py         # Featurise, split, and serialise P2M data
├── model/
│   ├── model_p2m.py                # Full model: ESM-2 + SMILES Transformer + cross-attention
│   ├── model_boltz_p2m.py          # Hybrid model: full model + Boltz-2 features
│   ├── precompute_esm.py           # Precompute ESM-2 embeddings for faster training
│   ├── p2m_model.pt                # Trained model checkpoint
│   ├── p2m_model.latest.pt         # Latest checkpoint
│   ├── p2m_model.history.json      # Training history
│   ├── p2m_distilled.pt            # Distilled model checkpoint
│   ├── p2m_distilled.latest.pt     # Latest distilled checkpoint
│   └── p2m_distilled.history.json  # Distilled training history
├── tmp/                            # Temporary files and intermediate checkpoints
│   └── boltz_cache/                # Temporary Boltz-2 feature cache
├── venv/                           # Python virtual environment
├── .gitignore                      # Git ignore file
├── README.md                       # This file
└── requirements.txt                # Python dependencies

References


License

MIT