Install DiffDock-L for AI Molecular Docking [2024 Guide]

Step-by-step guide to installing DiffDock-L, the MIT diffusion-based docking tool. Covers Docker setup, conda install, GPU config, and the SO(3) cache gotcha.

Jack Tom2026-04-226 min readIntermediate

Traditional molecular docking tools can take hours to dock a single ligand – and still get it wrong 77% of the time. DiffDock-L, an AI model from MIT and the Barzilay Lab, changed that in February 2024. 38% accuracy on blind tests. Nearly double what AutoDock Vina achieves.

Installation? Three paths.

What You’re Installing

DiffDock-L predicts how a small molecule binds to a protein. The “L” marks the updated model from the 2024 ICLR paper. Uses a diffusion generative model over the ligand’s translational, rotational, and torsional degrees of freedom – iteratively refines random poses until they look chemically plausible. (Original paper here.)

February 2024 brought generalization improvements. Clone the repo now? You get this version.

System Requirements

Minimum:

OS: Linux (Ubuntu 20.04+), macOS (limited), Windows via WSL2
CPU: Any modern x86_64 processor
RAM: 16GB (8GB might work with aggressive swapping)
Disk: ~10GB for environment + models
GPU: Optional – CPU inference runs 10-20x slower

Recommended: NVIDIA GPU with 8GB+ VRAM (RTX 3060, A4000, or better), CUDA 11.7+, driver 525+, 16-32GB RAM, SSD with 15GB free.

No GPU for ESMFold protein folding? You can still run DiffDock if you already have protein structures. ESMFold only matters if you’re starting from sequences.

Think of docking like parallel parking. AutoDock Vina tries every angle in a grid – exhaustive but slow. DiffDock starts with a random orientation and refines it through noise removal, the way diffusion models generate images. Faster. More accurate on complexes the model has never seen.

Docker Install

Avoids dependency hell.

docker pull rbgcsail/diffdock:latest

Run it interactively:

docker run -it --gpus all --entrypoint /bin/bash rbgcsail/diffdock:latest
# Inside:
micromamba activate diffdock

No GPU? Drop the --gpus all flag. Official docs say DiffDock runs on CPU – just expect minutes instead of seconds per docking run.

Test with the bundled example:

python -m inference 
 --protein_path data/1a0q_protein_processed.pdb 
 --ligand "COc1ccc(cc1)C#N" 
 --out_dir results/test_run

That SMILES string is a simple ligand. Output lands in results/test_run/.

The Cache Permission Trap

PyTorch tries to write to /home/appuser/.cache. Most container setups don’t allow it. Errors about kernel cache or checkpoint directories? This.

Fix before running inference:

export TORCH_HOME=$(pwd)
export PYTORCH_KERNEL_CACHE_PATH=$(pwd)

Redirects cache writes to your current directory. Community implementations confirm this workaround.

Conda Install

git clone https://github.com/gcorso/DiffDock.git
cd DiffDock
conda env create --file environment.yml
conda activate diffdock

Installs PyTorch, PyTorch Geometric, e3nn (equivariant neural networks), RDKit, Biopython, ESM. YAML fails? CUDA version mismatches are common. Install manually:

conda create --name diffdock python=3.9
conda activate diffdock
conda install pytorch pytorch-cuda=11.7 -c pytorch -c nvidia
pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric
pip install PyYAML scipy networkx biopython rdkit-pypi e3nn spyrmsd pandas biopandas

Then ESM (for protein embeddings):

pip install "fair-esm[esmfold]"
pip install 'dllogger @ git+https://github.com/NVIDIA/dllogger.git'
pip install 'openfold @ git+https://github.com/aqlaboratory/openfold.git@4b41059694619831a7db195b7e0988fc4ff3a307'

Why that OpenFold commit? Latest main breaks compatibility sometimes. Stability.

First Run: The SO(3) Cache Delay

python -m inference 
 --protein_path protein.pdb 
 --ligand "CCO" 
 --out_dir results/ethanol_test

First run on any new machine: 2-3 minutes longer. DiffDock precomputes lookup tables for SO(2) and SO(3) distributions (rotation groups) and caches them. Happens once per device. Don’t kill the process.

Verify It Works

ls results/ethanol_test/

You should see rank1.sdf, rank2.sdf, etc. – predicted ligand poses ranked by confidence. Open them in PyMOL or ChimeraX.

Check the confidence score in output logs. Most tutorials skip this: DiffDock’s confidence score ≠ binding affinity. Official FAQ calls it a measure of pose quality – how confident the model is in the predicted structure. Some collaborators saw correlation with affinity. Not a direct measure. Need affinity? Pipe output to GNINA or run MM/GBSA.

Common Errors and Fixes

Error	Cause	Fix
`PermissionError: [Errno 13] /home/appuser/.cache`	PyTorch can’t write cache in Docker	Export `TORCH_HOME` and `PYTORCH_KERNEL_CACHE_PATH` to writable dir
`RuntimeError: CUDA out of memory`	GPU VRAM exhausted	Reduce `--batch_size` or `--samples_per_complex`
`Bio.pairwise2 BiopythonDeprecationWarning`	Biopython deprecated module still in use	Ignore – it’s a warning, code runs
Model download stalls on first run	Checkpoint pulled from Hugging Face at runtime	Wait or use pre-cached Docker image (externelly/diffdock)
`Inference samples failed` warnings	Model failed to generate valid poses for some iterations	Normal for challenging complexes – check if any poses succeeded

NVIDIA NIM (Enterprise)

Deploying at scale? NVIDIA packages DiffDock as a NIM (inference microservice). Version 2.1.0 requires driver 535.104.05+ and an NGC API key.

docker pull nvcr.io/nim/mit/diffdock:2.1.0

Run:

export NGC_API_KEY=<your_key>
docker run --rm -it --name diffdock-nim 
 --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 
 --shm-size=2G 
 -e NGC_API_KEY=$NGC_API_KEY 
 -p 8000:8000 
 nvcr.io/nim/mit/diffdock:2.1.0

Exposes a REST API on port 8000. POST docking requests to http://localhost:8000/molecular-docking/diffdock/generate. NIM bundles model weights – no runtime downloads.

Why NIM vs open-source Docker? Easier horizontal scaling. Built-in telemetry. Support contracts. Running hundreds of docking jobs daily? Worth it.

Interpreting Confidence Scores

DiffDock outputs a confidence value per pose. Official guidance assumes your complex resembles training data: drug-like molecule (not huge) binding to medium-sized protein (1-2 chains), conformation close to bound state.

Large protein complex? Apo (unbound) structure? Massive ligand? Shift confidence thresholds down. Docs don’t quantify this. Community reports suggest a 10-20% drop for apo structures.

Upgrading from Original DiffDock

Installed before February 2024? You’re on the old model. Repo now defaults to DiffDock-L.

cd DiffDock
git pull origin main
conda env update --file environment.yml

Old model weights are in commit history if you need to reproduce 2023 results – check out commit a6c5675.

Uninstalling

Conda:

conda deactivate
conda env remove --name diffdock

Docker:

docker rmi rbgcsail/diffdock:latest
docker system prune

Can I run DiffDock without a GPU?

Yes. CPU works. 10-20x slower. Fine for testing a handful of complexes.

Does DiffDock calculate binding affinity?

No. Predicts 3D binding pose + confidence score. Score correlates with affinity in some cases but isn’t a ΔG or Kd value. Combine DiffDock with GNINA, Vina scoring, or free energy methods for affinity. Docs say some collaborators saw correlation – not a direct measure. If you’re screening candidates for synthesis, you’ll need affinity estimation downstream.

Why does the first run take forever?

Two reasons. SO(2)/SO(3) lookup table generation: 2-3 minutes, one-time per device. Model checkpoint downloads from Hugging Face if you’re using the default Docker image. Use a pre-cached image (externelly/diffdock built 04.15.24 with DiffDock-L v1.1) or NVIDIA NIM to skip downloads.