The #1 mistake people make installing biomolecular structure AI like Boltz-2 isn’t picking the wrong hardware. It’s running pip install boltz[cuda] on whatever Python and CUDA combo they already have – and then spending two days debugging a wrap_triton ImportError that no tutorial mentions.
Reverse-engineer the right approach: pin your Python version, match your CUDA driver to a torch wheel that actually exists, and only then touch Boltz. The order matters. Skip it and you’ll be reading GitHub issue threads at 2am.
Boltz-2 is the open-source model from MIT CSAIL that jointly predicts 3D complex structure and protein-ligand binding affinity. According to the paper, Boltz-2 matches or exceeds state-of-the-art structure accuracy across most modalities, and is the first AI model to approach free-energy-perturbation accuracy while being ~1000× faster in typical affinity calculations.All code and weights are MIT-licensed for academic and commercial use.
What version you’re actually installing
The PyPI package is boltz (no “2” – the repo houses both Boltz-1 and Boltz-2 weights). As of pip show output from community reports in early 2026, the version is 2.2.1. By default the CLI loads the latest Boltz-2 weights, so version pinning matters when reproducing a paper.
The NVIDIA NIM container is on its own track – the current image is nvcr.io/nim/mit/boltz2:1.6.0. Different version numbers because the NIM wraps an inference server around the model, not the model itself.
System requirements (the honest version)
Official docs are vague on hardware. Here’s what the GitHub issues actually reveal works:
| Component | Minimum | Recommended |
|---|---|---|
| OS | Linux (Ubuntu 22.04 tested) | Linux, native – WSL2 works but has known kernel issues |
| Python | 3.10 | 3.12 |
| GPU | None (CPU works, very slow) | NVIDIA Ampere or newer (A100, H100, RTX 30/40 series) |
| CUDA | 12.1-12.8 with stock install | 12.8 (avoid 12.9+ for now) |
| VRAM | ~16 GB for small complexes | 40-80 GB for sequences > 1000 residues |
| Disk | ~15 GB for weights + cache | 30+ GB if using NIM container |
If you’re on CPU-only or non-CUDA GPU hardware, you drop the [cuda] extra – but the CPU version is dramatically slower. For affinity screening this isn’t a workable mode. For one-off predictions of a small protein, it’ll finish overnight.
Install Boltz-2 with pip (the recommended path)
Fresh virtualenv. Always. Boltz pulls a long list of pinned dependencies (pytorch-lightning 2.5.0, hydra-core 1.3.2, trifast 0.1.13, rdkit, numba) and a single conflict on your global environment will silently break inference.
# 1. Create isolated environment
python3.12 -m venv boltz-env
source boltz-env/bin/activate
# 2. Install (GPU)
pip install boltz[cuda] -U
# CPU only? Drop the extra:
# pip install boltz -U
# 3. Verify
boltz --help
pip show boltz
The first boltz predict call will download model weights to ~/.boltz/. Budget about 10 minutes for the cold-start download on a decent connection.
The trifast/torch trap (this is where most installs die)
If pip install boltz[cuda] succeeds but boltz predict fails with an ImportError mentioning wrap_triton, you’ve hit a real dependency conflict. Boltz 2.0.3 pins trifast==0.1.13, which declares torch>=2.6.0 and uses APIs (torch.library.wrap_triton) that only exist in torch ≥ 2.6. But the official cu121 wheels only go up to torch==2.5.1+cu121 – there is no torch>=2.6.0+cu121 wheel published. Installing torch==2.5.1+cu121 satisfies CUDA but breaks trifast at runtime with the wrap_triton ImportError.
Workarounds, in order of preference:
- Move to CUDA 12.4 or 12.6 wheels where torch ≥ 2.6 binaries exist. This is the cleanest fix on a fresh machine.
- CPU-only install – works because the PyPI CPU wheels of torch already ship 2.6+. Slow, but it runs.
- Bleeding-edge CUDA 12.9 or 13.x?Install nightly torch from the matching cu129/cu130 index, comment out the torch>=2.2 line in pyproject.toml, and run with
--no_kernelsbecause Triton isn’t yet compatible with PyTorch nightly for these CUDA versions.
Pro tip: run
nvidia-smifirst and write down your driver’s CUDA version. Then look up the highest torch+cuXX wheel that exists on the PyTorch wheel index. If no wheel matches your CUDA at torch ≥ 2.6, you have two real choices and a lot of bad ones.
Alternative: Docker via NVIDIA NIM
If you don’t want to fight Python dependencies, NVIDIA ships Boltz-2 as an inference microservice. You get an HTTP endpoint instead of a CLI – different ergonomics, same model.
export LOCAL_NIM_CACHE=~/.cache/nim
export NGC_API_KEY=<your_ngc_key>
mkdir -p "$LOCAL_NIM_CACHE"
docker run --rm --name boltz2 --gpus all
--shm-size=16G
-e NGC_API_KEY
-v "$LOCAL_NIM_CACHE":/opt/nim/.cache
-p 8000:8000
nvcr.io/nim/mit/boltz2:1.6.0
First launch localizes up to 30 GB of data to disk, so plan storage accordingly. Once running, confirm with curl -X GET http://localhost:8000/v1/health/ready. The catch is you need an NGC account and key, which is free but adds a step.
First prediction (verifying the install works)
Save this as test.yaml:
version: 1
sequences:
- protein:
id: A
sequence: MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGL
- ligand:
id: B
smiles: 'N[C@@H](Cc1ccc(O)cc1)C(=O)O'
properties:
- affinity:
binder: B
Then run:
boltz predict test.yaml --use_msa_server --out_dir results
YAML is the only input format that supports binding-affinity prediction – FASTA workflows are quick for structure-only runs but won’t accept affinity properties, so any previous FASTA-based workflow needs converting.Look in the output JSON for two fields: affinity_pred_value and affinity_probability_binary. They’re trained on different datasets – use the binary probability for hit-discovery binder-vs-decoy ranking.
The sequence-length ceiling nobody mentions
Here’s something every tutorial skips: the cuEquivariance CUDA kernel fails on sequences greater than ~2000 residues, even on A100 80GB and H200 141GB cards. It’s not a VRAM problem – community reports show the same error when recycling and diffusion samples are pushed up on smaller sequences too.
The workaround is --no_kernels, which falls back to pure-PyTorch attention. Slower, but it completes. If you’re modeling antibody-antigen complexes or large multimers, plan for this from the start.
Common errors and what to actually do
ImportError: cannot import name 'wrap_triton'– the trifast/torch conflict above. Move to a CUDA version with torch≥2.6 wheels, or go CPU.- “Current GPU compute capability: sm_121… Match found: False” – your card (e.g. NVIDIA GB10) is newer than what your installed PyTorch was compiled for. Reinstall PyTorch from the nightly index matching your CUDA, e.g. cu130.
- CUDA out of memory – drop
--diffusion_samples, drop--recycling_steps, or split batch runs. For long sequences add--no_kernelsbefore increasing card size – kernel memory profile is different from baseline attention. - MSA server timeouts – the default ColabFold endpoint is rate-limited. For batch screening, run
boltz msaonce to download UniRef30 locally and skip the network entirely.
Upgrading and uninstalling
Upgrades are trivial: pip install boltz -U inside the same venv. Weights download separately on first run after upgrade if the model schema changed. Pin a version (boltz==2.2.1) in production scripts so reruns stay reproducible.
To uninstall fully:
pip uninstall boltz trifast cuequivariance-torch
rm -rf ~/.boltz/ # cached weights and MSA dbs
rm -rf ~/.cache/nim/ # if you used the NIM container
The MSA database directory can be hundreds of GB if you ran boltz msa --db all. Check it before deleting your venv and forgetting where the disk went.
FAQ
Does Boltz-2 run on Apple Silicon or AMD GPUs?
No CUDA, no acceleration. You can install the CPU build on macOS and it’ll predict, but expect minutes-to-hours per structure. There’s a Tenstorrent fork by Moritz Thüning for that hardware family – interesting if you have access, irrelevant for most labs.
Is the NIM container really worth it over pip?
If you’re deploying for a team that wants an HTTP API and you already have NGC access, yes – it sidesteps every Python-version headache in this article. If you’re a single researcher iterating on YAML inputs locally, pip is simpler and the feedback loop is faster. The NIM also locks you to whatever model version NVIDIA ships in the image, while pip tracks PyPI more closely.
Can I retrain Boltz-2 on my own data?
Not yet, officially. The repo says updated training code for Boltz-2 is coming, and current training instructions cover Boltz-1. If you’re planning a fine-tune, watch the release notes – or use Boltz-1 training as a structural template until the v2 pipeline lands.
Next step: spin up a fresh venv right now, run nvidia-smi, match it to a torch wheel that exists, and get the test YAML above producing a CIF file. Once that loop works, swap in a real protein from your project and start a single-sample run with --diffusion_samples 1 to measure wall-clock time on your hardware. That number tells you whether to invest in local install or move straight to NIM or a managed platform.