If you’ve tried predicting a protein-RNA complex with AlphaFold 3 and watched it stumble on a target with no joint MSA, you already know the gap. RNA protein complex AI is a small niche – most structure predictors lean hard on co-evolution signals that simply don’t exist for many RBPs. ProRNA3D-single from the Bhattacharya Lab takes a different route: feed it a single protein sequence and a single RNA sequence, and it returns a 3D complex.
The published results show it outperforms RoseTTAFold2NA, RoseTTAFold All-Atom, and AlphaFold 3 especially when evolutionary information is limited. That’s the reason to deploy it locally instead of waiting for a web server queue. This guide walks the actual install, not the abstract.
What you’re actually installing
Three moving parts. A PyTorch GPU stack. A pretrained checkpoint sitting on Zenodo, not GitHub. And – the catch nobody flags in most write-ups – pre-computed ESM2 and RNA-FM embeddings as .npy files that you have to generate yourself before ProRNA3D-single will even start.
The architecture: protein and RNA language model embeddings flow into a structure-aware graph, then through symmetry-aware graph convolutions and a ResNet-Inception plus geometric attention module that predicts an interaction map, then geometry optimization produces the final PDB. That pipeline is why the repo has no embedding generator – it expects you to bring those files already made.
System requirements
| Component | Minimum | Recommended |
|---|---|---|
| OS | Linux (Ubuntu 20.04+) | Ubuntu 22.04 |
| GPU | NVIDIA, 12 GB VRAM | 24 GB VRAM (A5000/A6000/A100) |
| CUDA | 11.7 (community-reported estimate) | 11.8 or 12.1 |
| Python | 3.9 (community-reported estimate) | 3.10+ (required by latest stable PyTorch as of writing) |
| RAM | 16 GB (estimated) | 32 GB (estimated) |
| Disk | ~10 GB (estimated) | 30+ GB with ESM2/RNA-FM weights |
| Conda | Miniconda 23.x | Mambaforge (faster solver) |
24 GB VRAM stops being optional fast. ESM2 650M embeddings alone are heavy; ribosomal complexes or long RNAs push you past 12 GB before the geometric attention module even runs.
Step 1 – Clone the repo and create the environment
The implementation is GPLv3 licensed and lives at github.com/Bhattacharya-Lab/ProRNA3D-single.
git clone https://github.com/Bhattacharya-Lab/ProRNA3D-single.git
cd ProRNA3D-single
conda env create -f ProRNA3D-single_environment.yml
conda activate ProRNA3D-single
Per the README, the recommended path is a conda environment from the provided .yml. Solver hanging? Switch to mamba: mamba env create -f ProRNA3D-single_environment.yml. Most “hung install” reports trace back to conda’s classic solver choking on PyTorch + CUDA version resolution – mamba handles it noticeably faster.
Step 2 – Download the model weights from Zenodo
The trained checkpoint isn’t in the GitHub repo. It lives on Zenodo, and you fetch it manually:
mkdir -p ProRNA3D_model
curl --output ProRNA3D_model/model.pt
"https://zenodo.org/records/11477127/files/model.pt?download=1"
That URL points to Zenodo record 11477127. Skipping this step is the #1 reason first runs crash with a missing-file error – the conda install completes silently without weights.
Add
--fail --locationto the curl command. Zenodo occasionally returns an HTML error page with HTTP 200, and without--failyou end up with a 4 KB “model.pt” full of HTML. Check withfile ProRNA3D_model/model.pt– it should report a Zip archive (PyTorch checkpoint format), not ASCII text.
It’s worth pausing here: hosting model weights on Zenodo instead of GitHub LFS is increasingly common in academic ML because Zenodo provides DOI-backed archival with no file-size limits. The tradeoff is that cloning the repo gives you zero working state – the weights are a completely separate fetch step. If your CI pipeline auto-clones and runs tests, it will fail silently until you add the curl step explicitly.
Step 3 – Generate the embeddings
ProRNA3D-single expects ESM2 embeddings at inputs/7ZLQB.rep_1280.npy and RNA-FM embeddings at inputs/7ZLQC_RNA.npy (for the bundled example). Neither file is generated by ProRNA3D-single itself – that gap is real and the README doesn’t flag it prominently.
Two side installs you need:
- ESM2 – use
esm2_t33_650M_UR50D(as of writing; verify against the ESM2 repo for current model names) and extract therep_1280per-residue representation – the 1280-dim hidden state. Output as .npy with the basename matching your PDB ID plus chain letter:7ZLQB.rep_1280.npy. The 150M and 35M variants will load without error but produce garbage interaction maps because the pipeline expects exactly 1280 dims. - RNA-FM – from ml4bio/RNA-FM, trained on 23M+ non-coding RNA sequences. Save the per-nucleotide embedding as
<PDB_ID><CHAIN>_RNA.npy.
You also need distance maps: protein Cα-Cα maps go in prot_dist/, RNA C’4-C’4 maps in rna_dist/ (per the README). No helper script is bundled – a short numpy/biopython script from your monomer PDBs is the standard path.
Step 4 – First run and verification
Populate inputs/inputs.list with one target ID per line, then:
python run_predictions.py
Use 7ZLQ first. Valid model.pt plus valid embeddings → predicted PDB in the output folder within a few minutes on a single GPU. No PDB? Three things to check, in order: file naming (case-sensitive), .npy shape (ESM2 must be [L, 1280] – not [1280, L], not [L, 640]), and trailing whitespace in inputs.list.
What does it mean for a structure prediction to “work” here, anyway? The 7ZLQ run tells you the pipeline executes cleanly. Whether the predicted complex is biologically meaningful is a separate question – one the geometric attention module handles better than you might expect when MSA depth is low, but not one a quick local test can answer definitively. That’s worth keeping in mind before you publish a result from a single run.
Common errors and the fixes that actually work
- CUDA out of memory on long RNAs. Embedding tensors scale with sequence length squared once paired. Split long RNAs at obvious domain boundaries, or move to a 40 GB+ GPU. No chunking flag exists in run_predictions.py.
FileNotFoundError: ProRNA3D_model/model.pt. The Zenodo download silently failed or saved to the wrong directory. Re-run the curl with--fail.- Embedding shape mismatch. Only the 650M ESM2 model produces the 1280-dim representation the pipeline expects. Smaller variants load without complaint but output is meaningless.
- conda env solve never completes. Switch to mamba:
mamba env create -f ProRNA3D-single_environment.yml. - PyRosetta licensing. If you extend the pipeline along the lines of the CASP16 setup – where ESMFold and E2EFold-3D feed component structures and PyRosetta runs restrained optimization – note that PyRosetta is free for academics but requires a separate license. It cannot be installed silently in an automated CI environment without credentials.
- No official Docker image. A third-party Dockerfile appears on Bohrium’s SciencePedia entry, but it’s not from the Bhattacharya Lab. Treat it as a community reference, not a maintained image.
How it compares to the alternatives
AlphaFold 3 is still the default starting point for most teams. Reasonable – until you hit an orphan RBP with no usable MSA. The paper’s finding: ProRNA3D-single attains better accuracy than MSA-dependent methods when MSA information is limited. For dense, well-studied complexes with rich MSAs, AlphaFold 3 remains competitive. For designed RNAs or anything outside Rfam’s well-mapped families, ProRNA3D-single is the better local install.
“Limited MSA” is doing a lot of work in that sentence, though. How limited? The paper doesn’t give a hard cutoff – no “below N sequences, switch tools” rule. That’s an open question worth testing on your own targets rather than trusting the benchmark alone.
Upgrading and uninstalling
As of writing, there’s no version tag system on the repo – “upgrade” means git pull in your clone, then re-run the curl in case the Zenodo record changed (record ID would change; verify in the README). To uninstall:
conda deactivate
conda env remove -n ProRNA3D-single
rm -rf ~/path/to/ProRNA3D-single
Don’t forget to clear ProRNA3D_model/ if you cloned elsewhere – the checkpoint file is not small.
FAQ
Do I really need a GPU?
Yes. CPU inference is technically possible but the geometric attention module on even a moderate complex pushes runtime from minutes to hours. Not practical.
Can I use ESMFold-predicted monomer structures instead of experimental PDBs?
Yes – turns out that’s exactly what the CASP16 submissions did. The lab’s own pipeline fed ESMFold for the protein component and E2EFold-3D for the RNA component, then ProRNA3D-single handled the interactions (per the CASP16 abstracts). The distance maps just need to be self-consistent; their absolute accuracy matters less than you’d expect because the geometric attention module is tolerant of component-structure noise. One caveat: if both monomer predictions are poor, errors compound – garbage in, garbage out still applies.
Is there a Docker image I can pull?
Not an officially maintained one. Third-party Dockerfiles exist (one on Bohrium’s SciencePedia) but they’re community references, not lab-maintained. For real reproducibility, pin your own image from the conda env file, commit the Zenodo record ID alongside it, and add the curl step to your Dockerfile explicitly – that way you control exactly what model version you’re running.
Run 7ZLQ first. Once that works end-to-end, swap in your own target’s embeddings. The bundled example confirms the install; your real target confirms the science.