The most common question I see in deployment threads: can I actually run NVIDIA’s open source LLM locally, or is this another “open weights” model that secretly needs an 8x H100 node? Short answer: yes, you can run Nemotron 3 Nano on a single H100 or even a beefy consumer card if you’re patient. Long answer is what this guide is for.
NVIDIA launched the Nemotron 3 family on December 15, 2025, with Nano shipping immediately and Super expected in H1 2026. This guide walks through deploying Nemotron 3 Nano 30B-A3B FP8 – the version most people will actually serve – using vLLM. It’s the cheapest entry point to NVIDIA’s open source LLM tools, and it has one configuration gotcha that wrecks model quality if you miss it.
System Requirements (Real Ones, Not the Marketing Sheet)
Nemotron 3 Nano is a hybrid Mamba2-Transformer Mixture-of-Experts model. Per NVIDIA’s technical report, it totals 31.6B parameters with 3.2B active per forward pass – the MoE router activates 6 out of 128 experts. That “3B active” number is what determines your inference VRAM, but you still need to fit the full 30B weights in memory.
| Spec | Minimum (FP8) | Recommended |
|---|---|---|
| GPU | 1× H100 80GB or H200 | 1× H200 (tested config) |
| System RAM | 64 GB | 128 GB |
| Disk (model weights) | ~35 GB FP8 / ~65 GB BF16 | 200 GB SSD for caching |
| CUDA | 12.4+ (check current release notes) | latest stable |
| Python | 3.10+ | 3.11 |
| vLLM | 0.10.1+ | latest |
NVIDIA’s own benchmark numbers use a single H200 at 8K input / 16K output as the reference. On that setup, Nemotron 3 Nano delivers throughput 3.3x higher than Qwen3-30B-A3B and 2.2x higher than GPT-OSS-20B. Below H100-class hardware you’ll either need the FP8 build or an aggressive quantization from the community.
Pull the Weights From Hugging Face
The model lives at nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8. Per the NVIDIA Nemotron Open Model License, you’ll click through a gate on the HF page before downloads work.
# Install the HF CLI and authenticate
pip install -U "huggingface_hub[hf_xet]"
hf auth login
# Confirm token works
hf auth whoami
# Pick a disk with at least 70 GB free for FP8
export HF_HOME=/mnt/models/hf-cache
hf download nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
The hf_xet extra switches downloads to chunked transfer, which on a 35 GB model noticeably speeds things up on a fast connection. Stock huggingface_hub without it will be slower – worth the extra second to install.
Install vLLM and Serve the Model
Docker with vLLM’s prebuilt image. Skip the source build unless you have a reason. The official model card uses this exact pattern:
export TP_SIZE=1 # increase if you have multiple GPUs
export HF_TOKEN=hf_xxxxxxxxxxxxx
docker run --runtime nvidia --gpus all
-v ~/.cache/huggingface:/root/.cache/huggingface
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN"
-p 8000:8000
--ipc=host
vllm/vllm-openai:v0.10.1
--model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
--tensor-parallel-size ${TP_SIZE}
--max-num-seqs 64
--max-model-len 131072
--trust-remote-code
--mamba_ssm_cache_dtype float32
The flag that matters:
--mamba_ssm_cache_dtype float32is not optional. The model card is direct: “Without this option, the model’s accuracy may degrade.” No error message, no warning – just quietly worse outputs. If you’ve tested Nemotron and thought “meh, it’s fine but nothing special,” check this flag first.
Why does this happen? The hybrid Mamba layers maintain a rolling state cache – separate from the attention KV cache that most LLM tooling is tuned for. At default precision (bf16), that state accumulates rounding errors across long sequences. Float32 costs a small slice of VRAM and stops the drift.
Verify the Deployment
Container reports Application startup complete. vLLM is now running an OpenAI-compatible endpoint on port 8000. Test it:
curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8",
"messages": [{"role": "user", "content": "Write a Python function that returns the nth Fibonacci number iteratively."}],
"max_tokens": 256
}'
Watch the response structure, not just the content. Per the model card, Nano generates a reasoning trace before its final answer – that’s by design, and the chat template has a flag to control it. If you see a wall of tokens before the answer, that’s the reasoning mode working, not a bug. Streaming performance varies by hardware; the benchmark reference config is a single H200 at 8K/16K.
Open question worth sitting with: NVIDIA publishes 1M-token context support, but almost no public deployment guide actually tests what happens at 500K+ tokens with the Mamba state cache. If your workload needs long context, that’s an area where community data is still thin as of mid-2026.
Common Install Errors and Fixes
Pulled from the model cards and community threads – the ones that actually show up:
- CUDA OOM during startup – the model card recommends
--max-num-seqs 64; drop it further if the error persists. Don’t touch--max-model-lenfirst; reduce concurrency before you reduce context. - “Unsupported model architecture: NemotronHForCausalLM” – you’re on a vLLM version older than 0.10.1. The hybrid Mamba-Transformer architecture needs explicit support;
--trust-remote-codealone won’t save you on stale builds. - HF 401 / gated repo error – the NVIDIA license acceptance is per-account, per-model. Click through on the web for the FP8 repo specifically, not just the BF16 one.
- OOM on DGX Spark despite “128 GB” – Spark uses unified LPDDR5X memory (~128 GB shared between CPU and GPU), not separate system + VRAM pools. Drop
--mem-fraction-staticfrom 0.80 to 0.70 to leave room for the OS. - Garbled multilingual output – Nano is optimized for English and code. For other languages you’ll get better results from Super once it fits your hardware.
Scaling Up to Nemotron 3 Super
Super is the obvious next step when Nano hits its ceiling – but the hardware math changes. Only 12 billion of its 120 billion parameters are active at inference, which keeps compute cheap, but the full 120B still has to live in memory.
Think of it like RAM vs. CPU cores. The 120B is what loads into memory (RAM); the 12B active is what runs each token (cores). You can’t skip loading the inactive experts just because they’re idle.
Watch which variant you pull. The BF16 Super build needs a multi-GPU node. For single-card targets, the Super model card is explicit: for a single B200 or DGX Spark, use the NVFP4 variant (NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4). NVIDIA’s Blackwell quantization format (NVFP4) is purpose-built for that hardware tier – it’s not a community quant.
The Ultra size – larger still, expected later in 2026 per the original NVIDIA announcement – is a question mark for single-node deployment. No deployment documentation exists yet as of mid-2026.
Upgrade and Uninstall
Upgrading from Nemotron 2 Nano to Nemotron 3 Nano isn’t a drop-in. The chat template changed (reasoning is now controlled by a flag, not a separate model variant), and the architecture is different – your old vLLM args won’t carry over. Pull fresh weights, update vLLM to 0.10.1+, and re-test prompts that depend on system-prompt behavior.
To uninstall:
# Stop and remove the container
docker stop nemotron && docker rm nemotron
# Remove the image
docker rmi vllm/vllm-openai:v0.10.1
# Delete cached weights (this is the big one - ~35 GB)
rm -rf ~/.cache/huggingface/hub/models--nvidia--NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
FAQ
Can I run Nemotron 3 Nano on a single RTX 4090?
Not the official FP8 build – 24 GB VRAM won’t cover it. Your path is a community GGUF quantization (Q4_K_M fits) with llama.cpp. Slower than the benchmark numbers, and no native MoE routing optimizations.
What’s the licensing situation for commercial use?
The NVIDIA Nemotron Open Model License is permissive for commercial use, including fine-tuning and redistribution of derivatives. The catch most people miss: the license is separate from the NIM container license, which has its own terms if you use NVIDIA’s prebuilt inference microservice instead of vLLM. Read both if your legal team is involved.
How fresh is the training data?
Pre-training cutoff June 2025, post-training February 2026 for Super. Nano’s cutoff is similar. Anything after that needs RAG or tool use.
Next step: Spin up the Docker command above with your own HF token and run the curl verification call. If the response streams cleanly and the reasoning trace looks coherent, you’re done – point your existing OpenAI SDK at http://localhost:8000/v1 and your code just works.