If you’ve ever asked “why is my Hugging Face transformers server stuck at 8 requests per second on an A100?” – vLLM is the answer most people land on, and most install guides skip the parts that actually break. This is a deployment walkthrough for vLLM v0.20.2, focused on high throughput LLM inference and the specific traps in the 0.20.x line.
Quick honest take: vLLM is the fastest way to serve open-weight models on a GPU you already own, but the install surface is brittle. CUDA versions, torch pins, and a default that grabs 90% of your VRAM regardless of model size – all of it bites. Get it running first. Then we’ll talk about why the defaults lie.
What you’re actually installing
KV cache memory is the bottleneck for LLM throughput – every concurrent request needs its own slice of it, and most inference engines handle that badly. vLLM’s answer is PagedAttention (Kwon et al., SOSP 2023): treats KV cache blocks like OS memory pages, tokens like bytes, requests like processes. The result, per the paper, is 2-4× throughput over FasterTransformer and Orca at matched latency.
One nuance most tutorials skip: paging isn’t free. The vAttention paper (arXiv:2405.04437) clocked PagedAttention at roughly 20-26% slower than the original FasterTransformer kernel at the per-kernel level – the overhead comes from block-table lookups and extra branching. The throughput win comes from fitting more concurrent requests in memory, not a faster math kernel. Worth knowing when you benchmark.
System requirements (the real ones)
vLLM runs on Linux with NVIDIA GPU compute capability 7.0+ (V100, T4, RTX 20xx series onward, A100, L4, H100, RTX 30/40 series). Mac and Windows: no runtime support. macOS can build vLLM with VLLM_TARGET_DEVICE=empty for dev imports only – you can’t actually serve on it. WSL2 technically works but adds OOM weirdness on small VRAM cards.
| Component | Minimum | Recommended |
|---|---|---|
| OS | Linux, glibc ≥ 2.35 | Ubuntu 22.04 / 24.04 |
| Python | 3.9 | 3.12 (matches wheels) |
| GPU | Compute capability 7.0, 16 GB VRAM | A100/H100 or 24 GB+ for 7B-13B |
| CUDA driver | Compatible with CUDA 13.0 (v0.20.x) | Driver 580+ for cu130 wheels (as of v0.20.0) |
| Disk | ~50 GB free (model weights cache) | NVMe, 200 GB+ |
The CUDA story matters more than usual right now. Starting with v0.20.0, the default PyPI wheel and the vllm/vllm-openai:v0.20.0 Docker image switched to CUDA 13.0 – bumped to 13.0.2 to match PyTorch 2.11.0. The official quickstart says to use uv with --torch-backend=auto, which picks the right wheel for your driver automatically. Naively running pip install vllm on a CUDA 12.9 system hands you a wheel that may not load.
Install vLLM v0.20.2 with uv
The official quickstart now leads with uv, not conda or plain pip. --torch-backend=auto is the reason: it picks the matching PyTorch index for your driver, removing the “wrong wheel” failure mode that burned a lot of people on the 0.20.0 release.
# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create a clean Python 3.12 env
uv venv --python 3.12 --seed
source .venv/bin/activate
# Install vLLM - auto-picks cu130 / cu129 / cu128 from your driver
uv pip install vllm --torch-backend=auto
Want Docker instead? Same image the vLLM team ships, one command:
docker run --runtime nvidia --gpus all
-v ~/.cache/huggingface:/root/.cache/huggingface
--env "HF_TOKEN=$HF_TOKEN"
-p 8000:8000
--ipc=host
vllm/vllm-openai:latest
--model Qwen/Qwen3-0.6B
The image is vllm/vllm-openai on Docker Hub (per the official GPU install docs). Pin a tag in production – :latest floats, and the 0.20.x line has at least one release you do not want.
The v0.20.1 trap – skip this version
This is the section other tutorials don’t have yet.
vllm==0.20.1 hard-pins torch==2.11.0. On a 3× RTX 4090 Ubuntu server, torch 2.11.0+cu130 consumes ~22-23 GiB VRAM per GPU at initialization – before any model weights load. GitHub issue #42049 documents this: the same machine on torch 2.10.0+cu130 shows ~0 GiB overhead. Every model OOMs. The fix isn’t a config flag; it’s a version number.
The ABI incompatibility makes downgrading torch inside the 0.20.1 environment impossible – you can’t pin torch 2.10.0 and keep vllm 0.20.1 working. Your options: upgrade to vllm==0.20.2 (written against this version, mid-May 2026) or fall back to vllm==0.19.1 with the cu129 backend.
One thing the version cadence makes worse: vLLM’s own RELEASE.md documents a right-shifted versioning scheme where patch releases ship roughly every two weeks and include features and bug fixes – not just backwards-compatible patches. “Latest” is not always stable.
First-time configuration and verification
Minimum viable command for a 24 GB card:
vllm serve Qwen/Qwen3-0.6B
--gpu-memory-utilization 0.85
--max-model-len 4096
--host 0.0.0.0 --port 8000
vLLM starts an OpenAI-compatible server at http://localhost:8000 by default. Two verification steps:
- Health check:
curl http://localhost:8000/v1/models– should return the loaded model name. - Real inference:
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"Qwen/Qwen3-0.6B","messages":[{"role":"user","content":"Say hi"}]}'
On the 90% VRAM thing:
nvidia-smishowing 90% used right after startup is not a memory leak. vLLM reserves that fraction of VRAM for KV cache blocks by design – a 0.5B model on a 24 GB card will still push ~22 GB usage at default settings (documented in GitHub issue #18582). If you’re sharing the GPU, pass--gpu-memory-utilization 0.5or lower.
Common install errors and fixes
Three failure modes cover most of the first-deploy pain.
- “ValueError: No available memory for the cache blocks” – Hits hardest on RTX 3060 12 GB. GitHub issue #27934 documents V1 engine failures on 12 GB Ampere cards even with
--cpu-offload-gb 8 --gpu-memory-utilization 0.5set for 7B models. Drop--max-model-lento 2048 or 1024 first – KV cache scales linearly with that value, so halving it halves the footprint. Still failing? Use a quantized model (AWQ 4-bit) or switch to llama.cpp for that hardware. - “CUDA out of memory” at startup on a 4090 – Almost always the 0.20.1 torch 2.11 issue. Upgrade to 0.20.2.
- “RuntimeError: CUDA error: the provided PTX was compiled with an unsupported toolchain” – Driver/toolkit mismatch. Inside Docker, add
-e VLLM_ENABLE_CUDA_COMPATIBILITY=1to enable the pre-installed CUDA forward compatibility libraries (datacenter and professional GPUs only, per the vLLM troubleshooting docs). Outside Docker, install thecuda-compatpackage. - Memory fragmentation on small VRAM – If reserved-but-unallocated PyTorch memory is large,
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Truehelps (GitHub issue #7655).
Upgrade and uninstall
uv pip install --upgrade vllm==0.20.2
# or for Docker:
docker pull vllm/vllm-openai:v0.20.2
Migrating from 0.6.x or 0.11.x? The PyTorch 2.11 bump in 0.20 is a breaking environment change. The V1 engine is now default – set VLLM_USE_V1=0 only if you’re hitting V1-specific bugs. Clear the HF cache between major upgrades if you run into weight-loading errors.
Uninstall:
uv pip uninstall vllm
rm -rf ~/.cache/vllm
# Optional: also clear downloaded weights
rm -rf ~/.cache/huggingface/hub
For Docker: docker rm -f <container> && docker rmi vllm/vllm-openai:<tag>. One community-reported gotcha: GPU memory from crashed vLLM processes sometimes persists across container restarts. A host reboot clears it reliably; other approaches vary by driver version.
FAQ
Can I run vLLM on a single RTX 3060 12 GB for a 7B model?
Technically yes, in practice: use llama.cpp instead. The V1 engine has documented memory accounting bugs on 12 GB Ampere cards (issue #27934) that push you toward quantized models or very short context windows regardless of config. vLLM’s batching advantage only pays off with concurrent requests; on a single-user 12 GB setup, llama.cpp is simpler and more predictable.
Should I use the pip wheel or the Docker image in production?
Docker, pinned tag. The image bundles compatible CUDA libraries, which removes the driver mismatch failures that hit pip installs. Pin the version – vllm/vllm-openai:v0.20.2, not :latest. The two-week patch cadence means “latest” can silently change torch versions, CUDA defaults, and engine behavior overnight, and as 0.20.1 showed, that matters.
Does vLLM actually deliver the 2-4× throughput claim from the paper?
The claim is real but conditional. It holds on batched workloads with many concurrent requests – that’s where PagedAttention’s memory efficiency translates directly to more in-flight sequences. On single-stream, low-concurrency workloads the per-kernel overhead (20-26% slower than FasterTransformer at the kernel level, per arXiv:2405.04437) can make vLLM slightly slower than a tuned TensorRT-LLM setup. There’s also a subtler question the paper doesn’t answer: what’s your actual prompt length distribution? A workload with uniformly short prompts gets much less benefit from paged memory than one with variable-length contexts. Benchmark with your real traffic before committing.
Next step: Once your server returns a valid /v1/chat/completions response, run GuideLLM (vLLM’s own benchmarking tool) against it with your actual prompt distribution. The default vLLM benchmark numbers are synthetic – your traffic shape determines whether continuous batching is helping or just hiding a config problem.