Two ways to store the KV cache for an LLM serving workload: pre-allocate one contiguous slab per request, or chop it into small fixed-size pages mapped through a block table. The first is what HuggingFace Transformers and early TGI versions did. The second is PagedAttention, and it’s why vLLM exists.
The contiguous approach is simpler to implement. Also catastrophically wasteful – traditional systems waste 60-80% of KV memory on padding and reserved-but-unused slots, while the PagedAttention paper (Kwon et al., SOSP 2023) measures the paged approach at under 4% waste. That gap is why every production-grade engine – TGI, TensorRT-LLM, LightLLM – has since adopted some variant of the paging trick. This guide covers deploying vLLM v0.21.0 (as of mid-2025, the current stable PyPI release) on a CUDA box with PagedAttention tuned correctly, plus the install traps that aren’t in the official quickstart.
What you’re actually installing
vLLM is an Apache 2.0 inference engine originally built in UC Berkeley’s Sky Computing Lab, now backed by over 2,000 contributors (as of mid-2025). PagedAttention is the memory algorithm at its core: each sequence’s KV cache maps through a logical block table to non-contiguous physical blocks in GPU memory – exactly like an OS page table maps virtual pages to physical frames.
The default block holds 16 tokens. When a sequence generates its 17th token, vLLM claims a fresh block from the free pool – anywhere in VRAM – and updates the table. No pre-allocation, no padding waste, and identical prefixes across requests can share physical blocks via copy-on-write.
Think of it like a city that stopped reserving entire parking garages for each driver and switched to a valet system that slots cars wherever space exists. The garages fill up the same way – but nothing sits empty while someone else circles the block.
System requirements
Per the vLLM releases page, v0.21.0 has the following baseline:
| Component | Minimum | Recommended |
|---|---|---|
| OS | Linux, glibc ≥ 2.35 (Ubuntu 22.04+) | Ubuntu 24.04 |
| Python | 3.12 | 3.12 in a fresh venv |
| GPU | NVIDIA Ampere (A10, A100) or newer | H100 / H200 for FP8 |
| CUDA | 12.x | 13.0 – matches the default v0.21.0 wheel |
| VRAM | 16 GB for 7B FP16 | 80 GB for 70B with tensor parallel |
| Disk | 30 GB for one 13B model | 200 GB if you collect models |
ROCm builds are available from v0.14.0 onwards. For the current ROCm target version, check the AMD ROCm vLLM optimization page directly – it tracks the latest tested combination.
Install vLLM v0.21.0 (the uv way)
The official quickstart recommends uv over plain pip. There’s a real reason for that, which we’ll get to.
# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create a Python 3.12 venv
uv venv --python 3.12 --seed
source .venv/bin/activate
# Install vLLM - auto-detects your CUDA driver
uv pip install vllm --torch-backend=auto
--torch-backend=auto inspects your installed CUDA driver and picks the matching PyTorch index. Want to lock in CUDA 13.0 explicitly? Use --torch-backend=cu130. Plain pip works for stable releases:
pip install vllm
But pip cannot install vLLM nightly wheels correctly. Turns out pip merges packages from --extra-index-url with the default PyPI index and picks whichever has the highest version number – so nightlies get silently overridden by the latest stable. uv gives the extra index higher priority, which is the only reason the nightly install actually works. This is documented in vLLM’s own install docs but easy to miss.
First-time config: the PagedAttention knobs that matter
Start the OpenAI-compatible server with a sane baseline. Using Llama 3.1 8B here – the Llama-2-7B example in every other tutorial is three years old:
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct
--gpu-memory-utilization 0.85
--max-model-len 8192
--block-size 16
--enable-prefix-caching
--port 8000
What each flag actually does:
--gpu-memory-utilization 0.85– default is 0.9, which leaves almost no headroom for CUDA driver memory. Community testing on A10G hardware showed headroom collapsing under real load. Drop to 0.85 for stability; push higher only after measuring your specific workload.--max-model-len 8192– caps the context window. Lower = more concurrent sequences fit in the KV cache pool. Don’t blindly set this to the model’s max.--block-size 16– the PagedAttention block size in tokens. 16 is the sweet spot for most models. Bump to 32 for long-context workloads where gather overhead amortizes better.--enable-prefix-caching– turns on cross-request KV sharing for identical prompt prefixes. Free throughput if your traffic has repeated system prompts.
Special case worth knowing: if you’re running DeepSeek-V3 or R1 on AMD MI300X, the AITER MLA backend requires--block-size 1. vLLM raises an error rather than silently defaulting if you forget. That’s in the ROCm vLLM documentation but absent from most generic install tutorials.
Verify it works
curl http://localhost:8000/health
# Expect: 200 OK
curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Say hi in 5 words."}]
}'
Watch nvidia-smi in another terminal during inference. The KV cache pool size is logged at startup as # GPU blocks: N. Multiply N × block_size × num_heads × head_dim × 2 (K and V) × bytes_per_dtype and you get the actual KV memory footprint – useful when sizing batches.
The batch-size-1 latency trap nobody warns you about
PagedAttention isn’t free. The custom CUDA kernel has to gather non-contiguous blocks via the block table at every attention step. At batch size 1 – a single user, one stream – that gather overhead is real, and FlashAttention-2’s contiguous access pattern actually wins on raw latency. The Runpod technical write-up on vLLM internals confirms this edge case explicitly.
Before committing: If your workload is genuinely single-stream (a developer using one chatbot, a batch script that never parallelizes), benchmark vLLM against a plain transformers + FlashAttention-2 setup first. PagedAttention wins decisively at concurrency > ~4. Below that, the math is closer than the marketing suggests.
At higher concurrency the memory savings dominate. The original SOSP 2023 paper measured up to 24x higher throughput versus HuggingFace Transformers. That number comes from high-concurrency scenarios – it’s a throughput algorithm, not a latency one.
Common errors and the actual fixes
CUDA out of memory at startup
Almost always the default --gpu-memory-utilization 0.9 colliding with CUDA driver reservations. Drop to 0.85, or lower --max-model-len to free KV cache budget.
OOM mid-inference with free VRAM showing in nvidia-smi
This one’s nastier and most install guides miss it entirely. vLLM records a static CUDA graph for the decode phase to speed up repeated kernel launches – that graph reserves a permanent memory chunk. When request sequence lengths vary wildly, the graph’s allocation gets fragmented and you OOM despite nvidia-smi showing free memory. --enforce-eager disables graph capture and fixes it, at a measurable throughput cost. Tested and documented against A10G 24 GB hardware.
PyTorch CUDA fragmentation errors
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
Set this before launching. Caps the maximum allocator split size, which reduces fragmentation when the KV cache pool grows and shrinks aggressively.
uv installs CPU torch on a GPU machine
Reported on GitHub: --torch-backend=auto can resolve to the CPU wheel even when CUDA is present on some uv versions. Workaround: explicitly pin with --torch-backend=cu130 (or whatever matches your driver).
Upgrade and uninstall
Upgrade is one command:
uv pip install -U vllm --torch-backend=auto
Read the changelog before jumping major versions – the v0.20 → v0.21 move shifted the default wheel to CUDA 13.0, which breaks setups pinned to CUDA 12.x. Uninstall:
uv pip uninstall vllm
rm -rf ~/.cache/huggingface # if you want the model cache gone too
The HuggingFace cache is the bigger disk hog – model weights stay there until you clear it.
FAQ
Is PagedAttention the same thing as FlashAttention?
No – different layers entirely. FlashAttention is a fused kernel that cuts HBM reads during the attention computation itself. PagedAttention manages where the KV cache lives in memory. vLLM runs both simultaneously: FlashAttention-2 or -3 handles the compute, PagedAttention handles the layout.
Can I run vLLM without PagedAttention?
Not really – it’s baked into the engine. If you specifically want dynamic KV allocation without the non-contiguous virtual memory layout that PagedAttention creates, the vAttention paper (arXiv 2405.04437) argues for using CUDA’s low-level virtual memory APIs to keep the layout contiguous in virtual space while still allocating physical memory on demand. Research direction, not a drop-in alternative today.
What block size should I actually use?
Stick with 16 unless your profiler says otherwise. Here’s the tradeoff: larger blocks (32) mean fewer table lookups per token, which helps long-context workloads – but the last partial block per sequence wastes more memory. Smaller blocks (8) cut that internal fragmentation but pay it back in lookup cost. For most Llama- and Qwen-family models on production traffic with mixed sequence lengths, 16 is what the vLLM team tuned the kernels around. One exception: DeepSeek V3/R1 on ROCm AITER – hard-required at 1, as covered above.
Next step: spin up the install, then run benchmarks/benchmark_throughput.py from the vLLM repo against your actual prompt distribution. The defaults are good. Your traffic isn’t default.