You’ve got Ollama installed. You ran ollama run llama3 and it worked. Then you walked away for ten minutes, came back, and your next query took eight seconds to even start responding.
That’s not a bug. That’s the default config.
Most Llama tutorials stop at “it runs.” This one starts where that frustration begins – the performance traps, the VRAM miscalculations, and the settings nobody tells you to change. If you want Llama running locally without the constant friction, here’s what actually matters.
Why Your Hardware Specs Lie to You
The Ollama docs say a 7B model needs 8GB of RAM. Technically true. Practically useless.
That 8GB accounts for the model weights in Q4_K_M quantization (around 4.7GB for Llama 3 8B). What it doesn’t mention: the KV cache. At a 32K context window, an 8B model burns through roughly 4.5GB just for the attention cache. You’re already at 9.2GB before the OS, Ollama’s overhead, or your browser tabs get a vote.
Run that setup on an 8GB machine and Ollama starts offloading layers to system RAM. Performance drops from 40 tokens/sec to maybe 8. Tests show models with RAM spillover run 5-30x slower than models that fit entirely in VRAM. The bottleneck isn’t your CPU – it’s PCIe bandwidth shuttling data between RAM and GPU.
The Real Minimums (As of March 2026)
| Model Size | Minimum RAM | Comfortable VRAM (GPU) | Expected Speed |
|---|---|---|---|
| 1B-3B (Llama 3.2) | 4GB | 2-4GB | 50+ tokens/sec |
| 7B-8B (Llama 3/3.1) | 16GB | 8-12GB | 30-60 tokens/sec |
| 13B-14B | 24GB | 16GB+ | 20-40 tokens/sec |
| 70B (Llama 3.3) | 64GB | 48GB+ (or dual GPUs) | 10-15 tokens/sec |
These numbers assume Q4_K_M quantization and a 32K context window. Go higher on context, add those gigabytes to the VRAM column.
Install Ollama (The Part Everyone Gets Right)
One command. Same across platforms.
# Linux/macOS
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download from ollama.com/download
# Run OllamaSetup.exe
Verify it’s running:
ollama --version
# Should show v0.6.x or later (as of April 2026)
Ollama starts as a background service automatically. It listens on localhost:11434. If that port’s already taken, you’ll know immediately – the service won’t start. Kill whatever’s sitting on 11434 or configure Ollama to use a different port.
Pull a Model (And Actually Understand What You’re Downloading)
Here’s where version naming gets messy.
ollama pull llama3 # 8B model, 4.7GB
ollama pull llama3.1 # 8B model, 4.9GB, 128K context
ollama pull llama3.2 # 3B model, 2.0GB (smaller!)
ollama pull llama3.1:70b # 70B model, 43GB
llama3:latest gives you the 8B variant. llama3.2:latest gives you 3B. The naming doesn’t follow a pattern – you have to check the library page to see sizes before pulling.
For most local setups, start here:
ollama pull llama3.1:8b
That’s Llama 3.1 at 8 billion parameters with Q4_K_M quantization baked in. Q4_K_M is the sweet spot – smaller quantizations like Q2 or Q3 trash output quality, while Q8 eats VRAM for marginal gains.
Pro tip: Run
ollama listafter pulling to confirm the model name and size. The download might succeed but use a different tag than you expected. Verify before you start scripting against it.
Run It (Three Ways That Matter)
1. Interactive Chat
ollama run llama3.1:8b
This drops you into a REPL. Type your prompt, hit Enter, watch tokens stream. Type /bye to exit.
Fast to test, annoying for real work.
2. Single-Shot Command
ollama run llama3.1:8b "Explain VRAM allocation in 50 words."
Runs the prompt, prints output, exits. Good for shell scripts or one-off questions.
3. API Call (The One You’ll Actually Use)
Ollama’s REST API runs automatically when the service is up. Two endpoints: native and OpenAI-compatible.
# Native API
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b",
"prompt": "Why does KV cache grow with context?",
"stream": false
}'
# OpenAI-compatible (easier for most tools)
curl http://localhost:11434/v1/chat/completions -d '{
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "Explain KV cache."}]
}'
The OpenAI format works with tools expecting GPT API structure. Point your client library at localhost:11434/v1 instead of OpenAI’s endpoint and it just works.
The Performance Killers Nobody Warns You About
Trap #1: The 5-Minute Unload
By default, Ollama unloads models after 5 minutes of inactivity. Next request triggers a full reload from disk – 3 to 8 seconds of dead time before tokens start flowing.
Fix it system-wide:
# Linux with systemd
sudo systemctl edit ollama
# Add:
[Service]
Environment="OLLAMA_KEEP_ALIVE=-1"
sudo systemctl restart ollama
-1 means “never unload.” The model stays resident in memory until you explicitly stop Ollama. If you’re tight on RAM, set it to 24h instead of infinity.
Trap #2: VRAM Overflow You Can’t See
Run ollama ps while a model is loaded:
NAME SIZE PROCESSOR CONTEXT
llama3.1:8b 4.9GB 100% GPU 4096
That 100% GPU is a lie if your total VRAM usage exceeds available memory. Ollama will silently offload layers to system RAM. Check actual utilization:
# NVIDIA
nvidia-smi
# AMD
rocm-smi
If VRAM usage is maxed out but ollama ps says 100% GPU, you’re in spillover territory. Either reduce context length (--num-ctx 8192) or grab a smaller model.
Trap #3: CPU Mode When You Have a GPU
Ollama auto-detects GPUs. Sometimes it fails quietly.
OLLAMA_DEBUG=1 ollama serve
Look for “discovering available GPUs” in the log output. If it reports zero devices but you have a GPU installed, your drivers are wrong. Ollama needs NVIDIA driver 531+ for CUDA, ROCm v7 for AMD. Older drivers won’t initialize.
On AMD + Linux, also check group permissions. Ollama needs access to /dev/kfd. If the ollama user isn’t in the render group, GPU detection fails.
When Ollama Is the Wrong Tool
Ollama’s built for simplicity, not maximum throughput. A few scenarios where you should look elsewhere:
- AMD Vulkan setups where you need bleeding-edge performance: Ollama’s vendored llama.cpp lags upstream by weeks or months. As of April 2026, there’s a documented 56% throughput gap between Ollama and standalone llama.cpp on AMD GPUs using Vulkan. If you’re on a 7900 XTX or Strix Halo APU, build llama.cpp from source.
- Batch workloads processing hundreds of prompts in parallel: Ollama’s scheduler handles concurrency (
OLLAMA_NUM_PARALLELdefaults to 4), but tools like vLLM or TGI are purpose-built for high-throughput batch inference. - Production deployments with strict SLAs: Ollama’s simplicity means less control over memory pools, request queueing, and failover. If uptime matters, containerize llama.cpp or use a managed inference service.
For prototyping, development, or personal AI workflows? Ollama’s unbeatable. For production scale, it’s a starting point, not the finish line.
What’s Actually Fast Enough
Real numbers from consumer hardware (as of early 2026):
- RTX 4060 Ti (16GB): Llama 3.1 8B at ~40 tokens/sec, context up to 64K without spillover. Comfortable for interactive use.
- Apple M3 Max (128GB unified): Llama 3.3 70B at ~15 tokens/sec. No discrete GPU, no VRAM limit – unified memory lets you run models other setups can’t touch.
- AMD Ryzen 9 + 64GB RAM (CPU-only): Llama 3.1 8B at ~8 tokens/sec. Usable, but you’ll feel every token. Memory bandwidth matters more than core count here.
If you’re sitting above 30 tokens/sec on a 7B-8B model, you’re in the productive zone. Below 10, you’ll spend more time waiting than working.
The Version Trap
As of March 2026, Ollama v0.6.2 supports Llama 4 (the new MoE models with 16x17B and 128x17B variants). If you’re on v0.17.1 or earlier, auto-update is broken – you have to manually re-download from ollama.com/download or re-run the install script. This is a one-time issue, but if your ollama pull llama4 command throws a 412 error, you’re on an old version.
Check your version:
ollama --version
Anything below v0.16.0 won’t recognize newer model architectures like Llama 4 or Qwen 3.5.
FAQ
Can I run multiple models at once?
Yes, but each model consumes VRAM. Ollama’s scheduler loads models on-demand and can keep multiple models resident if you have the memory. Set OLLAMA_MAX_LOADED_MODELS to control how many stay in memory simultaneously. Default is usually 1 or 2 depending on detected VRAM. Running two 8B models at once needs ~20GB VRAM after accounting for KV cache.
Why does the same model run faster in LM Studio than Ollama?
Could be quantization mismatch (LM Studio might default to Q5 where Ollama uses Q4), or Ollama’s vendored llama.cpp version lagging behind. On AMD Vulkan specifically, standalone llama.cpp built from recent upstream code runs measurably faster – sometimes 50%+ – because Ollama’s bundled version doesn’t include the latest GPU optimizations. Also check that both are actually using GPU, not CPU fallback.
What’s the difference between llama3, llama3.1, and llama3.2?
Llama 3: Original 8B and 70B models with 8K context. Llama 3.1: Same sizes plus a 405B flagship, context bumped to 128K, better at tool use and multilingual tasks. Llama 3.2: Small models (1B and 3B) optimized for on-device and edge use – lower memory, faster inference, less capable than the 8B variants. Meta’s Llama 3 technical report confirms 3.1’s 405B model rivals GPT-4 on many benchmarks. For local use, 3.1’s 8B is the best balance unless you’re on severely constrained hardware – then grab 3.2’s 3B.