How to Use Llama 3 Locally: An Engineer’s Guide

Run Llama 3 locally without the silent slowdowns. Covers VRAM math, quantization tradeoffs, KV cache traps, and which runner actually fits your hardware.

Morgan Hayes2026-05-128 min readAdvanced

Picture the end state: a terminal window with Llama 3.1 8B answering you at 40 tokens per second, a REST API running quietly on localhost:11434, and a quant choice that actually fits your GPU instead of silently spilling into RAM at one-tenth the speed. That’s where this goes. The path there is mostly about not making the four mistakes that every default tutorial leads you into.

We’ll skip the privacy pitch – you already know why you want this – and focus on the engineering decisions: which model size, which quantization, which runner, and how to catch the silent slowdowns before they make you think Llama 3 is just slow.

The problem with the default tutorial

Type ollama run llama3 and it works. That’s the appeal. It’s also the trap.

The default tag pulls an 8B model at Q4_0, drops you into a chat, and tells you nothing about whether your GPU is actually doing the work. The failure mode: Ollama silently offloads layers to system RAM when the model exceeds VRAM – and LocalLLM.in benchmarks show inference dropping 5-20x when that happens. Your responses just get slower and you assume Llama 3 is “like that.”

It isn’t. You picked the wrong quant.

Think of VRAM like a workbench. The model is your tools. If the bench is too small, half the tools end up on the floor – you can still work, but every reach to the floor costs you. Quantization is how you decide which tools to keep on the bench and which to leave in the drawer. This guide is about making that choice deliberately instead of letting Ollama make it silently.

Ollama, LM Studio, llama.cpp – what actually differs

Most guides treat these as competing products. They’re not, really. Turns out all three use llama.cpp as the underlying inference engine (as of early 2026), so raw compute speed is nearly identical. What varies is the wrapper: Ollama adds a daemon, a CLI, and an HTTP API. LM Studio adds a desktop GUI with Apple Silicon MLX optimizations. llama.cpp is the bare engine – fastest ceiling, most friction.

The “which tool is fastest” debate is mostly noise. Where your speed actually lives or dies: how each one handles KV cache, layer offloading, and model storage. Those three things. Not the GUI.

Pick the model size for your hardware (not the other way around)

Real VRAM numbers for the Llama 3 family, before any KV cache or system overhead (figures from InsiderLLM’s size guide, verified February 2026 – these change as new quants are released, so check current listings):

Model	VRAM (Q4_K_M)	VRAM (FP16)	Realistic hardware
Llama 3.2 1B	~1-2 GB	~3 GB	Anything, including Raspberry Pi 5
Llama 3.2 3B	~2-3 GB	~6-7 GB	Integrated GPU or 16 GB RAM laptop
Llama 3.1 8B	~5-6 GB	~16 GB	RTX 3060 12GB / M-series Mac with 16 GB
Llama 3.3 70B	~43 GB	~140 GB	Dual 3090s, A6000, or 64+ GB unified memory Mac

Add 1-2 GB for KV cache at default context, more if you push it. The 8B at Q4_K_M fits in 8 GB VRAM with room to spare – that’s where most people land.

The KV cache trap

KV cache grows linearly with context length. With FP16 precision, an 8B model at 32K context burns approximately 4.5 GB for KV cache alone – separate from the model weights (per LocalLLM.in’s VRAM analysis, February 2026). So that 8B model you sized for an 8 GB GPU? At 32K context it now needs ~10 GB total. At 128K, you’re in trouble on anything short of a workstation card.

This is why conversations get slower as they grow. You’re not imagining it – you’re hitting the KV cache ceiling.

The fix most tutorials skip: Set OLLAMA_KV_CACHE_TYPE=q8_0 before starting Ollama. It halves KV cache memory with minimal quality loss – matters most for the 70B, where it’s the difference between running and not running at all.

There’s a related trap with context limits. Meta advertises 128K context for the Llama 3.1 and 3.2 family. The fine print: quantized versions in Ollama typically cap at 8K despite the full-precision model supporting 128K (noted by InsiderLLM as of 2026 – verify with ollama show --modelinfo on your specific tag). The default Ollama tag will silently clamp your context. If you need real long-context work locally, verify the limit before you trust it.

The actual setup, walked backwards

Goal state: http://localhost:11434/api/chat responds from a model that’s entirely GPU-resident and inference is GPU-bound.

Step 1: Install the right runner for your job

Don’t pick based on which has the nicest GUI. Pick based on what you’re doing:

Building an app or pipeline → Ollama. It exposes a REST API out of the box, manages model versions like Docker manages containers, and runs as a background daemon you can forget about.
Apple Silicon and you want maximum tokens/sec → LM Studio. On a Mac Studio M3 Ultra, LM Studio hit 237 t/s vs Ollama’s 149 t/s on Gemma 3 1B – the gap comes from Apple’s MLX engine, which Ollama doesn’t use (Arsturn benchmark, 2025).
Want every last token/sec and don’t mind compiling → llama.cpp directly. Unlike LM Studio, which only gives access to pre-quantized models, llama.cpp lets you quantize on-device and customize memory usage to match your exact hardware.

For most readers: install Ollama from the official download page.

Step 2: Pull the right tag, not the default one

# Don't do this:
ollama run llama3

# Do this - explicit size and quant:
ollama pull llama3.1:8b-instruct-q4_K_M

# Verify what got pulled:
ollama show llama3.1:8b-instruct-q4_K_M --modelinfo

Q4_K_M is the recommended starting point – best balance of quality, VRAM, and speed. Q5_K_M gives slightly better quality at 15-20% more VRAM. Q6_K and Q8_0 get close to full-precision quality but need a lot more VRAM for shrinking gains. Q3 and Q2? Avoid them unless you have no choice – quality degradation is severe at those levels (per LocalLLM.in analysis, February 2026).

Step 3: Force-check that GPU is actually being used

OLLAMA_DEBUG=1 ollama run llama3.1:8b-instruct-q4_K_M "test" 2>&1 | head -20

# On Nvidia, watch live VRAM use:
nvidia-smi -l 1

If nvidia-smi shows the ollama process using close to your model size in VRAM – you’re GPU-bound. If it’s using a fraction and tokens/sec feel slow, layers have spilled into RAM. Fix: smaller quant or smaller model.

A real example: 8B on an RTX 3060 12GB with 16K context

Llama 3.1 8B at Q4_K_M: ~5.5 GB for weights. At 16K context with FP16 KV cache: add ~2.3 GB. Plus ~1 GB for runner overhead (approximate). Total: ~8.8 GB – comfortably inside 12 GB VRAM.

FROM llama3.1:8b-instruct-q4_K_M
PARAMETER num_ctx 16384
PARAMETER temperature 0.4
PARAMETER num_gpu 999 # offload everything to GPU
SYSTEM "You are a senior backend engineer. Be concise."

ollama create dev-assistant -f Modelfile
ollama run dev-assistant

The num_gpu 999 bit is intentional – it forces all layers to GPU. If Ollama can’t fit them all, it fails loudly instead of silently spilling. That’s what you want. Ollama’s automatic detection is usually the right call, but num_gpu 999 works as a tripwire: pass or fail, no quiet degradation.

Things that didn’t fit in the main flow

Storage layout matters if you switch tools. Ollama stores blobs by content-hash in ~/.ollama/models; LM Studio and llama.cpp read plain .gguf files (per D-Central’s comparison, 2025). Sharing a model between Ollama and llama.cpp without re-downloading 40 GB requires an explicit import via Modelfile – symlinks won’t work.
Llama 3.3 70B usually beats Llama 3.1 405B at a sixth of the hardware cost. The 70B matches or beats the 405B on several benchmarks (InsiderLLM, 2026). The 405B’s edge shows only on narrow reasoning edge cases. If you’re considering 405B: don’t, unless you have a specific benchmark showing it outperforms 70B on your actual task.
Don’t run Llama 3.2 1B and expect reasoning. Short-text summarization, simple instruction following, basic classification – yes. Holding a real conversation or anything requiring multi-step reasoning – no. It’s for embedded use, not chat.
Mac with 32+ GB unified memory? LM Studio’s MLX engine will outrun Ollama on single-user inference. Use Ollama only when the REST API is specifically what you need for app integration.

FAQ

Can I run Llama 3 70B on a single consumer GPU?

Not comfortably. At Q4_K_M it needs ~43 GB VRAM. No single consumer card ships with that. Your realistic options: workstation card (RTX A6000, 48 GB), dual 3090s, or CPU offload at 2-5 tokens per second – fine for batch jobs, painful for interactive chat.

Why is my local Llama 3 slower than ChatGPT even though I have a decent GPU?

Run nvidia-smi while a prompt is processing. Check what VRAM the ollama process is using. If it’s well under your model size, layers have spilled to CPU – the most common cause. The second cause: your context window has grown and KV cache is eating into headroom you thought you had. Cloud ChatGPT runs on H100 clusters with memory bandwidth you can’t match on a consumer card. But on an 8B model with everything GPU-resident, 30-50 tokens/sec is realistic. Under 10 tokens/sec on an 8B? Something is spilling.

Is Ollama safe to use commercially?

Ollama is MIT-licensed – the runner isn’t the issue. The model license is. Llama 3 permits commercial use under a usage threshold (check Meta’s current license terms directly – these conditions have been updated before and may change again).

Next step: pick the row from the table above that matches your VRAM minus 2 GB, pull it with the explicit quant tag, and run nvidia-smi during your first real prompt. VRAM use matches expectations? You’re done. It doesn’t? Drop one quant level and try again.