You’re paying $20/month for ChatGPT Plus or burning through API credits. Your data’s bouncing through OpenAI’s servers. And every time your internet hiccups, your workflow stops.
Running AI models locally solves all three problems. Ollama makes it possible without a PhD in machine learning or a server rack in your closet. Install it, pull a model, and you’ve got a private LLM running on your own hardware.
But here’s the part most tutorials skip: the #1 reason first-time setups fail isn’t installation – it’s VRAM. Pull a 7B model on an 8GB GPU, watch it load, then wonder why responses crawl at 2 tokens per second instead of 40. The model loaded. It’s just running on your CPU instead of your GPU, and you’d never know unless you checked.
Why Your GPU Matters More Than the Tutorial Told You
Every Ollama guide says “8GB RAM for 7B models.” Technically true. Practically useless.
Here’s what actually happens: a 7B model at Q4_K_M quantization needs about 4.7GB for the model weights. But at a 32K context window – standard for most real work – the KV cache alone eats another 4.5GB. That’s 9.2GB total. Your 8GB GPU can’t fit it. Ollama doesn’t error out. It just silently splits layers between your GPU and system RAM, and your performance drops off a cliff.
Benchmarks show the damage: an RTX 4060 running Llama 3.1 8B fully on GPU hits 40 tokens/s. Force it to use system RAM for overflow and you’re down to 8 tokens/s. That’s 5x slower because the PCIe bus is the bottleneck, not your CPU.
Check if you’re hitting this: run ollama ps after loading a model. Look at the Processor column. If it says CPU when you have a GPU, your model didn’t fit.
Install Ollama Without the Usual Hassles
Installation is the one part Ollama actually makes simple. Pick your OS, run one command, you’re done.
macOS
Download the .dmg installer from ollama.com or use Homebrew:
brew install ollama
On Mac, Ollama auto-starts as a background service. GPU acceleration via Metal is automatic if you’re on Apple Silicon (M1/M2/M3). The unified memory architecture is actually an advantage here – your 32GB M1 Max can run models that would need a $2000 NVIDIA card on PC.
Linux
One-line install script for most distros:
curl -fsSL https://ollama.com/install.sh | sh
The script auto-detects your GPU. If you have NVIDIA, make sure drivers are already installed – Ollama won’t install them for you. Check with nvidia-smi. For AMD, you’ll need ROCm v7 drivers.
Start the server manually:
ollama serve
It runs on http://localhost:11434 by default. Leave that terminal open or set it up as a systemd service if you want it to survive reboots.
Windows
Download the installer from ollama.com/download or use PowerShell:
irm https://ollama.com/install.ps1 | iex
The desktop app handles the background service. GPU support works for NVIDIA and AMD on x64; ARM Windows is CPU-only as of early 2026.
Pro tip: If you’re on Windows and considering Docker for Ollama, don’t. GPU passthrough works fine on Linux containers, but Docker Desktop on Mac has zero GPU support – you’ll be stuck on CPU. Native install is faster and simpler.
Pull Your First Model (and Size It Correctly)
Ollama’s model library has hundreds of options. Start with something that’ll actually run on your hardware.
List available models at ollama.com/library. Each model page shows sizes and quantization tags. Tags like :7b-q4_0 or :8b-instruct-q4_K_M indicate parameter count and quantization level.
Pull a model:
ollama pull llama3.2:3b
This downloads Llama 3.2 3B, about 2GB. Good for testing. For real work, 7-8B models are the sweet spot – llama3.1:8b or qwen2.5:7b are solid general-purpose choices.
If you have 16GB+ VRAM, try llama3.1:70b – the performance jump from 8B to 70B is substantial for reasoning tasks, but you’ll need serious hardware.
Quantization quick reference:
- Q4_K_M – Best balance for most users. ~50% size reduction, minimal quality loss.
- Q6_K – Higher quality, bigger size. Needs more VRAM.
- Q2/Q3 – Too aggressive. Quality degrades, output gets unpredictable.
Models download to ~/.ollama/models/ on Linux/Mac or %USERPROFILE%.ollama on Windows. Each model is 2-50GB depending on size.
Run the Model and Verify GPU Usage
Start an interactive session:
ollama run llama3.2:3b
You’ll get a prompt. Type anything. If the response is instant, you’re on GPU. If it stutters out word by word like you’re on dial-up, you’re on CPU.
Verify explicitly:
ollama ps
Output shows loaded models, memory usage, and processor type. Processor: GPU means you’re golden. Processor: CPU means your model overflowed VRAM and fell back to system RAM.
Exit the session with /bye or Ctrl+D.
The Suspend/Resume GPU Bug
On Linux with NVIDIA, if your machine goes to sleep, Ollama sometimes loses GPU access when it wakes up. The official fix: reload the NVIDIA UVM driver.
sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm
Restart Ollama and the GPU comes back. This is a known driver bug, not an Ollama issue, but it’ll bite you at 2am if you don’t know the workaround.
API Access for Real Applications
Interactive CLI is fine for testing. For building anything, you’ll use the API.
Ollama exposes a REST API at localhost:11434. Test it:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2:3b",
"prompt": "Explain TCP handshake in one sentence."
}'
Response streams back as JSON. Set "stream": false if you want the full response at once.
Python Integration
Install the official library:
pip install ollama
Basic usage:
import ollama
response = ollama.chat(
model='llama3.2:3b',
messages=[{
'role': 'user',
'content': 'Write a Python function to reverse a string.'
}]
)
print(response['message']['content'])
Stream responses for real-time output:
for chunk in ollama.chat(
model='llama3.2:3b',
messages=[{'role': 'user', 'content': 'Explain recursion.'}],
stream=True
):
print(chunk['message']['content'], end='', flush=True)
The API is OpenAI-compatible, so you can point existing OpenAI client code at localhost:11434 and it’ll work with minimal changes.
Optimize VRAM Usage When You’re Tight on Memory
If your model barely fits or you’re getting CPU fallback, try these before buying a new GPU.
Enable Flash Attention. This cuts memory usage for long contexts but it’s off by default. On Linux/Mac:
export OLLAMA_FLASH_ATTENTION=1
ollama serve
On Windows, set it as a system environment variable before starting Ollama. Flash Attention reduces the KV cache footprint as your context window grows – critical for 16K+ context work.
Lower the context window. Default is often 8K-32K tokens. If you’re doing short Q&A, you don’t need it. Create a Modelfile:
FROM llama3.1:8b
PARAMETER num_ctx 4096
Save as Modelfile, then:
ollama create llama-short -f Modelfile
Now ollama run llama-short uses the 4K context version, saving ~2GB VRAM.
Use a smaller quantization if you must. Switching from Q4 to Q3 frees VRAM but output quality suffers – responses get more repetitive, occasionally nonsensical. Q4_K_M is the floor for production use.
What Ollama Actually Does Under the Hood
Ollama isn’t magic. It’s a wrapper around llama.cpp with a model registry and an HTTP server.
When you ollama pull, it downloads GGUF model files – quantized weights optimized for inference. When you ollama run, it loads those weights into memory (VRAM if available, RAM otherwise) and starts a local inference server. Your prompts hit that server, llama.cpp runs the forward pass, tokens stream back.
The automatic GPU detection happens at startup. Ollama checks for CUDA (NVIDIA) or ROCm (AMD) libraries, measures available VRAM, and decides how many model layers to offload to GPU. If the full model fits, 100% goes to GPU. If not, it splits layers between GPU and CPU, prioritizing the GPU for the heaviest layers.
That split is why performance tanks when you overflow VRAM. The data transfer between system RAM and VRAM over PCIe becomes the bottleneck. Your GPU’s still doing work – it’s just waiting on data most of the time.
When Not to Use Ollama
Local LLMs aren’t always the answer.
If you need GPT-4-level reasoning, Ollama’s open models aren’t there yet. A 70B model on a $5000 GPU setup will still lose to GPT-4 on complex multi-step tasks. For one-off queries, paying $0.01 per API call beats buying hardware.
If you’re serving models to a team, Ollama’s single-server design doesn’t scale well. You’d want something like vLLM or TGI for production multi-user inference with request batching and model parallelism.
If your hardware is old – pre-2020 CPU, no GPU, 8GB RAM – you’ll get 3-5 tokens/s at best. That’s borderline unusable. Cloud APIs are faster.
But for private data, offline work, or high-volume usage where API costs add up, running local wins. You pay once for the GPU, then inference is free.
Frequently Asked Questions
Can I run Ollama without a GPU?
Yes, but it’s slow. CPU-only setups get 5-15 tokens/s on 7B models depending on your CPU and RAM speed. Usable for light testing, frustrating for real work. Memory bandwidth matters more than core count – a high-end desktop CPU with dual-channel DDR5 will outperform a laptop CPU with single-channel RAM.
How do I know if my GPU is actually being used?
Run ollama ps while a model is loaded. Check the Processor column – it’ll say GPU or CPU. You can also use nvidia-smi (NVIDIA) or radeontop (AMD) to watch GPU utilization in real-time. If Ollama is using the GPU, you’ll see memory usage spike and GPU utilization fluctuate during inference. If GPU utilization stays at 0%, you’re on CPU.
Why does my 8GB GPU struggle with a 5GB model?
The model size listed is just the weights. KV cache (which grows with context length), intermediate activations, and overhead add 30-100% more memory usage depending on your context window. A 5GB Q4 model at 16K context might actually need 7-8GB total. The rule: model VRAM should be about 2/3 of your total GPU memory to leave headroom. For an 8GB card, stick to models ≤5GB if you want long context work.
Next: Pull a Model That Matches Your Hardware
Installation’s done. Now run ollama pull llama3.2:3b to test your setup, then check ollama ps to confirm GPU usage. If you see CPU fallback, drop to a smaller model or enable Flash Attention before assuming you need new hardware.