Run an LLM Locally with Ollama v0.22 – Real Setup Guide

Install Ollama v0.22 on Mac, Linux, or Windows, then run Qwen3 or Gemma3 offline with the right VRAM math, GPU detection, and config from day one.

Alex Carter2026-04-289 min readIntermediate

End state: a terminal where you type ollama run qwen3:8b, the model loads onto your GPU in seconds, and you’re chatting with a fully local LLM at 40+ tokens per second – no API key, no telemetry, no cloud round-trip. This guide gets you there with Ollama v0.22.0 (released April 28, 2026) and spends most of its time on the parts other tutorials skip: the VRAM math, the silent CPU-fallback trap, and the post-sleep GPU disappearance that wastes hours.

Ollama is a wrapper around llama.cpp that handles model downloads, quantization variants, GPU offload, and a local REST API. As of April 2026, the project sits at over 170,000 GitHub stars and ships fast – a new release roughly every couple of days. That cadence is great for features, less great for stability, which is why version pinning matters (covered later).

What you actually need to run an LLM locally

Forget vague “8 GB minimum” advice. The real number depends on model size × quantization × context length. Here’s the formula that works, per the LocalAIMaster master table:

Min VRAM = model_file_size + 1.0 GB (KV cache @ 2K context) + 0.5 GB per extra 2K context

Q4_K_M size ≈ FP16 size × 0.30
FP16 size (GB) ≈ params(B) × 2

So a 7B model at Q4_K_M is roughly 4 GB on disk and needs about 5 GB of VRAM at default context – versus ~14 GB at full 16-bit precision. A 14B Q4 model wants ~9 GB. A 32B Q4 model wants 20+ GB, and that’s where local quality starts feeling close to cloud APIs.

Tier	Hardware	What runs well
Minimum	16 GB RAM, no GPU	3B models, slow 7B (CPU only)
Sweet spot	RTX 3060 12GB or M-series Mac 16GB	7B-14B Q4; RTX 3060 hits ~35-45 tok/s on 7B (as of early 2026)
Serious	RTX 4090 / 24GB+ VRAM	32B interactive; 70B Q4 with offload; ~75-85 tok/s on 7B

For NVIDIA, Ollama needs compute capability 5.0+ and driver version 531+ (per the official GPU guide, valid as of v0.22.0). For AMD on Linux, Ollama bundles ROCm 7 libraries – older ROCm 6.x drivers cause GPU discovery to hang and time out, silently falling back to CPU (more on that in a moment). On Apple Silicon, Metal works automatically.

Think of VRAM like a workbench. The model weights are the tools you spread out – and context length is how much active work you have laid out at once. Quantization is choosing compact tool handles instead of full-size ones: slightly less grip, but you fit twice as many on the same surface. The formula above is just measuring whether your workbench is big enough before you start unpacking.

Install Ollama v0.22 (per OS)

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# macOS - Homebrew or the .dmg from ollama.com/download
brew install ollama

# Windows - PowerShell
irm https://ollama.com/install.ps1 | iex

# Docker (Linux host with NVIDIA)
docker run -d --gpus all -v ollama:/root/.ollama 
 -p 11434:11434 --name ollama ollama/ollama

On Linux with an NVIDIA GPU, driver order matters. Install the driver first, confirm it with nvidia-smi, then run the install script. If Ollama gets installed before the NVIDIA driver exists, the binary ends up without CUDA libraries bundled – the symptom is the silent CPU fallback covered below. Reinstalling Ollama after the driver fixes it.

One Docker detail that trips people up: GPU acceleration doesn’t work in Docker Desktop on macOS, because Docker Desktop on Mac has no GPU passthrough (per official Ollama FAQ). Run the native app on a Mac.

First-time configuration – only what matters

The defaults are sane. Four environment variables are worth knowing on day one:

OLLAMA_HOST=0.0.0.0 – binds to all interfaces so other machines or containers can reach the API. Default is 127.0.0.1:11434, localhost only.
OLLAMA_MODELS=/path/to/big/disk – moves model storage off the system drive. The ollama system user needs read/write on that path; run sudo chown -R ollama:ollama <directory> after creating it.
OLLAMA_FLASH_ATTENTION=1 – cuts memory use as context length grows (per official FAQ). Enable this before touching context settings.
OLLAMA_KV_CACHE_TYPE=q8_0 – 8-bit KV cache uses roughly half the memory of f16, with no noticeable quality impact in most workloads (per official FAQ). Requires Flash Attention enabled.

On Linux with systemd: add these under [Service] in /etc/systemd/system/ollama.service. Then reload and restart: sudo systemctl daemon-reload && sudo systemctl restart ollama. On macOS desktop app, use launchctl setenv and restart Ollama from the menu bar.

Verify it actually works (not just “installed”)

Three checks. The third is the one tutorials skip:

# 1. Binary exists and version matches
ollama --version
# Should print: ollama version is 0.22.0

# 2. Server responds
curl http://localhost:11434/api/tags
# Returns JSON list of installed models (empty array on fresh install)

# 3. Pull a small model and check the PROCESSOR column
ollama run qwen3:8b "hi"
# In another terminal while the model is loaded:
ollama ps
# Look for: 100% GPU ← good
# If you see: 100% CPU ← silent fallback (see next section)

That third check is non-negotiable. Ollama falls back to CPU and doesn’t tell you – no error, no warning, no log line. It just quietly runs at a fraction of the speed. Tutorials that stop at “the model responded, you’re done” miss this entirely (confirmed by InferenceRig and InsiderLLM community guides).

The silent CPU-fallback problem (and three causes you’ll actually hit)

If ollama ps shows CPU when you expected GPU, debug in this order:

1. Driver was installed after Ollama. Covered above – reinstall Ollama. One command. People skip it because the binary responds, so it feels like it’s “working.”

2. The post-sleep UVM bug (Linux desktop). Working fine, laptop sleeps, you wake it – suddenly 4 tok/s. The NVIDIA UVM kernel module doesn’t survive a suspend cycle. Fix: sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm, then restart Ollama (per InsiderLLM). Make it permanent with a systemd resume hook, or it happens every single time.

3. WSL2 with the wrong driver setup. This one looks like you’re doing it right. Don’t install NVIDIA Linux drivers inside WSL2 – the Linux CUDA libraries are exposed automatically via /usr/lib/wsl/lib/, and adding a separate Linux NVIDIA driver breaks that passthrough (per InsiderLLM). Update your Windows driver, leave the WSL2 Linux side alone.

Quick check first: Run echo $CUDA_VISIBLE_DEVICES. If the output is -1 from some old script in your shell rc, Ollama sees zero GPUs regardless of hardware. Five-second check, potentially hours saved.

The configuration trap nobody warns you about

You read a tutorial: “increase parallelism for throughput!” So you set OLLAMA_NUM_PARALLEL=4. Suddenly your 7B model that fit comfortably is spilling layers to CPU.

The catch: OLLAMA_NUM_PARALLEL is the max number of parallel requests each model handles simultaneously, and required RAM scales by OLLAMA_NUM_PARALLEL × OLLAMA_CONTEXT_LENGTH (per official Ollama FAQ). Four parallel slots at 8K context allocates KV cache for 32K worth of tokens. A model that used 6 GB now wants 11 GB.

Same trap with context length alone. Doubling context doesn’t double VRAM, but it adds enough – on a card with 2 GB of headroom, it’s sufficient to push you into partial-offload territory where half the layers run on CPU and tok/s collapses.

Common errors – actual fixes

Pulled from the official troubleshooting docs and community reports:

“failed to finish discovery before timeout” on AMD Linux → ROCm driver too old. Upgrade to the ROCm v7 driver, reboot, then restart Ollama. (Ollama bundles ROCm 7 libraries; v6.x drivers cause discovery timeout.)
Runner fails to start, /tmp is noexec → Set OLLAMA_TMPDIR to a writable location, e.g., /usr/share/ollama/ (per official troubleshooting docs). This affects any system where /tmp is mounted with the noexec flag.
Pull hangs behind a corporate proxy → Set HTTPS_PROXY in the systemd unit (or pass -e HTTPS_PROXY= to Docker).
404 model not found → Tag casing matters. ollama pull qwen3:8b and ollama pull qwen3:8B are not equivalent; use lowercase tags.

Upgrade and uninstall

Upgrading is rerunning the install script – it overwrites the binary in place. Models survive: they live in ~/.ollama/models (Linux/macOS) or %HOMEPATH%.ollama (Windows), and the installer doesn’t touch those directories.

Pin to a specific version when a release breaks things: curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.5.7 sh – substitute your target. Given the release pace, worth knowing (per official troubleshooting docs).

Full Linux uninstall:

sudo systemctl stop ollama
sudo systemctl disable ollama
sudo rm /etc/systemd/system/ollama.service
sudo rm /usr/local/bin/ollama
sudo userdel ollama
rm -rf ~/.ollama # irreversible - deletes all models

macOS: drag the app to Trash, then rm -rf ~/.ollama. Windows: uninstall via Settings, delete %HOMEPATH%.ollama.

Where to go next

Pull qwen3:8b or gemma3 from ollama.com/library, point an OpenAI-compatible client at http://localhost:11434/v1, and you’ve swapped an API call for a local one. From there: custom Modelfiles, RAG with a local vector store, or wiring Ollama into Continue.dev for IDE-level coding help.

One thing that’s genuinely unclear in the current docs: how Ollama will handle multi-GPU setups as models push past 70B parameters. The --gpus all Docker flag works, but layer-splitting behavior across mismatched cards isn’t well documented. If you’re running a dual-GPU workstation, it’s worth testing ollama ps carefully and watching which device handles which layers – the answer might surprise you.

FAQ

Can I run Ollama without a GPU?

Yes – expect 5-10 tokens per second on a 7B Q4 model. Usable for testing, painful for anything longer than a quick query.

How do I know if a model will fit on my GPU before downloading 8 GB?

Check the file size on the model’s Ollama library page, add roughly 1 GB for KV cache at default 2K context, plus another 0.5 GB per additional 2K of context you plan to use. If the total sits under your VRAM with at least 1 GB of headroom for the OS and display, you’re fine. If it barely fits, you’ll either get tiny context windows or get pushed into partial CPU offload – which is where tok/s falls off a cliff. Enable OLLAMA_FLASH_ATTENTION=1 and OLLAMA_KV_CACHE_TYPE=q8_0 before assuming a model won’t fit; together they can recover 30-50% of KV cache memory.

Is Ollama actually private?

Inference runs locally – your prompts never leave the machine during a normal ollama run. The only outbound traffic is model pulls from ollama.com. Full air-gap: pull models on a connected machine, copy ~/.ollama/models over, block outbound 443 to ollama.com.