Skip to content

Install llama.cpp: Run LLMs Locally [2026 Build b8862]

llama.cpp lets you run large language models on your own hardware without cloud APIs. Here's how to build the latest version from source and fix the most common install errors.

7 min readIntermediate

Can you run a 7B language model on your laptop without sending data to OpenAI? Yes – if you know how to build llama.cpp correctly.

The latest build as of April 20, 2026 is b8862 (released 10 hours ago). If you’re reading this in May, that number is already outdated – llama.cpp uses a rolling release cycle with multiple daily builds and no version tags.

What You Actually Need

Check your specs. Not suggestions – requirements.

Minimum (7B models, Q4 quantization):
8GB RAM. 5GB goes to the model, rest to your OS. 4GB free disk. Modern CPU with AVX2 – Intel since 2013, AMD since 2015. Any OS: Windows 10+, macOS 10.15+, Linux kernel 4.x+.

Recommended (13B models or GPU acceleration):
16GB+ RAM. 20GB free disk. GPU: NVIDIA (8GB+ VRAM), AMD (ROCm-compatible), or Apple Silicon (M1/M2/M3/M4). Fast SSD – NVMe if you have it. Model loading is I/O bound.

Pro tip: 8GB RAM + Linux? You’ll need to tweak vm.swappiness later. Default settings thrash your swap. We’ll cover this in troubleshooting.

Build from Source (The Right Way)

The official repo: https://github.com/ggml-org/llama.cpp. Don’t use package managers – they lag behind by weeks.

Linux / macOS

Open a terminal:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build -j$(nproc)

The -j$(nproc) flag uses all CPU cores. Takes 2-5 minutes. Binaries land in build/bin/.

GPU acceleration? Add a flag before the first cmake:

NVIDIA (CUDA):-DGGML_CUDA=on
Apple Silicon (Metal):-DGGML_METAL=on – first-class support via ARM NEON, Accelerate, Metal. Installing the x86 version by mistake? 10x slower.
AMD (ROCm):-DGGML_HIPBLAS=on
Any GPU (Vulkan):-DGGML_VULKAN=on – works on modern GPUs regardless of vendor.

Example with CUDA:

cmake -B build -DGGML_CUDA=on
cmake --build build -j$(nproc)

Windows

Windows requires Visual Studio or MinGW. We’ll use MinGW – less painful.

1. Install w64devkit (not MinGW-w64 – different thing). Extract to C:w64devkit.

2. Add to PATH: C:w64devkitbin

3. Open a new terminal and run:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
set CMAKE_GENERATOR=MinGW Makefiles
set CMAKE_C_COMPILER=C:/w64devkit/bin/gcc.exe
set CMAKE_CXX_COMPILER=C:/w64devkit/bin/g++.exe
cmake -B build
cmake --build build

Skip the CMAKE_C_COMPILER line? You’ll hit “CMAKE_C_COMPILER not set” even with Visual Studio installed. This tripped up half the GitHub issues we reviewed (as of April 2026, still the #1 Windows build problem according to llama-cpp-python PyPI troubleshooting docs).

Grab a Model

llama.cpp runs GGUF files. Not PyTorch. Not safetensors.

Simplest option: Hugging Face. Search GGUF, sort by downloads. Popular picks:

  • Gemma 3 1B (GGUF) – tiny, fast, good for testing
  • Llama 3.2 11B – best quality/speed trade-off for 16GB systems
  • Qwen3.5 35B – if you have 24GB VRAM and want something smart

Download the Q4_K_M variant. 4-bit quantization, good quality, fits in reasonable memory.

Place it anywhere. We’ll use ~/models/gemma-3-1b.gguf in examples.

First Run

From the llama.cpp directory:

./build/bin/llama-cli -m ~/models/gemma-3-1b.gguf -p "Explain GGUF format in one sentence." -n 50

-m: model path. -p: your prompt. -n: max tokens (50 here, but bump it to 200-500 for real queries).

Tokens streaming? Good. Speed: 10-50 t/s on CPU, 50-200 on GPU.

For interactive chat:

./build/bin/llama-cli -m ~/models/gemma-3-1b.gguf -cnv

Models with built-in chat templates auto-activate conversation mode. If not, add --chat-template NAME.

Verify It Works

Run the help command:

./build/bin/llama-cli --help

List of flags? You’re good. “Command not found”? Your build failed silently. Check build/bin/ for the binary.

GPU acceleration active?

./build/bin/llama-cli -m ~/models/gemma-3-1b.gguf -p "test" -n 1 -ngl 99

-ngl 99 offloads all layers to GPU. Watch your GPU monitor (Task Manager on Windows, nvidia-smi on Linux). VRAM usage spike = GPU offload works.

When Things Break

Three errors cause 80% of install failures:

1. “The Makefile build is deprecated”

You ran make directly. As of 2026, llama.cpp requires CMake (per Hugging Face forum April 2025 report). Old tutorials still reference the Makefile. Ignore them. Use cmake -B build.

2. Slow performance on 8GB RAM (Linux)

Your system is swapping constantly.

sudo sysctl vm.swappiness=10
echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.conf

This tells the kernel to prefer dropping filesystem cache over swapping memory-mapped model pages. One user on a 2-core CPU with 8GB DDR2 went from unusable to 2 t/s (documented in GitHub discussion #21136).

3. “CMAKE_C_COMPILER not set” (Windows)

CMake can’t find your compiler even with Visual Studio installed. Two fixes:

First: Use w64devkit (shown earlier). Set CMAKE_C_COMPILER manually.

Second: Open “x64 Native Tools Command Prompt for VS 2022” (not regular cmd) and build from there. Visual Studio ships this shortcut – pre-configures paths.

Error Cause Fix
“Makefile deprecated” Used make directly Switch to cmake -B build
Swap thrashing on 8GB RAM High vm.swappiness sudo sysctl vm.swappiness=10
CMAKE_C_COMPILER not set CMake can’t find gcc/MSVC Set manually or use VS command prompt
“10x slower on M1 Mac” Built x86 version on ARM Use arm64 Python + -DGGML_METAL=on

The Part Nobody Mentions

Memory mapping: llama.cpp’s superpower and its weak point. The tool loads models directly from disk without copying them into RAM. A 5GB model doesn’t require 5GB free RAM – the OS streams chunks as needed.

Low-memory systems? The OS might swap those chunks out. Performance collapses. The vm.swappiness fix above prevents this: “I’d rather drop cached files than swap my model.”

Windows has no swappiness knob. Hitting disk constantly? Close background apps or buy more RAM.

Upgrade, Uninstall, or Switch Builds

To update to the latest build:

cd llama.cpp
git pull
cmake --build build --clean-first

To uninstall: delete the llama.cpp folder. No system-wide install unless you explicitly ran cmake --install build (you probably didn’t).

To switch between CPU and GPU builds: re-run cmake -B build with different flags. CMake caches old config – use --clean-first or delete build/ entirely if builds behave strangely.

FAQ

Can I run llama.cpp without a GPU?

Yes. CPU-optimized. 1-10 t/s depending on model size and CPU. A 7B Q4 model on a modern 8-core CPU: ~5-8 t/s. Usable.

Why does my 24GB GPU still run out of memory on a 70B model?

A 70B model in Q4 quantization is ~35GB. Your 24GB GPU can’t hold it entirely. llama.cpp offloads some layers to system RAM. Works, but slows inference – data crosses the PCIe bus. Control how many layers go to GPU with -ngl N (e.g., -ngl 40). Lower N = less VRAM, slower speed. One user reported 24GB VRAM maxes out at ~40 layers for a 70B model before system RAM takes over. Trade-off.

What’s the difference between Q4_K_M and Q8_0?

Quantization level. Q4_K_M: 4 bits per weight (smaller, faster, slight quality loss). Q8_0: 8 bits (larger, slower, better quality). For most use cases, Q4_K_M is the sweet spot. You won’t notice the quality difference unless you’re doing precision-critical work like math or code generation. Even then it’s minor. If you have the VRAM and don’t mind 2x slower inference, Q8_0 is technically more accurate. But Q4_K_M handles daily chat/writing/research just fine.

Now go download a GGUF and see what your hardware can actually do. The Hugging Face GGUF collection is a good place to start model shopping.