The end state you’re aiming for: a process listening on localhost:11434 that answers a curl with Mixtral-generated text in a few seconds, sitting on roughly 48 GB of RAM and 26 GB of disk. That’s it. Everything below is the path backwards from there.
Mixtral 8x7B is still one of the most interesting open source MoE models to self-host – released under Apache 2.0, with 8 experts per MLP layer, 45B total parameters, but compute roughly equivalent to a 14B dense model because each token only routes through two experts. So you pay 14B-class latency and 70B-class RAM. That trade is the whole point.
Mistral has since shipped newer open models – Mistral Large 3 and the Ministral 3 family launched under Apache 2 in late 2025 with upstream vLLM compatibility – but Mixtral 8x7B remains the sweet spot for a first MoE deployment because the tooling has had two years to stabilize. If you’ve never run a sparse MoE locally, start here, not on Large 3.
System requirements (the honest version)
The numbers everyone quotes assume Q4_K_M quantization. Below that line, things get weird (more on that later).
| Spec | Minimum (Q4_K_M) | Comfortable |
|---|---|---|
| RAM | 48 GB | 64 GB |
| Disk | 30 GB free | 50 GB free |
| GPU VRAM | None (CPU-only works, slowly) | 24+ GB (RTX 3090/4090, A6000, A100) |
| OS | Linux, macOS (Apple Silicon), Windows via WSL2 | Linux for production |
The 48 GB RAM floor isn’t arbitrary. During the first run, Ollama downloads a 26 GB Mixtral model and the system needs at least 48 GB of RAM to efficiently run it. On Apple Silicon, unified memory means a 64 GB M-series chip works surprisingly well; on x86, you want VRAM + system RAM combined to clear 48 GB.
For dedicated GPUs, A100 40GB or A6000 48GB will run Mixtral 8x7B; for the bigger Mixtral 8x22B, you need A100 80GB or H100 class. A single RTX 3090 with 24 GB VRAM can run Q4 by offloading ~23 of 33 layers to GPU and the rest to CPU – expect around 15 tokens/sec.
Pick your install: Ollama vs vLLM
There’s no universally right answer here. Ollama wins on setup time and laptop-friendliness. vLLM wins on throughput and OpenAI-compatible API for production. Pick based on what you actually need:
- Ollama – one shell command, GGUF quantization built in, runs on your laptop, OK throughput.
- vLLM – production-grade serving, PagedAttention, batching, but needs a real GPU and HF authentication.
- llama.cpp directly – most control, most fiddling. Skip unless you have a reason.
This guide walks Ollama in detail (the recommended path for most readers) and shows the vLLM commands at the end.
Install Ollama and pull Mixtral
On Linux, one line installs the daemon and CLI:
curl -fsSL https://ollama.com/install.sh | sh
ollama --version
On macOS, download the .dmg from ollama.com/download instead. On Windows, install the native build or use WSL2. Then pull the model – this is the 26 GB download:
ollama pull mixtral
ollama run mixtral "Say hello in three languages."
By default ollama pull mixtral grabs the Q4_0 quantization of Mixtral 8x7B Instruct. If you want a specific quant, name it: ollama pull mixtral:8x7b-instruct-v0.1-q4_K_M. The q4_K_M tag is usually a better quality/size trade than the default q4_0.
Docker install (if you prefer containers)
docker run -d --gpus=all
-v ollama:/root/.ollama
-p 11434:11434
--name ollama
ollama/ollama
docker exec -it ollama ollama pull mixtral
The canonical Docker command with GPU support maps /root/.ollama as a volume and exposes port 11434. Skip --gpus=all for CPU-only.
Verify it works
The HTTP API is the truest test – if curl gets a response, your stack is functional end-to-end:
curl http://localhost:11434/api/generate -d '{
"model": "mixtral",
"prompt": "Write one haiku about sparse activations.",
"stream": false
}'
You’ll get JSON with a response field. Ollama runs the local HTTP API at http://localhost:11434. First call is slow because the model loads into RAM; subsequent calls reuse it.
Pro tip: add
"keep_alive": "30m"to the JSON to pin the model in RAM longer than the default 5-minute idle timeout. Constantly reloading a 26 GB model wrecks throughput on shared boxes.
The MoE OOM trap nobody warns you about
Here’s the counterintuitive bug. A smaller quantization of Mixtral can fail to load while a bigger one succeeds on the same hardware. A reported case: mixtral:8x7b-instruct-v0.1-q4_K_M runs cleanly on an RTX 3090, but mixtral:8x7b-instruct-v0.1-q3_K_M throws CUDA out of memory on the same GPU.
Why? MoE layer scheduling. The Ollama scheduler estimates per-layer VRAM cost, then greedily packs layers onto GPUs in size order. For MoE models that estimate has historically been wrong. At Q8_0 each MoE layer weighs about 3.6 GiB in actual allocated buffers, but the per-layer cost estimator computes a much smaller value – so the scheduler can plan to fit 16 layers (~58 GiB) onto a 32 GiB GPU, and llama.cpp silently backs the overflow with pinned host RAM. The smaller quant tries to fit “more layers on GPU” based on a bad estimate and crashes.
If you hit this, force the offload manually:
# Drop GPU layers until it loads, starting at half the model's layers
OLLAMA_NUM_GPU=15 ollama run mixtral
Also worth knowing: the same model can stop fitting after an Ollama upgrade. Ollama v0.6.8 accepted num_gpu=41 for a Mistral model on an A10 24GB and ran at 85-90% GPU utilization, but v0.9.6+ only accepts num_gpu=40 because the compute graph allocation jumped from ~164 MB to ~9 GB, costing roughly 20% performance. If your config worked yesterday and OOMs today, check whether Ollama auto-updated.
Common errors and fixes
cudaMalloc failed: out of memory– dropOLLAMA_NUM_GPU, or shrink context with--num-ctx 2048. Reducing context from 8192 to 2048 can save 1-2 GB depending on model architecture.llama runner process no longer running: -1– almost always RAM exhaustion. Rundmesg | grep -i "out of memory"to confirm.- GPU detected but not used – Ollama loads only the CPU backend. Verify
nvidia-smiworks, check thatlibggml-cuda.sois present in the Ollama lib directory, and ensure CUDA_VISIBLE_DEVICES isn’t restricting access. - Model stuck loaded after use – Ollama keeps models loaded for 5 minutes after last use by default. Force unload:
curl http://localhost:11434/api/generate -d '{"model":"mixtral","keep_alive":0}'. - HF 401 on vLLM serve – vLLM sources weights from Hugging Face by default, so an HF_TOKEN with READ permission is required and you must accept the model card’s access conditions.
The vLLM path (briefly)
If you want a proper OpenAI-compatible endpoint and have a serious GPU:
uv pip install vllm # or: pip install vllm
export HF_TOKEN=hf_xxx
vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1
--tokenizer_mode mistral
--load_format mistral
--config_format mistral
Use vllm version >=0.6.1.post1 to ensure maximum compatibility with all Mistral models. The server speaks the OpenAI REST protocol on port 8000, so existing OpenAI SDK code works with base_url="http://localhost:8000/v1". Worth reading the Mistral vLLM deployment docs before going to prod.
Upgrade and uninstall
Upgrade Ollama by re-running the install script – it replaces the binary in place. Before upgrading, pin your working version if you’ve tuned num_gpu: the compute-graph regression mentioned earlier means a newer Ollama may need a lower offload count for the exact same model.
Uninstall on Linux:
sudo systemctl stop ollama
sudo systemctl disable ollama
sudo rm /etc/systemd/system/ollama.service
sudo rm $(which ollama)
sudo rm -rf /usr/share/ollama ~/.ollama
That last line nukes the model cache – the 26 GB Mixtral blob lives in ~/.ollama/models. If you’re switching to vLLM but keeping the weights, move them first.
FAQ
Is Mixtral still worth deploying in 2026 when Mistral Large 3 exists?
For learning MoE deployment locally, yes. The tooling is mature and the hardware requirements are achievable on a single workstation.
Why does Mixtral need 48 GB of RAM when only two experts activate per token?
Because all eight experts have to be resident in memory – the router picks which two to use per token at runtime, so you can’t predict which expert weights are safe to evict. Each token from the hidden states is dispatched twice (top 2 routing), but every expert must still be loaded in RAM, hence the ~70B-like RAM requirement. The compute savings are real; the memory savings are not.
Can I run Mixtral without a GPU?
Yes, on CPU only, but expect 1-3 tokens/sec on a modern desktop CPU with 48 GB RAM. Fine for batch jobs or overnight runs, painful for interactive chat. If you have an Apple Silicon Mac with 64 GB+ unified memory, the GPU cores get used automatically through Metal and performance is noticeably better than pure x86 CPU – usually in the 8-15 tokens/sec range depending on chip generation.
Next: pull mixtral:8x7b-instruct-v0.1-q4_K_M, run the curl test, and watch nvidia-smi while it generates. If GPU utilization sits below 40% with VRAM near full, you’ve hit the scheduler issue – drop OLLAMA_NUM_GPU by 2 and try again.