Skip to content

Install Gemma 4 Locally: A Deployment Guide (2026)

Deploy Google's Gemma 4 open-weight LLM on your own hardware. Covers VRAM math, Ollama setup, the float16 trap, and Apache 2.0 licensing fine print.

7 min readIntermediate

If you’ve been watching the open-weight LLM space, you already know the problem: every six months a new model claims state-of-the-art, and every six months you discover the license has a clause your legal team won’t sign. Google’s Gemma 4, released April 2, 2026 by Google DeepMind, finally fixes that. The weights ship under a plain Apache 2.0 license – and the model is genuinely competitive, not a token release.

This guide is a deployment-decision document, not a copy-paste walkthrough. We’ll cover what breaks during install, how much VRAM each variant actually needs per quantization level, and why the licensing change matters more than any benchmark score.

Why this release changes the deployment math

Gemma 4 ships in four sizes: Effective 2B (E2B), Effective 4B (E4B), 26B Mixture of Experts (MoE), and 31B Dense. Context windows are 128K on the E2B and E4B, and 256K on the 26B and 31B – confirmed in Aniruddha Adak’s Gemma 4 deployment guide.

Here’s what most install tutorials bury: earlier Gemma releases shipped under a custom Gemma Terms of Use that included a Prohibited Use Policy. Enterprise legal teams routinely flagged those clauses as ambiguous – asking for indemnification or scope clarification before sign-off. That friction kept Gemma out of plenty of production stacks, regardless of benchmark scores. The switch to Apache 2.0 is the real story. Benchmarks are nice; legal sign-off is what puts a model into production.

If you’re pitching Gemma 4 internally, lead with the Apache 2.0 LICENSE file in the repo, not the Arena leaderboard. The license is a one-page conversation with legal. A benchmark comparison is a six-week procurement review.

System requirements (the real numbers)

Most install guides say “a GPU” and move on. The actual VRAM picture depends on variant and quantization. Unquantized bfloat16 for the 31B Dense fits on a single 80GB NVIDIA H100 – per the official Google announcement – but almost nobody deploying locally has that. The table below uses community-estimated figures for Q4 quantization; treat them as planning targets, not guarantees.

Variant BF16 VRAM (est.) Q4 VRAM (est.) Practical target
Gemma 4 E2B ~5 GB ~2 GB Phone, laptop iGPU
Gemma 4 E4B ~9 GB ~3.5 GB RTX 3060 / M2
Gemma 4 26B MoE ~52 GB ~16 GB Single RTX 4090 (Q4)
Gemma 4 31B Dense ~62 GB ~20 GB 2× consumer GPUs or 1× H100

One naming gotcha: the “E” prefix on the smaller models stands for “effective parameters.” These models use Per-Layer Embeddings (PLE) – a secondary embedding signal fed into every decoder layer – so the parameter count you see in the name is what gets activated at inference, not the total weight file size on disk. The E2B sits at roughly 2.3B effective parameters; the E4B at roughly 4.5B (per arshtechpro’s practical guide on dev.to). Size the storage allocation by the weight file, not the name.

Install with Ollama (recommended path)

Ollama bundles model management, GPU detection, and the inference runtime in a single install. Other paths exist – Gemma 4 has day-0 support for transformers, llama.cpp, MLX, ONNX, and WebGPU per the Hugging Face Gemma 4 announcement – but Ollama is the right call unless you’re integrating into an existing pipeline.

curl -fsSL https://ollama.com/install.sh | sh
ollama --version
ollama pull gemma4:e4b
ollama run gemma4:e4b

For the 26B MoE: ollama pull gemma4:26b-moe. Use the instruction-tuned variants (suffix -it) for chat and tool use – base variants are for fine-tuning only.

First-time configuration that actually matters

The defaults Ollama ships with are not Google’s recommended sampling settings. Turns out the Gemma team published specific optimal values: temperature 1.0, top_k 64, top_p 0.95, min_p 0.0, and repetition penalty 1.0 – that last one means disabled in llama.cpp and transformers. Source: Unsloth’s Gemma inference docs, which cite the Gemma team directly.

FROM gemma4:e4b
PARAMETER temperature 1.0
PARAMETER top_k 64
PARAMETER top_p 0.95
PARAMETER min_p 0.0
PARAMETER repeat_penalty 1.0

Then ollama create gemma4-tuned -f Modelfile. Skip this step and you get Ollama’s generic defaults – temperature 0.8, top_k 40 – which is why outputs sometimes feel off compared to official demos.

Verify it works

ollama list
ollama run gemma4:e4b "A train leaves station A at 3pm going 60mph. Another leaves station B at 4pm going 80mph toward A. Stations are 200 miles apart. When do they meet? Show your work."

If the model walks through the algebra step by step, you’re good. Hallucinated number, no working shown, abrupt stop – wrong sampling params, or you pulled a base (non-IT) variant.

Three install errors and the fixes that work

Port 11434 already in use. The error reads: listen tcp 127.0.0.1:11434: bind: address already in use. Happens when the Ollama installer registers a background system service that keeps running after a reboot or reinstall. Fix: sudo systemctl stop ollama on Linux, or quit the menubar app on macOS, before running ollama serve manually.

Garbage output on T4 / V100 / old consumer GPUs. Almost no install tutorial covers this one. Gemma 3 and 4 weight values exceed float16’s maximum of 65504. Run those models in float16 mixed precision and gradients and activations quietly become infinity – the model emits NaN-poisoned tokens that look like garbage output with no error message. Fix: run in bfloat16 (Ampere or newer NVIDIA GPUs support this natively), or use a quantized GGUF via llama.cpp. Stuck on a free Colab T4? The Unsloth wrapper upcasts dynamically and sidesteps the issue. This is documented in Unsloth’s official Gemma docs.

transformers ImportError.cannot import name 'Gemma4ForCausalLM' means you’re on a 4.x version. Gemma 4 requires transformers 5.5.0 or later. Fix: pip install --upgrade "transformers>=5.5.0". Plain pip install transformers often pulls a cached 4.x on older environments.

That covers the mechanical failures. The harder question is whether a single-machine deployment actually fits your workload – or whether the 26B MoE on shared infrastructure is the smarter call. VRAM and throughput math for that decision depends entirely on your concurrency target, which is worth mapping before you commit to hardware.

Upgrading from Gemma 3

One-liner: ollama pull gemma4:e4b. Old weights stay on disk – ollama rm gemma3:4b is worth running; the 27B weights aren’t small.

Chat template format is unchanged between Gemma 3 and 4 for text-only input, so existing prompt tooling keeps working. What changed: context windows on the 26B and 31B now reach 256K (up from 128K). Raising the cap is possible – but kv-cache memory scales linearly with context length. Check VRAM headroom against the estimates in the table above before you do.

Uninstall

# Remove models
ollama rm gemma4:e4b
# Linux: stop and remove the service
sudo systemctl stop ollama
sudo systemctl disable ollama
sudo rm /usr/local/bin/ollama
sudo rm -rf /usr/share/ollama
# macOS
brew uninstall ollama # or drag the app to trash

FAQ

Is Gemma 4 open source or open weight?

Open weight. The weights and inference code are usable commercially under Apache 2.0, but training data and the full pipeline aren’t released – which is the standard OSI distinction between open-weight and open-source. For deployment it rarely matters; for redistribution claims, it does.

Which variant for a production chatbot with ~50 concurrent users?

The 26B MoE is the right call for most teams in that range. The catch: it activates only 3.8 billion of its total parameters at inference time, so tokens/sec are closer to a 4B model while quality sits much nearer the 31B Dense. On a single 80GB H100 you’ll have enough throughput; on a 24GB consumer card with Q4 quantization, a handful of concurrent sessions is feasible with acceptable latency – stress-test before committing. The 31B Dense is the better pick if output quality on complex reasoning tasks is the priority and you can absorb the VRAM cost.

How does Gemma 4 compare to Llama 4 or Qwen 3.5 for self-hosting?

On benchmarks published as of April 2026, the Gemma 4 31B ranks #3 on the Arena AI text leaderboard among open models and the 26B ranks #6 – outperforming models many times their size. If Apache 2.0 is non-negotiable for your use case (no license carve-outs, no custom ToU), Gemma 4 is the default winner on that axis alone. For direct quality comparisons against Llama 4 or Qwen 3.5 on your specific workload, run your own evals – leaderboard rankings shift and no published benchmark covers every use case.

Pull gemma4:e4b, run it against three prompts from your actual workload, and measure tokens/sec on your hardware. That’s the only benchmark that matters for a deployment decision.