Skip to content

Gemma 4 MTP Drafters: Faster Inference, Real Tradeoffs

Gemma 4 MTP drafters promise 3x faster inference. Here's what actually happens when you turn it on, plus the MoE batch-size trap most guides skip.

7 min readIntermediate

Here’s the contrarian take nobody is saying out loud: the headline Gemma 4 MTP drafter speedup of 3x is a ceiling, not a typical result. Most readers running a single chat session on a single GPU will see something closer to 1.5x-2x – that’s an honest estimate based on the architecture constraints below, not a number from Google’s benchmarks. And on one specific Gemma 4 variant, you can get exactly zero improvement if you configure it the way every tutorial tells you to.

That doesn’t mean MTP isn’t a big deal. Google’s announcement confirms MTP drafters for the Gemma 4 family deliver up to 3x speedup without any degradation in output quality or reasoning. As of May 2026, weights are already on Hugging Face and Kaggle, supported by transformers, MLX, vLLM, SGLang, and Ollama. But the gap between the headline number and your actual workload deserves a real conversation.

The problem MTP drafters actually solve

When you run Gemma 4 31B on your GPU, the bottleneck isn’t math. It’s memory. Standard LLM inference is memory-bandwidth bound – the processor spends most of its time moving billions of parameters from VRAM to compute units just to generate a single token. Under-utilized compute, high latency, especially on consumer hardware.

Speculative decoding flips that. It’s a technique introduced by Google researchers in Fast Inference from Transformers via Speculative Decoding (2022). A small drafter guesses several tokens. The big model checks them all in parallel – same memory cost as one token, but you get many. If the target accepts the draft, it emits the full drafted sequence plus one additional token in the time of a single normal forward pass.

The trick that makes Gemma 4’s MTP version different: the draft models use the target model’s last-layer activations and share its KV cache, so they don’t recalculate context the larger model has already processed. The drafter itself is a lightweight 4-layer model, paired one-to-one with each target size.

Why the existing alternatives fall short

Before MTP, the choices were bleak – raw autoregressive decoding and living with the latency, or community drafters with real versioning and compatibility headaches. Thoughtworks released an EAGLE3 draft head shortly after Gemma-4-31B launched, measuring 1.72x speedup at TP=2 on 8× H200 without changing outputs. That’s solid work. But Gemma 4’s hybrid sliding-window plus full-attention architecture breaks standard speculative decoding pipelines, so third-party drafters carry ongoing compatibility risk.

Google’s first-party MTP drafters land under the same Apache 2.0 license as the base models, with checkpoints purpose-built per size. Versioning, licensing, KV-cache compatibility – all handled upstream.

Honest benchmark comparison: Thoughtworks measured 1.72x with EAGLE3 on H200 hardware. Google claims up to 3x with MTP. The truth for your setup is almost certainly somewhere between those numbers – treat 3x as best-case, not expected.

The recommended approach (and the trap most tutorials hide)

The minimal working setup is two model loads and one extra parameter:

pip install torch accelerate transformers

from transformers import AutoModelForCausalLM, AutoProcessor

target_model = AutoModelForCausalLM.from_pretrained(
 "google/gemma-4-E2B-it"
)
assistant_model = AutoModelForCausalLM.from_pretrained(
 "google/gemma-4-E2B-it-assistant"
)

assistant_model.generation_config.num_assistant_tokens_schedule = "heuristic"

outputs = target_model.generate(
 **inputs,
 assistant_model=assistant_model,
 max_new_tokens=256,
 do_sample=False,
)

That’s the official path from Google’s MTP docs. The line that matters most is the one nobody emphasizes: num_assistant_tokens_schedule = "heuristic".

Why? Drafting many tokens – say, 15 – carries a high chance that not all get accepted, burning compute on rejected guesses. The heuristic schedule auto-adjusts at runtime based on how often the target accepts the drafter’s output. Hardcoding a static draft length is the most common configuration mistake, and it’s the one every tutorial ignores.

The MoE batch-size trap

Zero speedup. That’s what you can get from the 26B A4B MoE model at batch size 1 – and it’s buried at the bottom of Google’s official overview.

MoE models route each token through different expert subnetworks. Verifying drafted tokens can require loading additional expert weights from memory, directly offsetting the gains from drafting. At higher batch sizes, sequences overlap in which experts they activate, so loaded weights get reused across requests. At batch size 1, that overlap disappears – which is precisely why Google’s MTP docs warn the 26B A4B drafter may yield no speedup on hardware without good parallelism. Processing 4-8 requests simultaneously unlocks up to ~2.2x speedup locally on Apple Silicon for that model.

Running a single-user chat on the 26B MoE variant? MTP can break even or worse. That’s not a bug in MTP – it’s a fundamental property of sparse expert routing.

Quick decision table

Your setup Best Gemma 4 size Expected MTP gain
Phone / on-device, single user E2B or E4B Good – smaller dense models suit single-user latency well
Single dev machine, single chat 31B Dense Varies – benchmark on your prompts
Server, batch ≥ 4 26B A4B (MoE) Up to ~2.2x (per Google docs, May 2026)
Server, batch = 1, 26B MoE Wrong combo Often near zero

Numbers above reflect Google’s stated ranges as of May 2026 – your hardware and prompt mix will shift them.

A real-world scenario: code vs chat

Speedup isn’t a property of the model alone. It’s a property of how predictable your output tokens are.

Conversational replies follow common patterns; the drafter guesses well; acceptance is high. Code has unusual identifiers, novel logic, rare token sequences. The Thoughtworks EAGLE3 benchmark on Gemma-4-31B found exactly this: MT-Bench (conversational) showed the highest acceptance rate and speedup, SWEBench (code-heavy) showed the lowest. The same pattern applies to MTP because speculative decoding’s acceptance rate is driven by token predictability regardless of which drafter architecture you use.

So if you’re building a coding assistant on Gemma 4, expect lower acceptance and lower speedup than a customer-support chatbot on the same hardware. That’s not a flaw – it’s physics.

Pro tips that aren’t in the docs

  • Benchmark on your actual prompts. Synthetic 256-token completions hide acceptance-rate variance. Run 50 of your real production prompts and measure tokens/sec.
  • Don’t pair MTP with aggressive sampling. High-temperature outputs reduce drafter agreement. The more deterministic the decoding, the higher the acceptance.
  • Watch the drafter’s GPU memory cost. A 4-layer drafter sharing the KV cache is cheap but not free. On a maxed-out 24GB consumer GPU running 31B quantized, you may need to reduce max_new_tokens to fit.
  • If you already integrated a community EAGLE3 drafter, don’t rip it out blindly. Re-benchmark before switching – first-party MTP isn’t automatically faster than a well-tuned EAGLE3 setup.

FAQ

Does MTP change my model’s outputs at all?

No. Every draft token is verified by the full target model before it’s emitted. Same outputs, faster delivery.

Which Gemma 4 size should I start with?

If you’re new to running Gemma locally, start with E2B or E4B – smaller, lower VRAM requirement, and MTP still applies. Gemma 4 ships drafters for all four sizes (E2B, E4B, 26B A4B, 31B Dense) as of May 2026. The 31B Dense model gives more headroom for complex tasks but needs serious VRAM. Skip the 26B A4B unless you’re serving batched traffic – the MoE routing punishes single-user setups hard, as covered above.

Can I use MTP with Ollama or LM Studio without writing code?

vLLM, MLX, SGLang, and Ollama are all listed as supported backends in Google’s May 2026 announcement. LM Studio support depends entirely on their release cadence – check the changelog for your specific version before assuming it’s plug-and-play. When in doubt, the transformers path in the snippet above is the safest starting point.

Next action: open Google’s MTP code guide, copy the snippet above with the heuristic schedule, and run it twice – once with assistant_model set, once without. Time both. The number that comes out is your real speedup, not the headline.