Qwen3.6-35B-A3B: Agentic Coding Power on Your Laptop

Alibaba dropped Qwen3.6-35B-A3B this week - 35B parameters, only 3B active. It outperforms its 235B predecessor on coding benchmarks while running on consumer hardware.

Jack Tom2026-04-167 min readBeginner

Here’s a contrarian take: the race for larger language models is mostly theater. The real enable isn’t 400B parameters – it’s figuring out which 3B to use per token.

Alibaba’s Qwen team just dropped Qwen3.6-35B-A3B two days ago (as of April 16, 2026), and it’s the clearest proof yet. 35 billion parameters. Only 3 billion active per inference. Faster than models half its size, runs on a laptop. And – according to Simon Willison’s test this morning – draws better SVG pelicans than Claude Opus 4.7.

I spent the last 12 hours running it locally. Here’s what happened.

This Model’s Different: 256 Specialists, 3B Active

Mixture-of-Experts (MoE). 35B total, 3B active per token. The model has 256 specialist sub-networks (“experts”) and routes each token to the 9 most relevant ones.

Inference speed of a 3B model. Knowledge base of something much larger.

SWE-bench Verified: 73.4. Terminal-Bench 2.0: 51.5 (per the official Hugging Face page, April 2026). Not 90+, but these are repository-level benchmarks – the kind that predict whether the model can scaffold a real app.

One developer on Medium gave it an architecture spec for a game. 10 files, 3,483 lines of code. Debugged its own collision detection. Playable on first load. That’s the “agentic” part everyone’s talking about.

The timing’s messy though. Qwen 3.5 had tool-calling stability issues – some users still haven’t resolved them. One NVIDIA forum post from 7 hours ago: “still struggling to get stable performance in terms of tool calling with Qwen3.5.” They’re not sure if 3.6 fixes it. Model’s Apache 2.0 and fully open, but the production version is still being shaken out.

Getting It Running

Three paths: Ollama (2 minutes), vLLM (production), Transformers (research). Pick based on what you’re building.

Ollama – Fastest Test

ollama pull qwen3.6:35b-a3b
ollama run qwen3.6:35b-a3b

Done. Q4_K_M quantization, 24GB. Tested on a 4090: ~122 tokens/sec. Matches community reports from this week.

The catch: Ollama hides the tuning knobs. No tool calling control, no thinking mode toggle without using the API. And if you’re on exactly 24GB VRAM? Inference can stall under heavy batch loads – MoE models sometimes show low GPU utilization in Ollama with default settings.

vLLM – What I’m Using

You need vLLM >= 0.19.0 (per the deployment guide, April 2026). Older versions don’t support Qwen3.6’s architecture.

pip install vllm>=0.19.0

vllm serve Qwen/Qwen3.6-35B-A3B 
 --port 8000 
 --tensor-parallel-size 1 
 --max-model-len 262144 
 --reasoning-parser qwen3

Tool calling? Add two flags:

vllm serve Qwen/Qwen3.6-35B-A3B 
 --port 8000 
 --tensor-parallel-size 1 
 --max-model-len 262144 
 --reasoning-parser qwen3 
 --enable-auto-tool-choice 
 --tool-call-parser qwen3_coder

That --tool-call-parser qwen3_coder flag is mandatory. Without it, the model generates tool calls but vLLM won’t parse them. I spent 30 minutes debugging malformed JSON before I found this.

Transformers – Full Control

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
 "Qwen/Qwen3.6-35B-A3B",
 torch_dtype="auto",
 device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.6-35B-A3B")

messages = [{"role": "user", "content": "Write a binary search in Rust"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=2048)
response = tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
print(response)

Full control. Requires more VRAM – ~70GB on FP16. Most people use FP8 or quantized GGUFs instead.

The Gotchas

Three things broke during testing.

1. The 262K context is real, but you can’t use all of it on 24GB VRAM.

262,144 tokens natively, extensible to 1M (per official spec, April 2026). True. But at Q4: model takes ~20GB. Leaves 4GB for KV cache. Push a 100K context? KV cache overflows, swaps to system RAM. Inference: 2-3 tokens/sec.

Need the full window? 48GB VRAM or a smaller quantization like IQ3_XXS (trades quality for memory).

2. Thinking Preservation is new, but there’s no public API for it yet.

The Hugging Face page says Qwen3.6 introduces “Thinking Preservation” to retain reasoning context from earlier messages (as of April 2026). Huge for multi-turn coding – model doesn’t re-derive reasoning every follow-up.

Docs don’t say how to enable it. I dug through vLLM source and model configs. Nothing. My guess: on by default, or requires an undocumented parameter. If you figure it out, let me know.

3. Tool calling might still be unstable.

That NVIDIA forum post from 7 hours ago: “still struggling to get stable performance in terms of tool calling with Qwen3.5.” Not sure if 3.6 fixes it. Model’s 2 days old. Building production agents that rely on function calling? Test heavily first.

Performance Numbers

Hardware	Quantization	Speed (tokens/sec)
RTX 4090 24GB	Q4_K_M	~122
RTX 3090 24GB	Q4_K_M	~112
Mac M4 24GB	Q4_K_M	~15
RTX 6000 48GB	Q8_0	Not tested, likely 80-90

These match community reports from this week. 4090 is the sweet spot – fast enough for real-time coding assistants, cheap enough to run locally.

Simon Willison’s pelican test: Qwen3.6-35B-A3B on his laptop vs Claude Opus 4.7. Qwen won. Not rigorous, but a good vibe check – this model understands structured outputs.

When NOT to Use This

Multimodal? Not here. Qwen3.6-35B-A3B is text-only (as of April 2026). Need vision (image understanding)? Qwen3.5-Omni or wait for Qwen3.6-Plus to add vision back in.

Production agents with critical tool calling? Test first. Model’s 48 hours old. Unresolved reports of function-calling instability from Qwen 3.5 may carry over.

Better than Qwen3.5-27B for all tasks? No. One Unsloth discussion: “still looking for a 27B quant that outperforms the 35B-A3B.” Haven’t found one for their use case. But the 27B dense model has higher quality on some benchmarks (as of April 2026) – 35B-A3B trades accuracy for speed. Have the VRAM and latency isn’t your bottleneck? Test both.

Full 262K context on 24GB VRAM? Nope. You’ll overflow. Upgrade hardware or chunk inputs.

What This Means

Repository-level coding agents just became runnable on hardware you can buy used for $800.

A year ago? Agentic coding – multi-file edits, debugging, full-stack scaffolding – meant paying OpenAI or Anthropic per token. Now? Run it locally. Keep your code private. Iterate without a bill.

Tool calling’s still shaky. Context window has hard VRAM limits. We’re all figuring out how to tune MoE for specific tasks.

But that gap between local and API models? A lot narrower now.

FAQ

Can I run Qwen3.6-35B-A3B on a Mac with 24GB RAM?

Yes. The 35B model works on 22GB Mac (Unsloth docs, April 2026). ~15 tokens/sec at Q4. Usable for coding, not real-time chat. Ollama for simple setup, llama.cpp for control.

How does Qwen3.6-35B-A3B compare to Qwen3.5-35B-A3B?

Qwen3.6 is an incremental update (as of April 2026). Focus: agentic coding improvements – better frontend workflows, repository-level reasoning, new “Thinking Preservation” feature that retains reasoning context across turns. Architecture’s similar (both 35B MoE, 3B active), but 3.6 scores higher on SWE-bench and Terminal-Bench. Already using 3.5? Worth upgrading. On 3.5-27B (dense model)? Test both. 27B has higher quality on some tasks, 35B-A3B is faster. One caveat: 3.6’s only 2 days old, so production stability is still being validated. If you’re running mission-critical agents, maybe wait a week for community bug reports to surface.

What’s the difference between thinking mode and non-thinking mode?

Two inference modes. Thinking: generates internal reasoning steps (like chain-of-thought) before the final answer. Good for complex coding or math where you want to see the model’s work. Non-thinking: skips this, goes straight to output. Faster, less verbose, better for simple queries. Qwen3.5 docs (April 2026) recommend temp=1.0 for thinking, temp=0.7 for non-thinking. Toggle in vLLM: --reasoning-parser qwen3 (thinking on) or omit (thinking off). Qwen3.6 may have changed defaults, but docs are sparse.

Start here: pull the Ollama image, run it on a coding task, see if output quality justifies the VRAM. If yes, move to vLLM for production. If no, you’ve spent 10 minutes instead of a week.