An Ask HN thread just blew up – 477 upvotes, 245 comments – asking a question that doesn’t usually trend: has anyone actually swapped Claude or GPT for a local model as their daily coding driver? Not a weekend experiment. Daily.
Reading the comments, two approaches keep showing up. One is full replacement: cancel the subscription, run everything on-device, never touch a cloud API again. The other is the hybrid pattern – keep the agent UI (OpenCode, Aider, Crush), but point it at a local model. The hybrid wins. Full replacement sounds purer, but commenters describe local models that get stuck in loops, get tool calls wrong, and then waste thinking tokens re-reading files instead of retrying. The hybrid keeps the agent use that already works and just changes the brain.
The scenario this is for
You’re on Claude Max. You’re working on code you’d rather not stream to a third party. You have a 32GB GPU, a 36GB+ Mac, or a Mac Studio sitting around. And you want to know whether running a local model for daily coding is realistic right now (June 2026), or whether you should wait another six months.
This guide picks one stack that the thread converged on, walks through setup, and is honest about where it breaks.
What the HN thread actually converged on
Forget vibes. Someone counted mentions across 500+ comments (Tomasz Tunguz’s breakdown):
| Component | Top mention | Share |
|---|---|---|
| Model | Qwen3 35B-A3B (MoE) | 33% |
| Model (runner-up) | Qwen 27B variant | 20% |
| Agent use | Pi | 49% |
| Agent use (runner-up) | OpenCode | 45% |
The pattern is mixture-of-experts models paired with lightweight, OpenAI-API-compatible agent harnesses. Qwen3-Coder 30B ships 30B total parameters with only 3.3B activated per token (as of June 2026, per the Ollama library page). That’s why it runs on consumer hardware that couldn’t handle a dense 30B model – inference cost lands close to a 3B, quality close to a 30B.
On the benchmark side, the 480B variant scores 61.8% on Aider Polyglot, comparable to Claude Sonnet-4, GPT-4.1, and Kimi K2 (per Unsloth’s documentation, June 2026) – but that variant needs 250GB of unified memory minimum. Nobody’s running that on a laptop. The 30B-A3B is the one you actually run.
Setup: OpenCode + Ollama + Qwen3-Coder 30B
Three commands, then one config edit.
# 1. Install Ollama if you haven't
curl -fsSL https://ollama.com/install.sh | sh
# 2. Pull Qwen3-Coder 30B (about 19GB on disk, as of June 2026)
ollama pull qwen3-coder:30b-a3b-q4_K_M
# 3. Confirm it's serving
ollama serve # then in another terminal:
curl http://localhost:11434/api/chat -d '{"model":"qwen3-coder:30b-a3b-q4_K_M","messages":[{"role":"user","content":"hi"}]}'
Then OpenCode. It connects to Ollama through an OpenAI-compatible adapter – set provider.baseURL to your localhost endpoint ending in /v1. That’s the full integration; nothing else to configure on the OpenCode side.
Sampler settings matter more than people admit. Use what Qwen officially recommends: temperature 0.7, top_p 0.8, top_k 20, repetition_penalty 1.05. Drop temperature below 0.7 and the model gets repetitive on long edits.
Pro tip: If tool calls silently fail, raise Ollama’s
num_ctx. The OpenCode docs call this out specifically – default context is too small for an agent loop that’s reading two or three files plus a system prompt.
The three traps that aren’t in any tutorial
This is where the HN thread is more useful than the docs.
Trap 1: the tool-calling bug that broke every GGUF
For weeks after release (fixed as of June 2026), tool-calling was broken universally across all Qwen3-Coder GGUF uploads – affecting llama.cpp, Ollama, LMStudio, Open WebUI, and Jan. Unsloth patched it and coordinated with the Qwen team. If you pulled the model before the fix and never re-pulled, your agent will look like it’s working but tool calls will quietly drop. Re-pull. It’s free.
Trap 2: the variant tag matters more than the quant
One developer documented this in detail: Crush with a custom Q5_K_M Modelfile hallucinated [uses view tool] brackets instead of executing the view tool. Switching to Ollama’s curated qwen3-coder:30b-a3b-q4_K_M tag with the exact same prompt drove the agentic loop without complaint. The lesson: when an agent use misbehaves, try the curated tag before blaming the model or the use.
Trap 3: some models can read but can’t write
A benchmark from the ollama-opencode-setup repo (June 2026) exposes a failure mode tutorials never mention: mistral-nemo:12b cannot create or modify files at all in OpenCode, while qwen3:8b handles file writes. The model will happily review your code and then never write the patch. If your agent stalls on writes, the model – not the use – is the problem.
Honest limitations
The HN thread is unusually candid. Read enough comments and a pattern emerges: you need to know exactly what you’re asking. The model doesn’t fill in gaps the way Claude does – leave an assumption open and it takes the path of least resistance, CSS inline in HTML being a classic example.
Where local still loses:
- Ambiguous prompts. Claude fills gaps with plausible defaults. Local 30B-A3B exposes the gap.
- Architectural decisions. “How should I structure this microservice boundary?” – local gives a generic answer.
- Speed for trivial tasks. A simple file write that Claude does in 2-5 seconds takes qwen3:8b roughly 15-30 seconds (imagewize benchmark, June 2026). Multiply by 200 small edits a day.
Where local wins:
- Privacy-sensitive code. Nothing leaves the machine.
- Long-context refactors. 256K native context, extendable to 1M with YaRN – and you’re not paying per token to use it.
- Overnight batch work. Run it for eight hours. Cost: electricity.
Most of the HN replies claiming full replacement came from people on Mac Studios with 128GB unified memory. One commenter, Greenpants, runs Qwen3 35B-A3B on a Mac Studio 128GB and a MacBook 36GB, containerized and sandboxed for offline use – and rebuilt a Django/Wagtail site with it. That’s a significant hardware investment. The approach works, but the economics only make sense after that upfront spend pays off.
A realistic verdict
Is local ready to fully replace Claude for daily coding in June 2026? For most people, no. For specific workflows – privacy-sensitive code, long-context understanding, batch refactors, overnight runs – yes, today, with the stack above.
The hybrid pattern is the move: keep your cloud subscription for hard architectural calls, route everything else – boilerplate, tests, docstrings, mechanical refactors, code review on proprietary repos – through Qwen3-Coder 30B locally. How much you save depends entirely on your current usage pattern. Run the hybrid for two weeks and watch your cloud token count. That number tells you more than any projection.
FAQ
Do I need a Mac Studio to run this?
No. A 32GB GPU handles Qwen3-Coder 30B-A3B fine. Q5_K_M at 21.7GB is the sweet spot for 32GB cards; Q4_K_M is around 18GB with a 1-2% quality hit on coding tasks where token-level precision matters (per the ai.rs RTX 5090 guide, June 2026). Mac Studio is overkill unless you’re chasing the 480B variant – which needs 250GB of unified memory and is out of reach for virtually everyone.
Why Qwen3-Coder over DeepSeek or Gemma?
The HN thread voted with mention counts and Qwen variants dominated at 33%. That’s the short answer. The longer one: the Unsloth Dynamic UD-Q4_K_XL quantization (276GB) scored 60.9% on Aider Polyglot versus 61.8% for the full bf16 (960GB) – that kind of quantization resilience matters for a daily driver where you’ll be running compressed versions. DeepSeek and Gemma are credible; try them after you’ve baselined Qwen.
Is the cost math actually real?
Depends entirely on your usage. If you’re hitting Claude Max limits regularly, running local can shift a meaningful chunk of that load – but your hardware was already capital you spent, and the math looks different if you’re a light user. Honest advice: run the hybrid for two weeks, watch what actually moves to local, then decide. Don’t cancel anything on day one.
Next action: Run the three install commands above, pull qwen3-coder:30b-a3b-q4_K_M, and point OpenCode at http://localhost:11434/v1. Then take your last real Claude task – not a toy prompt, an actual ticket – and run it through the local stack. That single comparison tells you more than any benchmark.