Here’s a detail that gets buried in every writeup: the person who built ds4 – the new DeepSeek V4 Flash local inference engine for Metal – wrote the entire thing in about two weeks. The README is upfront that GPT-5.5 provided strong assistance throughout; antirez (yes, the Redis guy) made the point that if AI-assisted code isn’t your thing, this project isn’t for you. Honest. Rare.
The repo hit 1.4k stars shortly after release. But stars don’t configure Metal kernels. Time to get into the actual setup.
Should you even try this? A 30-second decision tree
Whether ds4 is worth your weekend comes down to one number – how much unified memory you have.
| Your Mac | What’s realistic |
|---|---|
| <128GB unified memory | Skip it. The 2-bit quant alone is 81GB. |
| 128GB MacBook Pro / Studio | Works, but cap context at 100-300K tokens. |
| 192GB+ Mac Studio | Comfortable. Push to 4-bit, longer context. |
| 512GB Mac Studio M3 Ultra | The promised land – full context, fast prefill. |
Why 128GB is the floor: the 2-bit quants are 81GB, and full 1M-token context needs roughly 26GB on top of that (the compressed indexer alone accounts for ~22GB of it), per the antirez/ds4 README. So you bought a 1M-token model but can realistically run maybe 100-300K of it on a 128GB machine. Worth knowing before the download starts.
What ds4 actually is (and isn’t)
Not Ollama. Not LM Studio. The README is explicit: ds4.c is a small native inference engine for DeepSeek V4 Flash, intentionally narrow – not a generic GGUF runner, not a wrapper around another runtime. One model, one architecture, one path through the GPU.
The wager behind that narrowness: a hand-tuned Metal graph for one specific model layout beats a generic runtime. In antirez’s April 2026 benchmarks, a 128GB MacBook Pro M3 Max running the 2-bit quant at 32K context hit 58.52 tokens/s prefill and 26.68 tokens/s generation. For a 284B-parameter model on a laptop, that’s not magic – but it’s usable.
The DeepSeek V4 Flash model card describes a Mixture-of-Experts design: 284B total parameters, 13B activated per token, 1M-token context window, MIT license, released April 24, 2026. Only 13B firing per token is what makes the memory math work on a Mac at all – worth keeping in mind when the benchmarks feel surprisingly fast.
Setup: from zero to running in about an hour
You need Apple Silicon with at least 128GB of unified memory, Xcode command-line tools, and roughly 100GB of free disk. That last point trips people up – Hugging Face downloads fail and resume constantly, and if you’re tight on space the partial files will fill your drive without completing. Give yourself headroom.
1. Clone and build
git clone https://github.com/antirez/ds4.git
cd ds4
make
Two binaries come out: ds4 (CLI) and ds4-server (HTTP server). No CMake. No Python environment. That’s the whole build.
2. Download the model
The repo ships a download script that pulls from huggingface.co/antirez/deepseek-v4-gguf, stores files under ./gguf/, resumes partial downloads with curl -C -, and updates ./ds4flash.gguf to point at whichever quant you pick.
./download_model.sh q2 # 81GB, fits 128GB Macs
./download_model.sh q4 # larger, needs 192GB+
One thing the tutorials skip: this engine only works with the DeepSeek V4 Flash GGUFs published specifically for this project. It’s not a general GGUF loader – the tensor layout, quantization mix, metadata, and optional MTP state have to match what ds4 expects. Already downloaded a DeepSeek V4 GGUF from somewhere else? Start over. Even if it’s labeled IQ2.
3. Start the server
./ds4-server --ctx 100000
--kv-disk-dir /tmp/ds4-kv
--kv-disk-space-mb 8192
100K-token context, 8GB of disk-backed KV cache, default port. Point any OpenAI SDK at http://localhost:8080/v1 and it works.
Sampling settings matter here: set
temperature=1.0andtop_p=1.0in your client. Those are DeepSeek’s officially recommended values for local deployment. The defaults that work fine for GPT-style models will give you noticeably worse output with this one.
The killer feature nobody talks about: disk KV cache
Most local-LLM tutorials skip this entirely. They shouldn’t.
Every time you start a fresh chat with a long context, you pay a prefill cost – even if 90% of your prompt is identical system instructions, the same docs, the same codebase summary you sent yesterday. ds4 persists the KV cache to disk. The defaults (from the README) are conservative: store prefixes of at least 512 tokens, cold-save up to 30,000 tokens, trim 32 tail tokens, align to 2,048-token chunks.
Second time you ask a question against the same long preamble – prefill is essentially free. For coding agents that send the same repo context on every turn, this is the difference between unusable and useful. That’s not marketing copy; it’s just how KV reuse works when the prefix matches.
One catch: by default, checkpoints may be reused across the 2-bit and 4-bit routed-expert variants if the token prefix matches. Use --kv-cache-reject-different-quant when you want strict same-quant reuse only. Most people will never notice. If outputs feel subtly off after switching quants, that’s your culprit.
Two API surfaces, one local server
ds4-server exposes /v1/chat/completions – the standard OpenAI-style endpoint – and /v1/messages, which is the Anthropic-compatible path used by Claude Code-style clients. Both support SSE streaming. Tool schemas on the OpenAI side get rendered into DeepSeek’s DSML tool format automatically.
Practically: point Claude Code at localhost:8080 and it connects. Same with the OpenAI Python SDK:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="none")
resp = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[{"role": "user", "content": "Refactor this Python function..."}],
temperature=1.0,
top_p=1.0,
)
As of April 2026, community reports on Hacker News put the hosted DeepSeek API at $0.14 in / $0.28 out per million tokens – verify that before committing to any cost comparison, since pricing changes. The local case isn’t really about saving money anyway. It’s about privacy, offline work, or not sending proprietary code over the wire.
Honest limitations
The README doesn’t hide these. Most tutorials do.
- Single-worker bottleneck. Request parsing and sockets run in client threads, but inference itself is serialized through one Metal worker – the server doesn’t batch independent requests. Concurrent requests wait in line. Don’t try to serve a team off this. One user at a time.
- MTP is not the speedup you’re hoping for. The MTP/speculative decoding path (per the mitsuhiko/ds4 fork README) is still experimental, correctness-gated, and provides at most a slight speedup. Don’t enable
--mtpexpecting a generation-speed win. - Apple Silicon only, as of May 2026. No CUDA, no AMD, no Linux GPU path. Future CUDA support is hinted at but uncommitted.
- Think Max mode is impractical on 128GB. DeepSeek recommends at least 384K context for Think Max reasoning mode. The memory math above tells you that’s out of reach without 256GB+ machines.
That last one is the quiet killer. The reasoning mode that makes V4 Flash competitive with frontier models needs more context than most Macs can give it. You’re running the model – just not at full strength.
Is this the future of local AI?
Probably not in the form ds4 itself takes. The pattern feels right, though: a tiny model-specific engine that does three things well rather than one generic runtime doing ten things adequately. Whether that approach survives the next model generation is genuinely unclear – it’s a trade-off of maintenance burden against raw speed.
For now: clone the repo, kick off the model download tonight, have it ready for tomorrow. If your Mac is big enough, you’ll have a frontier-adjacent coding model on localhost by lunch.
FAQ
Can I run ds4 on a 64GB Mac?
No. The smallest quant is 81GB. There’s no path here for sub-128GB machines.
Why use ds4 instead of just running DeepSeek V4 Flash through llama.cpp?
Both are valid. antirez maintains a separate llama.cpp fork with V4 Flash support if you want the generic-runtime path. The case for ds4 specifically: it’s a model-specific Metal executor with disk KV cache persistence – a feature a generic runtime won’t give you – and it’s the only option with dual OpenAI/Anthropic API surfaces in one binary. If you’re on a Mac and running this one model, ds4 wins on convenience. If you switch models often, or you’re on Linux, the generic stack makes more sense.
Will my coding agent actually work, or is this still toy-grade?
It works for solo use. The catch is parallel tool calls – if your agent fans those out simultaneously, they queue behind each other on the single Metal worker rather than running in parallel. One agent, one conversation at a time: fine. Anything that looks like concurrent inference: broken by design.