Two ways to start with DeepSeek V4 are floating around the community right now. The first: spin up V4-Pro because the spec sheet says 1.6 trillion parameters and you want the best. The second: start with V4-Flash, get your code working, and only graduate to Pro if your task actually needs it. The second one is correct – and not just because it’s cheaper.
V4-Pro is currently capacity-constrained. CFR’s analysis of the launch flags this directly: DeepSeek itself admits it cannot serve V4 Pro to most customers because it lacks the chips to do so. Build against Pro first and you’re building against a model that may queue, throttle, or quietly fall back. Start with Flash.
What DeepSeek V4 actually is
DeepSeek V4 dropped as a preview on April 24, 2026. It’s a Mixture-of-Experts pair: V4-Pro with 1.6T total parameters (49B activated) and V4-Flash with 284B total (13B activated), both supporting a one-million-token context length by default across all official services. Weights are MIT-licensed and live on Hugging Face.
The “almost on the frontier” framing comes straight from DeepSeek’s own paper, which concedes V4 trails state-of-the-art models by approximately 3 to 6 months – a candid admission you don’t often see in a launch doc. It’s not beating the current generation of frontier models. It’s roughly matching where they were half a year ago, at a fraction of the price.
The efficiency story is the interesting part. A hybrid attention mechanism means V4-Pro requires only 27% of single-token inference FLOPs and 10% of the KV cache compared with V3.2 – and that’s at the full 1M-token context setting, per the official Hugging Face model card. V4-Flash pushes further: 10% of FLOPs and 7% of KV cache vs V3.2 at the same window, as Simon Willison noted when reading the paper. Long context stopped being a luxury feature you pay 10x for.
Worth sitting with those numbers for a second. A 10% KV cache footprint at 1M tokens isn’t just a benchmark stat – it’s the difference between a model that’s theoretically capable of long context and one you can actually run at 1M in production without the memory bill eating your margin.
Pick a variant – and a mode
The decision is two-layered. First, which model. Second, which reasoning mode.
| Model | Total / Active | Best for | Availability |
|---|---|---|---|
| V4-Flash | 284B / 13B | Default daily driver, agents, code | Reliable |
| V4-Pro | 1.6T / 49B | Hard reasoning, long-doc analysis | Capacity-limited (as of April 2026) |
Both expose three reasoning effort modes. Think Max is the highest – maximum reasoning depth, recommended for math, multi-step planning, or hairy bug hunts. The lighter modes trade depth for speed on simpler queries. Think Max is not free: it burns tokens and time. Don’t use it to ask what the capital of France is.
The actual setup (fastest path)
If you’ve used DeepSeek before, almost nothing changes about your code. Keep the base_url, update the model string to deepseek-v4-pro or deepseek-v4-flash. The official announcement confirms the API supports both OpenAI ChatCompletions and Anthropic formats.
from openai import OpenAI
client = OpenAI(
api_key="YOUR_KEY",
base_url="https://api.deepseek.com"
)
resp = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[{"role": "user", "content": "Refactor this function..."}],
extra_body={"thinking": {"type": "enabled"}} # toggle thinking mode
)
print(resp.choices[0].message.content)
That’s it for hosted use. Three minutes if your key is already set. The announcement also confirms the API works with Claude Code, OpenClaw, and OpenCode out of the box, so existing agent harnesses just work.
The local-deploy gotcha nobody warns you about
Running V4 locally – even via vLLM – has one trap that none of the launch posts highlight. This release does not include a Jinja-format chat template. Instead, DeepSeek ships a dedicated encoding folder with Python scripts demonstrating how to encode messages in OpenAI-compatible format (confirmed in the Hugging Face model card).
So if you do this:
tokenizer.apply_chat_template(messages, tokenize=False)
# returns None or errors out
You haven’t done anything wrong. There’s no template registered. You need the helper they ship with the model:
from encoding_dsv4 import encode_messages
prompt = encode_messages(messages, thinking_mode="thinking")
tokens = tokenizer.encode(prompt)
This is a deliberate design choice – DeepSeek wanted finer control over how thinking and tool roles get serialized – but it breaks every “copy this snippet from a Llama tutorial” workflow. For local deployment, also set temperature = 1.0 and top_p = 1.0; the model card calls these out specifically, and default sampling values from other models will degrade output quality.
Pro tip: If you plan to use Think Max mode, the official guidance is to allocate at least 384K tokens of context window. The reasoning chains are long. Cap it at 32K and you’ll see truncated mid-thought outputs that look like model failures but are really config failures.
Common pitfalls
Three things will burn you in the first week. None of them are obvious from the docs.
- The silent alias migration. As of April 2026,
deepseek-chatanddeepseek-reasoneralready route to V4-Flash non-thinking and thinking respectively. Both aliases are scheduled for full retirement after July 24, 2026 at 15:59 UTC. If your old code still saysmodel="deepseek-chat", you’re already on V4 – you just don’t know it. Update the strings now so future debugging isn’t archaeology. - Pro requests may not actually run on Pro. Given the chip-shortage admission, behavior here may shift week to week. Log which model actually responded if you depend on Pro-specific quality.
- 1M context isn’t 1M output. The context window is the input budget. Output length is capped separately – check the current docs for your specific deployment, as this may have changed since launch.
Where V4 fits – and where it doesn’t
The honest comparison isn’t V4 vs GPT-5. Counterpoint’s principal AI analyst Wei Sun said V4’s benchmark profile suggests excellent agent capability at lower cost – that’s the actual pitch, and it’s a real one. Long-context document work, agentic coding loops where token volume makes frontier models expensive, open-weight requirements (compliance, on-prem, data residency) – those are the cases where V4 earns its place.
For frontier-level reasoning where the 3-6 month gap shows up? That gap is real and you’ll feel it. Same for multimodal tasks – V4’s main release is text-only. And anything requiring vendor-grade SLA reliability: preview is preview, and that’s not a knock, just a fact.
FAQ
Is DeepSeek V4 actually free to use?
Self-hosting: yes, MIT license. The API is paid, but priced below frontier API tiers per the official announcement. No free tier details were confirmed at launch.
Should I migrate my production system to V4 right now?
Run it in shadow mode first – same prompts, both responses logged, one week of comparison against your actual workload. That’s the only honest way to know. If your current setup uses frontier models for summarization, code completion, or RAG over long docs, the cost difference will likely justify a switch. The 3-6 month reasoning gap only shows up on tasks that actually need frontier reasoning. Whether yours do is something only your eval data can answer.
What’s the difference between V4 Think modes and the old R1 reasoner?
R1 was a separate model. V4 folds reasoning into the same model as a switchable mode – which is why deepseek-reasoner now just routes to V4-Flash thinking. One model, three effort levels, easier to manage in production than juggling two endpoints. The flip side: there’s no longer a dedicated reasoning-only checkpoint to tune against if that’s what your pipeline expected.
Next step: grab a key from the DeepSeek console, change one string in your existing OpenAI client to deepseek-v4-flash, and run your three hardest prompts through it. You’ll know within ten minutes whether it belongs in your stack.