You’ve been paying Anthropic or OpenAI for months. A teammate slacks you a link: DeepSeek V4 is out, the API costs a fraction of what you’re currently paying, and it ships with a 1M context window. Should you switch?
Honest answer: maybe – but probably not the way the launch posts are telling you to. V4 is not competitive with frontier U.S. models, and DeepSeek’s own technical paper admits V4 “trails state-of-the-art frontier models by approximately 3 to 6 months.” That’s the headline. The interesting part is what you can actually do with that gap when the price difference is this large.
What actually shipped on April 24
Two models, not one. According to DeepSeek’s April 24, 2026 release note, DeepSeek-V4-Pro has 1.6T parameters with 49B activated, and DeepSeek-V4-Flash has 284B parameters with 13B activated – both support a 1M-token context window. They’re released under MIT license and open-weights on Hugging Face.
The architectural detail that matters for your wallet: the Hugging Face model card notes that at 1M-token context, V4-Pro uses only 27% of the single-token inference FLOPs and 10% of the KV cache compared to V3.2. That’s why DeepSeek can price 1M context as a default floor instead of a premium tier.
Worth knowing: The 1M context isn’t a free pass to dump everything in. At V4-Pro rates, a single 1M-token prompt costs ~$0.44 just for input (as of the April 2026 launch). Long-context is now affordable, not free.
The fastest way to actually try DeepSeek V4
Three paths, in order of effort:
- Free, in your browser: open chat.deepseek.com. Per DeepSeek’s release announcement, you get Expert Mode (reasoning chain on) and Instant Mode (fast, default). No subscription, no waitlist.
- API, ~10 lines of code: use the OpenAI SDK pointed at DeepSeek’s base URL (snippet below).
- Self-host: grab the weights from Hugging Face. Flash fits on serious consumer hardware at a quant; Pro needs multi-node.
For the API, here’s the minimum that works:
from openai import OpenAI
client = OpenAI(
api_key=os.environ["DEEPSEEK_API_KEY"],
base_url="https://api.deepseek.com/v1",
)
response = client.chat.completions.create(
model="deepseek-v4-flash", # or deepseek-v4-pro
messages=[{"role": "user", "content": "Summarize this PDF..."}],
temperature=1.0,
top_p=1.0,
)
Two things tutorials skip. First, the Hugging Face model card recommends temperature=1.0 and top_p=1.0 – most people instinctively lower these, which can hurt reasoning quality. Second, the legacy aliases work for now but the official docs confirm that deepseek-chat and deepseek-reasoner will be fully retired after July 24, 2026, 15:59 UTC, currently routing to deepseek-v4-flash non-thinking and thinking respectively. Update model strings now while it’s a one-line change.
The reasoning_content gotcha that breaks your agent
Turns out this is the loudest practical V4 complaint on GitHub right now – and almost no tutorial covers it. Multiple open issues in OpenCode, OpenClaw, and Hermes all trace back to the same root cause.
When you turn on thinking mode, the API returns two fields: content (the answer) and reasoning_content (the chain-of-thought). Most clients capture content and discard reasoning_content. That’s fine for a single-turn chat. The moment a tool call enters the conversation, it isn’t.
From DeepSeek’s thinking mode official docs: for turns that perform tool calls, the reasoning_content must be fully passed back to the API in all subsequent requests. If your code doesn’t do this, the API returns a 400 error. This hits OpenCode, OpenClaw, Hermes, and basically every popular agent framework that wasn’t updated for V4.
Your three options:
- Disable thinking mode by passing
extra_body={"thinking": {"type": "disabled"}}– works, but you lose the model’s best capability. - Switch to the Anthropic-compatible endpoint at
https://api.deepseek.com/anthropicwith@ai-sdk/anthropic, which handles reasoning blocks natively. - Manually replay reasoning_content in your message history. In OpenCode, that’s the
"interleaved": {"field": "reasoning_content"}config.
One more silent trap, also from the official thinking mode docs: temperature, top_p, presence_penalty, and frequency_penalty are all ignored in thinking mode – no error, just no effect. People tune those values for hours wondering why nothing changes.
Pricing math that actually applies to you
Forget the “35× cheaper” tweets for a second. Here’s what your invoice actually depends on.
| Model | Input (cache miss) | Output | Cache hit (approx.) |
|---|---|---|---|
| V4-Flash | $0.14/M tokens | $0.28/M tokens | ~$0.014/M tokens |
| V4-Pro (full price) | $0.435/M tokens | $0.87/M tokens | ~1/10 of input rate |
Prices from OpenRouter listings as of April 2026 – check current rates before budgeting. A 75% input discount on V4-Pro ran through May 31, 2026 15:59 UTC; verify whether it’s still active at time of reading. For comparison against frontier closed models like Claude Opus or GPT-5.5, check Anthropic’s and OpenAI’s official pricing pages directly – those numbers shift often enough that any third-party table goes stale fast.
Cache-hit billing is where the math actually gets interesting for production workloads. Starting April 26, 2026, cache-hit input bills at roughly one-tenth of the standard rate – so a workload that reuses the same long system prompt or document across thousands of calls looks very different on your invoice than the headline number suggests. But it’s not automatic. Cache hits depend on the prefix actually persisting between requests, and DeepSeek treats it as best-effort. Track prompt_cache_hit_tokens and prompt_cache_miss_tokens in the response object – if your hit rate is below 50%, your prompt structure is the culprit, not the API.
What V4 isn’t good at
No images in the main release. The V4 you’re calling through the API is a language model. There’s a separate vision variant in development, but V4-Pro and V4-Flash don’t see images.
Your IDE may be quietly capping context. Docs say 1M – your client may say otherwise. A bug report on the Cursor community forum (as of late April 2026) showed V4 context limited to 200K tokens when connected through Cursor, while tools like OpenCode pass through the full 1M. The ceiling is a property of the API; the floor depends on your client. If you’re relying on long context, verify the client isn’t truncating before blaming the model.
The frontier gap is real. If your task sits at the absolute edge of what AI can currently do – frontier-level coding agents, very long-horizon reasoning, anything where the best closed models are still struggling – V4 will feel like a slightly older model. Because it is, by DeepSeek’s own admission.
So who should actually switch?
High-volume, repetitive workloads – classification, extraction, RAG, summarization, batch coding – are the obvious first candidates. If your current bill has four-plus digits in front of the decimal, V4-Flash is probably your next experiment. Cost-per-call is low enough that you can A/B test against your current model in production with real traffic and just compare quality on your specific tasks.
Agents with thinking-mode reasoning? Fine – but budget a day for the reasoning_content plumbing, or use the Anthropic-compatible endpoint and skip the headache.
If you’re a heavy ChatGPT user who just wants a free chat interface with a 1M context model that’s solid at reasoning, chat.deepseek.com is the answer. No subscription, no waitlist.
FAQ
Is DeepSeek V4 actually open-source?
Yes – weights are MIT-licensed on Hugging Face. Fine-tune and self-host freely. The training data and the chips used to train it are not public.
Should I migrate from deepseek-chat right now or wait?
Start now. The legacy aliases currently route to V4-Flash anyway, so behavior has already changed under your code since late April. If your tests have been stable, you’ve effectively already migrated to V4-Flash non-thinking – you just need to update the model string before July 24 to avoid the hard cutoff. Three months sounds like plenty of time until you remember holidays, on-calls, and the one teammate who owns that codepath being on PTO the week it matters.
Is V4-Pro worth it over V4-Flash?
For most production workloads, start with Flash. Pro earns its keep on complex multi-step coding, long agentic loops with hard reasoning, and synthesis across very long documents – tasks where you can measurably feel the depth difference. The honest answer on Flash vs. Pro performance is that DeepSeek hasn’t published a direct head-to-head breakdown with controlled thinking budgets, so the best test is your own workload. Run it on both, compare outputs, let your specific task decide.
Next step: grab a free DeepSeek API key, copy the snippet above, and run your single most expensive prompt from last month through V4-Flash. That one test will tell you more than any benchmark table.