Skip to content

DeepSeek V4: What Just Dropped and How to Use It

China's DeepSeek just released V4 with 1.6T parameters, 1M context, and $0.14/M pricing. Here's the real story behind the hype, how to get started today, and the gotchas nobody's talking about.

9 min readBeginner

DeepSeek dropped V4 yesterday. Same day OpenAI shipped GPT-5.5. That timing wasn’t accidental.

If you’re reading this, you probably saw the headlines: “1.6 trillion parameters,” “beats every open model,” “$0.14 per million tokens.” All true. Also not the full story.

Here’s what actually matters: V4 gives you a 1-million-token context window at a fraction of the cost of Claude or GPT. You can paste an entire codebase, a 300-page PDF, or a month of chat logs into a single API call. But there’s a catch nobody’s putting in the headlines.

Why This Release Happened Now

DeepSeek launched preview versions of V4-Pro and V4-Flash on Friday – the same Friday OpenAI released GPT-5.5. The timing is not accidental. OpenAI shipped GPT-5.5 the same day. DeepSeek needed a launch window where “open-source 1M-context MoE at a fraction of the cost” would not be buried under a closed-source price hike.

This is DeepSeek’s first major model since R1 rattled Silicon Valley in January 2025. R1 triggered a $1 trillion tech stock selloff because it proved you didn’t need OpenAI’s budget to build frontier-class AI. V4 is the sequel.

One thing stood out when I tested the API this morning: the rate limits aren’t what the docs say they are.

What You’re Actually Getting

Two models: DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) – both supporting a context length of one million tokens. Both use Mixture-of-Experts (MoE), which means only a fraction of the model activates for each request. That’s how they keep inference cheap.

The architecture upgrade is real. In the 1M-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2. Translation: processing a million tokens costs 90% less memory than it did three months ago.

DeepSeek-V4-Pro beats all rival open models for maths and coding, and trails only Google’s Gemini 3.1-Pro, a closed model, for world knowledge, according to the company’s announcement. Independent benchmarks will confirm or contradict that in the coming weeks.

Flash vs. Pro: Which One You Need

Flash is the default. It’s faster, cheaper, and handles 90% of what people actually do with LLMs: drafting emails, code generation, summarization, chat. The “flash” model has similar reasoning abilities to the “pro” version, while offering faster response times and “highly cost-effective” usage pricing.

Pro is for repository-scale code analysis, multi-file refactoring, complex reasoning chains. If you’re not sure which you need, start with Flash.

Getting Started: API Access in 3 Minutes

The web chat is live at chat.deepseek.com. You can test V4 there for free, no credit card. But if you want repeatable results or you’re building something, you need the API.

  1. Sign up at platform.deepseek.com
  2. Top up at least $2 (minimum balance requirement)
  3. Generate an API key from the dashboard
  4. Export it: export DEEPSEEK_API_KEY="your-key-here"

Test with curl:

curl https://api.deepseek.com/v1/chat/completions 
 -H "Authorization: Bearer $DEEPSEEK_API_KEY" 
 -H "Content-Type: application/json" 
 -d '{
 "model": "deepseek-v4-flash",
 "messages": [{"role": "user", "content": "Explain sparse attention in one sentence."}],
 "temperature": 0.2
 }'

If that works, you’re in. The API is OpenAI-compatible, so if you’ve used openai.ChatCompletion.create() before, swap the base URL and you’re done:

from openai import OpenAI

client = OpenAI(
 api_key="your-deepseek-key",
 base_url="https://api.deepseek.com/v1"
)

response = client.chat.completions.create(
 model="deepseek-v4-flash",
 messages=[{"role": "user", "content": "Fix this Python function..."}],
 temperature=0.2
)

print(response.choices[0].message.content)

Use temperature=1.0, top_p=1.0 as DeepSeek recommends; do not import GPT-5.5 or Claude defaults. Their model was tuned with these settings.

The Pricing Trap Nobody Mentions

Every tutorial shows you the rate card. V4-Flash: $0.14 / M input (cache miss), $0.028 / M input (cache hit), $0.28 / M output. V4-Pro: $1.74 / M input (cache miss), $0.145 / M input (cache hit), $3.48 / M output. Great. But two things will surprise you.

Pro tip: The 1M context is for input. Most answers fit in 2,000 output tokens. The 1M context is for input, not output. If you expect to generate a 50K-token response, you’re capped by max_tokens, not by the context window.

Second: thinking modes change your token burn rate. V4 supports three reasoning modes: non-thinking, thinking, and thinking_max. Prices are the same whether you are in thinking mode or non-thinking mode. The model ID sets the rate; the reasoning mode just changes how many tokens you burn at that rate. Thinking modes consume more tokens because the model writes reasoning traces.

You pay per token. If the model thinks for 10K tokens before answering, you pay for 10K tokens. Check reasoning_tokens in the usage object to see the real cost.

Cache Hits Are Your Best Friend

Cache-hit discount: roughly 80% off Flash, 92% off Pro on repeated prefixes. Caching is the single biggest cost lever on DeepSeek V4. The pattern is simple: anything that repeats across calls, especially long system prompts, agent tool schemas, and RAG context, gets billed at a fraction of the full input rate on the second and subsequent calls.

If you’re running 100 queries with the same 20K-token system prompt, the first call costs $2.80 (Flash). Calls 2-100 cost $0.56 each for that prompt. That’s a 7x difference on identical workload.

Three Gotchas That Will Break Your App

I ran V4-Flash through a batch of 50 coding tasks this morning. Three things broke that didn’t break on V3.

1. Rate Limits Are Dynamic and Invisible

DeepSeek API does NOT constrain user’s rate limit. We will try out best to serve every request. That’s what the docs say. But the FAQ tells a different story: The rate limit exposed on each account is adjusted dynamically according to our real-time traffic pressure and each account’s short-term historical usage.

In practice: burst 10 requests in 2 seconds and you’ll hit 429s. Space them over 10 seconds and you’re fine. There’s no published RPM cap. DeepSeek does not impose a fixed strict rate cap for all users by design – they attempt to serve every request without a hard limit. However, they use dynamic controls – if the service is under heavy load or if a single account makes an unusually high number of calls, the system may flag it and throttle or reject some requests.

Solution: Add exponential backoff with jitter. Don’t retry 4xx errors – those are your fault. Retry 429 and 5xx with a delay that grows: 1s, 2s, 4s, 8s.

2. Your Old Model IDs Are Already on V4

If you’re using deepseek-chat or deepseek-reasoner in production, you’re already on V4. deepseek-chat & deepseek-reasoner will be fully retired and inaccessible after Jul 24th, 2026, 15:59 (UTC Time). Currently routing to deepseek-v4-flash non-thinking/thinking.

Check your usage dashboard. If token counts spiked on April 24, that’s why. Migrate to explicit deepseek-v4-flash or deepseek-v4-pro model IDs before July.

3. Think Max Mode Needs Huge Context Allocation

For the Think Max reasoning mode, we recommend setting the context window to at least 384K tokens. If you allocate 128K (the V3 default), Think Max will either fail silently or truncate reasoning chains.

This isn’t in the quick-start guides. It’s buried in the model card.

What People Are Saying (and Missing)

The community reaction split into two camps. One side: “This is unbelievable value.” The other: “Where’s the multimodal support?”

The models can only process text for now, with DeepSeek stating that it was “working on incorporating multimodal capabilities.” Users are still unsure whether the new model can generate images and videos like ChatGPT, or whether it simply supports multimodal input. As of April 24, V4 is text-only. Image/video generation is “coming.”

The coding benchmarks matter more. According to Counterpoint’s principal AI analyst, Wei Sun, V4’s benchmark profile suggests it could offer “excellent agent capability at significantly lower cost.” If you’re using Claude for agentic coding tasks, V4-Pro might cut your bill by 60% with comparable output quality.

Actually: we won’t know until independent evals finish. DeepSeek’s self-reported numbers are promising. Wait for SWE-bench runs from people who aren’t DeepSeek employees.

When to Use V4 (and When Not To)

Use V4 if… Skip V4 if…
You’re processing full codebases (repository-scale analysis) You need image/video generation (not supported yet)
You need 100K+ token context regularly You require guaranteed sub-second latency (rate limits vary)
Cost per token matters more than brand reputation Your compliance team bans Chinese AI providers
You want open weights for local deployment You need 24/7 SLA guarantees (best-effort API)

The reason for its late arrival is related to the migration of V4’s training framework from NVIDIA to Huawei Ascend. The model was trained entirely on Huawei chips. Some jurisdictions have restrictions on DeepSeek’s API. Self-hosting with open weights sidesteps that, but you’ll need serious GPU firepower.

What to Do Next

Don’t just read about V4. Test it on your actual workload.

Here’s the smallest useful experiment: Take a task you currently send to GPT-4 or Claude. Could be code review, document summarization, data extraction. Run it through V4-Flash with temperature=0.2. Log the token count, response time, and output quality. Compare the cost.

If quality is close and cost is 10x lower, you’ve found a use case. If quality drops, you know where V4’s ceiling is. Either way, you’ll have data instead of benchmarks.

The API key takes 3 minutes. The test takes 10. Do it before your next standup.

FAQ

Is DeepSeek V4 better than GPT-4 or Claude Opus?

Depends on the task. The “pro” version’s performance falls only “marginally short” of OpenAI’s GPT‑5.4 and Gemini 3.1-Pro, “suggesting a developmental trajectory that trails state-of-the-art frontier models by approximately 3 to 6 months.” For coding and math, V4-Pro is competitive. For general knowledge and nuanced writing, Claude Opus 4.6 still leads. For cost-per-token, V4 wins by a wide margin.

Can I run V4 locally on my own hardware?

Yes. This repository and the model weights are licensed under the MIT License. DeepSeek V4 has fully open weights, continuing DeepSeek tradition. The weights are on Hugging Face. You’ll need dual RTX 4090s or a single RTX 5090 for V4-Flash. V4-Pro needs a serious cluster. Community reports suggest 4x H100 gets you 50-150 tokens/sec.

What’s the real difference between Flash and Pro?

DeepSeek-V4-Flash-Max achieves comparable reasoning performance to the Pro version when given a larger thinking budget, though its smaller parameter scale naturally places it slightly behind on pure knowledge tasks and the most complex agentic workflows. Flash is 13B active params, Pro is 49B. For simple tasks (chat, basic coding), Flash matches Pro. For complex multi-step reasoning and huge context, Pro pulls ahead. Start with Flash; upgrade only if you measure a quality gap on your specific use case.