Skip to content

DeepSeek V4 Pro Discount Goes Permanent: What It Means

DeepSeek V4 Pro's 75% promo just became the new permanent price. Here's how to actually use it, what changes June 1, and the gotchas nobody mentions.

8 min readBeginner

The DeepSeek V4 Pro “75% off” banner you’ve been seeing since April? It’s not a promotion anymore. DeepSeek’s pricing page now confirms that when the discount expires on May 31, 2026 at 15:59 UTC, the model’s official price gets adjusted to exactly 1/4 of the original – which is the same number. The “discount” just becomes the price. That’s a quiet but real shift, and the Hacker News thread generated hundreds of comments within hours of the announcement, mostly debating whether this changes the build-vs-buy math for small teams.

So here’s the practical question: does the DeepSeek V4 Pro permanent price actually matter for what you’re building? If you’ve been waiting to commit to a stack because you didn’t trust the promo would last – yes, you can stop waiting. If you’ve already been using V4 Pro, the cost line in your June invoice will look identical to May’s. The interesting stuff is what changes around the headline number, and that’s what this post covers.

What “permanent” actually means in the pricing page

Read carefully. The official docs phrase the change as: the V4 Pro API price will be “officially adjusted to 1/4 of the original price after the 75% discount promotion ends.” That’s not a renewal. It’s a reclassification. The “original” price ($1.74/M input, $3.48/M output) effectively becomes a phantom – it remains in the docs as the reference point, but nothing bills at it anymore.

Here’s the current rate card you should plan against (as of May 2026):

Token bucket Price per 1M tokens
Input – cache miss $0.435
Input – cache hit $0.003625
Output $0.87

The cache-hit number is the one to stare at. It dropped to 1/10th of the launch cache-hit price on April 26, 2026 – and that change applies across every DeepSeek model, not just V4 Pro. A repeated system prompt now costs roughly 120× less than the same tokens sent fresh. For agent loops and RAG pipelines, that’s not a footnote – it’s the whole reason to be here.

Why the cache-hit math changes how you’d use deepseek v4 pro

The naive way to read the new prices: “$0.435 per million tokens is cheap.” The smart way: if your prefix is stable, those same tokens cost less than a third of a cent per million.

Think of it like a tollbooth that charges full fare the first time your car passes, then opens the gate for free if it recognizes your license plate within a few minutes. Most production workloads – support copilots, code agents, doc analyzers – pass the same plate over and over. They were already paying for that pattern; now they’re getting refunded.

Pro tip: Put your largest stable content (system prompt, tool definitions, retrieval prefix) at the start of every request, byte-identical across calls. The cache keys on a prefix match. Append the user’s new message at the end. Swap the order and you defeat the cache without any error telling you that you did.

The fastest way to actually call it

The API is OpenAI-compatible, so if you already have an OpenAI client wired up, you’re literally changing two lines. Grab a key from platform.deepseek.com, drop in $2, and:

from openai import OpenAI

client = OpenAI(
 api_key="sk-...",
 base_url="https://api.deepseek.com"
)

response = client.chat.completions.create(
 model="deepseek-v4-pro",
 messages=[
 {"role": "system", "content": "You are a helpful assistant."},
 {"role": "user", "content": "Summarize this contract..."}
 ],
 temperature=1.0,
 top_p=1.0,
 extra_body={"thinking": {"type": "enabled"}},
 reasoning_effort="high"
)

print(response.choices[0].message.content)

Two things in that snippet that actually matter. First, temperature=1.0, top_p=1.0 – those are the values DeepSeek tuned the model with, per the model card on Together AI. Copy-paste OpenAI defaults (0.7, 0.95) into the call and you’ll get outputs that degrade – weaker reasoning, less coherent structure – and you’ll probably blame the model instead of the parameters.

Second, the thinking block. V4 Pro has three reasoning modes – non-think, think-high, and think-max. Each one burns more output tokens to reason before answering. Think-max can write tens of thousands of reasoning tokens before producing a single answer token, and you pay for all of them at the output rate. Use it deliberately.

Using it through Claude Code (the part most guides bury)

Routing V4 Pro through Claude Code, OpenCode, or any Anthropic-compatible agent? There’s a syntax detail that bites people. The model name must be deepseek-v4-pro[1m] – with the bracket suffix – to enable the full 1M context window. Without it you silently get a smaller window. No error, no warning, just truncation when your codebase gets too big.

export ANTHROPIC_BASE_URL=https://api.deepseek.com/anthropic
export ANTHROPIC_AUTH_TOKEN=sk-...
export ANTHROPIC_MODEL=deepseek-v4-pro[1m]
export ANTHROPIC_DEFAULT_OPUS_MODEL=deepseek-v4-pro[1m]
export ANTHROPIC_DEFAULT_SONNET_MODEL=deepseek-v4-pro[1m]

The official integration guide spells this out, but it’s easy to miss because the bracket looks like prose, not syntax.

Common pitfalls that hit new users

The headline number isn’t where most people overspend. These are.

  • Confusing input context with output ceiling. 1M context is for what you send in. Output is capped at 384K tokens via max_tokens. Ask for a 50K-word essay with only 2K output tokens allocated and you’ll get a cut-off response – not a longer context window.
  • Cache misses from sneaky prefix changes. Add a timestamp to your system prompt for “freshness” and you’ve just guaranteed zero cache hits. Same with dynamic user names, session IDs, or rotating instructions at the top of the prompt. Keep variable content out of the prefix.
  • HTTP 429 under load with no escape hatch. DeepSeek’s rate limiting is dynamic – it scales with server load and short-term account behavior. There’s no paid tier to buy your way past it, and the rate limit docs only specify a per-user_id concurrency of 500 for V4 Pro. Build retries with exponential backoff from day one, not after the first incident.
  • Using deepseek-chat or deepseek-reasoner in new code. Both names still work but are scheduled for full retirement on July 24, 2026, 15:59 UTC. They route to V4 Flash today. Anything you ship now using them will need a code change before that date.

When V4 Pro isn’t the right pick (even at this price)

Here’s where my recommendation gets uncomfortable: most teams reading this should default to V4 Flash, not V4 Pro.

Flash is cheaper – check current pricing at the DeepSeek docs for the exact ratio, as it shifts. For classification, extraction, templated generation, simple chat, and retrieval where the model is mostly summarizing what you already fetched: Flash matches Pro’s quality at a lower cost. The Pro upgrade earns its premium on multi-step reasoning, agentic coding loops, and long-context analysis where the model actually needs to think across hundreds of thousands of tokens. V4 Pro’s technical report shows it hits 93.5% on LiveCodeBench and 80.6% on SWE-Bench Verified – those are the workloads it was built for, not “summarize this email.”

Against the closed frontier models: at $0.87/M output, V4 Pro undercuts Claude Opus 4 and comparable closed models by a substantial margin on per-token cost. The real question isn’t price anymore – it’s whether DeepSeek’s dynamic rate limits and occasional regional latency fit your traffic shape. Some teams route through third-party providers for more predictable throughput; whether that’s worth the added infrastructure layer depends on your SLA requirements.

And one quiet question worth sitting with: if a frontier-tier open-weight model now costs 80 cents per million output tokens permanently, what does that imply about the closed-source pricing floor twelve months from now?

FAQ

Do I need to change anything in my code on June 1?

No. The model name stays the same, the endpoint stays the same, and the billed rate stays the same. The only thing that changes is the label in DeepSeek’s docs – “promotional” becomes “standard.”

If I have $50/month to spend, what’s a realistic workload I can run?

At the new V4 Pro rates with reasonable caching, $50 buys you something like 50 million cached input tokens, 2 million fresh input tokens, and around 5 million output tokens. That’s enough to run a daily code-review agent over a medium repo, or a small support copilot fielding a few thousand queries with a long shared system prompt. If you switch to V4 Flash for the same budget, you get substantially more headroom – Flash is where solo developers and small teams usually land.

Is V4 Pro actually as good as Claude Opus 4 or other frontier models?

It depends on the task and you should test on your own data, not benchmark scores. On documented benchmarks it’s genuinely competitive – Codeforces 3206 puts it at Legendary Grandmaster level, which is a strange sentence to write but it’s real. Anecdotally, it’s strong on code and math, slightly behind Claude on nuanced English writing, and roughly even on agentic tool use. The community consensus is something like “not a replacement for a frontier closed model in every scenario, but an obvious primary for cost-sensitive workloads.” Worth A/B testing on your top three use cases before committing.

Your next step: open platform.deepseek.com, generate a key, run the snippet above against V4 Flash first (cheaper to experiment with), then re-run the same prompt against V4 Pro and compare outputs side by side. If Flash matches Pro for your task, you just saved your budget.