DSpark Speculative Decoding: A Beginner’s Hands-On Guide

DSpark just dropped from DeepSeek. Here's how speculative decoding works, how to run the open-source DeepSpec code, and where it actually breaks.

Jamie Lin2026-06-277 min readBeginner

By the end of this guide you’ll know how to clone DeepSpec, train a DSpark draft model against a target like Qwen3-4B, and run the eval suite to see whether you actually got the promised speedup. You’ll also know which step is going to eat 38 TB of your disk if you don’t reroute it first.

DSpark dropped from DeepSeek in mid-2025 and is being passed around AI Twitter as the new state-of-the-art for speculative decoding. The framework powers DeepSeek-V4 Flash and Pro in production (as of mid-2025), and the codebase that trains its draft models was open-sourced on the same day. This tutorial is the hands-on version – what it is, how to run it, and the parts the announcement posts skipped.

What DSpark actually does (the 2-minute version)

Speculative decoding has been around since the late 2010s. A small draft model proposes a few tokens, the big target model verifies them in one parallel pass, and accepted tokens come out as if the target had generated them itself – same distribution, no quality loss. The whole point is to spend a single forward pass producing multiple tokens instead of one.

DSpark’s contribution is two specific tricks layered on top of that idea. According to the DSpark technical report, the first is a semi-autoregressive draft: instead of generating all draft tokens fully in parallel (which loses accuracy toward the end of the block) or fully serially (which is slow), it adds a lightweight serial module that models dependencies between tokens inside the same block. Second – and this is the production-engineering part – a confidence head scores each draft token’s likelihood of being accepted, paired with a hardware-aware scheduler that decides, based on current GPU load, how many of those tokens are even worth sending to verification.

Under low load, verify everything. Under high load, drop the tail tokens that are probably going to be rejected anyway. That’s the whole pitch.

The numbers, with the caveats attached

Three figures are getting passed around. They’re not the same thing and the difference matters.

Metric	Value	What it actually means
Acceptance length vs Eagle3	+26.7% to +30.9%	On Qwen3 4B/8B/14B (as of mid-2025 benchmarks), draft tokens get accepted further into each block
Acceptance length vs DFlash	+16.3% to +18.4%	Same benchmark, comparing to the parallel-draft baseline
Generation speed vs MTP-1	+60% to +85%	Per-user wall clock, with system throughput held constant

That last row is where most write-ups go wrong. The 60-85% figure isn’t a single-user speedup on your laptop – it’s measured on the DeepSeek-V4 production serving stack against MTP-1 at fixed throughput. In a strict latency-constrained regime, DSpark avoids the throughput collapse that previous schemes hit. If you’re running a single chat in a Jupyter notebook, your speedup will look different. Possibly a lot different.

How to actually use it

Two paths. Pick based on whether you want the model or want to train your own draft.

Path A: Just use the DSpark-accelerated DeepSeek model

The simpler route. DeepSeek-V4-Pro-DSpark on Hugging Face (as of mid-2025) is not a new model – it’s the same checkpoint with a speculative decoding module bolted on, and a minimal inference example lives in the inference folder. The catch: DeepSeek-V4-Pro is 1.6T parameters with 49B activated, and DeepSeek-V4-Flash is 284B with 13B activated, both supporting 1M token context. There is no consumer-GPU version. You’re renting an H100 cluster or you’re not running this locally.

Path B: Train your own DSpark draft model on a target you actually have

That’s the whole reason DeepSpec exists as a public repo. Data preparation, three draft model implementations (DSpark, DFlash, Eagle3), training code, and an eval use – all MIT-licensed, as of mid-2025. The three-stage pipeline:

Data prep – download prompts, regenerate the target model’s answers, build a target cache
Training – train the DSpark draft against the cached target outputs (train.sh spawns one worker per visible GPU)
Evaluation – eval.sh runs the draft checkpoint over benchmark tasks

Minimal eval command structure:

bash eval.sh 
 --target_name_or_path Qwen/Qwen3-4B 
 --draft_name_or_path ~/checkpoints/deepspec/dspark_block8_qwen3_4b/step_latest

Benchmark coverage in the repo (as of mid-2025): gsm8k, math500, aime25, humaneval, mbpp, livecodebench, mt-bench, alpaca, and arena-hard-v2 – math, code, and chat across the board.

The 38 TB problem nobody is mentioning

Read the README before you start downloading anything. The target cache step warns – right there in the DeepSpec README – that for the default Qwen/Qwen3-4B setting, it can produce roughly 38 TB of data. Thirty-eight terabytes. For a 4B target.

Before you run data prep: Override the default cache size. The 38 TB number is for full-coverage regeneration across all training prompts at full-precision logits. For most exploratory runs you want to subset the prompt list aggressively and cache only top-k logits – see scripts/data/README.md in the repo for the relevant flags.

This is the kind of detail that makes the difference between “I’ll try this tonight” and “my filesystem is on fire.”

Think of a draft model like a sous chef who pre-preps ingredients for the head chef to approve. A great sous chef speeds things up dramatically. A mediocre one creates more cleanup than they save. The target cache is the mise en place – and if you don’t scope it, you’re prepping for a restaurant you don’t have the kitchen for.

Other places this breaks

A bad draft model doesn’t just fail to help – it actively slows your target down. Turns out community-trained draft models for DeepSeek-V3 were hitting only 40-50% acceptance rates (as of mid-2025 community reports), and at that level, the verification overhead exceeds the gain from accepted tokens. You’ve spent GPU time to go slower. That’s not a theoretical risk; it showed up in real llama.cpp community testing.

So how do you know what acceptance rate is “good enough” for your hardware? Honestly – you don’t, until you measure. The paper’s hardware-aware scheduler exists precisely because the answer changes with GPU type, batch size, and load. There’s no universal threshold you can look up.

DSpark vs the alternatives

The closest comparisons are Eagle3 (the previous autoregressive-draft SOTA), DFlash (parallel-draft baseline), and MTP-1 (multi-token prediction, which DeepSeek used in V3). DeepSpec ships all three alongside DSpark, so you can A/B them on your own target without rebuilding the eval use from scratch.

Rough decision tree:

Your target is small (≤14B), single-user, latency-bound → DSpark or Eagle3, benchmark both
You’re serving a fleet under high concurrency → DSpark’s confidence scheduler is built for exactly this case
You want the simplest thing that works → MTP-1 is already integrated into many inference engines, no extra training required
You’re on consumer hardware running GGUF → none of the above directly applies; check the DeepSpec repo issues and discussions for llama.cpp-compatible draft options

FAQ

Does DSpark change the model’s outputs?

No. Speculative decoding is lossless by construction – the target model verifies every draft token and rejects anything it wouldn’t have produced itself.

Can I run DSpark on top of a non-DeepSeek model like Llama or Mistral?

Yes – that’s what DeepSpec is built for. The framework treats the target as a parameter (target_name_or_path) and you train a draft against it. The released DSpark checkpoints are tied to Qwen3 and DeepSeek-V4 targets, but the training pipeline is target-agnostic. Budget carefully for the data prep step (the 38 TB ceiling applies here too) and confirm your target’s tokenizer plays nicely with the draft architecture before committing GPU hours to a full training run.

How does this compare to Medusa or Eagle?

Medusa bolts extra prediction heads directly onto the target model – no separate draft model at all. Eagle goes the other direction: a separate small autoregressive model. DSpark is closest to Eagle but adds the semi-autoregressive block structure (to recover accuracy at later draft positions) and the confidence-scheduled verification layer. That’s where the acceptance-length gains over Eagle3 come from on the Qwen3 benchmarks. Worth noting: Eagle3 is also in the DeepSpec repo, so you can run both on your own target and see which one wins for your use case – rather than taking benchmark numbers at face value.

Next: clone github.com/deepseek-ai/DeepSpec, open scripts/data/README.md, and figure out your cache subset flags before you start downloading.