Skip to content

DeepSeek’s Open-Source Inference Kernels: A Practical Guide

DeepSeek open-sourced its inference optimizations powering DeepSeek-R1. Here's how FlashMLA, DeepEP and DeepGEMM actually speed up generation in your stack.

7 min readBeginner

You’re serving DeepSeek-R1 on rented H100s and watching the token counter. Not bad, but not where you want it. Then you flip three switches – a different attention kernel, a different GEMM library, and a draft model for speculative decoding – and the same hardware starts moving noticeably faster without touching the model weights. No new GPUs. No model changes. Just different code paths talking to the same silicon.

That win comes from DeepSeek’s open-source inference optimizations, the kernels that power their own production stack. They released them on GitHub during Open Source Week (Feb 24-28, 2025) and the community has been stitching them into SGLang and TensorRT-LLM ever since. This guide works backwards from the speedup: what to enable, why it works, and where it quietly breaks.

The bottleneck nobody warns you about

Most people self-hosting DeepSeek-V3 or R1 start with vLLM because that’s the default for every other open model. Baseten’s DeepSeek guide is blunt about this: vLLM has limited support for DeepSeek’s full feature set (like max context length, expert parallelism, and tool calling) and lower performance, so we don’t recommend using vLLM to serve DeepSeek in production.

The reason is structural. DeepSeek uses Multi-head Latent Attention (MLA) and a 671B-parameter Mixture-of-Experts with 37B active per token. Generic inference engines treat MLA like regular attention and route MoE tokens with naive all-to-all collectives. Both choices leave roughly half the GPU on the table.

What DeepSeek actually open-sourced

Five repos dropped over five days in late February 2025. Each one targets a specific bottleneck in the generation loop. The official open-infra-index is the source of truth – everything else is a re-skin of it.

Repo What it fixes Headline number
FlashMLA MLA decoding kernel for Hopper GPUs 3000 GB/s memory-bound, 580 TFLOPS compute-bound on H800
DeepEP All-to-all comms for MoE experts Low-latency decoding kernels, native FP8 dispatch
DeepGEMM FP8 matmul for dense + MoE 1350+ TFLOPS (1550 after April 2025 update on H800)
DualPipe Pipeline parallelism with compute/comm overlap Reduces pipeline bubbles in training and prefill
3FS Distributed file system for KV cache offload Targets large-scale KV cache across nodes

FlashMLA went viral almost immediately – it garnered over 5,000 stars within just six hours of its release. The community pulse on r/LocalLLaMA was a mix of “finally” and “wait, this only runs on Hopper?” Both reactions are correct.

Here’s the thing worth sitting with for a moment: five production-grade kernel libraries, released in one week, covering attention, expert routing, matrix multiplication, pipeline scheduling, and distributed storage. That’s not a blog post – that’s an entire inference stack. Whether the community can absorb all of it, or ends up only using two of the five repos in practice, is still an open question as of mid-2025.

The setup that actually delivers the speedup

Forget rebuilding from scratch. The fastest path today is SGLang, which has already integrated the three kernels you care about. Per the LMSYS blog (May 2025), SGLang’s PD-disaggregated implementation on 12 nodes of 8x H100 achieves 52.3k input tokens per second and 22.3k output tokens per second per node for 2000-token input sequences – the first open-source implementation to nearly match the throughput in DeepSeek’s official numbers.

For a typical single-node setup serving R1, the minimum useful command looks like this:

git clone https://github.com/sgl-project/sglang.git
cd sglang && pip install -e "python[all]"

# Pre-compile DeepGEMM kernels ahead of time
python3 -m sglang.compile_deep_gemm 
 --model deepseek-ai/DeepSeek-R1 
 --tp 8 --trust-remote-code

# Launch with MTP speculative decoding enabled
python3 -m sglang.launch_server 
 --model-path deepseek-ai/DeepSeek-R1 
 --tp 8 
 --speculative-algorithm EAGLE 
 --speculative-num-steps 3 
 --speculative-eagle-topk 1 
 --speculative-num-draft-tokens 4

The compile_deep_gemm step is the one most tutorials skip and then complain about cold-start latency. DeepGEMM is JIT-compiled – pre-compiling it once saves you the runtime overhead on every fresh server boot.

Why MTP is the biggest practical win

The flashy number in the news is DeepGEMM’s 1350 TFLOPS, but for an individual developer the bigger lever is Multi-Token Prediction. DeepSeek shipped MTP weights inside V3 and R1 themselves, so you don’t have to train a separate draft model. According to AMD’s reproducible SGLang tutorial, serving DeepSeek-V3 with MTP enabled boosts both the latency and throughput by 1.2 to 2.1 times across concurrency 1-64.

The catch – and this is the part the press releases skip – is that the 2.1x ceiling appears at low concurrency. As batch size climbs toward 64, the gain compresses toward 1.2x because the GPU is already saturated and there’s less idle compute for the draft model to fill. A solo developer running prototypes will see the big number. A high-traffic production endpoint will see the small one.

Pro tip: Benchmark MTP at YOUR expected concurrency, not at concurrency=1. Run two SGLang servers – one with --speculative-algorithm EAGLE, one without – and hit both with the same async client at your real traffic shape. The crossover point where MTP stops helping is real, and it depends on your input/output length ratio.

Where this quietly breaks

Three gotchas that haven’t made it into the hype posts yet.

  • Hopper-only, period. FlashMLA requires CUDA 12.3 or above (12.8 recommended) and PyTorch 2.0 or above, and the kernel itself is hand-tuned for Hopper Tensor Memory Accelerator. DeepGEMM is the same story – it exclusively supports NVIDIA Hopper tensor cores. If you’re on A100s, RTX 4090s, or anything older, none of this code runs. You’d fall back to FlashAttention 2 and a regular FP16 path.
  • Memory layout trap on Hopper builds. Per the DeepGEMM README, the SM90 implementation supports only the NT memory layout (row-major, col-major); the SM100 implementation supports all memory layouts (NT, TN, NN, TT). If your existing kernel feeds DeepGEMM TN-layout tensors on H100, you get garbage or a crash. The fix is either an explicit transpose or moving to the Blackwell build.
  • The 545% margin is a production number, not yours. The widely-quoted figure (as of Feb 27-28, 2025, per DeepSeek’s day_6 system overview) assumes their full PD-disaggregated cluster and 24/7 utilization. A single 8x H100 node hitting maybe 30% utilization is a different economic universe entirely.

What to do with this on Monday

If you already serve DeepSeek on SGLang: upgrade to a recent release, pre-compile DeepGEMM, and turn on MTP. That’s a 30-minute change and the cheapest speedup you’ll get this quarter. Measure your tokens/sec/user before and after – at your real concurrency, not at 1.

If you’re still on vLLM: switch the runtime first, then chase the kernels. The runtime swap alone closes most of the gap.

If you’re on non-Hopper GPUs: skip all of this and use the DeepSeek API directly. Self-hosting on older hardware doesn’t get you the kernel gains, and the economics of renting H100s rarely favor small-scale deployments.

FAQ

Can I use FlashMLA on my RTX 4090?

No. It’s Hopper-only.

How much faster is DeepGEMM than what I’m already running?

Depends on what you’re already running. Against CUTLASS-based FP8 kernels, the DeepGEMM repo reports 1.1x-2.7x speedups depending on matrix shape. Against the default FP16 path in older vLLM builds, the gap is much larger because you’re also moving from FP16 to FP8. As of the April 2025 update, the library reportedly hits 1550 TFLOPS on H800 – though numbers in this space change month to month, so re-benchmark before quoting.

Do I need all five repos or can I cherry-pick?

Cherry-pick. For most people running a single-node or small-cluster setup, FlashMLA + DeepGEMM + MTP gets you the majority of the gain with a fraction of the integration work. DeepEP only matters once you’re doing cross-node expert parallelism, and 3FS only matters if your KV cache is large enough to justify offload. Start with the runtime (SGLang or TensorRT-LLM), confirm it picks up the kernels, then layer on speculative decoding. Adding DualPipe and 3FS is a project for when you have a real distributed deployment, not a weekend.