Mercury 2 just clocked ~736 tokens per second on the Artificial Analysis leaderboard (as of early 2026). Cerebras is quoting 3,000. Every week brings a new “world’s fastest” claim, and your timeline is full of screenshots.
So how fast is N tokens per second really? More importantly: which N actually matters for what you’re doing? This guide skips the theory and gives you a way to map the number on a benchmark to the experience you’ll actually have. Sometimes paying for more speed buys you nothing.
The scenario: you’re staring at a tok/s number and don’t know if it’s good
Two providers. One says 80 tok/s, the other says 300. The 300 costs more. Do you need it?
It depends entirely on who – or what – is reading the output. If it’s a human, anything above your reading speed is invisible. If it’s another program, every millisecond shows up in the wall-clock total. Same number, completely different value.
Here’s the conversion most guides skip. On average there are about 1.3 tokens per word, and the average human reads at 3-5 words per second (msandbu.org). Multiply those out: comfortable reading lands somewhere between 4 and 7 tokens per second. At 50 tok/s, the model is already outrunning you by 10×.
Tok/s tiers translated into what you’ll actually feel
Community testing of local models has settled into a fairly consistent set of thresholds (as of 2025-2026). These map well to API speeds too:
- Under 5 t/s – You’re waiting for individual sentences. Noticeable.
- 5-15 t/s – Usable, but you read faster than the model writes.
- 20-40 t/s – Comfortable; roughly the pace of fast human reading.
- 60-100 t/s – The model outpaces reading; whole sentences pop in at once.
- 100+ t/s – Paragraphs appear nearly instantly. You stop noticing speed entirely.
For context on where the industry sits: as of early 2026, most vendors run GPT-OSS-120B at 100-300 tok/s on H100 GPUs, while Cerebras delivers roughly 1,800 tok/s on Llama 3.1 8B and 450 tok/s on Llama 3.1 70B using native 16-bit weights – and around 3,000 tok/s on GPT-OSS-120B (Cerebras official announcement). Baseten hit 650 tok/s on the same model using Nvidia’s Blackwell B200 GPUs, which was the strongest GPU-based result at the time of writing. The frontier moved from “readable” to “faster than any human can process” in roughly two years.
How to actually pick a speed: a 30-second check
Don’t trust the leaderboard for your use case. Here’s the faster way to find your real threshold.
Search “token speed visualizer” – several free tools come up. Run it at 10, 20, 40, 80, and 150 tok/s. Read along as if you’re waiting on a real answer. The point where it stops feeling like waiting? That’s your human floor. Anything above it is wasted on you alone.
Then think about what else consumes that output. A code agent, a voice TTS pipeline, a batch summarizer – if anything downstream reads the stream programmatically, the floor doesn’t apply to that path. Morph’s analysis puts the human ceiling at roughly 250 words per minute; past that, faster tokens feel identical because the bottleneck has shifted to your cognition. Coding agents have no such bottleneck – they consume output the moment it arrives and immediately fire the next action.
Quick rule: Picking a provider for a chat UI where a human reads every word? Anything past ~80 tok/s is mostly paying for bragging rights. Picking one for an agent chaining 30 sequential calls? Every extra tok/s compounds – that’s where 1,000+ tok/s providers earn their price premium.
When more tokens per second is wasted money
Faster isn’t always better. Three situations where premium speed gives you nothing:
Streaming chat to a single user. They cap out at reading speed. The first token (TTFT) is what they notice; the rest stack up in the buffer. Optimize TTFT, not throughput.
Async tasks. If your app delivers results by email or notification, the user never sees generation in real time. A 2-second response and a 0.4-second one look identical to a notification inbox.
Reasoning models where the answer is short. Reasoning models can have a time-to-first-token of 10-150 seconds while the model generates internal thinking tokens before responding (BenchLM). The visible output could arrive at 500 tok/s and you’d still be sitting through most of the wait. The headline speed is nearly irrelevant here.
For these workloads: optimize cost-per-token. The cheap, slower providers usually win.
Why the leaderboard number lies to you
The catch is, even picking the right tier doesn’t mean the published figure holds in production.
Concurrency erodes everything. A model benchmarked at 200 tok/s on a quiet server can drop to 60 tok/s under moderate concurrent traffic – and TTFT can spike from 0.5s to 5s at peak load. The benchmark is one request on an uncontested endpoint. Your production traffic isn’t that.
Tokenizers aren’t comparable across families. Turns out the same English sentence might be 10 tokens for GPT-4 and 12 tokens for Llama. Artificial Analysis normalizes this via a common tokenizer (tiktoken o200k_base) – raw cross-family tok/s comparisons without that normalization are misleading by default.
Speed sometimes hides a quality trade. Groq runs most models at 8-bit quantization for speed; Cerebras runs native 16-bit weights (Cerebras CS-3 vs Groq LPU). Groq’s inference on 16-bit models runs noticeably slower – drop of roughly 30-40% in practice. “Fastest” on a leaderboard can mean “a lower-precision version of the same model name.”
And sometimes tok/s just doesn’t predict the experience. An academic paper, “Speed and Conversational Large Language Models: Not All Is About Tokens per Second” (arXiv:2502.16721), found that existing token-based speed metrics don’t necessarily correlate with the time needed to complete different conversational tasks. Total task completion is what users care about. Tok/s is one variable inside it.
The diffusion wildcard that broke the leaderboard
Mercury 2’s ~736 tok/s (as of early 2026) is interesting not because of the number but because of how it’s reached. It doesn’t work like a traditional autoregressive LLM. Built on a diffusion architecture, it generates full draft sequences and refines them in parallel, rather than producing tokens one at a time – which is why it sits so far above the autoregressive pack on the speed chart.
Autoregressive decoding is memory-bandwidth-bound: each new token requires reading the full model weights from VRAM, and each step depends on the previous one, so you can’t parallelize the process. Generating 10 tokens per second on a 140 GB model requires roughly 1.4 TB/s of memory bandwidth. Diffusion sidesteps that constraint entirely.
What does this mean long-term? Honestly, unclear. If diffusion models scale to harder tasks without quality loss, the 1,000+ tok/s tier stops being exotic hardware territory and becomes the new normal. But as of this writing, Mercury 2 is a single fast entry on the board, not a category – and the quality tradeoffs at scale are still being tested.
Honest limits of tok/s as a metric
Tokens per second is a useful first filter and a terrible final word.
It tells you nothing about reasoning quality, nothing about TTFT, nothing about how the system degrades under your actual concurrency, and nothing about cost when the bill arrives. An academic paper (arXiv:2502.16721) makes this concrete: tok/s doesn’t reliably predict how long a full conversational task actually takes. For agents and pipelines, latency per step matters more than peak throughput. For human-facing chat, TTFT under 1 second usually matters more than whether the model outputs at 80 or 200 tok/s.
Use the number to rule things out. Don’t use it to make the final pick.
FAQ
What’s a “good” tokens-per-second number in 2026?
For reading chat output: anything above ~30 tok/s is fine. For agent pipelines, 100+ tok/s is a reasonable floor – and 500+ if you’re chaining many calls.
Why does my provider feel slower than the leaderboard says?
200 tok/s on a benchmark can land at 60 in your actual app during peak hours. Leaderboards measure one quiet request – yours shares the endpoint with everyone else. Concurrency, geographic routing, and time of day all eat into the number. Moral: run your own measurement on your real prompts at your real load before committing to a plan.
Is Cerebras’s 3,000 tok/s claim comparable to OpenAI’s API speed?
Not directly. Cerebras runs open-weight models (GPT-OSS-120B, Llama) on its own hardware; OpenAI’s hosted models are different model families with different tokenizers and different capability profiles. A higher tok/s number on a different model doesn’t automatically mean a faster experience on the task you need done. Compare on identical models when possible – and always look at end-to-end task completion time, not just the headline rate. The arXiv paper cited above (2502.16721) makes exactly this point.
Next step: pick one workflow you run weekly – chat, code generation, summarization, whatever – and time how long the model takes from the moment you hit enter to the moment you have a usable answer. That number is your real baseline. Any new provider gets compared against it, not against their landing page.