By the end of this tutorial you’ll paste a prompt into a free browser tool, watch it get chopped into numbered pieces, and predict – before hitting Enter – why a particular request is going to fail. Not a lecture on neural networks. A working mental model of how LLMs work, built from one 30-second experiment.
Andrej Karpathy’s tokenizer deep dive recently got translated into a full written guide on fast.ai, and the AI community has been re-circulating it ever since. That post makes one blunt argument: almost every weird LLM behavior you’ve cursed at – bad spelling, broken math, JSON it can’t quite produce – traces back to one place. So that’s where we start.
The 30-second demo that explains everything
Open tiktokenizer.vercel.app in a new tab. Pick gpt-4o from the dropdown. Type this:
Hello, world!
Four colored chunks appear: Hello, ,, world, ! – each with a number underneath. Those are token IDs. The model never sees the letters “Hello”. It sees an integer. The same phrase sent to Claude? 3 tokens, not 4 – different provider, different tokenizer, different bill. (That gap is documented in machinelearningplus’s BPE breakdown.)
Now type strawberry. Count the r’s the tool shows per chunk. The model isn’t being dumb when it tells you there are two r’s – it’s reading two or three opaque integers, not letters. IBM’s docs call this inference: tokenize, embed, run through the transformer, predict the next token, repeat. The model’s universe is integers, start to finish.
What just happened inside – in reverse
Here’s the full chain, from the answer you receive back to the prompt you sent:
- Stop signal. The model emits a special end token and halts. In the GPT-5 tokenizer, turns out that’s token ID 199999 – one specific number in a vocabulary slot, nothing more. (ngrok’s prompt caching post, Dec 2025)
- Token-by-token generation. Each new token gets appended back to your original prompt, and the whole thing re-runs. Context windows matter because the model needs to see everything to produce the next thing.
- Probability over the vocabulary. At each step the network scores every possible next token, then samples one. “Predicts the next word” is the cocktail-party version.
- Transformer layers. Numbers from your prompt flow through stacks of attention layers – each token can “look at” relevant earlier ones, which is why a pronoun in sentence 12 correctly refers to the noun in sentence 2.
- Embeddings. Each token ID maps to a vector – a long list of numbers. Similar tokens cluster nearby in that space.
- Tokenization. The step you just watched.
Everything else – RLHF, instruction tuning, system prompts, function calling – sits on top of those six steps.
The transformer, in one paragraph
The architecture under all of this comes from a 2017 Google paper: “Attention Is All You Need.” Before it, language models read text sequentially, one word at a time. Transformers process all tokens in parallel and let each one decide which others matter – self-attention. That parallelism is why training on “the internet” became feasible at all.
Here’s a question worth sitting with: if the model has no concept of letters – only token IDs – what does it even mean to say it “understands” your prompt? The answer matters less than you’d think for day-to-day use, but it reframes every frustrating output you’ve ever gotten.
Three gotchas – try each one yourself
1. A single extra space doubles your token count
Type hello with one leading space. One token. Add a second space: hello. Now it’s 2 tokens – confirmed by tiktoken’s o200k_base encoding. If you copy-paste indented prompts from a Markdown file or a code editor that adds soft tabs, you can quietly inflate input costs on long batches and never notice it happening.
2. Numbers don’t chunk the way you’d expect
Try 42, then 1000, then 123456. They split into different numbers of chunks – sometimes by digit pairs, sometimes weirder. The model isn’t seeing “one hundred twenty-three thousand four hundred fifty-six.” It’s seeing two or three blobs with no arithmetic meaning attached. That’s the actual reason it fumbles multi-digit multiplication – not a reasoning failure, a tokenization one.
3. YAML beats JSON on token count
Paste a small JSON object, then paste the equivalent YAML. Count tokens. YAML usually wins: braces and quotation marks each spawn their own tokens in JSON. If you’re building a pipeline where the model reads and emits structured data, switching to YAML can shave a noticeable chunk off your bill at zero quality cost. (Karpathy covers this directly in his tokenization lecture; the Prompt Engineering Guide echoes it.)
Before you optimize a prompt for “clarity”: run it through Tiktokenizer first. Half the time the win isn’t rewording – it’s removing duplicate whitespace, spelling out numbers for math tasks, or swapping JSON for YAML in the I/O layer.
Pitfalls people keep hitting
- Conflating ChatGPT with GPT-4.ChatGPT is the product layer – instruction tuning, a chat wrapper, safety filters – sitting on a raw model. Hand raw GPT the string “Summarize this paragraph” and it continues the paragraph. It doesn’t follow the instruction.
- Thinking the model “knows” facts. It stored statistical patterns over tokens, not a database. A fact that appeared 3 times in training data is unreliable. One that appeared 3 million times is still not guaranteed.
- Treating context windows as free. Every token in your prompt is re-processed every time a new token is generated. Long prompts get slower as the response grows – not just at the start.
- Expecting identical outputs. Sampling is stochastic. The same prompt sent twice returns different text. Temperature 0 makes it less random, not deterministic.
Where this mental model actually pays off
Tokens-first thinking changes a handful of practical decisions:
Prompt debugging. Instruction being ignored? Check whether unusual formatting – bullet asterisks, smart quotes, non-breaking spaces – is fragmenting it into tokens the model treats differently than plain text.
Cost estimation. Word count lies. Token count doesn’t. Run large jobs through Tiktokenizer before you submit them.
Multilingual work. Non-English text often costs 2-4x more tokens than English for the same meaning. Budget for that, or translate at the edges.
Model selection. Reasoning models like DeepSeek-R1 – 671 billion parameters, open-weight, released January 2025 – burn far more tokens per query because they generate long internal chains of thought. That’s a feature if you need deep reasoning. Expensive if you don’t.
When this model isn’t enough
Tokenization gets you 80% of practical intuition. The other 20% is genuinely harder.
It won’t explain why a model refuses a specific prompt – that’s safety training, a separate layer entirely. It won’t predict which model handles legal reasoning better; that needs benchmarks and your own evals. And it says nothing about emergent behaviors like in-context learning, which researchers themselves haven’t fully unpacked. For AI safety work, interpretability research, or fine-tuning? You need the deeper layers. For prompt engineering and daily use? Tokens are enough.
Your next 5 minutes
Open Tiktokenizer. Paste in your three most-used prompts. For each: look for leading double-spaces, raw numbers you could spell out, JSON you could convert to YAML. Fix what you find, save the cleaned versions. One sitting. Compounds every time you hit Enter.
FAQ
Do LLMs actually “understand” what I’m asking?
Not in any human sense. They have no intent, no beliefs, no experience. But for most practical purposes, the distinction stops mattering – treat them as extremely sophisticated pattern-matchers and move on.
Why does the same prompt cost different amounts on different providers?
Each company trains its own tokenizer. OpenAI’s o200k_base, Anthropic’s tokenizer, and Google’s all carve text into different chunk sizes – “Hello, world!” is 4 tokens on GPT-4 and 3 on Claude. Multiply that across millions of API calls and the difference becomes real money. Which is why switching providers for cost reasons usually requires re-benchmarking, not just a price comparison. You’re not comparing apples to apples; you’re comparing different slicing strategies applied to the same fruit.
Is bigger always better when picking a model?
The 2023-era assumption that parameter count predicts quality has weakened considerably. DeepSeek-R1 and several of Mistral’s smaller models showed competitive performance through better training data and techniques rather than raw size. Pick based on benchmark performance for your actual use case. The headline parameter number is marketing, not a spec sheet.