Skip to content

Where Does Next-Token Prediction Leave Us? A Practical Guide

LeCun just raised $1B saying next-token prediction is a dead end. Here's what that actually means for how you use ChatGPT, Claude, and Gemini today.

8 min readBeginner

So, where does next-token prediction leave us – the people actually typing prompts into ChatGPT, Claude, and Gemini? That’s the question buried under the headlines this month. Yann LeCun left Meta, raised $1.03 billion for AMI Labs at a $3.5B pre-money valuation, and called the core mechanism behind every major LLM a dead end. Cool. But your prompts still need to work tomorrow morning.

This post is the practical version. We’ll skip the AGI debate and the JEPA explainer that every other article is running this week. Instead: what does the next-token-prediction limitation actually do to your outputs, and what can you change in your prompts today to fight it?

The hook: why your long answers fall apart

You’ve probably noticed this. A model gives you a brilliant first paragraph, a decent second, and by paragraph four it’s contradicting itself or inventing a function that doesn’t exist. People call it "hallucination" and move on. The research community has a sharper name for it: error compounding.

Here’s the math that LeCun and a growing pile of papers keep pointing at. The 2024 ICML paper "The Pitfalls of Next-Token Prediction" spells it out plainly: even if every next-token error is as little as 0.01, the probability of encountering an erroneous token exponentially compounds along the way, and by the end of 200 tokens, blows up to 0.86. Eighty-six percent. From a one-percent per-token error rate.

That’s not a quirk you can prompt away with "please be accurate." It’s baked into the generation process.

Why "just write a better prompt" falls short

Most prompt-engineering advice treats the model as a black box you can sweet-talk. Be specific. Give examples. Set the role. That works – until the task requires the model to look ahead.

The pitfalls paper isolates a failure that no amount of prompt polish fixes. Bachmann and Nagarajan describe this as the "Clever Hans cheat," where the model exploits trivial cues from the prefix and fails on lookahead tasks such as their path-star graph problem. Translation: when the right next token depends on the end of the answer rather than the beginning, NTP-trained models get confused and grab whatever cue is closest at hand.

And before you assume this is a Transformer-only quirk – the authors observe failure for both the Transformer and the Mamba architecture, a structured state space model. They also find that a form of teacherless training that predicts multiple future tokens is, in some settings, able to work around this failure. So they pinpoint a precise scenario where it is next-token prediction during training that is at fault – not the architecture itself.

So the architecture isn’t the villain. The single-token training objective is. Which means the fix has to either change training (not your problem) or change how you ask the model to generate (very much your problem).

The recommended approach: prompt around the failure modes

Three concrete moves. Each maps to a specific NTP weakness that’s been documented in the literature.

1. Force the answer outline before any prose

This directly attacks error compounding. If you ask for a 1,500-word answer in one shot, you’re rolling the dice 1,500 times. Instead, get the model to commit to structure first:

Step 1: Output ONLY a numbered outline of your answer.
 No prose yet. 5-8 bullet points.
Step 2: Wait for me to say "go".
Step 3: Expand each bullet in 2-3 paragraphs, one bullet at a time.

Each section now starts from a clean, human-verified anchor instead of from drift accumulated over the previous thousand tokens. You’re effectively resetting the error counter.

2. Put the constraint at the end of the prompt, not the start

The Clever Hans cheat means models lean on the most recent prefix tokens. If your "most important rule" is buried in paragraph one of a long system prompt, expect it to fade. Restate the hard constraint right before the model generates. It’s not a hack – it’s working with the architecture.

3. Use reasoning models for lookahead problems

Models trained with reinforcement learning on full output chains aren’t escaping NTP at the token level, but they’re optimized against a longer horizon. As Derek Shiller laid out in his October 2025 analysis, RL post-training gives models a clear signal from the long-term course of text – a whole chain can be evaluated at once, depending on how well it turns out. Practically: if your task needs the model to plan (multi-step math, agent workflows, code that depends on a final invariant), pay for a reasoning model. If it’s a stylistic rewrite, don’t bother.

Pro tip: When a model contradicts itself across a long answer, don’t ask it to "double-check." That’s just more NTP on top of the broken output. Instead, paste the answer back and ask: "Find every claim in this text that depends on a fact stated earlier. List them." You’re forcing it to operate on the text as data, not as a continuation.

A real example: code that compiles vs. code that almost compiles

Last week I asked a frontier model to write a 300-line Python script that scrapes an API, deduplicates results, and writes to SQLite. One shot, no outline. Result: the function signature on line 40 used session, the call on line 220 used self.session, and a helper introduced on line 180 returned a tuple where the caller expected a dict. Classic compounding – by line 220 the model had "forgotten" its own line 40 conventions.

Same task, outline-first. I made it commit to a class skeleton with explicit method signatures before any implementation, then asked for one method at a time. More total tokens? Yes. But the naming was consistent start to finish, no type mismatches, and I didn’t spend the next 40 minutes hunting down a variable scope bug that appeared only at runtime. The outline acts as a contract the model keeps checking against – that’s the practical payoff.

This is the practical answer to LeCun’s critique. He’s right that the underlying mechanism is fragile. You can either wait for AMI Labs to ship something, or you can structure your prompts around what we already know NTP can’t do.

The training side is already moving

Worth knowing, because it shapes what you’ll be using in 12 months. DeepSeek didn’t wait for the debate to resolve. Multi-Token Prediction (MTP) is an advanced training technique introduced in the DeepSeek-V3 technical report (December 2024) that enables models to predict multiple future tokens simultaneously during pre-training. Instead of learning to predict only the next token at each position, MTP adds auxiliary prediction heads that predict tokens 2, 3, or more positions ahead.

Per that same report: because the acceptance rate of MTP1 is above 80%, speculative decoding reaches about 1.8× speedup in generation throughput. Faster, and probably more coherent too – because the model is forced to "pre-plan" its representations during training rather than stumbling forward one token at a time.

Approach What it predicts Status (as of early 2026)
Standard NTP Next 1 token GPT-4, Claude, Gemini (mostly)
Multi-Token Prediction Next k tokens (k=2-4 typical) DeepSeek-V3, in production
JEPA / world models Abstract representations, not tokens AMI Labs research, no product shipped yet

The interesting takeaway isn’t that NTP is doomed. It’s that the pure version is already being patched, quietly, inside the models you use.

Community reactions worth filtering

The Twitter and Substack reactions to the AMI announcement split into two camps. One: "LeCun was right all along, LLMs are toast." Two: "He raised a billion dollars on a slide deck with no product." Both are overconfident.

The honest read, from AMI’s own CEO Alexandre LeBrun in his TechCrunch interview: "My prediction is that ‘world models’ will be the next buzzword. In six months, every company will call itself a world model to raise funding." When the CEO is openly hedging, take the breathless commentary with a grain of salt.

FAQ

Does this mean ChatGPT and Claude are obsolete?

No. They’re production tools that work well for most text tasks. The critique is about ceiling, not floor.

If I’m a developer building on the OpenAI or Anthropic API, should I change anything?

Yes – specifically for long-form structured outputs. If you’re generating anything over ~500 tokens that needs internal consistency (a report, a multi-file codegen, an agent plan), break it into stages with verification between them. Generate the schema first, validate it, then generate content per field. This is the same idea as the outline-first prompt but enforced in code. You’ll see fewer downstream failures and your token bill might actually drop because retries get cheaper.

Are world models actually going to replace LLMs?

Unclear, and anyone telling you with certainty is selling something. World models are aimed at physical-world tasks – robotics, video understanding, planning under uncertainty – where LLMs are weakest. For document-shaped problems (writing, coding, summarization), LLMs are likely to keep winning, possibly with MTP or similar tweaks bolted on. The two approaches may end up complementary rather than competitive. No comparable benchmarks exist yet – AMI Labs has no product as of March 2026.

Try this next

Open whatever model you used yesterday. Take a long task you struggled with – one where the output drifted or contradicted itself. Re-run it with the outline-first pattern: numbered structure first, then expand one section at a time with explicit anchors back to the outline. If it works, you’ve just routed around the NTP limitation that a billion dollars of venture capital is trying to solve at the architecture layer.