The Last Six Months in LLMs: What Actually Changed

Simon Willison's PyCon 2026 lightning talk just dropped. Here's what the last six months in LLMs actually mean for your workflow - with a hands-on test you can run today.

Drew Sullivan2026-05-199 min readBeginner

The supposedly “best” LLM changed hands five times in a single month – November 2025 – between three providers. If you blinked, you missed it. That’s the opening punch of Simon Willison’s five-minute PyCon 2026 lightning talk, which dropped on May 19 and is currently doing laps around AI Twitter, Hacker News, and every developer Slack worth being in.

This piece is the homework after the talk. We’re not going to recap the slides – go watch them. Instead, here’s how to translate the last six months in LLMs into four things you should actually do this week to your own setup.

The quick context (90 seconds)

Simon Willison – Django co-creator, the guy who coined “prompt injection” – has been doing these recap talks for years. The new one covers what he calls the November 2025 inflection point, when that month became a critical turning point for LLMs, especially for coding.

Two things changed at once. First, the leaderboard went berserk: at the start of November the widely acknowledged best model was Claude Sonnet 4.5, then it was overtaken by GPT-5.1, then Gemini 3, then GPT-5.1 Codex Max, and then Anthropic took the crown back with Claude Opus 4.5. Five swaps. One month. Second, and more importantly, coding agents went from often-work to mostly-work – crossing a quality barrier where you could use them as a daily driver without spending most of your time fixing their mistakes. Willison puts the success rate at around 9 out of 10, up from something that felt more like a coin flip.

The why is boring but matters: all the compute budget at OpenAI and Anthropic went into reinforcement learning against simulated software environments – generate code, run it, see if it works, learn from the result. That’s it. That’s the whole trick. Gemini and xAI, for their part, largely skipped this step in 2025, which is why Willison says they’re roughly 12 months behind on coding as of mid-2026.

Step 1: Run the pelican test on whatever you’re using right now

Open whatever LLM you pay for. Paste this prompt:

Generate an SVG of a pelican riding a bicycle.

Save the SVG. Open it in a browser. Look at it.

This is the pelican-bicycle benchmark, and Simon’s reasoning is deliberately ridiculous: pelicans are hard to draw, bicycles are hard to draw, pelicans can’t ride bicycles, and there’s zero chance any AI lab would train a model for such a ridiculous task. The point isn’t to score the result – it’s to give you a 30-second vibe check you can repeat every time you switch models. When I ran this across four models in the same session, the variation was striking: one produced a recognizable bird on a two-wheeled shape, two drew blobs with spokes, and one generated valid SVG that rendered as a white rectangle. That spread tells you more than any leaderboard position.

Pro tip: Run the same prompt three times on the same model. If you get wildly different results each time, the model is effectively rolling dice on this task. Three coherent pelicans in a row means the model has a stable internal representation – which correlates with stability on harder tasks too.

There’s something worth sitting with here. The pelican test has no right answer. No rubric. No score out of 100. In a field drowning in leaderboards and percentage-point debates, a drawing of a bird on a bike is almost a philosophical protest – what does “better” even mean when the task is this open-ended? That discomfort is useful. It’s a reminder that most of what we actually use these models for is equally hard to score.

Step 2: Pick the right model for the right job (not the leaderboard winner)

Reading a benchmark, picking #1, and using it for everything – that’s the biggest mistake right now. The post-November reality is messier. Here’s a rough cheat sheet based on what Willison and other practitioners have been saying:

Task	What to reach for	Why
Writing/refactoring code agentically	Claude or OpenAI Codex models	Anthropic and OpenAI spent 2025 on RL loops for code; xAI and Gemini largely didn’t, and are roughly 12 months behind as of mid-2026 (per Willison’s Heavybit podcast)
Long-context reasoning, multimodal	Gemini 3 / 3.1 Pro	Gemini 3.1 Pro drew a pelican with a fish in its basket in February 2026 – but pelicans aren’t everything. Google’s strengths are in context length and multimodal tasks.
Running offline / privacy	Gemma 4, Qwen3.6, GLM-5.1 if you have the hardware	Gemma 4 is the most capable open-weight model from a US company; GLM-5.1 is a 1.5TB open-weight model from Chinese lab GLM – very effective if you can afford the hardware

On local models specifically: the two main themes of the past six months are that coding agents got reliable, and that laptop-available models – while a lot weaker than the frontier – have started punching well above their weight class. A 16.8GB file on your laptop in April 2026 (Qwen3.6-27B) produced what Willison called his best-ever pelican from a local model. That didn’t exist as a sentence you could write in late 2024.

Step 3: Audit your agent setup for the lethal trifecta

This is the part most “six months in LLMs” posts skip, and it’s the most important one if you’ve been bolting MCP servers onto your editor.

Willison’s lethal trifecta (coined June 2025) is three capabilities that, when combined in a single agent, create a reliable data-theft path:

Access to private data – your inbox, repo, files, database
Exposure to untrusted content – anything an attacker can send you (email, web page, issue comment, PDF)
An exfiltration vector – the ability to make HTTP requests, post comments, send messages, or render images with attacker-controlled URLs

Pull up the list of MCP servers connected to your coding agent. If any single agent context contains all three, you’re holding a loaded gun. The unsettling part: Willison has been flagging this risk for years, and as of 2026 no headline-grabbing incident has occurred – not because the attack is hard, but because it hasn’t yet been worth executing at scale.

Translation: nothing is protecting you except attacker apathy. Cut one leg of the trifecta per agent. Usually the easiest one to cut is the exfiltration vector – don’t give the same agent both “reads my email” and “can fetch arbitrary URLs.”

Common pitfalls to avoid

Don’t take the pelican benchmark too seriously. It’s now actively gamed. A repo called scosman/pelicans_riding_bicycles exists specifically to pollute the training set, and Willison admits most of his own published examples count as poisoning too. Google’s Jeff Dean tweeted an animated pelican riding a bicycle video alongside Gemini 3.1 in February 2026 – the labs are clearly paying attention. Use the pelican as a vibe check, not a leaderboard.

Don’t pick a “best model” and stick with it for six months. The crown changed hands five times in one month. The skill you want isn’t “choose the best LLM” – it’s “switch quickly when something better drops.” Use a router (OpenRouter, LiteLLM, or Simon’s own llm CLI) so changing providers is one config line, not a refactor.

Don’t assume “agent” means “reliable”. Mostly-works is not always-works. November was the point where you can use a coding agent as a reliable partner and get working code 9 out of 10 times. That tenth time can still nuke a branch.

What the results actually look like in practice

The local-model story is the one the cloud providers don’t want to talk about. In April 2026, Qwen3.6-27B – a 16.8GB download – gave Willison his best-ever pelican from any local model. That’s a concrete result: a file that fits on a mid-range laptop producing output that would have required a paid API call in late 2024. The gap is closing faster than the benchmarks show, because the benchmarks are measuring things labs optimize for, and local models are getting good at the things nobody bothered to test.

When NOT to use this stuff

Three situations where the November inflection point doesn’t help you:

Anything that touches money or production data without a human in the loop. 9-out-of-10 reliability is not 10-out-of-10. A trading bot, a payment processor, an auto-merging deploy pipeline – wrong target.
Tasks where you can’t read the output and tell if it’s right. If you don’t know enough to spot the bug, the agent’s confidence will smother your doubt.
Anything involving the lethal trifecta on actual private data. If you wouldn’t paste your inbox into a public forum, don’t connect it to an agent that can also browse the web.

Why does this feel different from previous “AI is here” moments? Maybe because the change finally happened on the boring axis: not flashier demos, just fewer broken outputs. That’s the kind of progress nobody throws a launch event for, and it’s also the kind that quietly reorganizes how work gets done.

FAQ

Do I need to watch the whole Simon Willison talk?

It’s five minutes. Watch it.

Is the pelican benchmark actually meaningful or is it a joke?

Both, on purpose. The joke is the point – it’s a task no lab would deliberately train on, which makes it a clean test of general capability. But as of 2026 the labs clearly are paying attention (Jeff Dean’s animated pelican video in February is hard to read any other way), so treat a great pelican as evidence the lab cares about Simon’s blog, not necessarily that the model is best-in-class. Use it for first impressions, not final decisions.

If I only do one thing from this article, what should it be?

Audit your agent for the lethal trifecta. Model choice is reversible; a data exfiltration incident isn’t.

Next action: open the LLM you used most this week, paste the pelican prompt, save the SVG, and start a folder. Re-run it every time you change models. In six months you’ll have your own visual diary of how fast this stuff is moving – which, judging by the last six months, is faster than any tutorial can keep up with.