Amateur Solves Erdős Problem with ChatGPT: How to Try It

An amateur just used ChatGPT Pro to crack a 60-year-old Erdős problem. Here's how to actually replicate the 'vibe maths' workflow yourself.

Riley Brooks2026-05-0210 min readBeginner

The story is everywhere this week: a 23-year-old with no math training feeds an open Erdős problem into ChatGPT, and out comes a proof that 60 years of professional mathematicians missed. Scientific American ran the headline. Hacker News lit up. Terence Tao weighed in. The reaction has been a mix of awe, skepticism, and a lot of people quietly opening a ChatGPT tab.

So let’s get past the hero story and talk about what you can actually do with this. Because the interesting part isn’t that a ChatGPT user solved an Erdős problem – it’s how, and what conditions made the run produce something novel instead of the usual confident-sounding nonsense.

Two ways to copy this workflow (one is clearly better)

Approach A: Open ChatGPT, paste a famous unsolved problem, ask for a proof, hit send. This is what most people will do after reading the news. It almost never works – and when it produces something that looks like a proof, you have no way to tell if it’s real.

Approach B: Pick a problem from a curated list (so the bar isn’t “unsolved by all of humanity”), give the model an explicit constraint to not search the web, run the same prompt several times, and have someone competent check the output. This is closer to what actually happened with Erdős #1196.

Approach B wins for a simple reason. The prompt that ended up on Hacker News explicitly told the model not to search the internet and framed the task as crafting a non-trivial, novel and creative proof for a number theory problem on primitive sets. That instruction isn’t decoration. Per Tao’s tracker, AI tools have helped move about 100 Erdős problems into the “solved” column since October 2025 – and the bulk of that count has been a souped-up literature search. Forcing the model offline is what flips it from “find the answer in training data” to “actually try to derive something.”

What actually got solved (the short version)

Erdős #1196 is a 1968 conjecture of Erdős, Sárközy, and Szemerédi about primitive sets supported on large integers. Prior work had progressively tightened the relevant sum. The new GPT-5.4 Pro proof, prompted by Liam Price, shows that for any primitive set A, the sum of 1/(a log a) over a > x is at most 1 + O(1/log x) – closing the gap to the conjectured bound of 1. The proof has since been formally verified in the Lean proof assistant, so this isn’t vibes. (A self-contained note on the proof is hosted at ulam.ai, using von Mangoldt weights and divisibility-chain framing.)

What made it possible is worth sitting with for a second. Decades of researchers, collectively, had steered toward the same opening move: transfer the problem into a continuous setting. Clean, classical, reasonable. The AI runs never made that move. They stayed in the discrete world and found tools – LYM-inequality-related machinery – that the literature had essentially stopped reaching for. Tao’s framing: “the literature had managed to focus on a somewhat suboptimal approach” and “the AI runs consistently stayed in the discrete world.” That’s not the AI being smarter. That’s the AI having no habits to break.

Which raises an honest question: how many other open problems are stuck not because the math is too hard, but because everyone learning the field absorbs the same wrong first instinct? We don’t know. That’s the genuinely interesting thing about this result – not the tool, but what the tool revealed about the literature.

Hands-on: the ChatGPT Erdős problem workflow you can actually run

Here’s the workflow, distilled from what Price and others on the erdosproblems.com forum appear to have done. You need a ChatGPT Pro subscription for the GPT-5.4 Pro “thinking” model – the regular tier won’t reproduce this.

Step 1 – Pick a problem from the open list

Go to erdosproblems.com and filter for problems still marked open. Don’t pick one with a $1,000 prize attached – those are open for a reason. Look for ones flagged as approachable, recently discussed, or in an area with active tooling (combinatorics, number theory).

Step 2 – Use a constraint-heavy prompt

Here’s a clean version of the structure that has been working:

You will work on a math research problem. Do NOT search
the internet or rely on retrieval. Do not assume the
problem has a known solution.

This is a test of whether you can craft a non-trivial,
novel proof from first principles. Provide a full
unconditional proof or a clear disproof.

Problem:
[paste the exact statement, with all definitions]

If you cannot prove the full statement, prove the
strongest partial result you can, and explicitly mark
any unverified steps.

Two pieces of that prompt are doing real work. The “do not search” line stops the model from regurgitating existing literature dressed up as a proof – which is how most of the “AI solved an Erdős problem” claims from late 2025 turned out to be re-discoveries of published proofs. The “mark unverified steps” line is what makes the output checkable later. Without it, the model will paper over its weakest link with confident prose.

Step 3 – Let it think. Seriously, let it think.

Price’s single run took roughly 80 minutes of AI reasoning. If your model finishes in 30 seconds, you’re not on the thinking variant – you’re on the regular one, and it’s not going to solve anything novel.

Step 4 – Run it more than once

This is the step every other tutorial skips. Tao himself raised the survivor-bias issue on the Erdős forum: thousands of model instances are being thrown at these problems, only the successes get reported, and one useful experiment would be running additional instances on the same problem with internet access disabled to test how reproducible the chain of thought actually is. Run your prompt 3-5 times in fresh chats. Compare. If the model proposes wildly different approaches each time, that’s a signal it’s hallucinating direction. If it converges on similar structure, that’s a weak signal something real is there.

Step 5 – Find someone who can actually check it

Lichtman’s blunt assessment: “The raw output of ChatGPT’s proof was actually quite poor. So it required an expert to kind of sift through and actually understand what it was trying to say.” He and Tao later shortened the proof to better distill the model’s key insight. If you don’t have a domain expert in your contacts, post the output on a relevant forum and ask. The Erdős Problems site has a comment section under each problem.

Pro tip: Before posting anything publicly as a “solution,” paste the proof back into a fresh ChatGPT session and ask: “Find the weakest step in this proof. Where would a referee push back?” The model is much better at finding holes in arguments than producing whole proofs.

Common pitfalls

Skipping the “no search” instruction. Without it, the model often quietly retrieves a known result and presents it as fresh work.
Trusting confidence. The model’s prose tone is identical when it’s right and when it’s wrong. There is no built-in uncertainty signal in a long proof output.
Running on the wrong tier. The non-thinking GPT models will produce something that looks like a proof in 10 seconds. It will be wrong. The thinking variants are the only ones with a track record here.
Treating one good run as validation. A single chain of reasoning that ends in a plausible-looking conclusion isn’t a proof. It’s a draft.

Performance and cost

As of mid-2026, ChatGPT Pro runs $20/month and includes up to 3,000 “thinking” messages per week with access to GPT-5.4 Pro (check OpenAI’s pricing page – these numbers shift). That sounds like enough headroom to attempt hundreds of problems. It isn’t. One serious run can chew through 80 minutes of compute, and you’ll want multiple runs per problem. Realistic throughput: maybe 20-40 substantive attempts per week if you’re disciplined about it.

And the hit rate is the part nobody puts in headlines.

Benchmark	GPT-5.2 score
Competition mathematics	77%
Open-ended research problems	25%

That 77%-vs-25% gap – per byteiota’s reporting on the GPT-5.2 benchmarks – means three out of four serious attempts at genuinely open problems will fail. Plan accordingly. The model doesn’t tell you which three.

When NOT to use this approach

Some categories where this workflow is a waste of time:

Famous open problems with prize money. If Riemann or P vs NP could fall to one prompt, it would have happened in the first week of GPT-5. The selection effect is brutal.
Problems requiring heavy symbolic computation. ChatGPT can describe a calculation; it cannot reliably perform a 10,000-term symbolic manipulation. Use a CAS for that.
Anything you cannot get independently verified. If there is no Lichtman-equivalent in your network, you have no way to distinguish a real proof from a beautiful-sounding one.
Problems where the literature is sparse or in a niche language. The model’s prior knowledge of obscure subfields is thin, and offline reasoning from scratch in a sparse area rarely produces anything correct.

The thing that made #1196 work was specific: it was a problem where the existing literature had collectively taken a wrong turn, and the AI happened to stay on the path where the right tools lived. That’s a structural setup, not a generic one. Most open problems don’t have a hidden “everyone went the wrong way” gap waiting to be exploited – or if they do, we can’t identify which ones in advance. That’s the honest limit of this whole approach.

FAQ

Did ChatGPT really solve the problem on its own?

Not exactly. The novel move came from the model; Lichtman and Tao had to rewrite the proof for it to be usable. Call it a collaboration where the AI supplied the idea nobody else tried, and the humans made it rigorous.

Can I do this with the free version of ChatGPT?

You can try the prompt structure on toy problems – free tier is fine for that. But the run that made the news used GPT-5.4 Pro on an extended-thinking session (roughly 80 minutes of reasoning). The free tier doesn’t give you that model. If you want to seriously attempt open problems rather than just poke at the format, the Pro tier (as of mid-2026, $20/month) is the floor. One more thing: even on Pro, the 3,000 thinking messages/week cap sounds generous until you’re burning 80 minutes per run and want five attempts at the same problem.

Is this going to replace mathematicians?

The 25% score on open-ended research problems is the honest answer. Tao flagged the survivor-bias problem publicly the same week the news broke – we’re seeing the wins, not the thousands of failed runs sitting quietly in someone’s chat history. What’s changed is narrower than the headlines suggest: an interested amateur can now contribute a useful seed of an idea. Turning that seed into a verified proof still requires a domain expert, and probably always will. The Lean verification step alone took human mathematicians to set up and interpret. “Replacement” is the wrong frame. “New kind of first draft” is closer.

Next step: open erdosproblems.com, pick one problem from the open list that looks approachable to you, and run the prompt template above three times in separate chats. Compare the outputs. That’s the actual entry point – not reading another article about Liam Price.