Test AI’s True Intelligence: The Einstein Relativity Challenge

DeepMind's CEO just proposed the hardest AI test yet: train a model on pre-1911 data and see if it discovers relativity. Here's why current LLMs fail - and how to test discovery vs. recall yourself.

Jack Tom2026-02-218 min readBeginner

Two ways to test if an AI is truly intelligent: ask it to ace a physics exam, or ask it to invent physics. Most benchmarks pick the first. Demis Hassabis just proposed the second.

The DeepMind CEO’s challenge? Brutal. Train a model on all human knowledge up to 1911 – nothing after. See if it can discover general relativity by 1915, just like Einstein did. Current LLMs? Won’t happen.

This isn’t about AGI timelines or sci-fi speculation. It’s understanding what your ChatGPT subscription actually buys you: a system that recalls brilliantly but discovers nothing. Here’s why that distinction matters – and how to test it yourself.

Why AI Can’t Discover Physics (Yet)

February 2026, India AI Summit. Hassabis drops this test. The AI community’s been dissecting it since.

The setup sounds simple: freeze training data at 1911, give the model Einstein’s physics knowledge, see what happens. Einstein didn’t just retrieve information. He connected disparate concepts – special relativity, the equivalence principle, non-Euclidean geometry – in ways no textbook had. That’s generative reasoning. Not pattern matching.

Today’s systems won’t do this. Even DeepMind’s AlphaFold – Nobel Prize Chemistry 2024 – solves a defined problem. Protein structure prediction. It doesn’t hypothesize entirely new frameworks.

The Knowledge Cutoff Problem

Here’s what stops current models: the knowledge cutoff. Every LLM has one. Training large language models costs over a billion dollars, takes months. You can’t stream in yesterday’s news. The data gets frozen, cleaned, deduplicated, then baked into the model’s weights during training.

GPT-4? April 2023. Claude 3.5? August 2023. Latest models as of early 2026? Mid-2025. After that date, the model’s knowledge is static. Zero training data on anything newer.

Ask GPT-4 about a May 2023 scientific discovery. It’ll admit ignorance – or hallucinate something plausible-sounding but wrong.

The Stated vs. Effective Cutoff Gap

Plot twist: the cutoff date your model reports might be lying.

A 2024 arXiv paper titled “Dated Data” tested this. Researchers probed multiple LLMs. Effective cutoffs – what the model actually knows – often differ drastically from reported cutoffs.

CommonCrawl lag. New data dumps contain old web pages. A “2023” crawl might include 2020 content. Deduplication chaos. Cleaning pipelines sometimes drop recent content, keep duplicates of older text. Your model’s Wikipedia knowledge might extend to 2023, but tax law stops at 2021 – same model, claimed 2023 cutoff.

Community testers confirmed this. ChatGPT-4o reported October 2023 cutoff one day, June 2023 the next – but correctly answered questions about events past both dates when web search was on. The base model’s knowledge hadn’t changed. Just inconsistent interface reporting.

Think about using LLMs for research. One domain’s knowledge is fresh. Another’s is years stale. Same cutoff claim.

How to Test This Yourself

You don’t need a billion-dollar budget to explore the Einstein test concept. Here’s hands-on experiments using models you already have access to.

Experiment 1: Find Your Model’s Real Cutoff

Models lie about cutoff dates. Test with falsifiable facts.

Open ChatGPT (or Claude, Gemini) in a fresh session.
Ask: “What is your knowledge cutoff date?” Note the answer.
Google “deaths in [month after stated cutoff] [year]” – find a Wikipedia list of notable deaths.
Pick someone moderately famous. Ask: “Is [person] still alive?”
Model knows they’re dead, they died after stated cutoff? Effective cutoff is later than claimed.

This “morbid but effective” method (coined by an eDiscovery researcher) cuts through AI’s self-reporting. Shows what it actually knows. Remember that 1911 limit earlier? Same principle.

Experiment 2: Test Retrieval vs. Reasoning

Now test whether the model can use pre-cutoff knowledge to derive post-cutoff conclusions.

Prompt: "Pretend your knowledge ends in 1911. Based only on classical mechanics, Maxwell's equations, and the Michelson-Morley experiment, what unresolved problem in physics would you prioritize investigating?"

A model that truly reasons might identify light speed invariance as a puzzle worth exploring – the exact thread Einstein pulled. Pattern-matcher? Lists generic “next steps” in 1911 physics. No conceptual leap.

Run this across GPT-4, Claude, Gemini. Compare. You’ll notice they describe the problem well – relativity is in their training data – but none propose it as an unsolved mystery.

Experiment 3: The Constraint Test

Can a model simulate discovery when explicitly told to ignore known solutions?

Prompt: "I'm writing alternate history. In 1915, Einstein never publishes general relativity. It's now 1920. What other approaches might physicists have tried to reconcile gravity with special relativity?"

Watch. Most models either: (a) accidentally describe general relativity anyway, or (b) propose vague alternatives without mathematical rigor. They’re remixing what they memorized. Not discovering.

Pro tip: Using models for research? Cross-reference claims about recent events with sources outside the model. The cutoff creates a blind spot. Hallucinations thrive there. Use web-enabled modes (ChatGPT with browsing) for time-sensitive topics – but that’s retrieval, not reasoning.

Why Hassabis Says This Matters Now

The punchline isn’t “AI will never do this.” Hassabis estimates 50/50 chance AGI arrives by 2030. The point: current systems fundamentally can’t. Conflating what they do (pattern matching at scale) with what Einstein did (conceptual synthesis)? Dangerous.

Hassabis used the phrase “jagged intelligence.” They win Math Olympiad gold medals. Fail simple arithmetic if you phrase it weirdly. Describe general relativity perfectly. Can’t derive it from first principles.

Practical stakes: using GPT-4 to draft a research proposal? It’ll recall thousands of papers. Won’t notice the connection between papers that nobody’s made yet. Still your job.

What Actually Needs to Happen

Hassabis was specific about what’s missing. January 2025 interview: continual learning (still unsolved), long-term planning, better memory, and the ability to invent hypotheses. Not just test them.

AlphaFold predicts protein structures. Doesn’t propose new theories of biochemistry. AlphaGeometry solves Olympiad problems. Doesn’t invent new branches of mathematics. Staggering achievements. Not AGI.

The Einstein test crystallizes the gap: Can the system generate genuinely new knowledge from existing information? Until yes, we’re building tools. Not scientists.

Does the Web Search Loophole Matter?

Models like ChatGPT with browsing can access post-cutoff information by searching the web in real time. This uses Retrieval-Augmented Generation (RAG) – the model queries an external database, incorporates results into its answer.

Does this solve the Einstein test problem? No. RAG extends recall. Lets the model fetch a Wikipedia page from 2026. Doesn’t grant the ability to synthesize that page with pre-existing knowledge in a way that yields a novel theory.

Einstein didn’t need to Google “general relativity” in 1915. The information didn’t exist yet. RAG helps when the answer is out there. Discovery happens when it’s not.

What This Means for You Today

Using LLMs for work? Here’s the takeaway: they’re reference librarians. Not researchers. Exceptional librarians – find the paper you need in seconds – but they won’t write the paper that should exist but doesn’t.

Practical moves:

Check cutoff dates when the topic is time-sensitive. Use the “celebrity death test” if you’re skeptical.
Use web-enabled modes for anything requiring current data (news, stock prices, recent product launches).
Skip LLMs for conceptual breakthroughs. Use them to summarize, organize, surface patterns in existing knowledge – then do the synthesis yourself.

The Einstein test isn’t a gotcha. It’s a map. Shows where the frontier is: between retrieval and discovery, between pattern and principle, between searching and seeing.

Frequently Asked Questions

When will AI pass the Einstein test?

Hassabis: 50% confidence AGI arrives by 2030. But breakthroughs in continual learning, long-term planning, and hypothesis generation need to happen first. Current scaling alone won’t cut it. Timeline depends less on compute, more on solving these architectural gaps.

Can I actually test a model’s knowledge cutoff accurately?

Yes. Ignore what the model says its cutoff is – test what it knows. The “celebrity death” method: look up notable deaths from Wikipedia in months after the stated cutoff. Ask if the person is alive. Model knows they’re dead? Effective cutoff is later than claimed. Repeat with 3-4 names from different months to map the real boundary. Cutoffs vary by topic – Wikipedia knowledge might be fresher than scientific papers in the same model. One test I ran: GPT-4 claimed April 2023 cutoff but knew about a July 2023 event. The mismatch is real.

Does Retrieval-Augmented Generation (RAG) solve the discovery problem?

No. RAG lets a model fetch current information from the web or a database. Solves the “what happened last week” problem. But it’s still retrieval – looking up existing answers. Einstein’s breakthrough wasn’t finding information. Creating a framework that didn’t exist. RAG makes models better at knowing. Doesn’t make them capable of inventing. The distinction matters: it defines where human expertise still can’t be automated. At the edge of knowledge where there’s nothing to retrieve yet.

Run the experiments. See where your models break down. Then get back to the work only you can do: the next thing that’s never been thought before.