Can your AI agent figure out how to win a game it’s never seen, with zero instructions?
ARC-AGI-3 dropped on March 25, 2026. Humans: 100%. Best AI: 0.37%. Released at Y Combinator with a fireside chat between François Chollet and Sam Altman, the benchmark isn’t just hard – it’s a reset.
Why Everyone’s Talking About This Benchmark
ARC-AGI-3 is the first fully interactive reasoning benchmark. Unlike static puzzles where you show an AI a pattern and ask it to complete the next one, this drops agents into turn-based games with their own hidden rules. No tutorial. No win condition stated. The agent explores, hypothesizes, and adapts – on the fly.
Per the technical paper (arXiv:2603.24621v1, March 2026), hundreds of handcrafted environments exist, each with 8-10 levels that progressively introduce new mechanics. Three went public during the preview: ls20 (map navigation with symbol transformations), ft09 (pattern matching across overlapping grids), and vc33 (volume adjustment to match target heights). François Chollet created the original ARC benchmark in 2019 to measure the gap between memorization and actual learning. Current AI systems pattern-match at scale, but they can’t reason through something they’ve never encountered.
The Scoring System That Punishes Brute Force
RHAE (Relative Human Action Efficiency): (human actions / AI actions)².
Human solves in 10 steps, your agent takes 100? You get 1%, not 10%. Take 200? 0.25%. Take 500? 0.04%.
This squared penalty blocks brute-force. AI used to try every possible operation until it stumbled on the answer. Now that obliterates your score. The human baseline is the second-best performer out of ten first-time players per environment – filtering outliers while maintaining a realistic reference (as of March 2026 scoring docs).
Pro tip: Optimize for action efficiency from day one, not just task completion. The scoring curve is quadratic – small inefficiencies compound fast.
Think about how you’d approach a new board game. You don’t try every possible move. You test one thing, notice what changes, form a theory, test that. If your AI takes 50x more attempts than you would, it’s not learning – it’s guessing with compute.
What Actually Works (And What Doesn’t)
Preview competition ran July 18 – August 19, 2025. Top three: all non-LLM.
StochasticGoose (12.58%) – CNN with RL that predicts which actions cause frame changes. Four-layer convolutional network encoded 64×64 frames. Blind Squirrel (6.71%) – Directed state graphs built from observed frames. Third place – Graph-based exploration solving a median of 30 out of 52 levels across 6 games (per technical report).
Frontier LLMs: Gemini 3.1 Pro 0.37%. GPT 5.4 at 0.26%. Opus 4.6 at 0.25%. Grok-4.20 scored 0.00% (all as of March 2026).
The winner’s Medium post explains why: average gameplay stretches into hundreds of steps = hundreds of thousands of tokens. A CNN processing visual grids frame-by-frame is just more efficient for this task.
How to Get Started Building an Agent
Official toolkit, MIT license on GitHub:
pip install arc-agi
# or
uv add arc-agi
You’ll need an ARC_API_KEY. Unlike ARC-AGI-1 and ARC-AGI-2 (fully offline datasets), version 3 requires an API key even for local execution. Register at three.arcprize.org. No key? Anonymous access – but you won’t get all games at release.
Basic agent structure:
import arc_agi
from arcengine import GameAction
arc = arc_agi.Arcade()
env = arc.make("ls20", render_mode="terminal")
for _ in range(10):
env.step(GameAction.ACTION1)
print(arc.get_scorecard())
64×64 grid, 16 colors. Six core actions: RESET (restart level), ACTION1-4 (directional moves: up/down/left/right), ACTION5 (general interaction – select/rotate/execute), ACTION6 (click with x,y coordinates from 0-63). ACTION7 (undo) exists but wasn’t available during competition.
Agents receive game observation frames, select an action, get feedback about state changes. Turn-based environment – it doesn’t change asynchronously. You have time to reason between moves.
The Gotcha Nobody Mentions
Custom harnesses don’t transfer.
Duke University tested Opus 4.6 with a hand-crafted use on a known environment – 97.1%. Unfamiliar environment? 0%. Perceiving the game environment and understanding the API format aren’t the bottlenecks. Generalization is – the ability to build a world model from scratch when dropped into something new (per The Decoder’s coverage of Duke testing, March 2026).
Benchmark contamination: the technical report shows Gemini 3 models using correct ARC-AGI color mapping in reasoning without the prompt mentioning “ARC-AGI” or the integer-to-color scheme. Quote from the report: “This strongly suggests ARC-AGI data is well represented in the underlying model.”
For anyone establishing clean baselines, that’s a problem. Reported scores from models trained after mid-2025 may be inflated.
ARC-AGI-3 vs. The Previous Versions
| Feature | ARC-AGI-1/2 | ARC-AGI-3 |
|---|---|---|
| Format | Static image-in, image-out puzzles | Interactive turn-based games |
| Goal visibility | Demonstrated via examples | Hidden – must be discovered |
| Scoring | Accuracy (correct output) | RHAE (efficiency vs. human baseline) |
| What it tests | Pattern recognition & abstraction | Exploration, goal acquisition, planning |
| Access | Fully offline dataset | Requires API key (even locally) |
| Scores comparable? | No – different metrics | RHAE incomparable with v1/v2 |
By 2025, frontier models hit 90%+ on ARC-AGI-1. The team released version 2 with harder compositional puzzles. Best systems there scored in the low teens. Version 3 changes the game by requiring temporal reasoning – intelligence measured across time, not just final answers.
The $2 Million Competition
ARC Prize 2026: live on Kaggle, $2M+ total. Two tracks – ARC-AGI-3 (new interactive benchmark) and ARC-AGI-2 (static compositional puzzles from last year, grand prize unclaimed).
Milestone #1: June 30, 2026. Milestone #2: September 30, 2026. Submissions close November 2, 2026. Results announced December 4, 2026.
Grand prize ($700K) goes to the first agent scoring 100% on ARC-AGI-3 evaluation. If not won, it rolls over to next year. Milestone prizes ($25K, $10K, $2.5K at each checkpoint) go to top open-source solutions.
All participants must open-source solutions under MIT or CC0. Kaggle evaluation runs with no internet – no API calls to external inference endpoints during scoring. You can use closed models during development, but final submission runs offline within Kaggle’s compute constraints.
Why This Feels Different
Most AI benchmarks test superhuman capabilities or specialized knowledge – they reward systems that memorize vast amounts of data.
ARC-AGI does the opposite. It focuses on tasks that are easy for humans yet hard or impossible for AI. If a task requires a PhD to solve, you’re testing crystallized intelligence (accumulated knowledge). ARC-AGI tests fluid intelligence – the ability to reason through novel problems using only core cognitive building blocks present at birth or acquired very early in development.
A five-year-old learns the rules of a new board game in minutes. GPT-5 can’t.
Will that gap close in six months (as some predict) or remain unsolved for years? The answer will tell us something real about what these systems can and can’t do.
Frequently Asked Questions
Can I use GPT-5 or Claude to build an ARC-AGI-3 agent?
Yes during development. Final Kaggle submission must run offline. LLMs haven’t performed well on ARC-AGI-3 – all scored under 1% during preview. Token overhead from hundreds of gameplay steps makes them inefficient.
How does the public leaderboard differ from the Kaggle competition?
Public leaderboard at arcprize.org: unlimited compute, internet access, closed models allowed. Rapid experimentation and directional exploration. Kaggle: strict constraints – no internet, ~$50 compute per submission, mandatory open-sourcing for prizes. Private evaluation set stays hidden to prevent overfitting. Core ARC-AGI tenet: solution creators can’t know the test in advance, or they risk encoding their own intelligence rather than building something that generalizes.
Why do humans score 100% if this is supposed to test intelligence?
Benchmark calibrated so untrained humans with no prior exposure solve every environment. Over 400 members of the general public tested in San Diego early 2025 to confirm difficulty. Humans solve these tasks easily while frontier AI struggles – reveals a core gap in how current systems learn. ARC-AGI measures skill-acquisition efficiency on unknown tasks. Human intelligence excels at this. Scaled pattern-matching still can’t match it. If an AI had general intelligence, it should do what a human can: drop into an unfamiliar game and figure it out.
Start with the official toolkit docs. Play a few public games in your browser at three.arcprize.org. Clone the GitHub repo, run a random agent locally. Fastest way to understand the gap: watch how inefficiently random exploration performs compared to even basic heuristics.