Google’s Gemini 3 Deep Think just posted an 84.6% score on ARC-AGI-2 (as of February 2026), the toughest abstract reasoning test for AI. That’s 24 points higher than humans (who average 60%) and crushes GPT-5.2 by over 30 points. The AI community is losing its mind. But here’s what you actually need to know: it costs $13.62 per task to run, and that number matters more than the benchmark.
This isn’t a victory lap post. It’s a walkthrough of what that score means in practice, how to actually use Deep Think right now, and the three gotchas nobody’s talking about – including evidence that the model may have seen ARC data during training.
Why the 84.6% Score Is Different
ARC-AGI-2 tests fluid intelligence, not memorization. Each puzzle: grid-based visual reasoning. You see 2-3 examples, infer the abstract rule, apply it to a new grid. ARC-AGI measures a model’s ability to learn new skills and generalize to novel tasks it has never encountered. Humans average about 60% on these visual puzzles, while previous AI models often struggled to break 20% (as of March 2025 when ARC-AGI-2 launched). Gemini Deep Think’s 84.6% isn’t just better – it’s operating at a level most researchers thought was years away.
The catch? The ARC Prize Foundation’s verification found that Deep Think uses correct ARC color mappings in its reasoning without being told, suggesting the training data includes ARC tasks. More on that contamination issue in a bit.
The Real Cost Trap
Verified cost: $13.62 per task on ARC-AGI-2 (as of February 2026). Sounds abstract until you realize what a “task” means in production.
One developer on the Gemini CLI GitHub hit a $66 bill in three days. Thought they were in the free tier. Gemini charges for both the visible output and internal “thinking” tokens. Deep Think goes extreme: Gemini 3 Pro used 96 reasoning tokens to solve a task vs. Gemini 3 Deep Think’s 138,000.
138,000 tokens. That’s not the answer – that’s the internal monologue the model ran before giving you the answer. On a coding task Flash handles in seconds? Deep Think burns through tokens equivalent to processing an entire novel.
Pro tip: Run the same prompt through Gemini 3 Pro first. If Pro solves it, you just saved yourself 1,400x the reasoning cost. Deep Think is for problems that actually need deep reasoning – not email summaries.
Think about what this means for production. You’re not paying for answers. You’re paying for exhaustive internal search processes that may or may not be necessary. The model doesn’t yet know when to stop reasoning early. For most use cases, that’s like hiring a PhD to answer customer service tickets – technically capable, economically nonsensical.
What Ultra Subscription Gets You
Google AI Ultra subscribers ($250/month as of December 2025) can now access “Deep Think” from the Tools menu. Here’s what Google doesn’t advertise: There are limits for how many times you can use Deep Think. Docs don’t say how many. Community reports suggest 10-40 queries per day depending on complexity. No official number.
That $250/month? Not unlimited Deep Think. It’s “some amount of Deep Think plus everything else in Ultra.” For most people, fine – you’re not burning 40 reasoning-heavy queries every day. Building a product around it? Production blocker.
How to Enable Deep Think in the Gemini App
Every guide links to API docs. Here’s how you turn it on as a subscriber:
- Open the Gemini app (web, iOS, or Android – live on all three as of December 2025).
- Click the model dropdown in the top-left. Select Gemini 3 Pro.
- In the prompt bar, you’ll see a new “Deep Think” toggle. Turn it on.
- Type your prompt. Hit send.
- Wait. Deep Think queries take 30 seconds to several minutes – Gemini will notify you in the web app next to the chat thread, or via mobile notification when the response is ready.
That’s it. No code. No API keys (yet). Just a toggle.
What to Actually Ask It
Use Deep Think for:
- Multi-constraint optimization: “Design a database schema for X with these 8 conflicting requirements.”
- Novel problem-solving: “Here’s a bug that only happens under condition A + B + C. What’s the root cause?”
- Research synthesis: Upload 3 PDFs and ask it to find contradictions across all three.
- Code architecture: “Refactor this 500-line file to follow SOLID principles without breaking these 12 edge cases.”
The model works best when the problem doesn’t have an obvious answer in its training data. Gemini Flash handles “write me a Python function to reverse a string” in 2 seconds. Save Deep Think for the hard stuff.
The Overfitting Controversy
The ARC Prize verification team found that Deep Think used correct ARC color mappings (magenta=6, green=3) in its reasoning, despite the prompt never mentioning ARC tasks or color format. Their conclusion (as of February 2026): “ARC data is well represented in the underlying model – enough to make correct ARC inferences based on just the structure.”
They believe this new type of “overfitting” is helping models solve ARC, but aren’t sure how much. The model has seen enough ARC-like tasks during training that it’s not purely solving these puzzles from scratch.
Does that invalidate the score? Not entirely. It’s still generalizing across novel patterns better than any other model. But the 84.6% isn’t a pure test of zero-shot reasoning – it’s reasoning augmented by task familiarity.
For your work: if you’re using Deep Think for actual novel problems in your domain (medical imaging, financial modeling, custom engineering), it won’t have that same task familiarity. Performance might not transfer at the same level.
Which raises a bigger question: are we measuring what we think we’re measuring? If the model has implicit knowledge of ARC’s structure, does that mean it’s “reasoning” or just pattern-matching at a higher level? The line keeps getting blurrier.
Comparing the Numbers
| Model | ARC-AGI-2 Score | Cost per Task | Access |
|---|---|---|---|
| Gemini 3 Deep Think | 84.6% | $13.62 | Ultra subscribers |
| Claude Opus 4.6 (Thinking Max) | 68.8% | ~$60 (est.) | Publicly available |
| GPT-5.2 (Thinking xhigh) | 52.9% | Not disclosed | ChatGPT Plus |
| Gemini 3 Pro (no Deep Think) | 31.1% | $0.81 | Free tier |
| Humans (average) | 60% | Free | Always |
Gap between Deep Think and Claude: 15.8 percentage points. Nearly double the distance between Claude and average human performance. The step-change is real.
Notice Gemini 3 Pro without Deep Think: 31.1%. The base model isn’t magic. The reasoning mode does all the heavy lifting. You’re paying for every token of that lift.
Three Things That Will Break Your Workflow
1. Processing time variability.Queries can take between 10 and 20 minutes to process (as of December 2025) for truly complex tasks. Building a user-facing feature? That latency kills UX. This is a background reasoning tool, not a real-time assistant.
2. Prompt sensitivity.The tool performs best when provided with clear and concise prompts. Vague questions → vague reasoning chains. Be specific about constraints, edge cases, and success criteria, or the model wastes tokens exploring irrelevant branches.
3. API access is gated.Deep Think is available via the Gemini API to select researchers, engineers and enterprises through an early access program (as of February 2026). Not in that program? You’re limited to the Ultra subscription UI. No programmatic access yet for most developers.
When Poetiq’s System Beat Deep Think
Plot twist: Poetiq’s refinement system scored 54% on the Semi-Private Test Set at $30.57 per problem – the previous best was 45% by Gemini 3 Deep Think at $77.16 (earlier test set, not the 84.6% run).
Different test sets, different versions. But the lesson: Poetiq’s approach suggests that progress in AI reasoning is moving away from purely scaling model size and toward well-engineered systems at the application layer.
You might get better results wrapping Gemini 3 Pro (without Deep Think) in a smart refinement loop than using Deep Think raw. Test both before committing to the expensive option.
FAQ
Can I use Deep Think without paying $250/month?
Not right now (as of February 2026). If you’re not a Google AI Ultra subscriber, you won’t see this model for a while – it’s available in the Gemini app for Ultra users. API access limited to an early access program. No free tier, no pay-per-use option yet for individuals.
Does the 84.6% score mean we’ve hit AGI?
No. As good as AI reasoning systems are, they still exhibit many flaws and inefficiencies necessary for AGI – we still need new ideas, like how to separate knowledge and reasoning. François Chollet (ARC’s creator) has said repeatedly: ARC is a milestone, not the finish line. The model is superhuman at this one benchmark but still fails at plenty of common-sense reasoning humans find trivial. Example: it can solve abstract grid puzzles but might struggle to explain why you can’t put a square peg in a round hole without calculating dimensions. The gap between specialized reasoning and general intelligence remains vast.
Why does Deep Think use 138,000 tokens when Pro uses 96 for the same problem?
Higher reasoning modes are strongly correlated with more reasoning tokens even when not strictly needed – these longer natural language programs allow more refinement (further exploration and verification). It’s exploring multiple solution paths in parallel, double-checking itself, refining. You’re paying for that exhaustive internal search, not just the final answer. Sometimes overkill. Model doesn’t yet know when to stop reasoning early. Here’s a concrete example: ask it to fix a simple off-by-one error in a loop. Pro spots it in 96 tokens. Deep Think? It’ll explore whether the error is in the loop condition, the increment, the array bounds, the calling code, potential race conditions, and whether the specification itself is ambiguous – burning 138,000 tokens to confirm what Pro found immediately. That exhaustive verification is the feature and the cost trap.
The 84.6% is real (as of February 2026). The cost is real. The access limits are real. And the overfitting concerns are real. If you’re an Ultra subscriber, turn on Deep Think for the problems that actually need it – architecture decisions, research synthesis, novel debugging. For everything else, stick with Pro or Flash and save the reasoning budget for when it counts.