How to Actually Test AGI Claims (Jensen Huang Edition)

Jensen Huang says we've hit AGI. Here's how to verify that yourself using the same benchmarks researchers use - no hype, just tests you can run.

Jack Tom2026-03-278 min readBeginner

Did we just hit AGI without noticing?

On March 22, 2026, NVIDIA CEO Jensen Huang told Lex Fridman “I think we’ve achieved AGI.” The clip went viral. AI-linked crypto tokens jumped 10-20%. NVIDIA shares rose 1.7%. Reddit exploded with reactions – “finally!” versus “he’s just selling GPUs.”

Actually, there are tests for this. Benchmarks. Numbers. You don’t have to take anyone’s word for whether AGI has arrived – you can check the same measurements researchers use. This tutorial shows you how.

Why bother? Because if Huang’s right, you’re living through the most important technological shift in human history. If he’s wrong, you’re watching a trillion-dollar marketing campaign. The difference matters.

What Huang Said (Then Immediately Contradicted)

Fridman’s question: how long until AI starts, grows, and runs a billion-dollar tech company? Huang’s answer: “I think it’s now.” But the definition he used – that’s where it gets weird.

Huang’s billion-dollar company doesn’t have to last. AI creates a viral app, a few billion people use it for 50 cents each, then it collapses. That’s his AGI benchmark. Temporary virality hitting $1B in revenue counts, even if it dies the next week.

The contradiction came seconds later. Could AI build something like NVIDIA? Huang: “zero percent.” Not unlikely. Zero.

AGI is here, but it can’t do what he does. Make that make sense.

Test 1: Run the ARC-AGI-2 Benchmark

Ignore the tweets. Check the ARC-AGI-2 leaderboard. François Chollet designed this benchmark to measure the one thing Narrow AI can’t fake: learning efficiency on tasks you’ve never seen.

You get 2-4 example grid puzzles showing input-output transformations. Then you solve a brand-new puzzle using the same rule. Humans – untrained, random people tested in San Diego – solve 100% of these in under 2 attempts. Trivially easy for us.

Best AI systems as of January 2026? Claude Opus 4.5: 37.6%. Top Kaggle entry: 24%. OpenAI’s o3 broke records on the older ARC-AGI-1 at 87.5%, but that required 172x normal compute. Cost per task: $27. Humans do it for free, instantly.

ARC-AGI measures efficiency, not just accuracy. Human solves a puzzle in 10 actions, AI takes 100? The AI doesn’t get 10% credit – it gets 1%. Scoring formula: (human actions / AI actions)². Deliberately punishes brute-force strategies.

You can try ARC-AGI puzzles yourself at the ARC Prize website. Load a puzzle. Spot the transformation rule from the examples. Check the leaderboard to see where frontier models actually stand.

If AGI means “human-level intelligence,” this test says we’re not even halfway.

Test 2: The Economic Output Benchmark (Huang’s Version)

Huang’s definition: can AI autonomously create a billion-dollar business? Let’s examine that.

He mentioned OpenClaw, an open-source AI agent platform (reportedly being acquired by OpenAI) as proof AI could theoretically launch and manage such a venture. What the data actually shows:

Short-term revenue generation: Plausible. AI builds apps, runs marketing campaigns, processes payments. A viral moment generating $1B in gross revenue over a few weeks? Technically possible.
Sustained institutional management: Zero evidence. No AI has demonstrated long-term strategic planning, competitive market adaptation, or judgment calls that keep companies alive through economic cycles.
Building complex organizations: Huang rated this at 0% probability. Same interview.

His AGI benchmark isn’t “can AI run a company?” It’s “can AI create a temporary viral product that monetizes before collapsing?”

Not general intelligence. Very specific pattern matching applied to one domain.

Test 3: Check for Cross-Domain Transfer

Real AGI transfers knowledge across totally different fields without retraining. A human learning chess applies strategic thinking to business negotiations. Can AI do that?

Try this experiment with any frontier LLM:

Ask it to solve a logic puzzle in a visual grid format (like ARC-AGI tasks)
Ask it to apply the same logical rule to a completely different context – organizational structure or music composition
Watch what happens

Current models will either fail to transfer the rule or need extensive prompting to manually bridge the domains. That’s Narrow AI behavior. You’re doing the generalizing, not the model.

According to IBM’s AI taxonomy, AGI would “use previous learnings and skills to accomplish new tasks in a different context without the need for human beings to train the underlying models.” We’re not there.

What the Researchers Actually Think

76% of AI researchers surveyed by the Association for the Advancement of Artificial Intelligence said scaling up current AI approaches is “unlikely” or “very unlikely” to result in AGI. That survey: 475 researchers – people actually building these systems.

Their consensus: more compute, more data, bigger models won’t get us to general intelligence. We need fundamentally different architectures.

Huang runs the company selling the compute. His $1 trillion chip sales projection through 2027 depends on companies believing AGI is just around the corner. More GPUs needed to reach it. The incentive? Not subtle.

The Efficiency Problem Nobody’s Talking About

Even when AI matches human performance on a task, it’s using 10x to 1000x more resources. OpenAI’s o3: $27 per task on ARC-AGI tests. Humans solve the same tasks in seconds – 20 watts of power, cognitive resources of a light snack.

ARC-AGI-2 now explicitly tracks this with cost-per-task metrics. Intelligence isn’t just can you solve it – it’s how efficiently. Remember that 200K context window? Efficiency is where it matters. A system needing a data center to replicate what your brain does with glucose isn’t exhibiting general intelligence. It’s brute-forcing a solution.

This gap – most AGI declarations ignore it. Huang’s definition sidesteps it entirely by focusing on economic output rather than cognitive process.

How to Evaluate Future AGI Claims

Next time someone declares AGI has arrived, run this checklist:

Test	What to Check	AGI Threshold
ARC-AGI Benchmark	Score on novel reasoning tasks	≥85% at human-level efficiency
Cross-Domain Transfer	Can it apply skills to unrelated fields without retraining?	Yes, autonomously
Learning Efficiency	Resource cost per task vs. human cognitive cost	Within 10x of human efficiency
Novel Problem Solving	Performance on tasks it wasn’t designed or trained for	Human-level accuracy
Sustained Reasoning	Long-term strategic planning without hallucination	Months-long coherence

If the claim doesn’t meet all five, it’s not AGI. Might be impressive Narrow AI. Might be economically valuable. Not general intelligence.

Narrow AI vs. AGI: The Actual Difference

Current AI – everything from ChatGPT to Claude to GPT-5 – is Narrow AI. Extremely capable in specific areas, useless outside them. You can’t ask ChatGPT to physically navigate a kitchen or learn a new skill by watching a YouTube video once. It processes text patterns. That’s it.

AGI would be different in kind. It would:

Learn new skills from minimal examples (few-shot learning across any domain)
Understand context the way humans do (not just pattern matching)
Transfer knowledge flexibly without human guidance
Adapt to situations it was never trained for
Exhibit genuine reasoning, not retrieval

We’re making progress. But “progress toward AGI” and “achieved AGI”? Very different claims.

Why This Matters Beyond the Hype

Declaring AGI prematurely has real consequences.

Contracts at OpenAI and Microsoft include clauses tied to AGI achievement. Regulations worldwide are being written based on AGI timelines. Investment flows toward or away from AI companies depending on whether people think the goal is 2 years away or 20.

If you’re building products, hiring AI talent, or deciding where to invest your learning time, you need accurate information about what AI can actually do – not what a CEO with a trillion-dollar chip roadmap says it can do.

Huang’s right about one thing: AI can probably create a billion-dollar app tomorrow. Optimize ad spending, generate viral content, automate customer service at scale. Valuable capabilities.

But not AGI. Narrow AI getting really, really good at specific things we’ve trained it to do.

Can AI actually create a billion-dollar company right now?

Short-term revenue? Maybe. An AI agent could build and launch a viral app that generates massive revenue briefly. Sustained institutional management – navigating competitive markets, long-term strategy, adapting to unknown challenges – Huang rated at 0%. One-off viral success: plausible. Running an actual company: not happening.

How do I know if a benchmark is testing real AGI or just better Narrow AI?

Check the training data. If benchmark tasks could plausibly appear in the model’s training set, it’s testing memorization. Real AGI benchmarks like ARC-AGI-2 use novel tasks explicitly designed to be unsolvable through pattern retrieval. The ARC Prize keeps private evaluation sets hidden to prevent overfitting. A test that allows retraining on similar examples? Not measuring general intelligence. Also: does the benchmark test efficiency or just accuracy? Humans learn from 2-3 examples. If AI needs 10,000, that’s not general intelligence – it’s brute force.

What would it look like if AGI actually arrived tomorrow?

You’d notice. An AGI system could learn to play a new video game by watching one playthrough, then beat human experts. Read a medical textbook and immediately practice surgery (given a robot body). Debug its own code, redesign its architecture, teach itself skills from fields it’s never encountered. Current AI can’t do any of this without massive training data in that specific domain. When AGI arrives, the capability gap will be obvious – not debatable. The difference between “AI writes better marketing copy” and “AI independently invents new physics” is everything.

Run the tests yourself. Check the benchmarks. The data’s all public. Next time someone with a financial interest in the answer declares AGI has arrived, you’ll know exactly how to verify the claim.