AGI Has Arrived: Two Approaches to Working With It (One Works)

Jensen Huang says AGI is here. Sequoia agrees. Gary Marcus calls it redefinition. Skip the debate - here's what actually works with today's 'AGI-class' systems.

Jack Tom2026-03-309 min readBeginner

Two people sit down to use what NVIDIA’s CEO calls AGI. One treats it as a magic answer machine. The other treats it as a reasoning tool with guardrails. Guess which one gets fired after their AI agent emails the wrong client list to competitors?

In March 2026, Jensen Huang told Lex Fridman “I think we’ve achieved AGI.” Sequoia Capital published “2026: This is AGI” in January. UC San Diego researchers argued in Nature that current LLMs already meet the bar.

Then Gary Marcus fired back: these claims redefine AGI, they don’t achieve it.

Here’s what matters more than the debate: people are already using o3-class models and long-horizon agents to do work that was impossible 6 months ago. Some are getting spectacular results. Others are creating expensive disasters.

The difference isn’t the tech. It’s the approach.

Why Treating It Like AGI Fails

The hype says AGI can figure anything out. So people hand it complex tasks with zero scaffolding and expect human-level judgment.

What actually happens: o3 costs $17-20 per reasoning task at low compute. At high compute, it burns thousands of dollars per problem. For that cost, it scores 87.5% on ARC-AGI – a benchmark where humans hit 100% on the first try.

Even Huang admitted in the same interview that current agents have “essentially zero” odds of running a company like NVIDIA independently. That’s a pretty big asterisk on “AGI has arrived.”

The economic definition Huang uses – can it build a billion-dollar company? – sounds impressive until you read the fine print. He’s talking about a viral app that makes money for a few months, like the dot-com era’s flash-in-the-pan websites. Not sustained strategic execution.

The Three Failure Modes Nobody Mentions

Failure mode 1: No verifier. o3 and similar models generate multiple reasoning paths and pick the best one. But “best” according to what? If you don’t define success criteria, the model optimizes for plausibility, not correctness. A client proposal that sounds great but misses the actual requirements.

Failure mode 2: Infinite task scope. Long-horizon agents can run for 30+ minutes. Sequoia’s example agent took 31 minutes to find a recruitment candidate. Sounds good – until the agent decides “find candidates” means scraping your entire CRM, crossing into three related but off-scope research tangents, and racking up API costs because you didn’t set boundaries.

Failure mode 3: The easy-task cliff. François Chollet, creator of the ARC-AGI benchmark, noted that o3 “still fails on some very easy tasks.” You’ll get genius-level performance on a complex reasoning puzzle, then watch it faceplant on a straightforward data formatting request. Unpredictability kills trust.

The Approach That Actually Works

Forget AGI. Call it “high-capability narrow reasoning.” Then design around that.

Here’s the shift: instead of asking the model to own the outcome, you own the outcome and use the model to scale your judgment. You stay in the loop. You verify. You constrain.

Step 1: Decompose the Task Into Verified Steps

Don’t ask: “Write a go-to-market strategy for our product.”

Do ask:

“List 10 potential customer segments for [product]. For each, specify the pain point we solve.”
Review the list. Pick 3.
“For [segment A], draft 3 messaging angles. Use [competitor X]’s positioning as a reference but differentiate on [Y].”
Review. Iterate.
“Generate a 4-week campaign timeline for [messaging angle 2]. Include channels, budget breakdown, success metrics.”

Each step is verify-then-proceed. The model never runs unsupervised for more than one decision point.

Step 2: Set Explicit Constraints and Verification Rules

When you use a long-horizon agent, you’re giving it autonomy. That autonomy needs fences.

Example: if you’re using an agent to research competitors, specify:

Data sources it can access (public web only, no internal databases)
Time limit (stop after 15 minutes, surface progress so far)
Output format (markdown table with columns: Competitor, Pricing, Key Feature, Gap vs Us)
Verification step (flag any claim without a source URL)

The Sequoia recruiting agent worked because the task had clear boundaries: 3 candidates, specific criteria, defined output (a draft email). It wasn’t asked to “hire someone.”

Step 3: Use CoT Prompting, But Make It Testable

o3’s breakthrough is chain-of-thought reasoning at scale. You want that. But raw CoT gives you a 10-paragraph explanation with no way to verify correctness mid-stream.

Better approach:

You are analyzing [problem]. Break your reasoning into numbered steps.

For each step:
1. State your assumption
2. State the evidence supporting it
3. State what would disprove it

After step 3, pause and ask: does the evidence actually support the assumption? If no, backtrack.

Final answer only after all steps check out.

This forces the model to surface its logic in a way you can audit. You catch errors at step 2 instead of after the final output.

Pro tip: If you’re using o3 or similar reasoning models for high-stakes decisions, run the same prompt twice with different temperature settings (0.3 and 0.8). If the outputs diverge significantly, the problem is under-specified or the model is guessing. Refine your prompt or add constraints.

Step 4: Accept That Some Tasks Still Need Humans

The o3 benchmark results are impressive – until you look at the cost. At high compute, o3 uses 172x the resources and billions of tokens per problem to hit 87.5% accuracy. A human solving the same task costs roughly $2 and gets 100%.

Use the model where speed and scale matter more than perfection. Research synthesis, draft generation, code scaffolding, scenario analysis. Don’t use it where mistakes are expensive and verification is hard. Legal contract review, medical diagnosis, financial forecasting without human oversight.

Real-World Example: Using o3 for Technical Research

Let’s say you need to evaluate whether to adopt a new database technology. Here’s how you’d use an o3-class model effectively.

Bad approach: “Should we migrate from Postgres to CockroachDB? Give me a recommendation.”

The model will give you a confident-sounding answer based on whatever training data bias it has. You won’t know if it’s right.

Good approach:

Step 1:
"List the top 5 technical trade-offs between Postgres and CockroachDB.
For each trade-off, cite the source (official docs, benchmark, or research paper)."

Step 2 (after review):
"Our workload has these characteristics: [X]. For each trade-off you listed,
analyze whether it's favorable or unfavorable for our workload. Show your reasoning."

Step 3 (after review):
"Based on steps 1-2, identify 3 risk factors if we migrate and 3 risk factors if we don't.
For each risk, suggest a mitigation strategy."

Step 4 (you do this part):
Take the model's output. Cross-check the citations. Run the migration risks by your senior engineer.
Make the decision yourself.

The model amplifies your research capacity. It doesn’t replace your judgment.

The Hidden Cost Nobody Talks About

The AGI hype cycle has a weird side effect: people underinvest in prompt engineering and workflow design because they assume the model will “just figure it out.”

Then they get mediocre results and blame the tech.

Using o3 or long-horizon agents effectively requires more upfront design work than using GPT-3.5. You’re defining verification loops, setting constraints, decomposing tasks. That’s overhead. But it’s the difference between a $20 research task that saves you 4 hours and a $2000 runaway agent that generates plausible nonsense.

The people getting value from these systems right now aren’t treating them as AGI. They’re treating them as unreliable-but-fast reasoning engines that need human-designed guardrails.

What This Means Going Forward

Whether we call it AGI or not, the capability jump is real. o3 went from 5% to 87.5% on ARC-AGI in under a year. That’s not incremental.

But the gap between benchmark performance and production reliability is also real. Chollet himself said o3 “still fails on some very easy tasks, indicating fundamental differences with human intelligence.”

So here’s the play: use these models where their strengths shine (synthesizing information, generating options, accelerating repetitive reasoning) and add scaffolding where they’re weak (verification, constraint enforcement, task decomposition).

The AGI debate will continue. Researchers will publish rebuttals. CEOs will make claims. You don’t need to wait for consensus. You can start using these systems today – if you design your workflows to compensate for what they can’t do.

Next step: pick one task you currently do manually that involves research + reasoning. Break it into 3-5 verifiable sub-tasks. Try running each sub-task through Claude, GPT-4, or o1 (o3 isn’t public yet, but o1 has similar reasoning characteristics). Verify each output before moving to the next step. See if the total time beats doing it yourself.

If it does, you’ve just found your first AGI-class use case. If it doesn’t, refine the decomposition and try again.

FAQ

Has AGI actually arrived, or is this just hype?

Depends on your definition. Huang uses an economic lens (can it build a billion-dollar company?). Traditional AI researchers use a cognitive lens (can it match human intelligence across all domains?). By Huang’s narrow definition, arguably yes – models can generate viral apps. By the cognitive definition, no – o3 still fails on easy tasks and can’t autonomously run a complex organization. The practical answer: the capabilities took a real leap, but calling it “AGI” is more marketing than science.

Should I wait for o3 to be released publicly, or can I use current models the same way?

o1 and Claude 3.5 Opus already have extended reasoning capabilities. You won’t get o3’s 87.5% ARC-AGI score, but you’ll get the same workflow benefits: multi-step reasoning, chain-of-thought transparency, ability to handle longer contexts. Start with what’s available. The techniques in this article apply to any reasoning-focused model. When o3 drops, you’ll already know how to use it effectively instead of learning from expensive mistakes.

What’s the single biggest mistake people make when using these “AGI-class” models?

Giving them end-to-end ownership of a task without verification checkpoints. The models are great at local reasoning (solving a specific sub-problem) but unreliable at global coherence (stitching 10 sub-problems into a correct final answer). If you let them run for 30 minutes unsupervised, you’ll get output that looks complete but contains subtle errors you won’t catch until it’s too late. Always decompose, verify, iterate. Never hand off and walk away.