Anthropic’s GAN-Inspired Agent Architecture: Build Apps That Actually Work

Anthropic just revealed the multi-agent architecture behind Claude Code's success. Here's the GAN-inspired Generator-Evaluator pattern that produces working apps where solo agents fail.

Jack Tom2026-03-299 min readBeginner

Most tutorials about Anthropic’s new agent architecture will tell you the breakthrough is splitting work between multiple agents. They’re wrong.

The real insight? Your AI agent is lying to you about the quality of its own work. And until you design around that fact, you’ll keep shipping broken apps that look functional in demos but fail the moment a user clicks anything.

Here’s what actually changed.

The $191 Difference Between Working and Broken

Anthropic ran a test in late March 2026. Same prompt, same model (Opus 4.5), two different architectures.

Solo agent: 20 minutes, $9. Produced a 2D retro game maker that looked great. The UI loaded. Menus appeared. Then you tried to play the game, and nothing worked. Entities appeared on screen but didn’t respond to input. The wiring between game logic and runtime had silently failed.

Full use: 6 hours, $200. Built the same app. Physics had rough edges, but the core loop worked. You could create sprites, define behaviors, build a level, and actually play it.

That $191 gap isn’t overhead. It’s the cost of shipping something that works.

According to Anthropic’s engineering blog published March 24, 2026, the difference came down to one architectural choice: separating the agent that builds from the agent that evaluates.

Why Your Agent Can’t Grade Its Own Homework

Anthropic’s research team discovered two failure modes that naive long-running agents hit every time.

First: context anxiety. As the context window fills up, models start wrapping up work prematurely. They “feel” like they’re running out of space and rush to finish, even when the actual limit is nowhere close. Sonnet 4.5 exhibited this so severely that Anthropic had to design full context resets into their use.

Second, and more insidious: self-evaluation failure. Ask an AI to evaluate work it just created, and it will confidently praise mediocre output. The same model that generated broken code is cognitively primed to defend it.

This isn’t a prompt engineering problem. You can’t fix it by asking nicely. The model literally cannot see its own mistakes.

Pro tip: If you’re building with Claude Code and asking it “does this look right?” after it writes code, you’re getting a useless answer. The agent will say yes even when a human would immediately spot the bug. Design your workflow assuming self-evaluation is broken.

The GAN-Inspired Solution: Generator vs. Evaluator

Anthropic’s breakthrough borrows from Generative Adversarial Networks – the machine learning architecture where two models compete. One generates, one discriminates.

Applied to agents:

Generator agent: Builds the code, implements features, writes the application
Evaluator agent: A separate, skeptical agent that rigorously critiques the Generator’s output against strict criteria

The Evaluator doesn’t just read code. It runs it. Using Playwright MCP tools, it opens a live browser, clicks buttons, types into forms, and watches what happens. It’s grading against four weighted criteria: design quality, originality, craft, and functionality.

In Anthropic’s retro game maker test, the Evaluator caught dozens of real bugs – including a FastAPI routing error where a route defined in the wrong order caused the server to try parsing the string “reorder” as an integer. Static code review would have missed it. The Evaluator found it because it actually ran the app.

The Full Three-Agent Architecture

For complete applications, Anthropic deploys three specialized agents:

Planner: Takes a brief prompt (“build a project management app with a Kanban board”) and expands it into a detailed product spec with features, user stories, and technical requirements
Generator: Works in sprints, implementing one feature at a time, committing progress to git incrementally
Evaluator: Runs end-to-end Playwright testing with hard pass/fail thresholds, catches bugs the Generator missed

One critical lesson: the Planner should focus on what and why, not how. When Anthropic’s Planner included granular technical details like “use WebSockets with a pub/sub pattern on Redis,” those premature decisions cascaded errors through the Generator’s work. A Planner that says “implement real-time collaboration” outperformed one that specified the exact architecture – because the Generator found better approaches that the Planner’s rigidity would have prevented.

When the use Makes Things Worse

Here’s what most tutorials won’t tell you: harnesses are temporary scaffolding, not permanent architecture.

Opus 4.6 shipped with a 1,000,000-token context window (per Anthropic’s official announcements). That’s enormous. Most multi-hour autonomous sessions never come close to filling it. Result? Anthropic dropped context resets entirely. What used to require a complex multi-session handoff system now runs as one continuous session.

The team went from a full use with sprints, context resets, and contract negotiations to a simplified version with just Planner + Generator + end-of-run Evaluator. Same quality output, dramatically less complexity.

The cost of over-engineering is real: unnecessary use complexity adds token costs, latency, and debugging surface area. A use built for Sonnet 4.5’s limitations will waste money if you run it on Opus 4.6.

Recommended practice from Anthropic’s research: run a use audit every time a major new model is released. Ask which components exist because of model limitations that may no longer apply. Then simplify.

How to Actually Implement This

The principles work across any model and framework. You don’t need Anthropic’s Claude Agent SDK, though it provides convenient abstractions.

Step 1: Define Evaluation Criteria

The Evaluator can’t grade “is this beautiful?” – that’s too vague. But it can grade:

“Does this follow our 5 design principles?” (score 0-5)
“Is there any AI slop – generic gradients, stock-photo vibes, predictable layout?” (score 0-3)
“Does the primary CTA have sufficient visual weight?” (yes/no)

Decompose subjective quality into explicit, checkable criteria. This works for design, legal analysis (“does this cite relevant case law?”), writing (“is tone consistent with the style guide?”), and code (“are error paths handled?”).

Step 2: Give the Evaluator Real Tools

Don’t ask the Evaluator to judge code by reading it. Give it Playwright (or Selenium, or Puppeteer) to actually run the app and interact with it. The Evaluator should click buttons, submit forms, resize windows, and verify behavior.

For backend code, give it pytest or equivalent to run the full test suite. For data pipelines, give it sample datasets to process end-to-end.

Step 3: Separate the Roles Completely

Use separate model instances or separate system prompts. The Generator’s prompt should focus on implementation. The Evaluator’s prompt should be explicitly skeptical and include phrases like “assume the code is broken until proven otherwise” and “prioritize finding failure modes over approving output.”

Step 4: Iterate in Loops

Generator produces → Evaluator tests and scores → feedback goes back to Generator → Generator fixes issues → repeat 5-15 rounds per feature until the Evaluator’s score crosses your threshold.

Anthropic’s use ran each generation cycle through 5-15 evaluator rounds before accepting output.

The Dark Side: When Harnesses Enable Attacks

This architecture doesn’t just build useful apps. According to Fortune’s reporting, a Chinese state-sponsored hacking group used Claude Code to infiltrate approximately 30 organizations – tech companies, financial institutions, government agencies – before Anthropic detected the campaign and banned the accounts.

The Generator-Evaluator loop that produces polished software also produces sophisticated exploits. The Evaluator catches bugs in attack code the same way it catches bugs in legitimate apps.

Anthropic’s leaked Mythos model draft (confirmed by the company as real, though names may change) warns that the new model “poses unprecedented cybersecurity risks” and is “currently far ahead of any other AI model in cyber capabilities.” The draft states it “presages an upcoming wave of models that can exploit vulnerabilities in ways that far outpace the efforts of defenders.”

Anthropic is deliberately slowing the release. The model is expensive to run, not ready for general availability, and being trialed only with select early-access enterprise customers.

If you’re building agents that write code autonomously, assume adversaries are using the same architecture you are – but pointed at finding vulnerabilities instead of building products.

What This Means for You Right Now

Forget the hype about “autonomous agents” that will replace developers. That’s not what this architecture enables.

What it actually does: lets a small team ship complex software faster by offloading the tedious parts – the 20-step BIM documentation workflows, the compliance audit scripts, the UI polish iterations – to agents that can work for hours without losing context.

The human stays in the loop at judgment points: which features to build, whether the Evaluator’s criteria are correct, when to ship. The use automates execution, not decisions.

If you’re building with Claude Code, Cursor, or any agentic coding tool right now:

Stop asking the agent if its own work is good – design a separate evaluation step
Use actual runtime testing (Playwright, pytest) instead of asking the agent to self-review
Commit progress incrementally so failures don’t wipe hours of work
Audit your use complexity every model upgrade – you might be over-engineering

The next step is hands-on. Pick a small project – a landing page, a CLI tool, a data pipeline – and implement the Generator-Evaluator split. You’ll see the quality difference immediately.

Frequently Asked Questions

Do I need the Claude Agent SDK to use this architecture?

No. The principles are model-agnostic and framework-agnostic. You can implement the same Generator-Evaluator pattern with any model API, a Python orchestration layer, and standard tools like Playwright for evaluation. The Claude Agent SDK provides convenient abstractions, but the core pattern works anywhere.

How do I know when my use is too complex?

Run the same task with and without the use components. If removing a component (like context resets or sprint-based planning) produces the same quality output faster and cheaper, that component has become unnecessary. Anthropic simplified their use when upgrading from Sonnet 4.5 to Opus 4.6 because the model’s capabilities made some scaffolding obsolete. Audit after every major model release.

Can the Evaluator agent be tricked into approving bad work if the Generator learns to game the evaluation criteria?

Yes, and this is a real concern. The Evaluator’s criteria must be concrete and verifiable – ideally grounded in runtime tests, not subjective judgment. If your criteria are “does the UI look polished,” a Generator can optimize for surface aesthetics while hiding functional bugs. If your criteria are “does this pass all 47 integration tests and handle the three documented edge cases,” gaming becomes much harder. Use automated testing wherever possible, and keep human review for the final gate before production.