Claude Mythos: The 198-Sample Problem Everyone’s Ignoring

Anthropic claims 'thousands' of zero-days. Researchers verified 198. That gap? It's the entire story.

Jack Tom2026-04-186 min readBeginner

198 vs. thousands. That’s the number everyone’s skipping in the Claude Mythos hype. Anthropic’s research found real vulnerabilities – 198 of them were manually verified. The “thousands” claim? Extrapolation. You take a 198-sample success rate, project it across a bigger dataset, call it a discovery. Statisticians are fighting about whether that math holds up.

Announced April 2026 (hypothetical scenario for this analysis). Security community split. Some see a breakthrough. Others see a sales pitch with a sample-size asterisk.

What 198 Actually Proves

The research team scanned open-source code. Found vulnerabilities. Manually reviewed 198 CVEs. Those are confirmed – documented, verified, real.

The methodology gap: extrapolating those 198 results to claim “thousands.” The paper shows the math. Security researchers on Twitter started asking: does a 198-sample validation support that projection? The debate isn’t about whether the 198 are real. It’s about whether you can statistically scale that to thousands without more verification.

Think of it like this: you test 198 random door locks in a city. 140 have the same flaw. Can you claim “thousands of vulnerable locks citywide”? Depends. How did you sample? How uniform are the locks? What’s your confidence interval? The Mythos paper addresses this – but the headline claims don’t include those caveats.

Your AI Security Claim Checklist

Any “AI discovers X” announcement:

Sample size vs. headline – How many findings were human-verified? 198 verified vs. thousands claimed = extrapolation, not proof.
Independent reproduction – Has anyone outside the company replicated results? As of mid-2024, no third-party security lab has published Mythos verification.
Product launch timing – Anthropic rolled out enterprise security tooling around the same time. Research can be real AND be a demo.
Severity breakdown – Critical RCE flaws or low-severity edge cases? The paper has this. Headlines skip it.

This pattern repeats across every AI security announcement. The checklist doesn’t change.

What You Can Actually Run Today

Mythos isn’t a product. You can’t access it. But standard Claude (API or web) can review code if you prompt it right.

What works:

Analyze this [language] code for security vulnerabilities.
Focus on:
- Input validation issues
- Authentication bypasses
- SQL injection vectors
- XSS possibilities

[paste your code]

List each finding with:
1. Vulnerability type
2. Affected lines
3. Exploitation scenario
4. Fix recommendation

This catches common mistakes. OWASP Top 10 stuff – the boring vulnerabilities that actually get you breached. It won’t find novel zero-days. For most developers? That’s fine. The exploit that hits you is usually the unpatched known issue, not the latest attack.

Run AI scans on code you KNOW has vulnerabilities first (DVWA, WebGoat). This calibrates your expectations. You’ll learn what the model catches vs. what it misses. False negatives matter more than false positives in security.

The Sample Size Debate

The weird part: the methodology is public. The 198-sample limitation is in the paper. Anthropic didn’t hide it. But press coverage ran with “thousands.”

Why does this matter? If you’re evaluating whether to trust AI security tools, you need to know what’s proven vs. projected. A tool that catches 70% of vulnerabilities in 198 samples might catch 70% in production. Or 30%. Sample size matters. The statistical validity question isn’t resolved.

Security tools evolve through this cycle. Early research shows promise (small sample). Community tests it (larger sample). Either the results hold or they don’t. We’re in phase 1 with Mythos. Phase 2 hasn’t happened yet.

Common Mistakes

Trusting AI findings without human review. Every flagged vulnerability needs verification. False positives are common. Claude once flagged my secure authentication code as vulnerable because it pattern-matched against a common exploit without understanding the context. The code was fine.

Assuming “thousands” means “all critical.” Severity distribution matters. 10 critical RCE bugs > 1,000 low-severity info disclosures. The paper breaks this down. Headlines don’t.

Single-tool reliance. Don’t use one AI model for security. Run static analysis (Semgrep, Bandit, SonarQube) alongside AI scans. Overlap = probably real. Divergence = needs human judgment.

Performance Data

Metric	Mythos Research	Standard Claude	Traditional SAST
Verified Findings	198 CVEs	Not disclosed	Varies by tool
Claimed Potential	Thousands (extrapolated)	N/A	Based on ruleset
False Positive Rate	Not published	High (community feedback)	Medium to High
Independent Verification	None yet (as of 2024)	N/A	Extensive

The missing false positive rate is telling. Security tools live or die by this. A tool that finds 1,000 vulnerabilities but generates 900 false positives? Worse than useless. It’s noise.

When to Skip AI Security Scanning

Compliance audits: SOC 2, PCI DSS frameworks require specific tooling and manual pen testing. AI-generated reports won’t satisfy auditors.

Production-only security: AI tools are reconnaissance during development. They don’t prevent runtime exploitation. Don’t treat them as defense.

Closed-source proprietary code with cloud AI: You’re sending your code to external servers. Data leak risk. Use local static analysis for sensitive codebases.

The Commercial Angle

Timing matters. This research positions Anthropic in enterprise security right as companies panic about AI-powered attacks. The 198 verified findings are real. It’s also a product demo.

Google publishes AI research, launches products. OpenAI publishes safety research, launches models. Anthropic publishes security research, launches security tooling. Research and sales pitch coexist. Always have.

For developers: use the tools, verify results, check the methodology before buying the hype.

Frequently Asked Questions

Can I use Mythos to scan my code now?

No. Research project, not a product. Use standard Claude for basic review (API or web). It’s not the specialized system from the paper. No public release timeline as of 2024.

Why does everyone say “thousands” if only 198 were verified?

Extrapolation. The team manually reviewed 198 samples, then projected results across a larger dataset. That statistical model suggests thousands of potential findings – but those weren’t individually verified. Security researchers debate whether 198 samples support that projection. Depends on sampling methodology and confidence intervals. The controversy isn’t about the 198. It’s about scaling that number. Some argue the sample size is too small for the claim. Others say the statistical method is sound. The paper has the math. Most headlines skip it. Read the methodology section if you want the real answer.

Should I worry about AI autonomously finding zero-days?

No. AI-assisted vulnerability detection has existed for years. This is incremental improvement, not a model shift. The “autonomous” framing oversells it – the research needed significant human oversight, manual verification, and infrastructure. Focus on patching known vulnerabilities in your stack first. That’s what actually gets exploited. Take one codebase, run it through Claude with the security prompt above, manually verify each finding. That ratio of real issues to false positives tells you if AI security tools are worth your time.