How to Evaluate AI Tools When You Can’t Code (Like Sam Altman)

OpenAI's CEO reportedly mixes up basic ML terms. Here's how to vet AI products when you're non-technical - and what this week's bombshell teaches us.

Jack Tom2026-04-087 min readBeginner

OpenAI’s CEO can’t code. Or can’t code well, anyway.

Per The New Yorker‘s investigation (published April 6-7, 2026), Sam Altman mixes up basic ML terms. Engineers who worked with him noticed. One board member told reporters he’s “unconstrained by truth.” Another used the word sociopath.

The revelation spread across Reddit and Hacker News within hours. But here’s the part nobody’s talking about: You don’t need to understand neural networks to make smart decisions about AI. Altman’s lack of technical depth didn’t stop him from building a trillion-dollar company. Your lack of a CS degree shouldn’t stop you from evaluating the tools claiming to change your work.

This guide teaches you the evaluation framework Altman’s critics are using – and that you can apply to any AI product, even if you’ve never written code.

Why This Actually Matters

Altman dropped out of Stanford’s CS program after two years. Never developed deep ML expertise, per The New Yorker’s sources. Engineers caught him confusing terminology during technical discussions.

OpenAI has 400 million weekly users (as of mid-2025, per MIT Technology Review). You’re probably one of them.

The real question: if the guy steering the world’s most influential AI company isn’t technical, what does “technical expertise” mean when you’re choosing tools?

Not as much as you’d think. 70% of AI projects fail in production – not because users lack coding skills, but because they skip structured evaluation (Galileo AI deployment data). Meanwhile, 89% of small businesses use AI without formal vetting (KumoHQ report).

The 5-Question Framework

Forget “what model does it use.” Ask these instead – they catch problems vendor demos hide.

1. Can You Name Three Ways This Tool Breaks?

Not “could it fail,” but actual documented failures.

OpenAI’s safety team once pledged 20% of computing resources to preventing catastrophe. The New Yorker’s reporting: they got 1-2% on outdated hardware before the team dissolved.

When evaluating a tool, search “[tool name] issues” or “[tool name] limitations.” Vendor docs don’t mention failure modes? Red flag. Tools that hide weaknesses aren’t being honest.

Test this: Pick an AI tool you use. Google “[tool] fails” or check its status page. Find three specific scenarios where it breaks. Can’t find them? You’re flying blind.

2. What Happens When You Feed It Garbage?

The ROBOT test – a framework from The LibrAIry project (documented by Delaware County Community College) – includes a key check: how does the tool handle edge cases?

Give it bad data. Ambiguous prompts. Contradictory instructions.

Does it fail gracefully with a clear error, or hallucinate confidently?

ChatGPT fabricates citations if you ask for sources. Claude sometimes ignores safety guardrails with indirect phrasing. Midjourney used to generate copyrighted characters until users caught it.

Test AI tools with intentionally vague or contradictory prompts before relying on them. “Write a blog post that’s formal but casual” or “Analyze this dataset” (without uploading one). Tools that admit confusion beat tools that pretend to understand.

3. Who Pays When It Screws Up?

The New Yorker investigation asks this implicitly about Altman’s leadership. When OpenAI allegedly broke safety commitments or reneged on deals, who bore the cost?

For you: if an AI coding assistant ships buggy code, does the vendor take responsibility? If a content generator produces plagiarized text, who’s liable?

Read the Terms of Service – the “limitation of liability” section. Most AI tools explicitly disclaim responsibility for output accuracy. That means you own the risk.

Watch for: “Provided as-is,” “no warranty of accuracy,” “not suitable for mission-critical applications.” See these? Factor error-checking into your workflow from day one.

4. Can You Measure What Changed?

Non-technical doesn’t mean non-rigorous.

Purdue’s AI evaluation framework: the most important question is what measurable outcome improves?

Distinguish process-centric AI (saves your team time) from product-centric AI (improves customer experience). Each needs different success metrics.

Process-centric: “This tool cut report generation from 6 hours to 2 hours” – measure time saved, error rate before/after
Product-centric: “This tool increased conversions by 12%” – measure user behavior, revenue impact, churn

Can’t define success before adopting a tool? You’ll never know if it worked. Demos showcase best-case scenarios, not your Tuesday afternoon reality.

5. What Does the Tool Not Know About Itself?

Ask the AI to explain its own limitations. Seriously.

Prompt: “What are the three biggest weaknesses of your training data?” or “When should I not trust your output?”

GPT-4 tells you its knowledge cutoff date and that it can’t access real-time data. Claude admits it can’t verify citations. Well-designed tools acknowledge boundaries. Tools that dodge the question? Hiding something.

What This Changes

The Altman story broke because engineers compiled evidence he made decisions without technical depth.

Real issue: the gap between appearing technical and making good decisions.

You don’t need ML expertise to ask:

Does this tool’s documentation list failure modes?
How does it behave under stress?
Who’s responsible when it’s wrong?
Can I measure the impact?
Does it acknowledge its limits?

Works for ChatGPT, Midjourney, Jasper, Notion AI, or anything claiming to “change your workflow.”

The irony? Domain expertise matters more than coding skills in the AI-first era. Non-technical founders now have a competitive edge because they understand problems better than models. You don’t need to know how transformers work. You need to know what happens when they fail.

The 10-Minute AI Audit

Pick one AI tool you’ve used this week. Set a timer. Answer these:

Question	Your Answer
What are 3 documented ways this tool fails?
What happens when I give it bad input?
What does the ToS say about liability?
What metric proves it’s working?
Can the tool describe its own limits?

Can’t answer at least three? You’re using the tool blind.

Fine for low-stakes tasks – drafting an email, brainstorming ideas. Dangerous for anything affecting your business, your reputation, or someone else’s outcomes.

What Matters

Sam Altman allegedly can’t code. Still built OpenAI into a juggernaut.

The lesson isn’t “technical skills don’t matter.” It’s that the right questions matter more than the right background.

Think of it this way: you don’t need to be a mechanic to know when your car’s making a bad noise. You just need to recognize the sound, describe it accurately, and find someone who can diagnose it. Same with AI tools. Pattern recognition beats ML knowledge.

This week’s controversy proves you don’t need to be an ML researcher to demand accountability from AI tools. You need to stop trusting marketing and start testing assumptions.

When a vendor demo shows you a flawless AI workflow, ask about the 1% of cases it breaks. Ask who pays when it hallucinates. Ask how they’d measure success in your workflow, not their demo environment.

Then decide if you trust them. Not because they can code, but because they’re honest about what could go wrong.

FAQ

Do I really need to test AI tools if millions of people already use them?

Yes. Popularity ≠ suitability.

OpenAI: 400 million weekly users. The New Yorker investigation: internal safety protocols allegedly misrepresented. Scale hides problems; doesn’t solve them. Test every tool for your use case – what works for a student writing essays might catastrophically fail for a lawyer drafting contracts.

What if I don’t understand the technical jargon in a tool’s documentation?

That’s actually useful information. If the docs require ML expertise to understand safety limitations, the vendor is gatekeeping. Demand plain-language explanations of failure modes. Purdue’s evaluation framework recommends assessing “explainability” – can the tool describe what it’s doing in terms you understand? No? You can’t evaluate when it’s wrong. Walk away or ask for better docs before committing. I spent 20 minutes once trying to figure out if a tool’s “confidence threshold” setting was something I needed to adjust. Docs assumed I knew what cross-entropy meant. I didn’t. Switched to a competitor with a UI that said “Low/Medium/High accuracy” instead.

How do I know if an AI tool’s “failure rate” is acceptable for my use?

Reverse-engineer from consequences. Tool drafts social media posts and a mistake means deleting a tweet? High error rates are tolerable. Tool screens job applications and bias means illegal discrimination? Even 1% failure is unacceptable. Match the tool’s documented accuracy to your risk tolerance. Most vendors bury accuracy metrics or test on ideal data – insist on real-world performance numbers or run your own pilot with diverse, messy inputs before scaling.