The most interesting number in the Harvard Science paper isn’t 67%. It’s 89% versus 34%. That’s the gap between OpenAI’s o1 and human doctors on management reasoning – the part of medicine that comes after the diagnosis. Headlines are obsessed with the diagnostic horse race. The real story is that the model is better at the messy thinking that follows.
Every news outlet led with the same stat this week: o1 offered the exact or very close diagnosis in 67% of triage cases, compared to 55% and 50% for the two human physicians. Fair enough – it’s a clean number from a real ER. But if you actually want to use what this study teaches, the headline is the wrong place to start. Let’s rebuild the prompt structure that produced these results, then talk about where it breaks.
What the study actually did (the 90-second version)
A Harvard / Beth Israel / Stanford team published six experiments on April 30, 2026, led by Brodeur, Buckley, Manrai, and Rodman. The one that hit the news used 76 real ER cases pulled straight from the electronic health record at Beth Israel Deaconess – no preprocessing, no cleanup. The model got the same data the doctors did: vitals, demographics, the nurse’s intake note.
Two attendings produced differentials. So did o1 and GPT-4o. A separate pair of attendings then graded everything blind. One rater couldn’t tell whether a differential came from a human or the AI in 83.6% of cases – the other was at 94.4%, per the arXiv preprint. That detail is buried in the supplement and almost no one is quoting it. It matters more than the 67% if you’re building anything with this.
What does it mean that trained physicians can’t distinguish o1’s reasoning from a colleague’s? That’s genuinely worth sitting with. It doesn’t settle the deployment question – but it does shift the frame from “AI as calculator” to something harder to categorize. Whether that’s exciting or unsettling probably depends on where you sit in the healthcare system.
The hands-on part: how to use o1 for clinical-style reasoning
Scope, plainly: this is for learning, drafting, second opinions on your own case write-ups, or building research tools. Not for diagnosing real patients without a licensed clinician in the loop. Here’s how to replicate the study’s prompt pattern.
Step 1: Pick the right model variant
The paper used o1-preview for most experiments – not the full o1 release. Per the study’s supplementary materials, o1-preview launched September 12, 2024 and full o1 followed December 5, 2024. Both carry an October 2023 pretraining cutoff. That cutoff is the hidden trap: any post-2023 guideline – new sepsis bundles, updated antibiotic recommendations – is invisible to the model unless you paste it in yourself.
Also worth knowing: o1 is now a legacy model in OpenAI’s catalog. For most reasoning tasks today, the standard o1 model in ChatGPT Plus is close enough to replicate the study’s behavior. o1-pro went live in the API in March 2025 at $150 per million input tokens and $600 per million output – which is why most independent replications stick to the cheaper tier. Check the OpenAI platform page before committing to any specific snapshot; the catalog moves fast.
Step 2: Mirror the study’s input structure
The Harvard team didn’t fine-tune anything. They just pasted the chart. Here’s a sanitized template based on what they fed the model:
You are reviewing an emergency department patient encounter.
Produce a ranked differential diagnosis (top 5) and explain your reasoning.
=== TRIAGE INFO ===
Age / sex / arrival mode:
Chief complaint:
Vitals (HR, BP, RR, SpO2, Temp):
Nurse intake note (verbatim):
=== PMHx / Meds / Allergies ===
=== HPI ===
=== Exam findings ===
=== Labs / imaging available ===
Rank by likelihood. Flag any 'cannot-miss' diagnoses
separately even if low probability.
The explicit “cannot-miss” instruction does real work here. Asking for it forces the model to widen its net rather than converge on the most likely answer – which is exactly the wrong behavior at triage, where ruling out the deadly option matters more than naming the probable one. The verbatim nurse note is the other key: the study’s point was that messy real-world text didn’t break the model, so don’t sanitize your input before you paste it.
Step 3: Stage the information like the study did
The Harvard team didn’t dump everything at once. Cases were structured into stages mirroring actual ED workflow – triage, ED workup, admission – and at each stage the model produced a differential. Do the same. Run o1 on triage info alone first, then add labs, then imaging. You’ll watch the differential narrow in real time, which is far more instructive than handing it the full chart and reading the answer.
After o1 produces a differential, paste it back in a second prompt and ask: “What disconfirming evidence would you look for to rule out diagnosis #1? What single test would change your ranking the most?” This mimics the management-reasoning task where the model hit 86-89% median against ~34% for doctors using UpToDate.
Common pitfalls people are already hitting
The Hacker News thread on this paper is full of valid criticism. Read it before you get excited.
- Don’t paste lab values without units. o1’s chain-of-thought will sometimes assume metric, sometimes US units, and quietly contradict itself. Always include units explicitly.
- Don’t ask “is this X?” – confirmation-bias the prompt and the model will agree with you. Ask for a ranked differential first, then probe.
- Don’t skip the cannot-miss flag. Without it, the model produces tighter, more confident lists – which is exactly the wrong behavior in triage.
- Don’t expect it to read images. The study was based entirely on text-based inputs. Paste a chest X-ray and you’re testing something the paper never tested.
- Don’t trust the 67% on your own messy chart. The Beth Israel notes were written by trained ED clinicians. Your dictated voice memo is not the same input distribution.
The performance numbers in context
One stat per row, because the comparisons matter more than any single number.
| Task | o1-preview | Best human baseline |
|---|---|---|
| ER triage diagnosis (76 cases) | 67% | 55% |
| Full ED workup (same cases) | ~80%+ | ~80%+ (converges) |
| NEJM CPC diagnosis included in differential (143 cases) | 78.3% | Physicians with search tools (prior studies; no direct comparison available) |
| Management reasoning (5 expert vignettes) | 86-89% median | ~34% (humans with UpToDate) |
Notice how the gap shrinks at full workup and explodes on management reasoning. The gap is biggest where the information is sparsest – counterintuitive only if you assume physicians have some special bedside intuition that activates under pressure. The data suggests humans generate narrow differentials when they should be wide-netting. The model isn’t wired that way.
When NOT to use this
Three situations where o1 is the wrong tool, regardless of how shiny the headline number looks.
Real patients, no clinician in the loop. The authors are not arguing the model is ready to call the shots when someone is actively dying. The paper calls for prospective trials before any clinical deployment. And the 67% also means 33% wrong – with no data yet on what the failure modes look like outside a controlled case set.
Pediatrics, pregnancy, or rare-disease populations underrepresented in the training data. The study sampled adult ED admissions in Boston. Generalization to other populations is unproven.
Anything time-critical where you need a confidence score. As Buckley acknowledged at the press briefing, o1 does hallucinate – though in most cases it at least suggests something helpful. “Helpful” is not “safe.” There’s a meaningful difference between a hallucination that’s in the right ballpark and one that misses a cannot-miss diagnosis entirely, and the study doesn’t yet characterize that failure distribution.
The Hacker News critique is sharper still: in the study’s setup, human doctors don’t normally diagnose from notes alone. The task asked them to do something they aren’t trained for in this exact form, then concluded AI outperforms them. Worth keeping in your back pocket when someone shares the headline as if it’s settled science.
So what should you actually do this week?
Take a de-identified case from any teaching collection – NEJM Clinicopathological Conferences are free for the first paragraph – and run the staged prompt above through o1 in ChatGPT. Compare the differential at triage stage to the published answer. Then re-run with one extra lab added. Watch how the ranking shifts. That ten-minute exercise will teach you more about reasoning models than another think-piece will.
FAQ
Can I use o1 for actual clinical decisions right now?
No. The authors said the technology needs prospective trials before bedside use, and regulators haven’t cleared it for diagnostic decisions.
Why did o1 do so much better on management reasoning than diagnosis?
Management reasoning – antibiotic choice, which test to order next, goals-of-care decisions – requires holding many competing factors simultaneously and generating a structured plan, not just naming a disease. That maps well to how reasoning models build chains of thought. The study’s own data shows the gap inverts as information gets richer: at triage (sparse info), o1 leads by 12 percentage points. By full workup, both sides are above 80%. The management task never gets simpler, so the model’s advantage holds. If you’re going to use o1 for anything clinical-adjacent, workup planning is probably the higher-use use case.
Should I use o1, o1-pro, or wait for the next model?
Standard o1 in ChatGPT Plus replicates the study’s behavior closely enough for learning. o1-pro at $150/$600 per million tokens is overkill unless you’re running large-batch evaluations. Honestly? By the time you read this, there’s a reasonable chance a successor model already does the same thing cheaper – check the OpenAI platform page before committing.