When Frontier LLMs Disagree on Fact-Checks: A Practical Guide

Frontier LLMs disagree on 67% of real-world fact-checks. Here's a hands-on panel method to spot when a single model is quietly wrong.

Drew Sullivan2026-05-287 min readBeginner

A new snapshot from Lenz.io is making the rounds on Hacker News this week, and the headline number is uncomfortable: five frontier LLMs disagree on 67% of real-world fact-check claims. Not synthetic benchmarks. Real user-submitted questions. The kind you’d actually paste into ChatGPT.

So before we get into the study – the practical question for you: if frontier LLMs disagree two-thirds of the time, how should you actually use them to check things? There are two reasonable approaches. One is clearly better.

The key takeaway, upfront

If you only remember one thing: asking one LLM whether a claim is true is a coin flip on whether you’d get the same verdict from a different LLM. The fix isn’t “use a better model” – it’s running a tiny panel and treating the disagreement itself as the signal.

The same 4-bucket rubric the researchers used is below. You can run this in three browser tabs.

What the study actually measured

The Lenz team took 1,000 recent claims submitted by real users to their fact-checking platform and asked the five top frontier LLMs to each pick one of four verdicts: True, Mostly True, Misleading, or False. On 67% of those claims, at least one model broke from the majority – or no majority formed at all.

672 of 1,000 claims (95% CI: 64-70%) had at least one model dissenting from the majority. More telling: 343 claims (95% CI: 31-37%) showed a 2+ bucket gap between the most-disagreeing pair – the difference between “True” and “Misleading,” not calibration noise.

The corpus matters as much as the number. Lenz’s claims are structurally fresh: real-user submissions from the past 180 days, never paired with canonical verdicts in any public training set, spanning health, science, finance, history, tech, and legal questions. That’s why this hits harder than older results – the models can’t have memorized the answers.

Snapshot v1.0, data as of May 21, 2026 (DOI: 10.5281/zenodo.20344847). Numbers may shift as model versions update.

Method A vs. Method B: which to use

You have two realistic options when you want an LLM to check a claim.

Approach	What it is	What it catches	What it misses
A. Re-roll same model	Ask GPT-5 the same question 3 times, take majority	Stochastic noise from sampling	Systematic bias baked into that model’s training
B. Panel of different models	Ask 3-5 different frontier models once each, look for agreement and spread	Both stochastic noise and model-specific biases	Shared blind spots across all frontier models (training data overlap)

Method B wins. Frontier LLMs are non-deterministic – a re-run with the same model and prompt produces somewhat different numbers – but those differences are sampling jitter. The 67% disagreement rate is across different models. That means there’s a systematic component you can’t shake out by re-rolling.

Same pattern, different domain. Turns out the Benchmark Illusion paper found identical behavior on academic benchmarks: disagreement among comparable models runs 16-66% on MMLU-Pro and 17-65% on GPQA. Even among top frontier models with accuracy above 60%, disagreement stays between 16% and 38%.

How to run a panel fact-check in 5 minutes

Three browser tabs. No API budget required.

Step 1: Pick three models from different labs

Don’t pick GPT-5 + GPT-5-mini + GPT-4o. That’s one perspective. Pick across labs – one from OpenAI, one from Anthropic, one from Google. The point is to sample training distributions, not parameter counts.

Step 2: Use the same prompt the researchers used

Copy this verbatim into each tab:

Classify this claim as of [today's date]: "[your claim here]"

Output exactly one label: True, Mostly True, Misleading, or False.
No explanations, no qualifiers.

The prompt looks simple, but two details in it are doing real work. “As of [date]” pins the temporal frame – a model with an older knowledge cutoff would otherwise silently answer about a different moment. Forbidding explanations forces commitment: once a model can hedge, every verdict becomes “it depends.” (This is the exact format used in the Lenz study, surfaced in the Hacker News discussion thread.)

Step 3: Read the spread, not just the majority

Three buckets to watch:

All three agree – proceed, but still verify the underlying source if stakes are high.
2-1 split with adjacent buckets (e.g., True / Mostly True / Mostly True) – soft signal. Probably true with caveats worth digging into.
2-bucket gap anywhere (e.g., True vs. Misleading) – stop. The Lenz data puts 34% of claims in this category. You don’t have an answer; you have a research project.

Pro tip: When you see a 2-bucket gap, ask each model the same follow-up: “What single source would change your verdict?” The models that name a real, checkable source are usable. The ones that gesture vaguely are guessing.

Step 4: Log what you found

A notes file with claim, three verdicts, and your resolution. After 20 claims you’ll see patterns – which model leans “Misleading” on health claims, which one over-trusts numeric specificity, which one is consistently the outlier on recent events. That calibration is the actual skill.

The catch nobody talks about

Here’s the part that the trending tweets miss. Verdict ambiguity is partly a task property, not just an LLM property. On AVeriTeC – 4,568 claims annotated through multi-round review against 50 fact-checking organizations – inter-annotator agreement reaches κ=0.619. Substantial, but well short of perfect.

Human fact-checkers don’t fully agree either. When your three-model panel splits 2-1 on “Mostly True” vs. “Misleading,” it’s not always proof a model is wrong. Sometimes the claim itself is genuinely ambiguous, and the right move is to rewrite the question, not to keep asking models.

That’s a humbling thing to sit with.

Edge cases worth knowing

Temporal framing collapses to factual disagreement. A claim like “the World Bank’s active portfolio in Nigeria is over $16.4 billion” can be True in 2025 and False in 2023. Models with different knowledge cutoffs will produce a 2-bucket gap that looks like factual disagreement but is really a date mismatch. Always include the “as of [date]” line.

Don’t test your panel on AVeriTeC or PolitiFact. Famous fact-check corpora have been publicly available for years. Almost certainly in current frontier-model training data. Measured disagreement on them confounds true inference disagreement with memorization. Use claims from the last 30 days, from sources the models are unlikely to have ingested.

Krippendorff’s α is a floor, not a ceiling. Front-load this one: if your own panel hits α = 0.6, you’re matching frontier-lab behavior – the Lenz panel scored 0.639 across 5 raters on 1,000 items. Hitting 0.9 on your own claims isn’t a good sign. It usually means one model is dominating, or the questions are too easy to be worth checking.

Knowledge gaps run deeper than verdicts suggest. 75% overall factual accuracy – that’s the number from Mining the Mind research on GPT-4.1’s extracted knowledge base (as of the paper’s publication). For relational knowledge specifically – “spouse,” “sibling” – completeness drops to 16-23%. A confident verdict isn’t a knowledgeable verdict.

FAQ

Does this mean I shouldn’t use LLMs for fact-checking at all?

No. It means don’t use one LLM as a verdict machine. A panel works fine as a triage step before real source-checking.

I’m a free-tier user and can’t access multiple frontier models. What now?

ChatGPT free, Claude.ai free, and Gemini free all give you access to capable models – usage limits vary and change often, so check each product’s current plan page. For the claims that actually matter, run the same prompt across three tabs. You won’t get volume, but you’ll get the panel effect where it counts. For everything else: treat a single-model verdict as a strong hypothesis, check the source the model cites before quoting it, and move on.

Why four buckets instead of just True / False?

Most real claims aren’t binary. “Vitamin D prevents colds” has a kernel of truth, a misleading framing, and a false implication – all in one sentence. Binary classification hides exactly that kind of asymmetry. Hoes et al. (2023) showed this directly: ChatGPT scored 72% accuracy on 12,784 fact-checked statements, but accuracy on true claims was higher than on false ones. A binary label would have buried that gap. The 4-bucket rubric is a compromise – granular enough to catch the messy middle, simple enough that models commit to one option.

Next: open three tabs right now – one ChatGPT, one Claude, one Gemini – paste the prompt template from Step 2, and run it on the last “is this actually true?” question you wondered about this week. The first time you watch three frontier models land on different buckets, the rest of this article stops feeling abstract.