Five frontier LLMs disagree on 67% of 1k real-world fact-check claims

https://news.ycombinator.com/rss Hits: 2
Summary

Lenz Research · Snapshot v1.0 · data as of May 21, 2026 67% of real fact-checks, top AI models don't agree on the answer. 1,000 claims, rated by 5 frontier LLMs. Beyond Benchmarks: Disagreement Among Frontier LLMs on Real-World Fact-Checks Jordanov, Kosta · Lenz Research · kosta@lenz.io We presented 1,000 recent real user claims to the five top frontier LLMs and asked each one for a verdict. These aren't benchmark items with public answer keys — they're claims real users submitted for verification to a fact-checking platform. Only one verdict bucket can be correct per claim, so any disagreement among the panel means at least one model's verdict is label-inconsistent under this 4-bucket rubric (True / Mostly True / Misleading / False). On 67% of claims, the panel splits. Key findings 67% of claims (672 / 1,000; 95% CI: 64–70%) have at least one frontier model dissenting from the panel majority — or no majority forms at all. 34% of claims (343 / 1,000; 95% CI: 31–37%) involve a 2+ bucket gap between the most-disagreeing pair of frontier verdicts — a substantive disagreement on the answer, not just a calibration shift. Krippendorff's α (ordinal) = 0.639 across 5 raters on 1,000 items — nontrivial but limited agreement. The panel converges on definitive verdicts; the middle of the rubric is where it fractures. Within the 328 unanimous claims, only 4 are unanimous-Misleading and 0 are unanimous-Mostly-True. Some models concentrate verdicts at the True/False poles; others spread across the middle two buckets. Contents How often the frontier disagrees Substantive vs nuance disagreement Model-vs-model agreement Per-model behavior (verdict distribution + agreement-with-rest) Detailed results (by domain, majority verdict, unanimous) Data Methodology Reproducibility Limitations FAQ Ethics & data use Changelog Appendix: Example claims where the frontier fractures 1How often the frontier disagrees On 67% of claims (672 / 1,000; 95% CI: 64–70%), the frontier panel doesn't agree —...

First seen: 2026-05-28 13:11

Last seen: 2026-05-28 14:12