Executive TL;DR:
- Five frontier LLMs disagree on 67% of 1k real-world fact-check claims.
- The disagreements highlight the need for more precise prompts and rubrics.
- The study raises questions about the reliability of LLMs in fact-checking tasks.
The Buzz Score
The Internet’s Verdict: 70% Hyped, 30% Skeptical
Forum Voices
Experts are weighing in on the study, with some pointing out the limitations of the prompt and harness used.
Here’s the prompt they used:
Classify this claim as of <date>: "<atomic claim>" Output exactly one label: True, Mostly True, Misleading, or False. No explanations, no qualifiers.
Others are highlighting the importance of including more models in the study, such as Grok.
Why did they exclude Grok? Given the published philosophical differences in how Grok is trained, it would provide an interesting data point.
Implications and Concerns
The study raises concerns about the reliability of LLMs in fact-checking tasks and the need for more precise prompts and rubrics.
These aren’t benchmark items with public answer keys — they’re claims real users submitted for verification to a fact-checking platform.
The use of LLMs in the production of the report itself is also a topic of discussion.
Why did they exclude Grok? Given the published philosophical differences in how Grok is trained, it would provide an interesting data point.
Focus Keyword: LLM Disagreements