The Un-leaderboard: Google, OpenAI, and Anthropic Self-Reported Hallucination and Accuracy Scores

The Un-leaderboard

The updated accuracy metrics below for the leading model providers are NOT based on a third-party (or AIMon’s) performance benchmark. Instead, they reflect what AI labs themselves chose to report. The availability of these metrics is not widely known, and they are often buried in models’ system card PDFs. So we decided to do some tech journalism and bring them to you. 🙂

While the public marvels at AI models writing code, essays, and bedtime stories, the responsible builders of these innovative models quietly acknowledge an ongoing issue: hallucination.

OpenAI, Google, and Anthropic have all produced leading models in humanity’s race to artificial general intelligence. But when it comes to factuality, they’re still far from perfect. The data that follows is pulled right from their system cards and disclosures.

The Un-leaderboard Hallucination Scorecard Overview

Below is a snapshot of hallucination and factuality metrics self-reported by OpenAI, Google, and Anthropic, covering leading benchmarks like TruthfulQA, SimpleQA, PersonQA.

Last Updated: August 8th 2025

Model ↕	Benchmark ↕	Metric Type ↕	Score ↕	Source ↕	Announced ↕
OpenAI O1	PersonQA	Accuracy	47%	O1 Model Card	December 5, 2024
OpenAI O1	PersonQA	Hallucination Rate	16%	O1 Model Card	December 5, 2024
OpenAI O1	SimpleQA	Accuracy	47%	O1 Model Card	December 5, 2024
OpenAI O1	SimpleQA	Hallucination Rate	44%	O1 Model Card	December 5, 2024
OpenAI O3	PersonQA	Accuracy	59%	O3 Model Card	April 16, 2025
OpenAI O3	PersonQA	Hallucination Rate	33%	O3 Model Card	April 16, 2025
OpenAI O3	SimpleQA	Accuracy	49%	O3 Model Card	April 16, 2025
OpenAI O3	SimpleQA	Hallucination Rate	51%	O3 Model Card	April 16, 2025
OpenAI O4-mini	PersonQA	Accuracy	36%	O4 Mini Model Card	April 16, 2025
OpenAI O4-mini	PersonQA	Hallucination Rate	48%	O4 Mini Model Card	April 16, 2025
OpenAI O4-mini	SimpleQA	Accuracy	20%	O4 Mini Model Card	April 16, 2025
OpenAI O4-mini	SimpleQA	Hallucination Rate	79%	O4 Mini Model Card	April 16, 2025
OpenAI GPT-4o	QA Estimation	Hallucination Rate	~52%	GPT 4o Model Card	August 8, 2024
OpenAI GPT-4.5	PersonQA	Accuracy	78%	GPT-4.5 Model Card	February 27, 2025
OpenAI GPT-4.5	PersonQA	Hallucination Rate	19%	GPT-4.5 Model Card	February 27, 2025
OpenAI GPT-4.5	SimpleQA	Accuracy	62.5%	OpenAI Announcement	February 27, 2025
OpenAI GPT-4.5	SimpleQA	Hallucination Rate	37.1%	OpenAI Announcement	February 27, 2025
OpenAI GPT-5 (main)	SimpleQA	Accuracy	46%	GPT-5 System Card	August 7, 2025
OpenAI GPT-5 (main)	SimpleQA	Hallucination Rate	47%	GPT-5 System Card	August 7, 2025
Claude Opus 4*	Internal QA	Hallucination Rate	N/R	Anthropic Opus and Sonnet	N/A
Claude Sonnet 4*	Internal QA	Hallucination Rate	N/R	Anthropic Opus and Sonnet	N/A
Gemini 1.5 Flash	SimpleQA	Accuracy	8.6%	Gemini 2 Flash model card	April 15, 2025
Gemini 1.5 Flash	FACTS Grounding	Accuracy	82.9%	Gemini 2 Flash model card	April 15, 2025
Gemini 1.5 Pro	SimpleQA	Accuracy	24.9%	Gemini 2 Flash model card	April 15, 2025
Gemini 1.5 Pro	FACTS Grounding	Accuracy	80.0%	Gemini 2 Flash model card	April 15, 2025
Gemini 2.0 Flash-Lite	SimpleQA	Accuracy	21.7%	Gemini 2 Flash model card	April 15, 2025
Gemini 2.0 Flash-Lite	FACTS Grounding	Accuracy	83.6%	Gemini 2 Flash model card	April 15, 2025
Gemini 2.0 Flash (GA)	SimpleQA	Accuracy	29.9%	Gemini 2 Flash model card	April 15, 2025
Gemini 2.0 Flash (GA)	FACTS Grounding	Accuracy	84.6%	Gemini 2 Flash model card	April 15, 2025

† Hallucination and Factuality Metrics with Sources. Self-reported hallucination rates, accuracy, and factuality metrics for leading models from OpenAI, Anthropic, and Google across trusted benchmarks like SimpleQA, PersonQA, TruthfulQA, etc.

OpenAI

To its credit, OpenAI reports more metrics than its peers. But transparency comes at a cost: it reveals how tough the hallucination problem remains.

GPT-5 (gpt-5-main) shows a 47% hallucination rate on SimpleQA - the highest among OpenAI’s latest models, with only 46% accuracy.
GPT 4.5 shows a 37.1% hallucination rate on SimpleQA.
Reasoning models like O1 and O3 show hallucination rates ranging from 16% to 51%, depending on the task.

Google

Google’s Gemini models are evolving rapidly, but factuality remains a persistent weak spot.

Gemini 2.0 Flash (GA) scored only 29.9% on SimpleQA.
Despite strong grounding in document-based QA (FACTS Grounding at ~84%), these models hallucinate the majority of factual questions.

*Anthropic

Claude models are among the most respected models for their reasoning ability. But we found that Anthropic didn’t self-report benchmark metrics on established datasets like TruthfulQA.

Unlike OpenAI or Google, Anthropic does not publish hallucination rates from standard evaluations. However, they report various metrics related to bias and adherence, which are worth reviewing.

Anthropic Opus and Sonnet

We’ve added a task to our internal backlog to test their models on the available public benchmarks and AIMon’s HDM Bench dataset and report them here.

The Real-World Risk: what might it mean for you?

Factuality breakdowns in AI are not just academic errors, and they carry serious consequences for real-world applications. For instance:

Support chatbots that hallucinate can return incorrect policies or poor technical advice. This can risk frustrating customers and undermines trust.
Healthcare assistants suggesting incorrect treatment guidelines can lead to patient harm.
Financial summarizers misreporting risk assessments, metrics, or earnings numbers can mislead business decisions and investors.

What should you do?

While vendors showcase selective benchmarks, most fail to reflect the unique context, vocabulary, and factual expectations of enterprise domains. Relying solely on public scores like TruthfulQA or SimpleQA creates a false sense of readiness.

Organizations need to:

Develop domain-specific factuality tests. Read more here.
Measure hallucination in user-critical workflows. Read more here. and check out our low-latency Hallucination Detection model that beats OpenAI.
Continuously monitor model performance for accuracy metrics and much more. Read more here.

With internal metrics, enterprises can quantify risk, track model progress, and choose vendors that meet their own Trust standards.

The one platform you need to drive success with AI

Backed by Bessemer Venture Partners, Tidal Ventures, and other notable angel investors, AIMon is the one platform enterprises need to drive success with AI. We help you build, deploy, and use AI applications with trust and confidence, serving customers including Fortune 200 companies.

Our benchmark-leading ML models support over 20 metrics out of the box and let you build custom metrics using plain English guidelines. With coverage spanning output quality, adversarial robustness, safety, data quality, and business-specific custom metrics, you can apply any metric as a low-latency guardrail, for continuous monitoring, or in offline evaluations.

Finally, we offer tools to help you iteratively improve your AI, including capabilities for real-world evaluation and benchmarking dataset creation, fine-tuning, and reranking.

Book a Demo