Mon Jun 02 / Puneet Anand

The Unleaderboard: Google, OpenAI, and Anthropic Self-Reported Hallucination and Accuracy Scores

This isn't a third party leaderboard. It's a snapshot of what OpenAI, Google, and Anthropic self report about their models. These factuality and hallucination metrics are often buried in system cards, so we surfaced them to shed light on a persistent issue: even the most advanced models still struggle with the truth.

Example of LLM Architecture

The Unleaderboard

The updated accuracy metrics below for the leading model providers are NOT based on a third-party (or AIMon’s) performance benchmark. Instead, they reflect what AI labs themselves chose to report. The availability of these metrics is not widely known, and they are often buried in models’ system card PDFs. So we decided to do some tech journalism and bring them to you. 🙂

While the public marvels at AI models writing code, essays, and bedtime stories, the responsible builders of these innovative models quietly acknowledge an ongoing issue: hallucination.

OpenAI, Google, and Anthropic have all produced leading models in humanity’s race to artificial general intelligence. But when it comes to factuality, they’re still far from perfect. The data that follows is pulled right from their system cards and disclosures.

The Unleaderboard Hallucination Scorecard Overview

Below is a snapshot of hallucination and factuality metrics self-reported by OpenAI, Google, and Anthropic, covering leading benchmarks like TruthfulQA, SimpleQA, PersonQA.

Last Updated: Jun 1st 2025

Model ↕ Benchmark ↕ Metric Type ↕ Score ↕ Source ↕ Announced ↕
OpenAI O1 PersonQA Accuracy 47% O1 Model Card December 5, 2024
OpenAI O1 PersonQA Hallucination Rate 16% O1 Model Card December 5, 2024
OpenAI O1 SimpleQA Accuracy 47% O1 Model Card December 5, 2024
OpenAI O1 SimpleQA Hallucination Rate 44% O1 Model Card December 5, 2024
OpenAI O3 PersonQA Accuracy 59% O3 Model Card April 16, 2025
OpenAI O3 PersonQA Hallucination Rate 33% O3 Model Card April 16, 2025
OpenAI O3 SimpleQA Accuracy 49% O3 Model Card April 16, 2025
OpenAI O3 SimpleQA Hallucination Rate 51% O3 Model Card April 16, 2025
OpenAI O4-mini PersonQA Accuracy 36% O4 Mini Model Card April 16, 2025
OpenAI O4-mini PersonQA Hallucination Rate 48% O4 Mini Model Card April 16, 2025
OpenAI O4-mini SimpleQA Accuracy 20% O4 Mini Model Card April 16, 2025
OpenAI O4-mini SimpleQA Hallucination Rate 79% O4 Mini Model Card April 16, 2025
OpenAI GPT-4o QA Estimation Hallucination Rate ~52% GPT 4o Model Card August 8, 2024
OpenAI GPT-4.5 PersonQA Accuracy 78% GPT-4.5 Model Card February 27, 2025
OpenAI GPT-4.5 PersonQA Hallucination Rate 19% GPT-4.5 Model Card February 27, 2025
OpenAI GPT-4.5 SimpleQA Accuracy 62.5% OpenAI Announcement February 27, 2025
OpenAI GPT-4.5 SimpleQA Hallucination Rate 37.1% OpenAI Announcement February 27, 2025
Claude Opus 4* Internal QA Hallucination Rate N/R Anthropic Opus and Sonnet N/A
Claude Sonnet 4* Internal QA Hallucination Rate N/R Anthropic Opus and Sonnet N/A
Gemini 1.5 Flash SimpleQA Accuracy 8.6% Gemini 2 Flash model card April 15, 2025
Gemini 1.5 Flash FACTS Grounding Accuracy 82.9% Gemini 2 Flash model card April 15, 2025
Gemini 1.5 Pro SimpleQA Accuracy 24.9% Gemini 2 Flash model card April 15, 2025
Gemini 1.5 Pro FACTS Grounding Accuracy 80.0% Gemini 2 Flash model card April 15, 2025
Gemini 2.0 Flash-Lite SimpleQA Accuracy 21.7% Gemini 2 Flash model card April 15, 2025
Gemini 2.0 Flash-Lite FACTS Grounding Accuracy 83.6% Gemini 2 Flash model card April 15, 2025
Gemini 2.0 Flash (GA) SimpleQA Accuracy 29.9% Gemini 2 Flash model card April 15, 2025
Gemini 2.0 Flash (GA) FACTS Grounding Accuracy 84.6% Gemini 2 Flash model card April 15, 2025

† Hallucination and Factuality Metrics with Sources. Self-reported hallucination rates, accuracy, and factuality metrics for leading models from OpenAI, Anthropic, and Google across trusted benchmarks like SimpleQA, PersonQA, TruthfulQA, etc.

OpenAI

To its credit, OpenAI reports more metrics than its peers. But transparency comes at a cost: it reveals how tough the hallucination problem remains.

  • GPT 4.5 shows a 37.1% hallucination rate on SimpleQA.
  • Reasoning models like O1 and O3 show hallucination rates ranging from 16% to 51%, depending on the task.

Google

Google’s Gemini models are evolving rapidly, but factuality remains a persistent weak spot.

  • Gemini 2.0 Flash (GA) scored only 29.9% on SimpleQA.
  • Despite strong grounding in document-based QA (FACTS Grounding at ~84%), these models hallucinate the majority of factual questions.

*Anthropic

Claude models are among the most respected models for their reasoning ability. But we found that Anthropic didn’t self-report benchmark metrics on established datasets like TruthfulQA.

Unlike OpenAI or Google, Anthropic does not publish hallucination rates from standard evaluations. However, they report various metrics related to bias and adherence, which are worth reviewing.

Anthropic Opus and Sonnet

We’ve added a task to our internal backlog to test their models on the available public benchmarks and AIMon’s HDM Bench dataset and report them here.

The Real-World Risk: what might it mean for you?

Factuality breakdowns in AI are not just academic errors, and they carry serious consequences for real-world applications. For instance:

  • Support chatbots that hallucinate can return incorrect policies or poor technical advice. This can risk frustrating customers and undermines trust.
  • Healthcare assistants suggesting incorrect treatment guidelines can lead to patient harm.
  • Financial summarizers misreporting risk assessments, metrics, or earnings numbers can mislead business decisions and investors.

What should you do?

While vendors showcase selective benchmarks, most fail to reflect the unique context, vocabulary, and factual expectations of enterprise domains. Relying solely on public scores like TruthfulQA or SimpleQA creates a false sense of readiness.

Organizations need to:

With internal metrics, enterprises can quantify risk, track model progress, and choose vendors that meet their own Trust standards.

The one platform you need to drive success with AI

Backed by Bessemer Venture Partners, Tidal Ventures, and other notable angel investors, AIMon is the one platform enterprises need to drive success with AI. We help you build, deploy, and use AI applications with trust and confidence, serving customers from fast-moving startups to Fortune 200 companies.

Our benchmark-leading ML models support over 20 metrics out of the box and let you build custom metrics using plain English guidelines. With coverage spanning output quality, adversarial robustness, safety, data quality, and business-specific custom metrics, you can apply any metric as a low-latency guardrail, for continuous monitoring, or in offline evaluations.

Finally, we offer tools to help you iteratively improve your AI, including capabilities for bespoke evaluation and training dataset creation, fine-tuning, and reranking.