Mon Jun 02 / Puneet Anand
This isn't a third party leaderboard. It's a snapshot of what OpenAI, Google, and Anthropic self report about their models. These factuality and hallucination metrics are often buried in system cards, so we surfaced them to shed light on a persistent issue: even the most advanced models still struggle with the truth.
The updated accuracy metrics below for the leading model providers are NOT based on a third-party (or AIMon’s) performance benchmark. Instead, they reflect what AI labs themselves chose to report. The availability of these metrics is not widely known, and they are often buried in models’ system card PDFs. So we decided to do some tech journalism and bring them to you. 🙂
While the public marvels at AI models writing code, essays, and bedtime stories, the responsible builders of these innovative models quietly acknowledge an ongoing issue: hallucination.
OpenAI, Google, and Anthropic have all produced leading models in humanity’s race to artificial general intelligence. But when it comes to factuality, they’re still far from perfect. The data that follows is pulled right from their system cards and disclosures.
Below is a snapshot of hallucination and factuality metrics self-reported by OpenAI, Google, and Anthropic, covering leading benchmarks like TruthfulQA, SimpleQA, PersonQA.
Last Updated: Jun 1st 2025
Model ↕ | Benchmark ↕ | Metric Type ↕ | Score ↕ | Source ↕ | Announced ↕ |
---|---|---|---|---|---|
OpenAI O1 | PersonQA | Accuracy | 47% | O1 Model Card | December 5, 2024 |
OpenAI O1 | PersonQA | Hallucination Rate | 16% | O1 Model Card | December 5, 2024 |
OpenAI O1 | SimpleQA | Accuracy | 47% | O1 Model Card | December 5, 2024 |
OpenAI O1 | SimpleQA | Hallucination Rate | 44% | O1 Model Card | December 5, 2024 |
OpenAI O3 | PersonQA | Accuracy | 59% | O3 Model Card | April 16, 2025 |
OpenAI O3 | PersonQA | Hallucination Rate | 33% | O3 Model Card | April 16, 2025 |
OpenAI O3 | SimpleQA | Accuracy | 49% | O3 Model Card | April 16, 2025 |
OpenAI O3 | SimpleQA | Hallucination Rate | 51% | O3 Model Card | April 16, 2025 |
OpenAI O4-mini | PersonQA | Accuracy | 36% | O4 Mini Model Card | April 16, 2025 |
OpenAI O4-mini | PersonQA | Hallucination Rate | 48% | O4 Mini Model Card | April 16, 2025 |
OpenAI O4-mini | SimpleQA | Accuracy | 20% | O4 Mini Model Card | April 16, 2025 |
OpenAI O4-mini | SimpleQA | Hallucination Rate | 79% | O4 Mini Model Card | April 16, 2025 |
OpenAI GPT-4o | QA Estimation | Hallucination Rate | ~52% | GPT 4o Model Card | August 8, 2024 |
OpenAI GPT-4.5 | PersonQA | Accuracy | 78% | GPT-4.5 Model Card | February 27, 2025 |
OpenAI GPT-4.5 | PersonQA | Hallucination Rate | 19% | GPT-4.5 Model Card | February 27, 2025 |
OpenAI GPT-4.5 | SimpleQA | Accuracy | 62.5% | OpenAI Announcement | February 27, 2025 |
OpenAI GPT-4.5 | SimpleQA | Hallucination Rate | 37.1% | OpenAI Announcement | February 27, 2025 |
Claude Opus 4* | Internal QA | Hallucination Rate | N/R | Anthropic Opus and Sonnet | N/A |
Claude Sonnet 4* | Internal QA | Hallucination Rate | N/R | Anthropic Opus and Sonnet | N/A |
Gemini 1.5 Flash | SimpleQA | Accuracy | 8.6% | Gemini 2 Flash model card | April 15, 2025 |
Gemini 1.5 Flash | FACTS Grounding | Accuracy | 82.9% | Gemini 2 Flash model card | April 15, 2025 |
Gemini 1.5 Pro | SimpleQA | Accuracy | 24.9% | Gemini 2 Flash model card | April 15, 2025 |
Gemini 1.5 Pro | FACTS Grounding | Accuracy | 80.0% | Gemini 2 Flash model card | April 15, 2025 |
Gemini 2.0 Flash-Lite | SimpleQA | Accuracy | 21.7% | Gemini 2 Flash model card | April 15, 2025 |
Gemini 2.0 Flash-Lite | FACTS Grounding | Accuracy | 83.6% | Gemini 2 Flash model card | April 15, 2025 |
Gemini 2.0 Flash (GA) | SimpleQA | Accuracy | 29.9% | Gemini 2 Flash model card | April 15, 2025 |
Gemini 2.0 Flash (GA) | FACTS Grounding | Accuracy | 84.6% | Gemini 2 Flash model card | April 15, 2025 |
† Hallucination and Factuality Metrics with Sources. Self-reported hallucination rates, accuracy, and factuality metrics for leading models from OpenAI, Anthropic, and Google across trusted benchmarks like SimpleQA, PersonQA, TruthfulQA, etc.
To its credit, OpenAI reports more metrics than its peers. But transparency comes at a cost: it reveals how tough the hallucination problem remains.
Google’s Gemini models are evolving rapidly, but factuality remains a persistent weak spot.
Claude models are among the most respected models for their reasoning ability. But we found that Anthropic didn’t self-report benchmark metrics on established datasets like TruthfulQA.
Unlike OpenAI or Google, Anthropic does not publish hallucination rates from standard evaluations. However, they report various metrics related to bias and adherence, which are worth reviewing.
Anthropic Opus and Sonnet
We’ve added a task to our internal backlog to test their models on the available public benchmarks and AIMon’s HDM Bench dataset and report them here.
Factuality breakdowns in AI are not just academic errors, and they carry serious consequences for real-world applications. For instance:
While vendors showcase selective benchmarks, most fail to reflect the unique context, vocabulary, and factual expectations of enterprise domains. Relying solely on public scores like TruthfulQA or SimpleQA creates a false sense of readiness.
Organizations need to:
With internal metrics, enterprises can quantify risk, track model progress, and choose vendors that meet their own Trust standards.
Backed by Bessemer Venture Partners, Tidal Ventures, and other notable angel investors, AIMon is the one platform enterprises need to drive success with AI. We help you build, deploy, and use AI applications with trust and confidence, serving customers from fast-moving startups to Fortune 200 companies.
Our benchmark-leading ML models support over 20 metrics out of the box and let you build custom metrics using plain English guidelines. With coverage spanning output quality, adversarial robustness, safety, data quality, and business-specific custom metrics, you can apply any metric as a low-latency guardrail, for continuous monitoring, or in offline evaluations.
Finally, we offer tools to help you iteratively improve your AI, including capabilities for bespoke evaluation and training dataset creation, fine-tuning, and reranking.