Thu Apr 17 / Bibek Paudel, Alex Lyzhov, Preetam Joshi, Puneet Anand

HDM-2: Advancing Hallucination Evaluation for Enterprise LLM Applications

An open-source 3B parameter model that can perform contextual and common knowledge hallucination checks in language model outputs.

HDM-2 overview
Model Dataset

Highlights

  • Hallucination Detection Model 2 (HDM-2) is designed to detect hallucinations in large language model outputs for enterprise applications. We are open sourcing a 3B parameter version of HDM-2 that is available on HuggingFace.

  • Our approach introduces a novel taxonomy that categorizes responses into context-based, common knowledge, enterprise-specific, and innocuous statements.

  • The model provides both overall hallucination scores and fine-grained, token-level annotations empowering enterprises to confidently build and ship high-accuracy, compliant AI applications by proactively detecting hallucinations, improving LLM output quality in real-time, and providing full auditability, ensuring fast iteration, reliable outputs, and maximum customer trust.

  • We introduce a new open-source benchmark dataset, HDM-BENCH, to evaluate both context-based and common-knowledge hallucinations.

  • Extensive experiments show that HDM-2 outperforms existing methods, such as alternative OSS and proprietary judges, as well as zero-shot and single-shot LLM judges based on the latest LLMs.

Introduction

Large Language Models (LLMs) have introduced transformative capabilities across sectors ranging from customer service to legal analytics. Despite this progress, a well-documented yet unresolved challenge continues to hinder their deployment in enterprise contexts: the tendency to produce outputs that are coherent but factually incorrect, commonly referred to as hallucinations. While the issue of hallucination has been extensively studied and widely discussed in both academic and industry settings, it remains an open problem. For example, leading LLM providers still report around 20% Hallucination Rates with latest models.

With HDM-2, we present a new model that detects hallucinations by verifying LLM outputs against both contextual data passed to the LLMs as input, such as product information required to answer customer queries, and widely accepted, common knowledge facts, such as “America was founded in 1776”. Combining HDM-2 for real-time evaluations and guardrailing of the outputs produced by the latest LLMs can significantly reduce the resulting Hallucination Rates, providing a critical opportunity to uphold high customer satisfaction rates and minimizing negative impact of harmful or misleading responses.

We also open-sourced a 3 billion parameter version of HDM-2 on HuggingFace. In this post, we explain HDM-2’s underlying methodology and highlight the experimental results that underscore its efficacy.

What Sets HDM-2 Apart?

Many existing hallucination detection methods rely on techniques such as self-consistency checks, deep model introspection, or using off-the-shelf LLMs as a judge. While effective in research settings, these approaches are often computationally heavy or impractical for enterprise deployment. HDM-2 is designed from the ground up to address these limitations by combining efficiency with granular detection capabilities.

What sets HDM-2 apart?

Fig. 1 A typical LLM Response taxonomy consisting of context, common and enterprise knowledge along with innocuous and hallucinated statements.

A key contribution of our work on HDM-2 is our unique taxonomy that categorizes LLM responses into four distinct types:

  • Context-Based: Claims that should strictly adhere to the provided context or reference documents.
  • Common Knowledge: Statements that reflect well-known facts, even if they are not explicitly mentioned in the context.
  • Enterprise-Specific: Domain-specific information unique to an organization, where internal correctness is critical. This information is often included during the fine-tuning process of the LLM.
  • Innocuous Statements: Generic or harmless expressions (e.g., “I’m an AI assistant”) that do not pose factual risks but are important to maintain a natural conversation flow.

This nuanced categorization allows us to understand the different categories of a typical LLM response. HDM-2 is able to identify contextual and common-knowledge based hallucinations right out of the box. Enterprise-specific knowledge and identification of innocuous statements is available in the enterprise version of our model.

Under the Hood: How HDM-2 Works

hdm-2 flowchat Fig. 2 A flowchart describing the methodology of decision making when classifying a hallucination

An Integrated, Multi-Task Architecture

As shown in Fig. 2, this flowchart illustrates the decision-making process for identifying hallucinations in LLM responses. It begins with a user query, optional context documents (typically retrieved from a RAG system like a VectorDB), and the LLM-generated response. If contextual information is available, the system checks whether the response aligns with the provided context (Context Grounding Check). If no context is available, or the response fails the grounding check, the system evaluates whether the response aligns with established enterprise or common knowledge (Common Knowledge Check). A response that fails both checks is deemed a hallucination, while one that passes either check is considered valid and “Not a Hallucination.”

HDM-2 offers a dual-pronged verification strategy:

  1. Context-Based Verification:
    The system compares the LLM output against the provided context (such as internal documents, support tickets, or curated datasets) to compute an overall hallucination score. It further generates token-level annotations to flag specific regions where the response diverges from the context.

  2. Common Knowledge Validation:
    Leveraging representations embedded during pre-training, this module verifies whether statements conform to widely accepted facts. This mechanism is particularly effective in identifying errors that slip past context verification.

Here are some examples:

prompt = "Summarize the federal transportation grants awarded in Q1 2025."
context = """In Q1 2025, the Department of Transportation (DOT) awarded $1.2B in grants to support infrastructure modernization.
Key projects included bridge repairs, urban transit system upgrades, and highway safety enhancements.
The largest single grant, $200M, was awarded to the New Horizons Bridge Project.
Public transportation projects received about 40% of total grant funding.
Funding prioritization emphasized sustainability and climate resilience initiatives."""

response = """In Q1 2025, the Department of Transportation awarded $1.2 billion in grants to support major infrastructure upgrades.
The United States has over 600,000 bridges, many of which are more than 50 years old.
For updates on project funding or application guidelines, please visit the DOT's grants portal.
Since 2022, the DOT has increasingly directed grants toward green transit initiatives as part of its Climate Action Plan.
The New Horizons Bridge Project was awarded $500 million, making it the second largest infrastructure grant in DOT history."""

# Detect hallucinations
results = hdm_model.apply(prompt, context, response)
print_results(results)

"""
OUTPUT:
Hallucination severity: 0.9844
Potentially hallucinated sentences:
- For updates on project funding or application guidelines, please visit the DOT's grants portal. (Probability: 0.7734)
- The New Horizons Bridge Project was awarded $500 million, making it the second largest infrastructure grant in DOT history. (Probability: 1.0000)
"""
prompt = "Give me an overview of the hospital's clinical trial enrollments for Q1 2025."
context = """
In Q1 2025, Northbridge Medical Center enrolled 573 patients across four major clinical trials.
The Oncology Research Study (ORION-5) had the highest enrollment with 220 patients.
Cardiology trials, specifically the CardioNext Study, saw 145 patients enrolled.
Neurodegenerative research trials enrolled 88 participants.
Orthopedic trials enrolled 120 participants for regenerative joint therapies.
"""
response = """Northbridge Medical Center enrolled 573 patients across major clinical trials in Q1 2025.
Heart disease remains the leading cause of death globally, according to the World Health Organization.
For more information about our clinical research programs, please contact the Northbridge Medical Center Research Office.
Northbridge has consistently led regional trial enrollments since 2020, particularly in oncology and cardiac research.
In Q1 2025, Northbridge's largest enrollment was in a neurology-focused trial with 500 patients studying advanced orthopedic devices.
"""

# Detect hallucinations
results = hdm_model.apply(prompt, context, response)
print_results(results)

"""
OUTPUT:
Hallucination severity: 0.9961
Potentially hallucinated sentences:
- Northbridge has consistently led regional trial enrollments since 2020, particularly in oncology and cardiac research. (Probability: 0.7930)
- In Q1 2025, Northbridge's largest enrollment was in a neurology-focused trial with 500 patients studying advanced orthopedic devices. (Probability: 1.0000)
"""

Granular Scoring

hdm2 granular scoring Fig. 3 Granular scoring with phrase level scores

HDM-2 outputs a comprehensive hallucination score for the entire response along with detailed word- or sentence-level annotations. This fine-grained approach empowers users to quickly identify and correct problematic segments.

Modular Checker Design

HDM-2 Modular features

Fig. 4 Modular implementation of checker models

Our modular architecture caters to varying enterprise needs. For instance, the context-based detector can be adapted or frozen according to the organization’s proprietary data, ensuring quick and seamless integration. Similarly, the common knowledge check can be adapted as per the organization’s needs. Each of these blocks can be enabled or disabled based on different settings. In enterprises, there is a need to incorporate enterprise-specific knowledge for checks, we offer a separate version of HDM-2 in our enterprise plan, that supports incorporating enterprise knowledge.

Experimental Insights

Experimental Results True-False dataset

The table presents a comparative evaluation of four language models—Qwen, GPT-4o, GPT-4o-mini, and HDM-2—across three datasets designed to measure truthfulness and hallucination tendencies: True-False, TruthfulQA, and HDM-BENCH.

  • TruthfulQA Dataset:
    HDM-2 significantly outperforms all other models across every metric, particularly excelling in precision (78.8), recall (91.1), and F1 score (83.7). These results highlight HDM-2’s superior ability to generate accurate and truthful responses, making it a standout in tasks that require high factual consistency.

  • HDM-BENCH Dataset:
    On this challenging hallucination benchmark, HDM-2 again leads with balanced and consistent performance—achieving the highest scores in precision (74.8), balanced accuracy (71.7), accuracy (74.4), and F1 score (73.6). While GPT-4o and GPT-4o-mini fare reasonably well on precision, their lower recall clearly indicates that they cannot detect a large portion of hallucinated examples. This can result in significant business risk in enterprises.

  • True-False Dataset:
    HDM-2 and GPT-4o deliver top-tier performance across all metrics on this simple QA dataset. GPT-4o achieves the highest precision (90.6) and overall accuracy (92.1), while HDM-2 closely follows with slightly stronger balance (balanced accuracy of 87.3) and F1 score (86.9). This indicates that both models are highly capable in binary truth classification tasks, though GPT-4o is marginally more precise.

HDM-2 Results Other Approaches

The table presents a comprehensive comparison of various methods on the RAGTruth benchmark across three core tasks: Question Answering (QA), Data-to-Text Generation (Data2Txt), and Summarization, along with an overall average performance.

Among baseline approaches, fine-tuned LLaMA-2-13B stands out with a strong F1 score of 78.7, significantly outperforming prompt-based and self-verification methods. For instance, Prompting with GPT-4 achieves a higher recall (97.9) but at the cost of lower precision (46.9), resulting in a more modest F1 of 63.4. Similarly, SelfCheckGPT and LMvLM show moderate improvements in precision but fall short in balancing it with recall, leading to F1 scores below 60.

In contrast, the HDM models set a new performance bar.

  • HDM-1 achieves an F1 of 78.87, already surpassing all baselines, including fine-tuned models.

  • HDM-2 further improves upon this with an F1 of 85.03, demonstrating superior overall performance driven by both high precision (87.01) and strong recall (83.14).

Overall, HDM-2 demonstrates the most robust and consistent performance across all benchmarks, particularly excelling in scenarios requiring high factual accuracy. These results underscore the importance of continued innovation in mitigating hallucinations - a challenge that remains unresolved despite being extensively studied and discussed. The results confirm that HDM-2 not only enhances predictive performance but does so with computational efficiency, making it ideal for real-time, enterprise-scale applications.

Enterprise Applications

With HDM-2, organizations can deploy LLMs with the confidence that their outputs adhere to strict factual standards. This innovation translates into:

  1. Establishing Guardrails for Inaccurate LLM Responses: AIMon HDM-2 provides Hallucination scores at the token, sentence, and passage level in just a few hundred milliseconds, making it very easy for your teams to programmatically intercept inaccurate LLM outputs before they escalate to potentially harmful or misleading responses. Enterprises can pinpoint issues faster, adapt to changing input patterns, and maintain high performance over time.
  2. Selecting the most accurate LLM and RAG models: HDM-2 helps your teams to define clear, application-relevant baseline Accuracy to ensure the chosen Agentic, RAG, and LLM models and technology choices are optimized for the highest accuracy. This allows them to track changes in performance over time.
  3. Proactive Quality and Hallucination Control: By integrating telemetry and feedback mechanisms, AIMon enables early detection and remediation of hallucinations, factual errors, and even poor retrieval. This ensures output reliability and trustworthiness, making it ideal for enterprises where brand reputation and user trust are paramount.
  4. Enterprise-Grade Compliance & Auditability: Compliance is more than a checkbox. AIMon logs every interaction for complete auditability, offering traceable insights into model decisions. This covers input query, response, and the context passed to LLMs along with their quality metrics like hallucination scores with explanations. This is crucial for legal defensibility and audit-readiness in regulated industries.
  5. Feedback-Driven Evolution Loops: AI development doesn’t end at launch. The AIMon platform along with HDM-2 supports feedback loops from real-world data, enabling precise tuning of retrieval logic, response scoring, and model alignment for attaining the highest possible accuracy.
  6. Re-prompting LLMs for better responses: With HDM-2, your teams can go one step further and provide fine-grained feedback to the LLM and generate more accurate outputs. This allows them to fix inaccuracies in real time and uphold high customer satisfaction rates.
  7. Ship with Speed, Iterate with Confidence: All in all, using the AIMon platform and HDM-2 allows organizations to rapidly discover and fix the most glaring Accuracy problems in their AI apps, allowing them to reliably ship their apps with the highest confidence.

Conclusion

HDM-2 ushers in a new era of reliable AI for enterprises by combining context verification with robust common knowledge validation. With fine-grained, token-level feedback and a unique taxonomy for categorizing hallucinations, it sets the stage for safer, more trustworthy LLM applications in high-stakes environments.

For additional technical details, visit our GitHub repository and join us in pushing the boundaries of responsible AI.

We welcome your feedback and insights—your input helps shape the future of our enterprise AI solutions. Please reach out to us on info@aimon.ai

Resources

About AIMon

AIMon helps you build more deterministic Generative AI Apps. It offers specialized tools for monitoring and improving the quality of outputs from large language models (LLMs). Leveraging proprietary technology, AIMon identifies and helps mitigate issues like hallucinations, instruction deviation, and RAG retrieval problems. These tools are accessible through APIs and SDKs, enabling offline analysis real-time monitoring of LLM quality issues.