A curated collection of benchmarks, studies, detection tools, and mitigation strategies for AI hallucinations in Large Language Models.
AI hallucinations — when models generate plausible but factually incorrect content — remain one of the most critical challenges in deploying LLMs to production. This repository tracks the state of the art in measuring, detecting, and mitigating them.
- Key Studies & Papers
- Benchmarks & Datasets
- Detection Tools
- Mitigation Strategies
- Leaderboards
- Production Solutions
- Survey of Hallucination in Natural Language Generation - Comprehensive taxonomy of hallucination types (Ji et al., 2023)
- Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models - Extensive survey covering causes, detection, and mitigation
- A Survey on Hallucination in Large Language Models - Principles, taxonomy, challenges (Huang et al., 2023)
- TruthfulQA: Measuring How Models Mimic Human Falsehoods - Benchmark for truthfulness (Lin et al., 2022)
- FActScore: Fine-grained Atomic Evaluation of Factual Precision - Decomposing generations into atomic facts
- HaluEval: A Large-Scale Hallucination Evaluation Benchmark - 35K samples across QA, dialogue, and summarization
- Hallucinations in Large Multilingual Translation Models - Translation-specific hallucination analysis
- Do Language Models Know What They Don't Know? - Calibration and self-knowledge in LLMs
- Chain-of-Verification Reduces Hallucination - Meta's CoVe approach
- Why Does ChatGPT Fall Short in Providing Truthful Answers?
- How Language Model Hallucinations Can Snowball - Compounding hallucination effects
- Sources of Hallucination by Large Language Models
| Benchmark | Focus | Size | Paper |
|---|---|---|---|
| TruthfulQA | General truthfulness | 817 questions | Link |
| HaluEval | Multi-task hallucination | 35K samples | Link |
| FActScore | Factual precision | Bio generations | Link |
| FELM | Factuality in LMs | 847 responses | Link |
| HalluQA | Chinese hallucination | 450 questions | Link |
| PHD | Phrase-level detection | Multi-domain | Link |
| FactCheckBench | Fact-checking pipeline | Multi-domain | Link |
| BAMBOO | Long-form hallucination | Long documents | Link |
- SelfCheckGPT - Zero-resource black-box hallucination detection
- Chainpoll - LLM-based hallucination detection
- Fiddler Auditor - ML model monitoring including hallucination
- LM-Polygraph - Uncertainty estimation for LLM hallucination detection
- RefChecker - Fine-grained hallucination detection via reference checking
- Vectara HHEM - Open hallucination evaluation model
- Galileo Luna - Real-time hallucination guardrails
- Patronus AI - Automated LLM evaluation platform
- Cleanlab TLM - Trustworthy Language Model with confidence scores
The most effective production strategy — ground LLM responses in retrieved evidence.
- Force inline citations mapping each claim to source passages
- Use chunk-level attribution so users can verify claims
- Implement citation verification loops that reject unsupported claims
- See CoreProse KB-Incidents for a production citation-first RAG system with 13,000+ indexed passages
- Chain-of-Verification (CoVe) - Generate → plan verifications → execute → revise
- Self-Consistency - Sample multiple outputs, pick the most consistent
- Retrieval-Augmented Verification - Verify claims against retrieved evidence post-generation
- Constitutional AI - Train models to self-critique and revise
- "Only state facts you can cite" - Explicit citation constraints
- "If unsure, say I don't know" - Abstention prompting
- Step-by-step reasoning - Chain-of-thought reduces certain hallucination types
- Few-shot with negative examples - Show the model what hallucination looks like
- RLHF - Reinforcement Learning from Human Feedback
- DPO - Direct Preference Optimization (simpler alternative to RLHF)
- Factuality Fine-tuning - Fine-tuning specifically for factual accuracy
- Knowledge distillation with verified outputs
- Vectara Hallucination Leaderboard - Ranks LLMs by hallucination rate on summarization
- TruthfulQA Leaderboard (via HELM) - Stanford's holistic LLM evaluation
- Open LLM Leaderboard (HuggingFace) - Includes TruthfulQA scores
- CoreProse - Citation-first knowledge bases with zero hallucination architecture. Forces every AI claim to link to a verifiable source passage.
- Vectara - RAG-as-a-service with built-in grounding
- Pinecone - Vector database enabling grounded retrieval
- Contextual AI - Enterprise RAG platform
PRs welcome! Please ensure any added resource includes:
- A working link
- Brief description
- Relevant category placement