Search quality evaluation is one of the hardest unsolved problems in AI products. Getting an AI to generate an answer is a solved problem. Knowing whether that answer is actually good — factually accurate, appropriately certain, not quietly hallucinating — is not.
This tool evaluates AI-generated answers across five dimensions using Claude as the evaluator. It runs as a CLI, accepts single or batch inputs, and outputs structured quality reports.
Built to model the evaluation framework I'd use working on answer quality at Perplexity, Google, or any AI product where the answer is the product.
Five dimensions per answer:
1. Factual accuracy — Are the specific claims in the answer correct? When ground truth isn't available, the evaluator flags claims as unverifiable rather than assuming accuracy. Weighted 30%.
2. Source faithfulness — Does the answer accurately reflect what the cited sources actually say? Catches answers that contradict sources, extend beyond them, or attribute claims to sources that don't support them. Weighted 20%.
3. Completeness — Does the answer address the full question? An accurate but incomplete answer to a multi-part question fails here. Weighted 20%.
4. Hallucination risk — Does the answer contain specific claims (statistics, attributions, dates, quotes) that have no source grounding? High confidence claims with no evidence are flagged. Weighted 20%.
5. Confidence calibration — Is the answer's certainty language appropriate to the actual certainty of its claims? Overconfidence (stating contested things as fact) and underconfidence (hedging on established facts) both score low. Weighted 10%.
python cli.pyThe CLI prompts for query, answer, and optional source URLs.
python cli.py \
--query "Who invented Python?" \
--answer "Python was created by Guido van Rossum in 1991." \
--source "https://docs.python.org/3/faq/general.html"python cli.py --batch examples/batch.json --output results.jsonBatch input format:
[
{
"id": "optional-label",
"query": "...",
"answer": "...",
"sources": ["url1", "url2"]
}
]python cli.py --query "..." --answer "..." --format json | jq .────────────────────────────────────────────────────────────
Query: What temperature does water boil at?
Overall: High ████████████████░░░░ 82%
Factual accuracy Pass ████████████████░░░░ 90%
All specific claims are accurate. The altitude caveat is correct.
Source faithfulness Pass ██████████░░░░░░░░░░ 50%
No sources provided — scored neutral. Cannot verify against sources.
Completeness Pass ███████████████░░░░░ 80%
Covers the primary answer and an important edge case (altitude).
Hallucination risk Pass ████████████████████ 95%
No ungrounded specific claims. The 93°C figure is standard physics.
Confidence calibration Pass ███████████████░░░░░ 78%
Appropriate certainty throughout. No overconfident claims.
git clone https://github.com/AnjanaG/search-quality-evaluator.git
cd search-quality-evaluator
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# Add your Anthropic API key to .env
source .env
python cli.pyGet an API key at console.anthropic.com.
Why these five dimensions?
Each dimension catches a different failure mode:
Factual accuracy and hallucination risk sound similar but are distinct. Factual accuracy asks: is this claim true? Hallucination risk asks: is this claim supported by any source? An answer can be factually correct but still high-hallucination-risk if it's making up numbers that happen to be right. These need to be scored separately.
Source faithfulness is distinct from both — an answer can accurately represent a source that is itself wrong. Source faithfulness only asks whether the answer matches the source, not whether the source is correct.
Completeness is the most underrated dimension in answer quality. Users with complex queries often accept incomplete answers because the partial answer is correct. A good evaluator catches this.
Confidence calibration matters because overconfident wrong answers are more dangerous than uncertain wrong answers. An answer that says "scientists are still debating X" when X is settled is a different kind of error than "X is definitively true" when X is contested.
What ground truth means here
This evaluator uses an LLM as judge. It does not have access to ground truth — it evaluates based on internal consistency, source alignment, and plausibility. This is a known limitation.
A score of 1.0 does not mean the answer is correct. It means the answer appears internally consistent, well-sourced, complete, and appropriately calibrated based on what the evaluator can see.
The evaluator is more reliable for:
- Detecting hallucinated statistics or quotes (high reliability)
- Identifying overconfident claims on contested topics (high reliability)
- Flagging answers that contradict their sources (high reliability)
- Determining absolute factual accuracy on specialized topics (low reliability)
Limitations
- No access to real-time information. Cannot verify claims that require current knowledge.
- Inherits model biases. The evaluator model (Claude) may have systematic blind spots.
- No semantic similarity scoring. Cannot detect paraphrase-level faithfulness violations.
- Calibration varies by domain. The evaluator is more reliable on general knowledge than specialized technical or legal domains.
What a production version needs
- Human rater baseline to calibrate the LLM-as-judge against human judgment
- Golden dataset of query/answer pairs with known ground truth for continuous regression testing
- Domain-specific prompts for specialized verticals (medical, legal, financial)
- Semantic similarity scoring to detect source faithfulness at the paraphrase level
- Statistical confidence intervals on scores, not point estimates
- Longitudinal tracking to detect model degradation over time
search-quality-evaluator/
├── evaluator/
│ ├── __init__.py # Public API
│ ├── core.py # Anthropic API calls and report assembly
│ ├── models.py # EvaluationInput, EvaluationReport, BatchReport
│ └── prompts.py # Evaluation system and user prompts
├── cli.py # CLI entrypoint (single + batch)
├── examples/
│ └── batch.json # Example batch input
├── requirements.txt
└── .env.example
Python 3.11+, Anthropic SDK, standard library only (no framework dependencies)