search-quality-evaluator

Search quality evaluation is one of the hardest unsolved problems in AI products. Getting an AI to generate an answer is a solved problem. Knowing whether that answer is actually good — factually accurate, appropriately certain, not quietly hallucinating — is not.

This tool evaluates AI-generated answers across five dimensions using Claude as the evaluator. It runs as a CLI, accepts single or batch inputs, and outputs structured quality reports.

Built to model the evaluation framework I'd use working on answer quality at Perplexity, Google, or any AI product where the answer is the product.

What it evaluates

Five dimensions per answer:

1. Factual accuracy — Are the specific claims in the answer correct? When ground truth isn't available, the evaluator flags claims as unverifiable rather than assuming accuracy. Weighted 30%.

2. Source faithfulness — Does the answer accurately reflect what the cited sources actually say? Catches answers that contradict sources, extend beyond them, or attribute claims to sources that don't support them. Weighted 20%.

3. Completeness — Does the answer address the full question? An accurate but incomplete answer to a multi-part question fails here. Weighted 20%.

4. Hallucination risk — Does the answer contain specific claims (statistics, attributions, dates, quotes) that have no source grounding? High confidence claims with no evidence are flagged. Weighted 20%.

5. Confidence calibration — Is the answer's certainty language appropriate to the actual certainty of its claims? Overconfidence (stating contested things as fact) and underconfidence (hedging on established facts) both score low. Weighted 10%.

Usage

Single evaluation — interactive

python cli.py

The CLI prompts for query, answer, and optional source URLs.

Single evaluation — flags

python cli.py \
  --query "Who invented Python?" \
  --answer "Python was created by Guido van Rossum in 1991." \
  --source "https://docs.python.org/3/faq/general.html"

Batch evaluation

python cli.py --batch examples/batch.json --output results.json

Batch input format:

[
  {
    "id": "optional-label",
    "query": "...",
    "answer": "...",
    "sources": ["url1", "url2"]
  }
]

JSON output (pipe-friendly)

python cli.py --query "..." --answer "..." --format json | jq .

Sample output

────────────────────────────────────────────────────────────
Query: What temperature does water boil at?
Overall: High  ████████████████░░░░ 82%

  Factual accuracy              Pass         ████████████████░░░░ 90%
    All specific claims are accurate. The altitude caveat is correct.

  Source faithfulness           Pass         ██████████░░░░░░░░░░ 50%
    No sources provided — scored neutral. Cannot verify against sources.

  Completeness                  Pass         ███████████████░░░░░ 80%
    Covers the primary answer and an important edge case (altitude).

  Hallucination risk            Pass         ████████████████████ 95%
    No ungrounded specific claims. The 93°C figure is standard physics.

  Confidence calibration        Pass         ███████████████░░░░░ 78%
    Appropriate certainty throughout. No overconfident claims.

Setup

git clone https://github.com/AnjanaG/search-quality-evaluator.git
cd search-quality-evaluator
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# Add your Anthropic API key to .env
source .env
python cli.py

Get an API key at console.anthropic.com.

Evaluation methodology

Why these five dimensions?

Each dimension catches a different failure mode:

Factual accuracy and hallucination risk sound similar but are distinct. Factual accuracy asks: is this claim true? Hallucination risk asks: is this claim supported by any source? An answer can be factually correct but still high-hallucination-risk if it's making up numbers that happen to be right. These need to be scored separately.

Source faithfulness is distinct from both — an answer can accurately represent a source that is itself wrong. Source faithfulness only asks whether the answer matches the source, not whether the source is correct.

Completeness is the most underrated dimension in answer quality. Users with complex queries often accept incomplete answers because the partial answer is correct. A good evaluator catches this.

Confidence calibration matters because overconfident wrong answers are more dangerous than uncertain wrong answers. An answer that says "scientists are still debating X" when X is settled is a different kind of error than "X is definitively true" when X is contested.

What ground truth means here

This evaluator uses an LLM as judge. It does not have access to ground truth — it evaluates based on internal consistency, source alignment, and plausibility. This is a known limitation.

A score of 1.0 does not mean the answer is correct. It means the answer appears internally consistent, well-sourced, complete, and appropriately calibrated based on what the evaluator can see.

The evaluator is more reliable for:

Detecting hallucinated statistics or quotes (high reliability)
Identifying overconfident claims on contested topics (high reliability)
Flagging answers that contradict their sources (high reliability)
Determining absolute factual accuracy on specialized topics (low reliability)

Limitations

No access to real-time information. Cannot verify claims that require current knowledge.
Inherits model biases. The evaluator model (Claude) may have systematic blind spots.
No semantic similarity scoring. Cannot detect paraphrase-level faithfulness violations.
Calibration varies by domain. The evaluator is more reliable on general knowledge than specialized technical or legal domains.

What a production version needs

Human rater baseline to calibrate the LLM-as-judge against human judgment
Golden dataset of query/answer pairs with known ground truth for continuous regression testing
Domain-specific prompts for specialized verticals (medical, legal, financial)
Semantic similarity scoring to detect source faithfulness at the paraphrase level
Statistical confidence intervals on scores, not point estimates
Longitudinal tracking to detect model degradation over time

Project structure

search-quality-evaluator/
├── evaluator/
│   ├── __init__.py        # Public API
│   ├── core.py            # Anthropic API calls and report assembly
│   ├── models.py          # EvaluationInput, EvaluationReport, BatchReport
│   └── prompts.py         # Evaluation system and user prompts
├── cli.py                 # CLI entrypoint (single + batch)
├── examples/
│   └── batch.json         # Example batch input
├── requirements.txt
└── .env.example

Stack

Python 3.11+, Anthropic SDK, standard library only (no framework dependencies)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

search-quality-evaluator

What it evaluates

Usage

Single evaluation — interactive

Single evaluation — flags

Batch evaluation

JSON output (pipe-friendly)

Sample output

Setup

Evaluation methodology

Project structure

Stack

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
evaluator		evaluator
examples		examples
.env.example		.env.example
README.md		README.md
cli.py		cli.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

search-quality-evaluator

What it evaluates

Usage

Single evaluation — interactive

Single evaluation — flags

Batch evaluation

JSON output (pipe-friendly)

Sample output

Setup

Evaluation methodology

Project structure

Stack

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages