Skip to content

AnjanaG/search-quality-evaluator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

search-quality-evaluator

Search quality evaluation is one of the hardest unsolved problems in AI products. Getting an AI to generate an answer is a solved problem. Knowing whether that answer is actually good — factually accurate, appropriately certain, not quietly hallucinating — is not.

This tool evaluates AI-generated answers across five dimensions using Claude as the evaluator. It runs as a CLI, accepts single or batch inputs, and outputs structured quality reports.

Built to model the evaluation framework I'd use working on answer quality at Perplexity, Google, or any AI product where the answer is the product.


What it evaluates

Five dimensions per answer:

1. Factual accuracy — Are the specific claims in the answer correct? When ground truth isn't available, the evaluator flags claims as unverifiable rather than assuming accuracy. Weighted 30%.

2. Source faithfulness — Does the answer accurately reflect what the cited sources actually say? Catches answers that contradict sources, extend beyond them, or attribute claims to sources that don't support them. Weighted 20%.

3. Completeness — Does the answer address the full question? An accurate but incomplete answer to a multi-part question fails here. Weighted 20%.

4. Hallucination risk — Does the answer contain specific claims (statistics, attributions, dates, quotes) that have no source grounding? High confidence claims with no evidence are flagged. Weighted 20%.

5. Confidence calibration — Is the answer's certainty language appropriate to the actual certainty of its claims? Overconfidence (stating contested things as fact) and underconfidence (hedging on established facts) both score low. Weighted 10%.


Usage

Single evaluation — interactive

python cli.py

The CLI prompts for query, answer, and optional source URLs.

Single evaluation — flags

python cli.py \
  --query "Who invented Python?" \
  --answer "Python was created by Guido van Rossum in 1991." \
  --source "https://docs.python.org/3/faq/general.html"

Batch evaluation

python cli.py --batch examples/batch.json --output results.json

Batch input format:

[
  {
    "id": "optional-label",
    "query": "...",
    "answer": "...",
    "sources": ["url1", "url2"]
  }
]

JSON output (pipe-friendly)

python cli.py --query "..." --answer "..." --format json | jq .

Sample output

────────────────────────────────────────────────────────────
Query: What temperature does water boil at?
Overall: High  ████████████████░░░░ 82%

  Factual accuracy              Pass         ████████████████░░░░ 90%
    All specific claims are accurate. The altitude caveat is correct.

  Source faithfulness           Pass         ██████████░░░░░░░░░░ 50%
    No sources provided — scored neutral. Cannot verify against sources.

  Completeness                  Pass         ███████████████░░░░░ 80%
    Covers the primary answer and an important edge case (altitude).

  Hallucination risk            Pass         ████████████████████ 95%
    No ungrounded specific claims. The 93°C figure is standard physics.

  Confidence calibration        Pass         ███████████████░░░░░ 78%
    Appropriate certainty throughout. No overconfident claims.

Setup

git clone https://github.com/AnjanaG/search-quality-evaluator.git
cd search-quality-evaluator
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# Add your Anthropic API key to .env
source .env
python cli.py

Get an API key at console.anthropic.com.


Evaluation methodology

Why these five dimensions?

Each dimension catches a different failure mode:

Factual accuracy and hallucination risk sound similar but are distinct. Factual accuracy asks: is this claim true? Hallucination risk asks: is this claim supported by any source? An answer can be factually correct but still high-hallucination-risk if it's making up numbers that happen to be right. These need to be scored separately.

Source faithfulness is distinct from both — an answer can accurately represent a source that is itself wrong. Source faithfulness only asks whether the answer matches the source, not whether the source is correct.

Completeness is the most underrated dimension in answer quality. Users with complex queries often accept incomplete answers because the partial answer is correct. A good evaluator catches this.

Confidence calibration matters because overconfident wrong answers are more dangerous than uncertain wrong answers. An answer that says "scientists are still debating X" when X is settled is a different kind of error than "X is definitively true" when X is contested.

What ground truth means here

This evaluator uses an LLM as judge. It does not have access to ground truth — it evaluates based on internal consistency, source alignment, and plausibility. This is a known limitation.

A score of 1.0 does not mean the answer is correct. It means the answer appears internally consistent, well-sourced, complete, and appropriately calibrated based on what the evaluator can see.

The evaluator is more reliable for:

  • Detecting hallucinated statistics or quotes (high reliability)
  • Identifying overconfident claims on contested topics (high reliability)
  • Flagging answers that contradict their sources (high reliability)
  • Determining absolute factual accuracy on specialized topics (low reliability)

Limitations

  • No access to real-time information. Cannot verify claims that require current knowledge.
  • Inherits model biases. The evaluator model (Claude) may have systematic blind spots.
  • No semantic similarity scoring. Cannot detect paraphrase-level faithfulness violations.
  • Calibration varies by domain. The evaluator is more reliable on general knowledge than specialized technical or legal domains.

What a production version needs

  • Human rater baseline to calibrate the LLM-as-judge against human judgment
  • Golden dataset of query/answer pairs with known ground truth for continuous regression testing
  • Domain-specific prompts for specialized verticals (medical, legal, financial)
  • Semantic similarity scoring to detect source faithfulness at the paraphrase level
  • Statistical confidence intervals on scores, not point estimates
  • Longitudinal tracking to detect model degradation over time

Project structure

search-quality-evaluator/
├── evaluator/
│   ├── __init__.py        # Public API
│   ├── core.py            # Anthropic API calls and report assembly
│   ├── models.py          # EvaluationInput, EvaluationReport, BatchReport
│   └── prompts.py         # Evaluation system and user prompts
├── cli.py                 # CLI entrypoint (single + batch)
├── examples/
│   └── batch.json         # Example batch input
├── requirements.txt
└── .env.example

Stack

Python 3.11+, Anthropic SDK, standard library only (no framework dependencies)

About

CLI tool that evaluates AI-generated answers across five quality dimensions

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages