veritail includes a standalone autocomplete (type-ahead) evaluation mode. It runs deterministic quality checks and LLM-based semantic evaluation against your suggestion engine. You can run autocomplete evaluation on its own, combine it with search evaluation in a single run, or compare two autocomplete configurations side by side.
Provide a CSV or JSON file with a prefix column. Prefix types are automatically inferred from character count, or you can provide an optional type column to override:
| Character count | Inferred type |
|---|---|
| 1-2 | short_prefix |
| 3-9 | mid_prefix |
| 10+ | long_prefix |
prefix
he
hea
headph
headphones wAdd a suggest function to your adapter module. It receives a prefix string and returns an AutocompleteResponse or a bare list[str]:
# adapter.py
from veritail import AutocompleteResponse
def suggest(prefix: str) -> AutocompleteResponse:
results = my_autocomplete_api.suggest(prefix)
return AutocompleteResponse(suggestions=results)
# Or simply: return results (a bare list[str] is also accepted)If your adapter only defines suggest() without a search() function, search evaluation is skipped automatically.
veritail run \
--autocomplete prefixes.csv \
--adapter adapter.py \
--llm-model gpt-4o \
--openAll checks below are deterministic and run alongside the LLM evaluation. No additional configuration is needed.
| Check | What it catches |
|---|---|
empty_suggestions |
Prefix returned zero suggestions |
duplicate_suggestion |
Exact duplicate suggestions (case-insensitive) |
prefix_coherence |
Suggestion neither starts with the prefix nor shares a token with it |
offensive_content |
Suggestion contains a blocked term (requires blocklist) |
encoding_issues |
HTML entities, control characters, or leading/trailing whitespace |
length_anomaly |
Suggestion shorter than 2 characters or longer than 80 characters |
latency |
Adapter response time exceeds threshold (default 200 ms) |
Pass two adapters to run an A/B comparison:
veritail run \
--autocomplete prefixes.csv \
--adapter bm25_search_adapter.py --config-name bm25-baseline \
--adapter semantic_search_adapter.py --config-name semantic-v2In addition to per-adapter checks, comparison mode adds two cross-configuration checks:
- suggestion_overlap -- Jaccard index of normalized suggestions between configurations
- rank_agreement -- Spearman rank correlation for shared suggestions
Add domain-specific checks with --autocomplete-checks. Each check_* function receives (prefix: str, suggestions: list[str]) and returns list[CheckResult]:
# my_autocomplete_checks.py
from veritail.types import CheckResult
def check_brand_prefix(prefix: str, suggestions: list[str]) -> list[CheckResult]:
"""Flag suggestions that don't preserve a known brand prefix."""
checks = []
for s in suggestions:
if prefix.lower() in ("sony", "sam") and not s.lower().startswith(prefix.lower()):
checks.append(
CheckResult(
check_name="brand_prefix",
query=prefix,
passed=False,
detail=f"'{s}' does not start with brand prefix '{prefix}'",
)
)
return checksveritail run \
--autocomplete prefixes.csv \
--adapter adapter.py \
--autocomplete-checks my_autocomplete_checks.pyCustom check results appear alongside built-in checks in reports.
In addition to deterministic checks, veritail automatically runs LLM-based semantic evaluation for single-adapter autocomplete runs. This evaluates whether suggestions are semantically relevant to the user's intent, diverse across shopping intents, and appropriate for the store's vertical.
# Autocomplete-only with LLM evaluation
veritail run \
--autocomplete prefixes.csv \
--adapter adapter.py \
--llm-model gpt-4o
# With vertical and custom instructions
veritail run \
--autocomplete prefixes.csv \
--adapter adapter.py \
--llm-model gpt-4o \
--vertical home-improvement \
--instructions "Big-box home improvement retailer"
# Combined with search evaluation
veritail run \
--queries queries.csv \
--autocomplete prefixes.csv \
--adapter adapter.py \
--llm-model gpt-4o
# Batch mode (50% cheaper, slower)
veritail run \
--autocomplete prefixes.csv \
--adapter adapter.py \
--llm-model gpt-4o --batchEach prefix with non-empty suggestions receives one LLM call that scores:
- Relevance (0-3) -- Do suggestions match the likely shopping intent?
- 3: All suggestions are plausible completions a real shopper would want
- 2: Most suggestions make sense, one or two are tangential
- 1: Several suggestions miss the likely intent
- 0: Suggestions do not match any reasonable interpretation of the prefix
- Diversity (0-3) -- Do suggestions cover different categories and use cases?
- 3: Suggestions span multiple product categories, brands, or use cases
- 2: Some variety, but clustering around one category
- 1: Most suggestions are near-duplicates or cover one narrow intent
- 0: All suggestions are essentially the same item or intent
- Flagged suggestions -- Individual suggestions that are unrelated to the prefix, offensive or inappropriate, duplicative, or from the wrong product domain
One LLM call per prefix. A run with 100 prefixes makes 100 calls. Prefixes with empty suggestions are skipped. Use --batch for 50% cost reduction with cloud providers (see Batch Mode and Resume).
- Check pass/fail summary table
- Failed checks detail
- LLM Suggestion Quality summary (average relevance, average diversity, flagged count)
- Flagged suggestions detail with reasoning
- Lowest relevance scores
- Per-prefix drill-down
Standalone report with per-prefix detail, including LLM scores and flagged suggestions. Open automatically with --open.
suggestion-judgments.jsonl-- Written alongside other evaluation artifacts in the experiment output directory
- CLI Reference -- Full list of
veritail runoptions including--autocomplete,--autocomplete-checks, and--sample - Batch Mode and Resume -- Reducing cost with batch API and resuming interrupted runs
- Supported LLM Providers -- Cloud and local model options
- Custom Checks -- Writing custom check modules for search evaluation