veritail works with cloud LLM APIs and any OpenAI-compatible local model server. This page covers provider setup, local model configuration, and model quality guidance.
- Python >= 3.9
- An LLM provider -- one of:
- OpenAI API key (included with base install)
- Anthropic API key (
pip install veritail[anthropic]) - Google Gemini API key (
pip install veritail[gemini]) - A running OpenAI-compatible local model server (no extra install needed -- see Local models below)
| Provider | Example --llm-model |
API key env var | Install |
|---|---|---|---|
| OpenAI | gpt-4o, gpt-4o-mini, o3-mini |
OPENAI_API_KEY |
included |
| Anthropic (Claude) | claude-sonnet-4-5, claude-haiku-4-5 |
ANTHROPIC_API_KEY |
pip install veritail[anthropic] |
| Google Gemini | gemini-2.5-flash, gemini-2.5-pro |
GEMINI_API_KEY or GOOGLE_API_KEY |
pip install veritail[gemini] |
Cloud models provide the highest evaluation quality and are recommended for production use.
veritail selects the provider based on the model name passed to --llm-model:
- Names starting with
claudeuse the Anthropic API. - Names starting with
geminiuse the Google Gemini API. - All other names use the OpenAI API (this also covers OpenAI-compatible local servers).
OpenAI, Anthropic, and Gemini all support batch evaluation via veritail run --batch. Batch mode submits all judgments in a single API call and polls for results, which can reduce costs and rate-limit pressure. For OpenAI, batch mode is only available when using the default OpenAI endpoint (not when --llm-base-url is set).
veritail connects to any server that exposes the OpenAI chat completions API (POST /v1/chat/completions). Pass --llm-base-url to point at a local endpoint:
# Ollama
ollama pull qwen3:14b
veritail run \
--queries queries.csv \
--adapter my_adapter.py \
--llm-model qwen3:14b \
--llm-base-url http://localhost:11434/v1 \
--llm-api-key not-needed
# vLLM
veritail run \
--queries queries.csv \
--adapter my_adapter.py \
--llm-model meta-llama/Llama-4-Scout \
--llm-base-url http://localhost:8000/v1 \
--llm-api-key not-needed
# LM Studio
veritail run \
--queries queries.csv \
--adapter my_adapter.py \
--llm-model local-model \
--llm-base-url http://localhost:1234/v1 \
--llm-api-key lm-studio--llm-base-url and --llm-api-key also work with veritail generate-queries.
You can set environment variables instead of passing CLI flags:
export OPENAI_BASE_URL=http://localhost:11434/v1
export OPENAI_API_KEY=not-needed
veritail run --queries queries.csv --adapter my_adapter.py --llm-model qwen3:14b| Server | Default port | Docs |
|---|---|---|
| Ollama | 11434 |
OpenAI compatibility |
| vLLM | 8000 |
OpenAI-compatible server |
| LM Studio | 1234 |
API docs |
| LocalAI | 8080 |
Features |
| llama.cpp server | 8080 |
Server docs |
| SGLang | varies | Docs |
veritail computes aggregate IR metrics (NDCG, MRR, MAP) from LLM relevance scores. The reliability of these metrics depends on the LLM's ability to follow instructions and produce consistent judgments.
| Model tier | Examples | Metric reliability |
|---|---|---|
| Frontier cloud models | Claude Sonnet/Opus, GPT-4o, GPT-o3 | High -- recommended for production evaluation |
| Large local models (70B+) | Llama 4 Maverick, Qwen 3 72B, DeepSeek V3 | Good -- comparable to cloud models with sufficient hardware |
| Mid-size local models (14B-30B) | Qwen 3 14B/30B, Phi-4 14B, Mistral 7x8B | Adequate -- some scoring noise; suitable for rapid iteration |
| Small local models (<=8B) | Llama 3.2 3B, Phi-4-mini, Gemma 3 4B | Noisy -- scores may be inconsistent and affect metric reliability |
For reliable metrics that can inform production search decisions, we recommend frontier cloud models or 70B+ parameter local models. Smaller models are useful for fast, low-cost iteration during development but their scores should be interpreted with caution.
- Backends -- storage backend options
- Development -- contributing and running tests