A public benchmark for evaluating the data processing, reconciliation, and forensic reasoning capabilities of AI agents in a financial context.
data/: The synthetic dataset representing a real-world acquiring setup.pave_settlements.csvpave_transactions.csvvisa_interchange_rates.csv
questions.json: 24 forensic accounting tasks of increasing complexity.METHODOLOGY.md: Details on the dataset, tasks, and evaluation metrics.LEADERBOARD.md: Current state-of-the-art performance on this benchmark.PAPER_DRAFT.md: Technical paper on Adversarial Verification and the benchmark methodology.
We provide a ready-to-use runner script that supports OpenAI, Anthropic, Gemini, and other OpenAI-compatible APIs (like Deepseek, Grok, Kimi).
-
Install Dependencies:
pip install -r requirements.txt
-
Run the Benchmark: Use the runner script to automatically prompt your model and save the predictions.
For OpenAI / OpenClaw:
export OPENAI_API_KEY="sk-..." python run_benchmark.py --provider openai --model openclaw
For Anthropic / Claude 4.6:
export ANTHROPIC_API_KEY="sk-..." python run_benchmark.py --provider anthropic --model claude-4-6-adaptive-thinking
For Gemini:
export GEMINI_API_KEY="..." python run_benchmark.py --provider gemini --model gemini-3.1-pro
For other OpenAI-compatible models (e.g., Deepseek, Grok):
python run_benchmark.py --provider openai --model deepseek-chat --api-key "sk-..." --base-url "https://api.deepseek.com/v1"
-
Evaluate the Results: You can evaluate locally if you have the ground truth, or use our Hosted Evaluation API for official scoring.
Local Evaluation:
python evaluate.py --pred predictions.json
Hosted Evaluation (Official):
# Request an API key to use the hosted scorer curl -X POST https://api.pavebenchmark.com/v1/benchmark/evaluate \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d @predictions.json
If you prefer to write your own harness, your agent must output a predictions.json file. The expected format for each question is detailed inside questions.json.
We welcome contributions to expand the dataset or add new forensic scenarios. Please submit a PR or open an issue to discuss proposed changes.