Pave Forensic Accounting Benchmark

A public benchmark for evaluating the data processing, reconciliation, and forensic reasoning capabilities of AI agents in a financial context.

Repository Contents

data/: The synthetic dataset representing a real-world acquiring setup.
- pave_settlements.csv
- pave_transactions.csv
- visa_interchange_rates.csv
questions.json: 24 forensic accounting tasks of increasing complexity.
METHODOLOGY.md: Details on the dataset, tasks, and evaluation metrics.
LEADERBOARD.md: Current state-of-the-art performance on this benchmark.
PAPER_DRAFT.md: Technical paper on Adversarial Verification and the benchmark methodology.

Quickstart: Running the Benchmark

We provide a ready-to-use runner script that supports OpenAI, Anthropic, Gemini, and other OpenAI-compatible APIs (like Deepseek, Grok, Kimi).

Install Dependencies:
```
pip install -r requirements.txt
```

Run the Benchmark: Use the runner script to automatically prompt your model and save the predictions.

For OpenAI / OpenClaw:

export OPENAI_API_KEY="sk-..."
python run_benchmark.py --provider openai --model openclaw

For Anthropic / Claude 4.6:

export ANTHROPIC_API_KEY="sk-..."
python run_benchmark.py --provider anthropic --model claude-4-6-adaptive-thinking

For Gemini:

export GEMINI_API_KEY="..."
python run_benchmark.py --provider gemini --model gemini-3.1-pro

For other OpenAI-compatible models (e.g., Deepseek, Grok):

python run_benchmark.py --provider openai --model deepseek-chat --api-key "sk-..." --base-url "https://api.deepseek.com/v1"

Evaluate the Results: You can evaluate locally if you have the ground truth, or use our Hosted Evaluation API for official scoring.

Local Evaluation:

python evaluate.py --pred predictions.json

Hosted Evaluation (Official):

# Request an API key to use the hosted scorer
curl -X POST https://api.pavebenchmark.com/v1/benchmark/evaluate \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d @predictions.json

Manual Evaluation

If you prefer to write your own harness, your agent must output a predictions.json file. The expected format for each question is detailed inside questions.json.

Contributing

We welcome contributions to expand the dataset or add new forensic scenarios. Please submit a PR or open an issue to discuss proposed changes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pave Forensic Accounting Benchmark

Repository Contents

Quickstart: Running the Benchmark

Manual Evaluation

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
.gitattributes		.gitattributes
.gitignore		.gitignore
LEADERBOARD.md		LEADERBOARD.md
METHODOLOGY.md		METHODOLOGY.md
README.md		README.md
evaluate.py		evaluate.py
ground_truth.json		ground_truth.json
questions.json		questions.json
requirements.txt		requirements.txt
run_benchmark.py		run_benchmark.py

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Pave Forensic Accounting Benchmark

Repository Contents

Quickstart: Running the Benchmark

Manual Evaluation

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages