Skip to content

Finance-Benchmark/pave-benchmark

Repository files navigation

Pave Forensic Accounting Benchmark

A public benchmark for evaluating the data processing, reconciliation, and forensic reasoning capabilities of AI agents in a financial context.

Repository Contents

  • data/: The synthetic dataset representing a real-world acquiring setup.
    • pave_settlements.csv
    • pave_transactions.csv
    • visa_interchange_rates.csv
  • questions.json: 24 forensic accounting tasks of increasing complexity.
  • METHODOLOGY.md: Details on the dataset, tasks, and evaluation metrics.
  • LEADERBOARD.md: Current state-of-the-art performance on this benchmark.
  • PAPER_DRAFT.md: Technical paper on Adversarial Verification and the benchmark methodology.

Quickstart: Running the Benchmark

We provide a ready-to-use runner script that supports OpenAI, Anthropic, Gemini, and other OpenAI-compatible APIs (like Deepseek, Grok, Kimi).

  1. Install Dependencies:

    pip install -r requirements.txt
  2. Run the Benchmark: Use the runner script to automatically prompt your model and save the predictions.

    For OpenAI / OpenClaw:

    export OPENAI_API_KEY="sk-..."
    python run_benchmark.py --provider openai --model openclaw

    For Anthropic / Claude 4.6:

    export ANTHROPIC_API_KEY="sk-..."
    python run_benchmark.py --provider anthropic --model claude-4-6-adaptive-thinking

    For Gemini:

    export GEMINI_API_KEY="..."
    python run_benchmark.py --provider gemini --model gemini-3.1-pro

    For other OpenAI-compatible models (e.g., Deepseek, Grok):

    python run_benchmark.py --provider openai --model deepseek-chat --api-key "sk-..." --base-url "https://api.deepseek.com/v1"
  3. Evaluate the Results: You can evaluate locally if you have the ground truth, or use our Hosted Evaluation API for official scoring.

    Local Evaluation:

    python evaluate.py --pred predictions.json

    Hosted Evaluation (Official):

    # Request an API key to use the hosted scorer
    curl -X POST https://api.pavebenchmark.com/v1/benchmark/evaluate \
      -H "Authorization: Bearer YOUR_API_KEY" \
      -H "Content-Type: application/json" \
      -d @predictions.json

Manual Evaluation

If you prefer to write your own harness, your agent must output a predictions.json file. The expected format for each question is detailed inside questions.json.

Contributing

We welcome contributions to expand the dataset or add new forensic scenarios. Please submit a PR or open an issue to discuss proposed changes.

About

A public benchmark for evaluating the data processing, reconciliation, and forensic reasoning capabilities of AI agents in a financial context.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages