Contributing

Thanks for contributing to Awesome Agentic Evaluation.

What To Add

Add resources that are directly relevant to agentic evaluation, not generic LLM evaluation alone.

Strong additions usually fit at least one of these categories:

benchmarking
environment simulation
trajectory or process evaluation
observability and tracing
benchmark rigor and methodology
production testing and regression workflows

Curation Rules

Prefer primary sources: official repository, official paper, or official documentation.
Keep each entry to one link and one concise sentence whenever possible.
Mark archived, outdated, or historically important resources clearly.
Do not add generic "awesome AI" projects that are not evaluation-centric.
Place entries in the most specific section that fits.
Keep entries alphabetized within a section when practical.

Suggested Entry Format

- [**Project / Paper Name**](link) - One-line explanation of what it evaluates and why it matters.

Pull Request Checklist

I added a primary source link.
I placed the item in the most specific section available.
I kept the description concise and factual.
I checked for obvious duplicates.
I noted archival or historical status when relevant.

Scope Notes

This repository focuses on evaluations where agents interact with tools, environments, users, or production systems. Static model-only benchmarks are usually out of scope unless they are directly reused for agent evaluation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributing

What To Add

Curation Rules

Suggested Entry Format

Pull Request Checklist

Scope Notes

FilesExpand file tree

CONTRIBUTING.md

Latest commit

History

CONTRIBUTING.md

File metadata and controls

Contributing

What To Add

Curation Rules

Suggested Entry Format

Pull Request Checklist

Scope Notes