Thanks for contributing to Awesome Agentic Evaluation.
Add resources that are directly relevant to agentic evaluation, not generic LLM evaluation alone.
Strong additions usually fit at least one of these categories:
- benchmarking
- environment simulation
- trajectory or process evaluation
- observability and tracing
- benchmark rigor and methodology
- production testing and regression workflows
- Prefer primary sources: official repository, official paper, or official documentation.
- Keep each entry to one link and one concise sentence whenever possible.
- Mark archived, outdated, or historically important resources clearly.
- Do not add generic "awesome AI" projects that are not evaluation-centric.
- Place entries in the most specific section that fits.
- Keep entries alphabetized within a section when practical.
- [**Project / Paper Name**](link) - One-line explanation of what it evaluates and why it matters.- I added a primary source link.
- I placed the item in the most specific section available.
- I kept the description concise and factual.
- I checked for obvious duplicates.
- I noted archival or historical status when relevant.
This repository focuses on evaluations where agents interact with tools, environments, users, or production systems. Static model-only benchmarks are usually out of scope unless they are directly reused for agent evaluation.