This repository treats each skill as production AI behavior, not just documentation. Changes must preserve output contracts, trigger reliability, and evaluation coverage.
SKILL.mdis the behavioral contract.- References and examples support the contract but must not contradict it.
- Output formats are deterministic and intentionally strict.
- Personas must come from
.github/skills/agile-story-writer/references/personas.md. - Changes to prompts, examples, or evals require review because small wording changes can shift model behavior.
| Change | Examples | Required checks |
|---|---|---|
feat(skill) |
New invoke mode, new trigger pattern, new persona, new output field | npm run quality, live promptfoo eval for affected skill |
fix(skill) |
Correct refusal behavior, tighten AC rules, repair examples | npm run quality, live promptfoo eval for affected skill |
docs(skill) |
README, runbook, deployment guidance | npm run quality |
chore(quality) |
Tooling, workflow, eval harness | npm run quality; live eval when eval semantics change |
Each skill must have 8-12 automated tests in evals/*.yaml.
Every eval file should include:
- Happy path
- Missing or vague input
- Anti-pattern refusal
- Persona specificity
- Scope boundary validation
- Format preservation
- One skill-specific edge case
- One regression test from a real failure
Use deterministic assertions for required structure. Use llm-rubric only for quality
judgment that keyword checks cannot verify.
CI selects EVAL_MODEL from configured secrets. If EVAL_MODEL is unset, it chooses a
default from the first available provider key.
Run live evals when:
SKILL.mdchanges- eval criteria change
- examples that steer behavior change
- supported model changes
- a drift incident is suspected
Record model, date, and failure summary in PR notes when live evals are run.
Do not blindly compress full SKILL.md files. Compression can remove trigger words,
weaken refusal rules, or change the exact output contract.
Safe targets:
AGENTS.md- internal notes
- duplicate prose in docs
- long examples after eval coverage exists
Unsafe targets:
- box output templates
- acceptance criteria rules
- refusal tables
- invoke-mode trigger lists
- persona names
After any compression, run npm run quality and live promptfoo evals for affected skills.
- Does change preserve deterministic box formats?
- Does it avoid generic "as a user" language?
- Does it keep
SKILL.md, examples, rubrics, and evals aligned? - Does every behavioral change add or update an eval?
- Does
skills.jsonstill match skill paths and versions? - Does
applyTostill describe only trigger phrases for the owning skill? - Does
npm run qualitypass? - Were live promptfoo evals run when behavior changed?
Run full live evals:
- before each release
- after provider model upgrades
- after significant skill rewrites
- quarterly for baseline drift detection
Compare failures against previous outputs, then tighten instructions or eval rubrics.