| diataxis_type | explanation |
|---|---|
| title | Lifecycle |
| description | The full autoresearch lifecycle from eval readiness through improvement to completion |
The full autoresearch lifecycle from start to finish. Each phase builds on the previous one.
Before the improvement loop can run, the skill needs evals.
Check: Does {skill-path}/evals/evals.json exist with 3+ eval cases?
- Yes → Proceed to Phase 2
- No → Run eval-doctor first:
/autoresearch --eval-doctor path/to/my-skill
The eval-doctor reads the skill, generates eval cases with realistic prompts and discriminating expectations, and writes evals/evals.json. Optionally creates evals/trigger-eval.json for description testing.
Output: A skill directory with well-formed evals ready for the improvement loop.
Why eval quality comes first: If evals are weak or miscalibrated, the improvement loop optimizes against a moving or misleading target. This is the "moving goalposts" problem — when both the artifact and the measurement change together, score improvements become meaningless. Locking down good evals before entering the loop ensures every score change reflects a real change in skill quality.
The first step of the improvement loop establishes where the skill currently stands.
- Create workspace:
{skill-name}-autoresearch/directory - Snapshot v0: Immutable copy of the original skill
- Copy candidate: Mutable working copy
- Run all evals: Execute each eval prompt with the skill, grade the outputs
- Compute score: Mean pass_rate across all grading results
- Log: Record baseline in results.tsv
Output: A workspace with baseline snapshot, initial score, and grading details for every eval case.
Key insight: The baseline score tells you how much room there is for improvement. A 0.90 baseline has less room than a 0.50.
Why an immutable baseline matters: The v0 snapshot is never modified. This gives you a fixed reference point for regression detection — the convergence report diffs against v0, so you always know exactly what changed. Without a stable baseline, you couldn't tell whether a score drop came from a bad improvement or from the reference itself shifting.
The core loop. For each iteration:
The improver agent:
- Reads grading.json files from the previous iteration
- Identifies failed expectations and groups them by pattern
- Reviews results.tsv for score history (avoids repeating reverted approaches)
- Modifies the candidate skill: SKILL.md, scripts, references
- Reports a changelog of what changed and why
All evals run against the modified candidate, producing new grading.json files and a new composite score.
- Score improved (
score_i > best_score) → KEEP- Snapshot the candidate to
v{i}/ - Update best version and best score
- Snapshot the candidate to
- Score didn't improve (
score_i ≤ best_score) → REVERT- Restore candidate from best snapshot
- Candidate returns to the last known-good state
Append to results.tsv: iteration number, timestamp, score, best score, action, changelog.
best_score ≥ 1.0→ Perfect. Stop.- Last 3 actions all "reverted" → Stuck. Stop.
- Otherwise → Continue to next iteration.
Output: A results.tsv tracking every iteration, snapshots of kept versions, and full eval results per iteration.
Why keep/discard beats always-forward: A naive loop would always keep changes and hope for the best. The keep/revert mechanism prevents score regression — if an iteration makes things worse, the candidate snaps back to the last known-good state. This means the best_score monotonically increases, and the improver never compounds a bad change with another bad change on top of it.
After the loop finishes (by reaching max iterations, perfect score, or stuck condition):
-
Convergence report: The convergence reporter agent analyzes results.tsv, generates a diff between v0 and the best version, lists remaining weaknesses, and recommends an action.
-
User decision: "Apply the best version (v{N}, score {S}) to the original skill? [y/n]"
- Apply: Best snapshot replaces original skill files
- Decline: Changes stay in workspace for manual review
Output: A convergence report and an explicit user decision.
Why human confirmation is required: The loop can improve a skill's eval scores, but scores don't capture everything — readability, tone, approach, and alignment with the user's intent are judgment calls. Automated systems shouldn't modify production artifacts without approval. The explicit apply/decline step ensures a human validates that the changes are actually desirable, not just higher-scoring.
After applying changes, the skill's behavior may have changed enough that its trigger description no longer matches. Use skill-creator to re-optimize:
/skill-creator optimize-description path/to/my-skillThis is optional but recommended when the skill's SKILL.md changed significantly.
Output: Updated description in SKILL.md frontmatter.
For sustained improvement, alternate between fixing evals and improving the skill:
┌──────────────┐ ┌──────────────┐
│ EVAL DOCTOR │────▶│ IMPROVEMENT │──┐
│ (fix evals) │ │ LOOP │ │
└──────────────┘ └──────────────┘ │
▲ │
│ ┌──────────────┐ │
└────│ REVIEW │◀──────────┘
│ (evals ok?) │
└──────────────┘
Each cycle through this meta-loop:
- Eval-doctor tightens evals (score may drop as expectations get harder)
- Improvement loop raises the skill to meet the new bar
- Review whether evals are still appropriate
Stop when the score on tough evals meets your quality target (typically 0.85+).
- The Autoresearch Pattern — Design philosophy
- Algorithm — Formal specification
- Expected Results — What to expect at each phase