diataxis_type	explanation
title	Convergence and Scoring
description	How composite scores are calculated and when the improvement loop converges

Convergence and Scoring

Composite Score Calculation

The score for each iteration is the mean pass_rate across all eval cases:

score = mean([grading.summary.pass_rate for each eval case])

Where each pass_rate is:

pass_rate = passed_expectations / total_expectations

Example

Three eval cases in an iteration:

Eval	Passed	Total	pass_rate
eval-1	4	6	0.667
eval-2	3	4	0.750
eval-3	5	5	1.000

Composite score: (0.667 + 0.750 + 1.000) / 3 = 0.806

Why Mean?

The mean gives equal weight to each eval case regardless of how many expectations it has. This prevents an eval with 10 expectations from dominating one with 3. Each eval case tests a distinct scenario; they should contribute equally to the overall score.

Convergence Detection

The loop tracks convergence through the keep/revert sequence and the best_score trajectory.

Perfect Convergence

best_score reaches 1.0

All expectations pass in all evals. The loop stops immediately.

Plateau

best_score stops increasing but loop hasn't hit abort condition

The score stabilized — the improver can still find changes but none beat the current best. The loop runs to max_iterations and the convergence report classifies this as "plateau."

Stuck

3 consecutive reverts

The improver tried 3 different approaches and none improved the score. This triggers an early abort. The convergence report classifies this as "stuck."

Stuck doesn't mean the skill can't be improved — it means the improver, given its current instructions and the current eval signal, can't find a better approach. Possible next steps:

Run eval-doctor to recalibrate evals
Manually edit the skill to address structural issues
Run the loop again (the improver is non-deterministic, may try new approaches)

Rapid Improvement

Most iterations kept, score rose quickly

The skill had clear issues that the improver identified and fixed. Common when the skill has missing sections, wrong output formats, or incomplete instructions.

Score Non-Determinism

LLM execution is non-deterministic. Running the same skill on the same eval twice may produce different outputs and different scores. This means:

A score of 0.78 and 0.80 are effectively the same
Small score drops (0.02-0.05) after a change may be noise, not regression
The keep/discard threshold is strict (>, not >=) to avoid accepting noise as improvement

Autoresearch accepts this by using a simple comparison: score_i > best_score. A tie is treated as no improvement, which biases toward keeping proven changes rather than accepting uncertain ones.

Implications

Don't read too much into single-iteration score changes of < 0.05
The overall trajectory (across multiple iterations) is more meaningful than any single data point
If scores bounce randomly with no trend, the evals may have subjective expectations that different grader runs evaluate inconsistently

Snapshot Rollback Safety

The keep/discard mechanism provides safety through snapshots:

Before improvement: The candidate is in a known-good state (best version so far)
After improvement: The improver modifies the candidate freely
After evaluation: If the score improved, the new state is snapshotted. If not, the candidate is restored from the best snapshot.

This means:

The candidate can never degrade below the best-known version
Bad experiments are free — they cost compute time but can't damage the artifact
Every snapshot is immutable after creation — you can always go back

The restore() function in snapshot.py does an exact restore: it copies all files from the snapshot AND removes files in the candidate that aren't in the snapshot. This prevents accumulated cruft from failed experiments.

Algorithm — Formal loop specification
The Autoresearch Pattern — Design philosophy
Expected Results — Typical score trajectories

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convergence and Scoring

Composite Score Calculation

Example

Why Mean?

Convergence Detection

Perfect Convergence

Plateau

Stuck

Rapid Improvement

Score Non-Determinism

Implications

Snapshot Rollback Safety

Related

FilesExpand file tree

convergence-and-scoring.md

Latest commit

History

convergence-and-scoring.md

File metadata and controls

Convergence and Scoring

Composite Score Calculation

Example

Why Mean?

Convergence Detection

Perfect Convergence

Plateau

Stuck

Rapid Improvement

Score Non-Determinism

Implications

Snapshot Rollback Safety

Related