feat(workflows): add standing-eval-suite (labelled-regression eval harness)#2052
feat(workflows): add standing-eval-suite (labelled-regression eval harness)#2052narutomugens-byte wants to merge 2 commits into
Conversation
…rness)
EXPERIMENTAL companion to agentic-eval-gate. Where the gate judges one diff
once with no ground truth, this scores a curated, version-controlled set of
LABELLED cases against a 5-dimension rubric (correctness, completeness,
maintainability, safety, verification), aggregates deterministically, compares
to an optional committed baseline, and gates — so quality is measured over time
("evals are the new tests" + the quality flywheel).
- .archon/workflows/experimental/standing-eval-suite.yaml — 4-node DAG:
load-suite -> score-cases (judge, medium tier) -> aggregate (deterministic
gate) -> verdict (small tier).
- .archon/scripts/eval-{load-suite,aggregate}.ts — the deterministic nodes as
NAMED scripts (inline multi-line script: nodes silently no-op on Windows; the
bun -e argv truncates at the first newline).
- .archon/evals/seed/ — 5-dim rubric (suite.json) + 3 labelled cases + README.
- Suite selected via EVAL_SUITE (default: seed). Scorecard written to artifacts;
trend appended to .archon/state/eval-history.jsonl (gitignored).
Smoke (seed, foreground): gate FAIL, overall 3.1 < 3.5 — floors tripped on
verification/completeness, missing-tests + borderline-refactor flagged. The
untested-but-correct case is correctly blocked.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR adds a standing eval-suite harness, its seed suite configuration and cases, workflow wiring, and documentation, plus a separate golden eval suite with baseline data and three evaluation cases. ChangesStanding Eval Suite
Golden Eval Suite
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
.archon/scripts/eval-aggregate.ts (1)
16-16: 🔒 Security & Privacy | 🟡 Minor | ⚡ Quick winSame unsanitized
EVAL_SUITEpath issue aseval-load-suite.ts.See companion comment in
eval-load-suite.ts:19-22; the same env-var-to-path concern applies here.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.archon/scripts/eval-aggregate.ts at line 16, The eval suite path in evalAggregate is built directly from the EVAL_SUITE value, so sanitize or validate suite before passing it to join and restrict it to an allowed suite name/path fragment. Update the logic in evalAggregate to mirror the safer handling used in evalLoadSuite so untrusted env input cannot escape the intended .archon/evals directory.
🧹 Nitpick comments (1)
.archon/evals/README.md (1)
6-11: 📐 Maintainability & Code Quality | 🔵 TrivialAdd language specifier to directory tree code block.
The fenced code block is missing a language tag, triggering
MD040. Usetext(orbashif you prefer executable syntax) to silence the warning.+```text
.archon/evals//
suite.json # rubric dimensions + weights + thresholds (JSON — bun reads it dep-free)
cases/*.yaml # one labelled case each (YAML — the AI judge reads these natively)
baseline.json # OPTIONAL, COMMITTED: blessed mean_by_dim from an accepted run🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.archon/evals/README.md around lines 6 - 11, The README fenced directory tree block is missing a language tag and triggers MD040; update the code fence in the evals documentation to use a language specifier such as text (or bash if you want executable-style syntax). Make the change in the markdown snippet that documents the suite layout so the fenced block is properly annotated and the warning is silenced.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.archon/scripts/eval-aggregate.ts:
- Line 23: The `eval-aggregate.ts` gate currently trusts `suite.json` too much
while `results.json` is validated, so make `manifest` validation fail loudly
before any scoring or threshold checks run. In `main` (where `manifest` is
parsed) and the logic that uses `manifest.dimensions`, `weights`, and
`thresholds`, add explicit validation that required fields exist and are
well-formed, especially ensuring every declared dimension has a corresponding
weight and both `thresholds.dim_floor` and `thresholds.overall_min` are present.
If validation fails, throw or exit with a clear error instead of falling back to
`weights[d] || 0` or comparing against `undefined`.
In @.archon/scripts/eval-load-suite.ts:
- Around line 19-22: The `EVAL_SUITE` value is used directly in path
construction without validation, so update `eval-load-suite` to sanitize or
whitelist `suite` before passing it to `join(...)` and only allow a plain suite
identifier. Apply the same protection in `eval-aggregate` as well, using the
relevant suite-handling logic there, so both entry points reject path separators
or `..` traversal and only resolve within `.archon/evals/`.
In @.archon/workflows/experimental/standing-eval-suite.yaml:
- Around line 16-18: The `score-cases` workflow is over-permissioned: its
`allowed_tools` grants `Bash` and unscoped `Write` even though it only needs to
read `queue.txt`/case files and emit `results.json`. Update the `score-cases`
configuration in the standing eval suite so it follows the read-only design and
least-privilege intent, keeping only the minimal tool access required; use the
`score-cases` block and its `allowed_tools` list as the place to fix this.
---
Duplicate comments:
In @.archon/scripts/eval-aggregate.ts:
- Line 16: The eval suite path in evalAggregate is built directly from the
EVAL_SUITE value, so sanitize or validate suite before passing it to join and
restrict it to an allowed suite name/path fragment. Update the logic in
evalAggregate to mirror the safer handling used in evalLoadSuite so untrusted
env input cannot escape the intended .archon/evals directory.
---
Nitpick comments:
In @.archon/evals/README.md:
- Around line 6-11: The README fenced directory tree block is missing a language
tag and triggers MD040; update the code fence in the evals documentation to use
a language specifier such as text (or bash if you want executable-style syntax).
Make the change in the markdown snippet that documents the suite layout so the
fenced block is properly annotated and the warning is silenced.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 52436635-a821-4dca-870e-bd5820a60dfa
📒 Files selected for processing (8)
.archon/evals/README.md.archon/evals/seed/cases/borderline-refactor.yaml.archon/evals/seed/cases/good-input-validation.yaml.archon/evals/seed/cases/missing-tests.yaml.archon/evals/seed/suite.json.archon/scripts/eval-aggregate.ts.archon/scripts/eval-load-suite.ts.archon/workflows/experimental/standing-eval-suite.yaml
| process.exit(1); | ||
| } | ||
|
|
||
| const manifest = JSON.parse(readFileSync(join(dir, 'suite.json'), 'utf8')); |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
suite.json (manifest) is trusted without validation, unlike results.json.
The script fails loudly on a malformed results.json but applies no equivalent rigor to manifest.dimensions/weights/thresholds. Concretely:
- Line 59:
weights[d] || 0silently zero-weights a dimension ifweightsis missing that key insuite.json, skewingoverallwithout any warning. - Lines 67 and 83-88: if
thresholds.dim_floor/overall_minare absent, comparisons againstundefinedare alwaysfalse—dim_floor_failureswould never trigger (false negative), while theoverall_mincheck would always fail (false positive) — producing an inconsistent, hard-to-diagnose gate result for a misconfigured suite.
Given this script is the authoritative gate, a misconfigured suite.json should fail loudly the same way a malformed results.json does.
🛡️ Proposed fix
const manifest = JSON.parse(readFileSync(join(dir, 'suite.json'), 'utf8'));
+for (const key of ['dimensions', 'weights', 'thresholds']) {
+ if (!manifest[key]) {
+ console.error(`suite.json is missing required field "${key}" — gate FAILS.`);
+ process.exit(1);
+ }
+}
+const dims0: string[] = manifest.dimensions;
+for (const d of dims0) {
+ if (typeof manifest.weights[d] !== 'number') {
+ console.error(`suite.json weights is missing dimension "${d}" — gate FAILS.`);
+ process.exit(1);
+ }
+}
+for (const k of ['overall_min', 'dim_floor', 'case_min', 'regression_tolerance']) {
+ if (typeof manifest.thresholds[k] !== 'number') {
+ console.error(`suite.json thresholds is missing "${k}" — gate FAILS.`);
+ process.exit(1);
+ }
+}Also applies to: 41-43, 59-59, 67-67, 82-88
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.archon/scripts/eval-aggregate.ts at line 23, The `eval-aggregate.ts` gate
currently trusts `suite.json` too much while `results.json` is validated, so
make `manifest` validation fail loudly before any scoring or threshold checks
run. In `main` (where `manifest` is parsed) and the logic that uses
`manifest.dimensions`, `weights`, and `thresholds`, add explicit validation that
required fields exist and are well-formed, especially ensuring every declared
dimension has a corresponding weight and both `thresholds.dim_floor` and
`thresholds.overall_min` are present. If validation fails, throw or exit with a
clear error instead of falling back to `weights[d] || 0` or comparing against
`undefined`.
| const suite = process.env.EVAL_SUITE || 'seed'; | ||
| const dir = join(process.cwd(), '.archon', 'evals', suite); | ||
| const casesDir = join(dir, 'cases'); | ||
| const manifestPath = join(dir, 'suite.json'); |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟡 Minor | ⚡ Quick win
Unvalidated EVAL_SUITE env var used directly in path construction.
suite flows straight into join(process.cwd(), '.archon', 'evals', suite) with no check that it's a plain identifier. A value containing ..// would let the script read manifests/cases outside .archon/evals/. Same pattern exists independently in eval-aggregate.ts (line 16).
🛡️ Proposed fix
-const suite = process.env.EVAL_SUITE || 'seed';
+const suite = process.env.EVAL_SUITE || 'seed';
+if (!/^[\w-]+$/.test(suite)) {
+ console.error(`Invalid EVAL_SUITE "${suite}": must be alphanumeric/dash/underscore only`);
+ process.exit(1);
+}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const suite = process.env.EVAL_SUITE || 'seed'; | |
| const dir = join(process.cwd(), '.archon', 'evals', suite); | |
| const casesDir = join(dir, 'cases'); | |
| const manifestPath = join(dir, 'suite.json'); | |
| const suite = process.env.EVAL_SUITE || 'seed'; | |
| if (!/^[\w-]+$/.test(suite)) { | |
| console.error(`Invalid EVAL_SUITE "${suite}": must be alphanumeric/dash/underscore only`); | |
| process.exit(1); | |
| } | |
| const dir = join(process.cwd(), '.archon', 'evals', suite); | |
| const casesDir = join(dir, 'cases'); | |
| const manifestPath = join(dir, 'suite.json'); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.archon/scripts/eval-load-suite.ts around lines 19 - 22, The `EVAL_SUITE`
value is used directly in path construction without validation, so update
`eval-load-suite` to sanitize or whitelist `suite` before passing it to
`join(...)` and only allow a plain suite identifier. Apply the same protection
in `eval-aggregate` as well, using the relevant suite-handling logic there, so
both entry points reject path separators or `..` traversal and only resolve
within `.archon/evals/`.
| v1 SCOPE (this file): scores PRE-SUPPLIED candidates carried in each case | ||
| (`candidate:`). Read-only — never edits the repo. Deterministic aggregation; | ||
| the LLM judge only scores 1-5 per dimension. |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
score-cases grants Bash (and unscoped Write) beyond what the task needs.
The prompt only requires reading queue.txt/case files and writing one results.json, yet allowed_tools includes Bash. This runs against the live checkout (worktree.enabled: false), so this contradicts the stated "Read-only — never edits the repo" design (Lines 16-18) and exceeds least privilege.
🛡️ Proposed fix
- allowed_tools: [Read, Write, Bash]
+ allowed_tools: [Read, Write]Also applies to: 52-96
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.archon/workflows/experimental/standing-eval-suite.yaml around lines 16 -
18, The `score-cases` workflow is over-permissioned: its `allowed_tools` grants
`Bash` and unscoped `Write` even though it only needs to read `queue.txt`/case
files and emit `results.json`. Update the `score-cases` configuration in the
standing eval suite so it follows the read-only design and least-privilege
intent, keeping only the minimal tool access required; use the `score-cases`
block and its `allowed_tools` list as the place to fix this.
A second dataset for standing-eval-suite that acts as the standing QUALITY BAR (stricter thresholds: overall_min 4.0, dim_floor 3.5, case_min 3). Three strong, well-tested cases (validated config loader, safe user lookup, retry-with-backoff). baseline.json is blessed from a PASS run (all dims 5.0), so the regression check (any dim dropping > 0.5 below baseline -> gate FAIL) is now exercised. Verified deterministically: feeding a result with safety 5.0->4.0 yields gate FAIL with regressions: [safety], even though all absolute thresholds still pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.archon/evals/golden/cases/safe-user-lookup.yaml:
- Around line 16-26: The getUser snippet is incomplete because it uses
db.query<User>(...) and the User type without importing either dependency, so
make the embedded candidate code self-contained by adding the missing imports
for db and User in get-user.ts. Keep the function name getUser and the existing
query logic unchanged, but ensure every referenced symbol is explicitly imported
so the golden case is compile-ready and consistent with the other examples.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e04e634e-40c9-4b35-b8e6-418b9d939328
📒 Files selected for processing (5)
.archon/evals/golden/baseline.json.archon/evals/golden/cases/retry-with-backoff.yaml.archon/evals/golden/cases/safe-user-lookup.yaml.archon/evals/golden/cases/validated-config-loader.yaml.archon/evals/golden/suite.json
✅ Files skipped from review due to trivial changes (1)
- .archon/evals/golden/baseline.json
| candidate: | | ||
| // src/users/get-user.ts | ||
| export async function getUser(id: string): Promise<User | null> { | ||
| if (!id || !id.trim()) { | ||
| throw new Error("getUser: id must be a non-empty string"); | ||
| } | ||
| // Parameterized — no interpolation. A missing row yields undefined -> null. | ||
| // Any DB/connection error throws out of db.query and propagates to the caller. | ||
| const row = await db.query<User>("SELECT * FROM users WHERE id = $1", [id]); | ||
| return row ?? null; | ||
| } |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Candidate snippet references db and User without importing them.
Unlike the other two golden cases (validated-config-loader.yaml, retry-with-backoff.yaml), which import every dependency they use, this candidate uses db.query<User>(...) and the User type with no corresponding import statement. Since this case is meant to be the blessed "GOOD — correct, complete" reference, the embedded code itself should be complete/compilable to set a consistent quality bar across golden cases.
🩹 Proposed fix
candidate: |
// src/users/get-user.ts
+ import { db } from "../db";
+ import type { User } from "./types";
+
export async function getUser(id: string): Promise<User | null> {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| candidate: | | |
| // src/users/get-user.ts | |
| export async function getUser(id: string): Promise<User | null> { | |
| if (!id || !id.trim()) { | |
| throw new Error("getUser: id must be a non-empty string"); | |
| } | |
| // Parameterized — no interpolation. A missing row yields undefined -> null. | |
| // Any DB/connection error throws out of db.query and propagates to the caller. | |
| const row = await db.query<User>("SELECT * FROM users WHERE id = $1", [id]); | |
| return row ?? null; | |
| } | |
| candidate: | | |
| // src/users/get-user.ts | |
| import { db } from "../db"; | |
| import type { User } from "./types"; | |
| export async function getUser(id: string): Promise<User | null> { | |
| if (!id || !id.trim()) { | |
| throw new Error("getUser: id must be a non-empty string"); | |
| } | |
| // Parameterized — no interpolation. A missing row yields undefined -> null. | |
| // Any DB/connection error throws out of db.query and propagates to the caller. | |
| const row = await db.query<User>("SELECT * FROM users WHERE id = $1", [id]); | |
| return row ?? null; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.archon/evals/golden/cases/safe-user-lookup.yaml around lines 16 - 26, The
getUser snippet is incomplete because it uses db.query<User>(...) and the User
type without importing either dependency, so make the embedded candidate code
self-contained by adding the missing imports for db and User in get-user.ts.
Keep the function name getUser and the existing query logic unchanged, but
ensure every referenced symbol is explicitly imported so the golden case is
compile-ready and consistent with the other examples.
Calibration check — is the judge measuring real quality?Before treating a blessed baseline as a real bar, I validated that the 1. Gradation — an independent annotator scored the 6 shipped cases across all 5 dimensions; the judge landed within ±1 on 30/30 dim-cells (max diff 1). 2. Substance vs. theater (blinded) — 3 engineered hard-negatives whose
The rationales confirm genuine detection, not pattern-matching — it flagged the slugify tests as checks that "pass for any string-returning function," and traced the median formula to "returns undefined for empty arrays … passes only by coincidence on a pre-sorted odd-length input." Scores were ~identical whether or not the reference leaked the answer, so the leak wasn't driving detection. Conclusion: the judge reads substance. One observed limitation — it awards a flat 5 to genuinely-good code, so the all-5 Method note for future case authors: the seed/golden |
Summary
agentic-eval-gatejudges a single diff once, with no ground truth — there is no standing way to measure quality over time against labelled examples.standing-eval-suite— a 4-node workflow that scores labelled cases on a 5-dimension rubric, aggregates deterministically, optionally compares to a committed baseline, and gates; two named scripts; and two datasets —seed(demo: includes deliberately-failing cases) andgolden(a standing quality bar with a blessedbaseline.json).packages/touched; no engine/schema/bundled-defaults changes. Experimental workflow only (lives in.archon/workflows/experimental/).agentic-eval-gateand all existing workflows untouched.UX Journey
After
Architecture Diagram
After
Connection inventory:
Label Snapshot
risk: lowsize: Mworkflowsworkflows:experimentalChange Metadata
featureworkflowsLinked Issue
agentic-eval-gate)Validation Evidence (required)
This PR adds NO TypeScript under
packages/and no bundled defaults — it is workflow YAML + named scripts + a dataset. The standardbun run validate(type-check/lint/test/bundled checks) targets the engine and is not exercised by these files. Validation here is workflow-validation + an end-to-end smoke:missing-testscase (correct code, zero tests) is correctly blocked — proving the verification dimension bites — and the golden baseline proves regression detection.bun run validate— no engine/TS/bundled files changed.Security Impact (required)
mutates_checkout: false, no worktree); judge node restricted to[Read, Write, Bash], verdict to[].$ARTIFACTS_DIRand the gitignored.archon/state/.Compatibility / Migration
EVAL_SUITEenv var selects a suite; defaultseed).Human Verification (required)
seedrun end-to-end; confirmedqueue.txt/results.json/scorecard.jsonwritten and a trend line appended to.archon/state/eval-history.jsonl.results.json→ aggregate fails loudly (gate FAIL, not silent pass); a case missing a dimension score → fails loudly. Discovered and worked around an Archon-on-Windows bug where inline multi-linescript:nodes silently no-op (bun-eargv truncates at the first newline) — hence named scripts.Side Effects / Blast Radius (required)
.archon/scripts/namespace (prefixedeval-to avoid collision).Rollback Plan (required)
.archon/workflows/experimental/standing-eval-suite.yaml,.archon/scripts/eval-load-suite.ts,.archon/scripts/eval-aggregate.ts,.archon/evals/) — no other code references them.archon validate workflowswould flag a broken YAML; a run that can't find its dataset fails atload-suite.Risks and Mitigations
.archon/evals/README.md.🤖 Generated with Claude Code
Summary by CodeRabbit