Skip to content

feat(workflows): add standing-eval-suite (labelled-regression eval harness)#2052

Open
narutomugens-byte wants to merge 2 commits into
coleam00:devfrom
narutomugens-byte:feat/standing-eval-suite
Open

feat(workflows): add standing-eval-suite (labelled-regression eval harness)#2052
narutomugens-byte wants to merge 2 commits into
coleam00:devfrom
narutomugens-byte:feat/standing-eval-suite

Conversation

@narutomugens-byte

@narutomugens-byte narutomugens-byte commented Jun 30, 2026

Copy link
Copy Markdown

Summary

  • Problem: agentic-eval-gate judges a single diff once, with no ground truth — there is no standing way to measure quality over time against labelled examples.
  • Why it matters: the vibe-coding SDLC's "evals are the new tests" + quality flywheel needs a curated, version-controlled suite that re-scores on every run and catches regressions.
  • What changed: adds standing-eval-suite — a 4-node workflow that scores labelled cases on a 5-dimension rubric, aggregates deterministically, optionally compares to a committed baseline, and gates; two named scripts; and two datasets — seed (demo: includes deliberately-failing cases) and golden (a standing quality bar with a blessed baseline.json).
  • What did NOT change (scope boundary): no TS in packages/ touched; no engine/schema/bundled-defaults changes. Experimental workflow only (lives in .archon/workflows/experimental/). agentic-eval-gate and all existing workflows untouched.

UX Journey

After

Operator                         Archon (standing-eval-suite)
────────                         ────────────────────────────
EVAL_SUITE=seed                  load-suite (script)   → writes case queue
  archon workflow run ────────▶  score-cases (medium)  → judges each case 1-5 on 5 dims
  standing-eval-suite            aggregate (script)    → weighted means, floors, baseline diff → scorecard
                                 verdict (small)        → human-readable PASS/FAIL
sees gate decision ◀──────────  scorecard.json (artifacts) + trend line (.archon/state)

Architecture Diagram

After

.archon/evals/<suite>/           .archon/workflows/experimental/        .archon/scripts/
  suite.json        ───────────▶  standing-eval-suite.yaml  ──run──▶  eval-load-suite.ts [+]
  cases/*.yaml      ──read by──▶    (load → score → aggregate → verdict)  eval-aggregate.ts [+]
  baseline.json (opt)
                                  outputs: $ARTIFACTS_DIR/eval/scorecard.json [+]
                                           .archon/state/eval-history.jsonl  [+] (gitignored)

Connection inventory:

From To Status Notes
standing-eval-suite.yaml eval-load-suite.ts new named script (bun run)
standing-eval-suite.yaml eval-aggregate.ts new named script (bun run)
eval-load-suite.ts .archon/evals/ new reads dataset (cwd-relative)
eval-aggregate.ts .archon/state/eval-history.jsonl new appends trend (gitignored)

Label Snapshot

  • Risk: risk: low
  • Size: size: M
  • Scope: workflows
  • Module: workflows:experimental

Change Metadata

  • Change type: feature
  • Primary scope: workflows

Linked Issue

  • Related # (none — extends the agentic-engineering eval work alongside agentic-eval-gate)

Validation Evidence (required)

This PR adds NO TypeScript under packages/ and no bundled defaults — it is workflow YAML + named scripts + a dataset. The standard bun run validate (type-check/lint/test/bundled checks) targets the engine and is not exercised by these files. Validation here is workflow-validation + an end-to-end smoke:

bun run cli validate workflows standing-eval-suite
#   standing-eval-suite   ok        (1 valid, 0 with errors)

EVAL_SUITE=seed bun run cli workflow run standing-eval-suite --no-worktree
#   gate FAIL, overall 3.1 < 3.5
#   dim_floor_failures: [completeness (2.667), verification (2.333)]
#   case_failures: [missing-tests, borderline-refactor]

EVAL_SUITE=golden bun run cli workflow run standing-eval-suite --no-worktree
#   gate PASS, overall 5.0 — all 3 strong cases scored 5/5; baseline blessed from this run

# Regression path (deterministic, no AI) — feed eval-aggregate a result with safety 5.0->4.0:
#   gate FAIL, baseline_used: true, regressions: [{dim: safety, baseline: 5, current: 4}]
#   (absolute thresholds all still pass — only the >0.5 baseline drop trips it)
  • Evidence provided: workflow-validation + both foreground runs (seed FAIL, golden PASS) + a deterministic regression-path check. The missing-tests case (correct code, zero tests) is correctly blocked — proving the verification dimension bites — and the golden baseline proves regression detection.
  • Intentionally skipped: full bun run validate — no engine/TS/bundled files changed.

Security Impact (required)

  • New permissions/capabilities? No — read-only workflow (mutates_checkout: false, no worktree); judge node restricted to [Read, Write, Bash], verdict to [].
  • New external network calls? No.
  • Secrets/tokens handling changed? No.
  • File system access scope changed? No — scripts read the cwd dataset and write only to $ARTIFACTS_DIR and the gitignored .archon/state/.

Compatibility / Migration

  • Backward compatible? Yes — purely additive; no existing workflow or schema changed.
  • Config/env changes? No (optional EVAL_SUITE env var selects a suite; default seed).
  • Database migration needed? No.

Human Verification (required)

  • Verified scenarios: foreground seed run end-to-end; confirmed queue.txt/results.json/scorecard.json written and a trend line appended to .archon/state/eval-history.jsonl.
  • Edge cases checked: empty/missing results.json → aggregate fails loudly (gate FAIL, not silent pass); a case missing a dimension score → fails loudly. Discovered and worked around an Archon-on-Windows bug where inline multi-line script: nodes silently no-op (bun -e argv truncates at the first newline) — hence named scripts.
  • Also verified: golden suite gate PASS (overall 5.0) and the regression-vs-baseline path (deterministic check above) — a single dimension dropping past tolerance flips the gate.
  • What was not verified: behavior on a very large suite (single-judge-pass scope; loop-based per-case fan-out is the documented v2).

Side Effects / Blast Radius (required)

  • Affected subsystems/workflows: none existing — new experimental workflow + two new scripts + a new dataset dir.
  • Potential unintended effects: none on other workflows; the new named scripts share the .archon/scripts/ namespace (prefixed eval- to avoid collision).
  • Guardrails/monitoring: aggregate fails fast on any malformed/empty judge output; deterministic gate is authoritative (the small-model verdict only narrates).

Rollback Plan (required)

  • Fast rollback: delete the four added paths (.archon/workflows/experimental/standing-eval-suite.yaml, .archon/scripts/eval-load-suite.ts, .archon/scripts/eval-aggregate.ts, .archon/evals/) — no other code references them.
  • Feature flags/toggles: none needed; it only runs when explicitly invoked.
  • Observable failure symptoms: archon validate workflows would flag a broken YAML; a run that can't find its dataset fails at load-suite.

Risks and Mitigations

  • Risk: LLM-judge scoring is non-deterministic run-to-run.
    • Mitigation: deterministic aggregation + thresholds are authoritative; v2 adds N-vote judging. Documented in .archon/evals/README.md.
  • Risk: single judge pass may drop cases on a very large suite.
    • Mitigation: scoped to small curated suites for v1; aggregate fails loudly if a case is unscored; loop-based fan-out is the documented v2 path.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Added support for a standing evaluation suite workflow with deterministic suite loading, automated case scoring, and pass/fail aggregation with historical scorecards.
    • Added new seed eval cases focused on input validation, safe error handling, missing-test enforcement, and refactor edge cases.
    • Added a golden eval suite and baseline updates for consistent scoring.
  • Documentation
    • Added guidance for selecting and running standing eval suites, including directory structure, case formats, scoring/threshold rules, baseline blessing, and roadmap notes.

…rness)

EXPERIMENTAL companion to agentic-eval-gate. Where the gate judges one diff
once with no ground truth, this scores a curated, version-controlled set of
LABELLED cases against a 5-dimension rubric (correctness, completeness,
maintainability, safety, verification), aggregates deterministically, compares
to an optional committed baseline, and gates — so quality is measured over time
("evals are the new tests" + the quality flywheel).

- .archon/workflows/experimental/standing-eval-suite.yaml — 4-node DAG:
  load-suite -> score-cases (judge, medium tier) -> aggregate (deterministic
  gate) -> verdict (small tier).
- .archon/scripts/eval-{load-suite,aggregate}.ts — the deterministic nodes as
  NAMED scripts (inline multi-line script: nodes silently no-op on Windows; the
  bun -e argv truncates at the first newline).
- .archon/evals/seed/ — 5-dim rubric (suite.json) + 3 labelled cases + README.
- Suite selected via EVAL_SUITE (default: seed). Scorecard written to artifacts;
  trend appended to .archon/state/eval-history.jsonl (gitignored).

Smoke (seed, foreground): gate FAIL, overall 3.1 < 3.5 — floors tripped on
verification/completeness, missing-tests + borderline-refactor flagged. The
untested-but-correct case is correctly blocked.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds a standing eval-suite harness, its seed suite configuration and cases, workflow wiring, and documentation, plus a separate golden eval suite with baseline data and three evaluation cases.

Changes

Standing Eval Suite

Layer / File(s) Summary
Suite data
.archon/evals/seed/suite.json, .archon/evals/seed/cases/*.yaml
Adds the seed suite manifest and three seed eval cases for refactor, input validation, and missing-tests scenarios.
Suite loading
.archon/scripts/eval-load-suite.ts
Validates suite inputs, builds the case queue, and prints a JSON summary.
Aggregation and gate
.archon/scripts/eval-aggregate.ts
Computes per-dimension means, weighted overall score, gate status, scorecard output, and eval history.
Workflow wiring
.archon/workflows/experimental/standing-eval-suite.yaml
Defines the load-suite, score-cases, aggregate, and verdict workflow nodes.
Documentation
.archon/evals/README.md
Documents suite layout, execution, scoring rules, baseline blessing, and roadmap notes.

Golden Eval Suite

Layer / File(s) Summary
Golden suite data
.archon/evals/golden/suite.json, .archon/evals/golden/baseline.json, .archon/evals/golden/cases/*.yaml
Adds the golden suite manifest, baseline means, and cases for retry, safe user lookup, and validated config loading.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A rabbit hopped through eval lore, 🐰
with queues and gates and cases galore.
One suite to load, one suite to score,
a golden path and docs toோர்?
Hop, hop — the checks now run much more!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly names the new standing-eval-suite workflow and its labeled-regression eval harness scope.
Description check ✅ Passed The PR description covers most required sections, including summary, diagrams, labels, validation, risks, and rollback, with only some template detail missing.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
.archon/scripts/eval-aggregate.ts (1)

16-16: 🔒 Security & Privacy | 🟡 Minor | ⚡ Quick win

Same unsanitized EVAL_SUITE path issue as eval-load-suite.ts.

See companion comment in eval-load-suite.ts:19-22; the same env-var-to-path concern applies here.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.archon/scripts/eval-aggregate.ts at line 16, The eval suite path in
evalAggregate is built directly from the EVAL_SUITE value, so sanitize or
validate suite before passing it to join and restrict it to an allowed suite
name/path fragment. Update the logic in evalAggregate to mirror the safer
handling used in evalLoadSuite so untrusted env input cannot escape the intended
.archon/evals directory.
🧹 Nitpick comments (1)
.archon/evals/README.md (1)

6-11: 📐 Maintainability & Code Quality | 🔵 Trivial

Add language specifier to directory tree code block.

The fenced code block is missing a language tag, triggering MD040. Use text (or bash if you prefer executable syntax) to silence the warning.

+```text
.archon/evals//
suite.json # rubric dimensions + weights + thresholds (JSON — bun reads it dep-free)
cases/*.yaml # one labelled case each (YAML — the AI judge reads these natively)
baseline.json # OPTIONAL, COMMITTED: blessed mean_by_dim from an accepted run

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.archon/evals/README.md around lines 6 - 11, The README fenced directory
tree block is missing a language tag and triggers MD040; update the code fence
in the evals documentation to use a language specifier such as text (or bash if
you want executable-style syntax). Make the change in the markdown snippet that
documents the suite layout so the fenced block is properly annotated and the
warning is silenced.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.archon/scripts/eval-aggregate.ts:
- Line 23: The `eval-aggregate.ts` gate currently trusts `suite.json` too much
while `results.json` is validated, so make `manifest` validation fail loudly
before any scoring or threshold checks run. In `main` (where `manifest` is
parsed) and the logic that uses `manifest.dimensions`, `weights`, and
`thresholds`, add explicit validation that required fields exist and are
well-formed, especially ensuring every declared dimension has a corresponding
weight and both `thresholds.dim_floor` and `thresholds.overall_min` are present.
If validation fails, throw or exit with a clear error instead of falling back to
`weights[d] || 0` or comparing against `undefined`.

In @.archon/scripts/eval-load-suite.ts:
- Around line 19-22: The `EVAL_SUITE` value is used directly in path
construction without validation, so update `eval-load-suite` to sanitize or
whitelist `suite` before passing it to `join(...)` and only allow a plain suite
identifier. Apply the same protection in `eval-aggregate` as well, using the
relevant suite-handling logic there, so both entry points reject path separators
or `..` traversal and only resolve within `.archon/evals/`.

In @.archon/workflows/experimental/standing-eval-suite.yaml:
- Around line 16-18: The `score-cases` workflow is over-permissioned: its
`allowed_tools` grants `Bash` and unscoped `Write` even though it only needs to
read `queue.txt`/case files and emit `results.json`. Update the `score-cases`
configuration in the standing eval suite so it follows the read-only design and
least-privilege intent, keeping only the minimal tool access required; use the
`score-cases` block and its `allowed_tools` list as the place to fix this.

---

Duplicate comments:
In @.archon/scripts/eval-aggregate.ts:
- Line 16: The eval suite path in evalAggregate is built directly from the
EVAL_SUITE value, so sanitize or validate suite before passing it to join and
restrict it to an allowed suite name/path fragment. Update the logic in
evalAggregate to mirror the safer handling used in evalLoadSuite so untrusted
env input cannot escape the intended .archon/evals directory.

---

Nitpick comments:
In @.archon/evals/README.md:
- Around line 6-11: The README fenced directory tree block is missing a language
tag and triggers MD040; update the code fence in the evals documentation to use
a language specifier such as text (or bash if you want executable-style syntax).
Make the change in the markdown snippet that documents the suite layout so the
fenced block is properly annotated and the warning is silenced.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 52436635-a821-4dca-870e-bd5820a60dfa

📥 Commits

Reviewing files that changed from the base of the PR and between 59bbd00 and 9b86554.

📒 Files selected for processing (8)
  • .archon/evals/README.md
  • .archon/evals/seed/cases/borderline-refactor.yaml
  • .archon/evals/seed/cases/good-input-validation.yaml
  • .archon/evals/seed/cases/missing-tests.yaml
  • .archon/evals/seed/suite.json
  • .archon/scripts/eval-aggregate.ts
  • .archon/scripts/eval-load-suite.ts
  • .archon/workflows/experimental/standing-eval-suite.yaml

process.exit(1);
}

const manifest = JSON.parse(readFileSync(join(dir, 'suite.json'), 'utf8'));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

suite.json (manifest) is trusted without validation, unlike results.json.

The script fails loudly on a malformed results.json but applies no equivalent rigor to manifest.dimensions/weights/thresholds. Concretely:

  • Line 59: weights[d] || 0 silently zero-weights a dimension if weights is missing that key in suite.json, skewing overall without any warning.
  • Lines 67 and 83-88: if thresholds.dim_floor/overall_min are absent, comparisons against undefined are always falsedim_floor_failures would never trigger (false negative), while the overall_min check would always fail (false positive) — producing an inconsistent, hard-to-diagnose gate result for a misconfigured suite.

Given this script is the authoritative gate, a misconfigured suite.json should fail loudly the same way a malformed results.json does.

🛡️ Proposed fix
 const manifest = JSON.parse(readFileSync(join(dir, 'suite.json'), 'utf8'));
+for (const key of ['dimensions', 'weights', 'thresholds']) {
+  if (!manifest[key]) {
+    console.error(`suite.json is missing required field "${key}" — gate FAILS.`);
+    process.exit(1);
+  }
+}
+const dims0: string[] = manifest.dimensions;
+for (const d of dims0) {
+  if (typeof manifest.weights[d] !== 'number') {
+    console.error(`suite.json weights is missing dimension "${d}" — gate FAILS.`);
+    process.exit(1);
+  }
+}
+for (const k of ['overall_min', 'dim_floor', 'case_min', 'regression_tolerance']) {
+  if (typeof manifest.thresholds[k] !== 'number') {
+    console.error(`suite.json thresholds is missing "${k}" — gate FAILS.`);
+    process.exit(1);
+  }
+}

Also applies to: 41-43, 59-59, 67-67, 82-88

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.archon/scripts/eval-aggregate.ts at line 23, The `eval-aggregate.ts` gate
currently trusts `suite.json` too much while `results.json` is validated, so
make `manifest` validation fail loudly before any scoring or threshold checks
run. In `main` (where `manifest` is parsed) and the logic that uses
`manifest.dimensions`, `weights`, and `thresholds`, add explicit validation that
required fields exist and are well-formed, especially ensuring every declared
dimension has a corresponding weight and both `thresholds.dim_floor` and
`thresholds.overall_min` are present. If validation fails, throw or exit with a
clear error instead of falling back to `weights[d] || 0` or comparing against
`undefined`.

Comment on lines +19 to +22
const suite = process.env.EVAL_SUITE || 'seed';
const dir = join(process.cwd(), '.archon', 'evals', suite);
const casesDir = join(dir, 'cases');
const manifestPath = join(dir, 'suite.json');

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟡 Minor | ⚡ Quick win

Unvalidated EVAL_SUITE env var used directly in path construction.

suite flows straight into join(process.cwd(), '.archon', 'evals', suite) with no check that it's a plain identifier. A value containing ..// would let the script read manifests/cases outside .archon/evals/. Same pattern exists independently in eval-aggregate.ts (line 16).

🛡️ Proposed fix
-const suite = process.env.EVAL_SUITE || 'seed';
+const suite = process.env.EVAL_SUITE || 'seed';
+if (!/^[\w-]+$/.test(suite)) {
+  console.error(`Invalid EVAL_SUITE "${suite}": must be alphanumeric/dash/underscore only`);
+  process.exit(1);
+}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const suite = process.env.EVAL_SUITE || 'seed';
const dir = join(process.cwd(), '.archon', 'evals', suite);
const casesDir = join(dir, 'cases');
const manifestPath = join(dir, 'suite.json');
const suite = process.env.EVAL_SUITE || 'seed';
if (!/^[\w-]+$/.test(suite)) {
console.error(`Invalid EVAL_SUITE "${suite}": must be alphanumeric/dash/underscore only`);
process.exit(1);
}
const dir = join(process.cwd(), '.archon', 'evals', suite);
const casesDir = join(dir, 'cases');
const manifestPath = join(dir, 'suite.json');
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.archon/scripts/eval-load-suite.ts around lines 19 - 22, The `EVAL_SUITE`
value is used directly in path construction without validation, so update
`eval-load-suite` to sanitize or whitelist `suite` before passing it to
`join(...)` and only allow a plain suite identifier. Apply the same protection
in `eval-aggregate` as well, using the relevant suite-handling logic there, so
both entry points reject path separators or `..` traversal and only resolve
within `.archon/evals/`.

Comment on lines +16 to +18
v1 SCOPE (this file): scores PRE-SUPPLIED candidates carried in each case
(`candidate:`). Read-only — never edits the repo. Deterministic aggregation;
the LLM judge only scores 1-5 per dimension.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

score-cases grants Bash (and unscoped Write) beyond what the task needs.

The prompt only requires reading queue.txt/case files and writing one results.json, yet allowed_tools includes Bash. This runs against the live checkout (worktree.enabled: false), so this contradicts the stated "Read-only — never edits the repo" design (Lines 16-18) and exceeds least privilege.

🛡️ Proposed fix
-    allowed_tools: [Read, Write, Bash]
+    allowed_tools: [Read, Write]

Also applies to: 52-96

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.archon/workflows/experimental/standing-eval-suite.yaml around lines 16 -
18, The `score-cases` workflow is over-permissioned: its `allowed_tools` grants
`Bash` and unscoped `Write` even though it only needs to read `queue.txt`/case
files and emit `results.json`. Update the `score-cases` configuration in the
standing eval suite so it follows the read-only design and least-privilege
intent, keeping only the minimal tool access required; use the `score-cases`
block and its `allowed_tools` list as the place to fix this.

A second dataset for standing-eval-suite that acts as the standing QUALITY BAR
(stricter thresholds: overall_min 4.0, dim_floor 3.5, case_min 3). Three strong,
well-tested cases (validated config loader, safe user lookup, retry-with-backoff).

baseline.json is blessed from a PASS run (all dims 5.0), so the regression check
(any dim dropping > 0.5 below baseline -> gate FAIL) is now exercised. Verified
deterministically: feeding a result with safety 5.0->4.0 yields gate FAIL with
regressions: [safety], even though all absolute thresholds still pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.archon/evals/golden/cases/safe-user-lookup.yaml:
- Around line 16-26: The getUser snippet is incomplete because it uses
db.query<User>(...) and the User type without importing either dependency, so
make the embedded candidate code self-contained by adding the missing imports
for db and User in get-user.ts. Keep the function name getUser and the existing
query logic unchanged, but ensure every referenced symbol is explicitly imported
so the golden case is compile-ready and consistent with the other examples.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e04e634e-40c9-4b35-b8e6-418b9d939328

📥 Commits

Reviewing files that changed from the base of the PR and between 9b86554 and e7df69b.

📒 Files selected for processing (5)
  • .archon/evals/golden/baseline.json
  • .archon/evals/golden/cases/retry-with-backoff.yaml
  • .archon/evals/golden/cases/safe-user-lookup.yaml
  • .archon/evals/golden/cases/validated-config-loader.yaml
  • .archon/evals/golden/suite.json
✅ Files skipped from review due to trivial changes (1)
  • .archon/evals/golden/baseline.json

Comment on lines +16 to +26
candidate: |
// src/users/get-user.ts
export async function getUser(id: string): Promise<User | null> {
if (!id || !id.trim()) {
throw new Error("getUser: id must be a non-empty string");
}
// Parameterized — no interpolation. A missing row yields undefined -> null.
// Any DB/connection error throws out of db.query and propagates to the caller.
const row = await db.query<User>("SELECT * FROM users WHERE id = $1", [id]);
return row ?? null;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Candidate snippet references db and User without importing them.

Unlike the other two golden cases (validated-config-loader.yaml, retry-with-backoff.yaml), which import every dependency they use, this candidate uses db.query<User>(...) and the User type with no corresponding import statement. Since this case is meant to be the blessed "GOOD — correct, complete" reference, the embedded code itself should be complete/compilable to set a consistent quality bar across golden cases.

🩹 Proposed fix
 candidate: |
   // src/users/get-user.ts
+  import { db } from "../db";
+  import type { User } from "./types";
+
   export async function getUser(id: string): Promise<User | null> {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
candidate: |
// src/users/get-user.ts
export async function getUser(id: string): Promise<User | null> {
if (!id || !id.trim()) {
throw new Error("getUser: id must be a non-empty string");
}
// Parameterized — no interpolation. A missing row yields undefined -> null.
// Any DB/connection error throws out of db.query and propagates to the caller.
const row = await db.query<User>("SELECT * FROM users WHERE id = $1", [id]);
return row ?? null;
}
candidate: |
// src/users/get-user.ts
import { db } from "../db";
import type { User } from "./types";
export async function getUser(id: string): Promise<User | null> {
if (!id || !id.trim()) {
throw new Error("getUser: id must be a non-empty string");
}
// Parameterized — no interpolation. A missing row yields undefined -> null.
// Any DB/connection error throws out of db.query and propagates to the caller.
const row = await db.query<User>("SELECT * FROM users WHERE id = $1", [id]);
return row ?? null;
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.archon/evals/golden/cases/safe-user-lookup.yaml around lines 16 - 26, The
getUser snippet is incomplete because it uses db.query<User>(...) and the User
type without importing either dependency, so make the embedded candidate code
self-contained by adding the missing imports for db and User in get-user.ts.
Keep the function name getUser and the existing query logic unchanged, but
ensure every referenced symbol is explicitly imported so the golden case is
compile-ready and consistent with the other examples.

@narutomugens-byte

Copy link
Copy Markdown
Author

Calibration check — is the judge measuring real quality?

Before treating a blessed baseline as a real bar, I validated that the medium-tier judge measures quality rather than surface features. Two probes, against labels committed before any judge run:

1. Gradation — an independent annotator scored the 6 shipped cases across all 5 dimensions; the judge landed within ±1 on 30/30 dim-cells (max diff 1).

2. Substance vs. theater (blinded) — 3 engineered hard-negatives whose reference: described only what a correct solution looks like, never naming the candidate's defect or the expected score. The judge independently floored every trap:

hard-negative defect hidden behind… trap dim score
vacuous tests a present test file asserting only toBeDefined() / typeof === "string" verification 1
SQL injection input guard + green tests + clean structure safety 1
happy-path median a passing odd-length test + a confident "works for any length" comment correctness 1

The rationales confirm genuine detection, not pattern-matching — it flagged the slugify tests as checks that "pass for any string-returning function," and traced the median formula to "returns undefined for empty arrays … passes only by coincidence on a pre-sorted odd-length input." Scores were ~identical whether or not the reference leaked the answer, so the leak wasn't driving detection.

Conclusion: the judge reads substance. One observed limitation — it awards a flat 5 to genuinely-good code, so the all-5 baseline.json has no headroom; on a tiny suite, correlated ±1 wobbles on the same dimension could throw a false regression. Documented next step (out of scope for this PR): median-of-3 N-vote on score-cases to collapse that wobble.

Method note for future case authors: the seed/golden reference: fields currently hint the expected verdict ("expect mid scores", "should score low on verification"), so a standard run partly tests reading-comprehension; the blinded hard-negatives above are what isolate independent detection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant