feat(workflows): add standing-eval-suite (labelled-regression eval harness) by narutomugens-byte · Pull Request #2052 · coleam00/Archon

narutomugens-byte · 2026-06-30T21:08:46Z

Summary

Problem: agentic-eval-gate judges a single diff once, with no ground truth — there is no standing way to measure quality over time against labelled examples.
Why it matters: the vibe-coding SDLC's "evals are the new tests" + quality flywheel needs a curated, version-controlled suite that re-scores on every run and catches regressions.
What changed: adds standing-eval-suite — a 4-node workflow that scores labelled cases on a 5-dimension rubric, aggregates deterministically, optionally compares to a committed baseline, and gates; two named scripts; and two datasets — seed (demo: includes deliberately-failing cases) and golden (a standing quality bar with a blessed baseline.json).
What did NOT change (scope boundary): no TS in packages/ touched; no engine/schema/bundled-defaults changes. Experimental workflow only (lives in .archon/workflows/experimental/). agentic-eval-gate and all existing workflows untouched.

UX Journey

After

Operator                         Archon (standing-eval-suite)
────────                         ────────────────────────────
EVAL_SUITE=seed                  load-suite (script)   → writes case queue
  archon workflow run ────────▶  score-cases (medium)  → judges each case 1-5 on 5 dims
  standing-eval-suite            aggregate (script)    → weighted means, floors, baseline diff → scorecard
                                 verdict (small)        → human-readable PASS/FAIL
sees gate decision ◀──────────  scorecard.json (artifacts) + trend line (.archon/state)

Architecture Diagram

After

.archon/evals/<suite>/           .archon/workflows/experimental/        .archon/scripts/
  suite.json        ───────────▶  standing-eval-suite.yaml  ──run──▶  eval-load-suite.ts [+]
  cases/*.yaml      ──read by──▶    (load → score → aggregate → verdict)  eval-aggregate.ts [+]
  baseline.json (opt)
                                  outputs: $ARTIFACTS_DIR/eval/scorecard.json [+]
                                           .archon/state/eval-history.jsonl  [+] (gitignored)

Connection inventory:

From	To	Status	Notes
standing-eval-suite.yaml	eval-load-suite.ts	new	named script (bun run)
standing-eval-suite.yaml	eval-aggregate.ts	new	named script (bun run)
eval-load-suite.ts	.archon/evals/	new	reads dataset (cwd-relative)
eval-aggregate.ts	.archon/state/eval-history.jsonl	new	appends trend (gitignored)

Label Snapshot

Risk: risk: low
Size: size: M
Scope: workflows
Module: workflows:experimental

Change Metadata

Change type: feature
Primary scope: workflows

Linked Issue

Related # (none — extends the agentic-engineering eval work alongside agentic-eval-gate)

Validation Evidence (required)

This PR adds NO TypeScript under packages/ and no bundled defaults — it is workflow YAML + named scripts + a dataset. The standard bun run validate (type-check/lint/test/bundled checks) targets the engine and is not exercised by these files. Validation here is workflow-validation + an end-to-end smoke:

bun run cli validate workflows standing-eval-suite
#   standing-eval-suite   ok        (1 valid, 0 with errors)

EVAL_SUITE=seed bun run cli workflow run standing-eval-suite --no-worktree
#   gate FAIL, overall 3.1 < 3.5
#   dim_floor_failures: [completeness (2.667), verification (2.333)]
#   case_failures: [missing-tests, borderline-refactor]

EVAL_SUITE=golden bun run cli workflow run standing-eval-suite --no-worktree
#   gate PASS, overall 5.0 — all 3 strong cases scored 5/5; baseline blessed from this run

# Regression path (deterministic, no AI) — feed eval-aggregate a result with safety 5.0->4.0:
#   gate FAIL, baseline_used: true, regressions: [{dim: safety, baseline: 5, current: 4}]
#   (absolute thresholds all still pass — only the >0.5 baseline drop trips it)

Evidence provided: workflow-validation + both foreground runs (seed FAIL, golden PASS) + a deterministic regression-path check. The missing-tests case (correct code, zero tests) is correctly blocked — proving the verification dimension bites — and the golden baseline proves regression detection.
Intentionally skipped: full bun run validate — no engine/TS/bundled files changed.

Security Impact (required)

New permissions/capabilities? No — read-only workflow (mutates_checkout: false, no worktree); judge node restricted to [Read, Write, Bash], verdict to [].
New external network calls? No.
Secrets/tokens handling changed? No.
File system access scope changed? No — scripts read the cwd dataset and write only to $ARTIFACTS_DIR and the gitignored .archon/state/.

Compatibility / Migration

Backward compatible? Yes — purely additive; no existing workflow or schema changed.
Config/env changes? No (optional EVAL_SUITE env var selects a suite; default seed).
Database migration needed? No.

Human Verification (required)

Verified scenarios: foreground seed run end-to-end; confirmed queue.txt/results.json/scorecard.json written and a trend line appended to .archon/state/eval-history.jsonl.
Edge cases checked: empty/missing results.json → aggregate fails loudly (gate FAIL, not silent pass); a case missing a dimension score → fails loudly. Discovered and worked around an Archon-on-Windows bug where inline multi-line script: nodes silently no-op (bun -e argv truncates at the first newline) — hence named scripts.
Also verified: golden suite gate PASS (overall 5.0) and the regression-vs-baseline path (deterministic check above) — a single dimension dropping past tolerance flips the gate.
What was not verified: behavior on a very large suite (single-judge-pass scope; loop-based per-case fan-out is the documented v2).

Side Effects / Blast Radius (required)

Affected subsystems/workflows: none existing — new experimental workflow + two new scripts + a new dataset dir.
Potential unintended effects: none on other workflows; the new named scripts share the .archon/scripts/ namespace (prefixed eval- to avoid collision).
Guardrails/monitoring: aggregate fails fast on any malformed/empty judge output; deterministic gate is authoritative (the small-model verdict only narrates).

Rollback Plan (required)

Fast rollback: delete the four added paths (.archon/workflows/experimental/standing-eval-suite.yaml, .archon/scripts/eval-load-suite.ts, .archon/scripts/eval-aggregate.ts, .archon/evals/) — no other code references them.
Feature flags/toggles: none needed; it only runs when explicitly invoked.
Observable failure symptoms: archon validate workflows would flag a broken YAML; a run that can't find its dataset fails at load-suite.

Risks and Mitigations

Risk: LLM-judge scoring is non-deterministic run-to-run.
- Mitigation: deterministic aggregation + thresholds are authoritative; v2 adds N-vote judging. Documented in .archon/evals/README.md.
Risk: single judge pass may drop cases on a very large suite.
- Mitigation: scoped to small curated suites for v1; aggregate fails loudly if a case is unscored; loop-based fan-out is the documented v2 path.

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added support for a standing evaluation suite workflow with deterministic suite loading, automated case scoring, and pass/fail aggregation with historical scorecards.
- Added new seed eval cases focused on input validation, safe error handling, missing-test enforcement, and refactor edge cases.
- Added a golden eval suite and baseline updates for consistent scoring.
Documentation
- Added guidance for selecting and running standing eval suites, including directory structure, case formats, scoring/threshold rules, baseline blessing, and roadmap notes.

…rness) EXPERIMENTAL companion to agentic-eval-gate. Where the gate judges one diff once with no ground truth, this scores a curated, version-controlled set of LABELLED cases against a 5-dimension rubric (correctness, completeness, maintainability, safety, verification), aggregates deterministically, compares to an optional committed baseline, and gates — so quality is measured over time ("evals are the new tests" + the quality flywheel). - .archon/workflows/experimental/standing-eval-suite.yaml — 4-node DAG: load-suite -> score-cases (judge, medium tier) -> aggregate (deterministic gate) -> verdict (small tier). - .archon/scripts/eval-{load-suite,aggregate}.ts — the deterministic nodes as NAMED scripts (inline multi-line script: nodes silently no-op on Windows; the bun -e argv truncates at the first newline). - .archon/evals/seed/ — 5-dim rubric (suite.json) + 3 labelled cases + README. - Suite selected via EVAL_SUITE (default: seed). Scorecard written to artifacts; trend appended to .archon/state/eval-history.jsonl (gitignored). Smoke (seed, foreground): gate FAIL, overall 3.1 < 3.5 — floors tripped on verification/completeness, missing-tests + borderline-refactor flagged. The untested-but-correct case is correctly blocked. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-30T21:09:02Z

📝 Walkthrough

Walkthrough

This PR adds a standing eval-suite harness, its seed suite configuration and cases, workflow wiring, and documentation, plus a separate golden eval suite with baseline data and three evaluation cases.

Changes

Standing Eval Suite

Layer / File(s)	Summary
Suite data `.archon/evals/seed/suite.json`, `.archon/evals/seed/cases/*.yaml`	Adds the seed suite manifest and three seed eval cases for refactor, input validation, and missing-tests scenarios.
Suite loading `.archon/scripts/eval-load-suite.ts`	Validates suite inputs, builds the case queue, and prints a JSON summary.
Aggregation and gate `.archon/scripts/eval-aggregate.ts`	Computes per-dimension means, weighted overall score, gate status, scorecard output, and eval history.
Workflow wiring `.archon/workflows/experimental/standing-eval-suite.yaml`	Defines the load-suite, score-cases, aggregate, and verdict workflow nodes.
Documentation `.archon/evals/README.md`	Documents suite layout, execution, scoring rules, baseline blessing, and roadmap notes.

Golden Eval Suite

Layer / File(s)	Summary
Golden suite data `.archon/evals/golden/suite.json`, `.archon/evals/golden/baseline.json`, `.archon/evals/golden/cases/*.yaml`	Adds the golden suite manifest, baseline means, and cases for retry, safe user lookup, and validated config loading.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A rabbit hopped through eval lore, 🐰
with queues and gates and cases galore.
One suite to load, one suite to score,
a golden path and docs toோர்?
Hop, hop — the checks now run much more!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly names the new standing-eval-suite workflow and its labeled-regression eval harness scope.
Description check	✅ Passed	The PR description covers most required sections, including summary, diagrams, labels, validation, risks, and rollback, with only some template detail missing.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

.archon/scripts/eval-aggregate.ts (1)
16-16: 🔒 Security & Privacy | 🟡 Minor | ⚡ Quick win

Same unsanitized EVAL_SUITE path issue as eval-load-suite.ts.

See companion comment in eval-load-suite.ts:19-22; the same env-var-to-path concern applies here.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.archon/scripts/eval-aggregate.ts at line 16, The eval suite path in
evalAggregate is built directly from the EVAL_SUITE value, so sanitize or
validate suite before passing it to join and restrict it to an allowed suite
name/path fragment. Update the logic in evalAggregate to mirror the safer
handling used in evalLoadSuite so untrusted env input cannot escape the intended
.archon/evals directory.

🧹 Nitpick comments (1)

.archon/evals/README.md (1)
6-11: 📐 Maintainability & Code Quality | 🔵 Trivial

Add language specifier to directory tree code block.

The fenced code block is missing a language tag, triggering MD040. Use text (or bash if you prefer executable syntax) to silence the warning.
+```text
.archon/evals//
suite.json # rubric dimensions + weights + thresholds (JSON — bun reads it dep-free)
cases/*.yaml # one labelled case each (YAML — the AI judge reads these natively)
baseline.json # OPTIONAL, COMMITTED: blessed mean_by_dim from an accepted run
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.archon/evals/README.md around lines 6 - 11, The README fenced directory
tree block is missing a language tag and triggers MD040; update the code fence
in the evals documentation to use a language specifier such as text (or bash if
you want executable-style syntax). Make the change in the markdown snippet that
documents the suite layout so the fenced block is properly annotated and the
warning is silenced.
Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.archon/scripts/eval-aggregate.ts:
- Line 23: The `eval-aggregate.ts` gate currently trusts `suite.json` too much
while `results.json` is validated, so make `manifest` validation fail loudly
before any scoring or threshold checks run. In `main` (where `manifest` is
parsed) and the logic that uses `manifest.dimensions`, `weights`, and
`thresholds`, add explicit validation that required fields exist and are
well-formed, especially ensuring every declared dimension has a corresponding
weight and both `thresholds.dim_floor` and `thresholds.overall_min` are present.
If validation fails, throw or exit with a clear error instead of falling back to
`weights[d] || 0` or comparing against `undefined`.

In @.archon/scripts/eval-load-suite.ts:
- Around line 19-22: The `EVAL_SUITE` value is used directly in path
construction without validation, so update `eval-load-suite` to sanitize or
whitelist `suite` before passing it to `join(...)` and only allow a plain suite
identifier. Apply the same protection in `eval-aggregate` as well, using the
relevant suite-handling logic there, so both entry points reject path separators
or `..` traversal and only resolve within `.archon/evals/`.

In @.archon/workflows/experimental/standing-eval-suite.yaml:
- Around line 16-18: The `score-cases` workflow is over-permissioned: its
`allowed_tools` grants `Bash` and unscoped `Write` even though it only needs to
read `queue.txt`/case files and emit `results.json`. Update the `score-cases`
configuration in the standing eval suite so it follows the read-only design and
least-privilege intent, keeping only the minimal tool access required; use the
`score-cases` block and its `allowed_tools` list as the place to fix this.

---

Duplicate comments:
In @.archon/scripts/eval-aggregate.ts:
- Line 16: The eval suite path in evalAggregate is built directly from the
EVAL_SUITE value, so sanitize or validate suite before passing it to join and
restrict it to an allowed suite name/path fragment. Update the logic in
evalAggregate to mirror the safer handling used in evalLoadSuite so untrusted
env input cannot escape the intended .archon/evals directory.

---

Nitpick comments:
In @.archon/evals/README.md:
- Around line 6-11: The README fenced directory tree block is missing a language
tag and triggers MD040; update the code fence in the evals documentation to use
a language specifier such as text (or bash if you want executable-style syntax).
Make the change in the markdown snippet that documents the suite layout so the
fenced block is properly annotated and the warning is silenced.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 52436635-a821-4dca-870e-bd5820a60dfa

📥 Commits

Reviewing files that changed from the base of the PR and between 59bbd00 and 9b86554.

📒 Files selected for processing (8)

.archon/evals/README.md
.archon/evals/seed/cases/borderline-refactor.yaml
.archon/evals/seed/cases/good-input-validation.yaml
.archon/evals/seed/cases/missing-tests.yaml
.archon/evals/seed/suite.json
.archon/scripts/eval-aggregate.ts
.archon/scripts/eval-load-suite.ts
.archon/workflows/experimental/standing-eval-suite.yaml

coderabbitai · 2026-06-30T21:14:19Z

+  process.exit(1);
+}
+
+const manifest = JSON.parse(readFileSync(join(dir, 'suite.json'), 'utf8'));


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

suite.json (manifest) is trusted without validation, unlike results.json.

The script fails loudly on a malformed results.json but applies no equivalent rigor to manifest.dimensions/weights/thresholds. Concretely:

Line 59: weights[d] || 0 silently zero-weights a dimension if weights is missing that key in suite.json, skewing overall without any warning.

Lines 67 and 83-88: if thresholds.dim_floor/overall_min are absent, comparisons against undefined are always false — dim_floor_failures would never trigger (false negative), while the overall_min check would always fail (false positive) — producing an inconsistent, hard-to-diagnose gate result for a misconfigured suite.

Given this script is the authoritative gate, a misconfigured suite.json should fail loudly the same way a malformed results.json does.

🛡️ Proposed fix

const manifest = JSON.parse(readFileSync(join(dir, 'suite.json'), 'utf8')); +for (const key of ['dimensions', 'weights', 'thresholds']) { + if (!manifest[key]) { + console.error(`suite.json is missing required field "${key}" — gate FAILS.`); + process.exit(1); + } +} +const dims0: string[] = manifest.dimensions; +for (const d of dims0) { + if (typeof manifest.weights[d] !== 'number') { + console.error(`suite.json weights is missing dimension "${d}" — gate FAILS.`); + process.exit(1); + } +} +for (const k of ['overall_min', 'dim_floor', 'case_min', 'regression_tolerance']) { + if (typeof manifest.thresholds[k] !== 'number') { + console.error(`suite.json thresholds is missing "${k}" — gate FAILS.`); + process.exit(1); + } +}

Also applies to: 41-43, 59-59, 67-67, 82-88

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.archon/scripts/eval-aggregate.ts at line 23, The `eval-aggregate.ts` gate currently trusts `suite.json` too much while `results.json` is validated, so make `manifest` validation fail loudly before any scoring or threshold checks run. In `main` (where `manifest` is parsed) and the logic that uses `manifest.dimensions`, `weights`, and `thresholds`, add explicit validation that required fields exist and are well-formed, especially ensuring every declared dimension has a corresponding weight and both `thresholds.dim_floor` and `thresholds.overall_min` are present. If validation fails, throw or exit with a clear error instead of falling back to `weights[d] || 0` or comparing against `undefined`.

coderabbitai · 2026-06-30T21:14:19Z

+const suite = process.env.EVAL_SUITE || 'seed';
+const dir = join(process.cwd(), '.archon', 'evals', suite);
+const casesDir = join(dir, 'cases');
+const manifestPath = join(dir, 'suite.json');


🔒 Security & Privacy | 🟡 Minor | ⚡ Quick win

Unvalidated EVAL_SUITE env var used directly in path construction.

suite flows straight into join(process.cwd(), '.archon', 'evals', suite) with no check that it's a plain identifier. A value containing ..// would let the script read manifests/cases outside .archon/evals/. Same pattern exists independently in eval-aggregate.ts (line 16).

🛡️ Proposed fix

-const suite = process.env.EVAL_SUITE || 'seed'; +const suite = process.env.EVAL_SUITE || 'seed'; +if (!/^[\w-]+$/.test(suite)) { + console.error(`Invalid EVAL_SUITE "${suite}": must be alphanumeric/dash/underscore only`); + process.exit(1); +}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const suite = process.env.EVAL_SUITE || 'seed';

const dir = join(process.cwd(), '.archon', 'evals', suite);

const casesDir = join(dir, 'cases');

const manifestPath = join(dir, 'suite.json');

const suite = process.env.EVAL_SUITE || 'seed';

if (!/^[\w-]+$/.test(suite)) {

console.error(`Invalid EVAL_SUITE "${suite}": must be alphanumeric/dash/underscore only`);

process.exit(1);

}

const dir = join(process.cwd(), '.archon', 'evals', suite);

const casesDir = join(dir, 'cases');

const manifestPath = join(dir, 'suite.json');

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.archon/scripts/eval-load-suite.ts around lines 19 - 22, The `EVAL_SUITE` value is used directly in path construction without validation, so update `eval-load-suite` to sanitize or whitelist `suite` before passing it to `join(...)` and only allow a plain suite identifier. Apply the same protection in `eval-aggregate` as well, using the relevant suite-handling logic there, so both entry points reject path separators or `..` traversal and only resolve within `.archon/evals/`.

coderabbitai · 2026-06-30T21:14:20Z

+  v1 SCOPE (this file): scores PRE-SUPPLIED candidates carried in each case
+  (`candidate:`). Read-only — never edits the repo. Deterministic aggregation;
+  the LLM judge only scores 1-5 per dimension.


🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

score-cases grants Bash (and unscoped Write) beyond what the task needs.

The prompt only requires reading queue.txt/case files and writing one results.json, yet allowed_tools includes Bash. This runs against the live checkout (worktree.enabled: false), so this contradicts the stated "Read-only — never edits the repo" design (Lines 16-18) and exceeds least privilege.

🛡️ Proposed fix

- allowed_tools: [Read, Write, Bash] + allowed_tools: [Read, Write]

Also applies to: 52-96

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.archon/workflows/experimental/standing-eval-suite.yaml around lines 16 - 18, The `score-cases` workflow is over-permissioned: its `allowed_tools` grants `Bash` and unscoped `Write` even though it only needs to read `queue.txt`/case files and emit `results.json`. Update the `score-cases` configuration in the standing eval suite so it follows the read-only design and least-privilege intent, keeping only the minimal tool access required; use the `score-cases` block and its `allowed_tools` list as the place to fix this.

A second dataset for standing-eval-suite that acts as the standing QUALITY BAR (stricter thresholds: overall_min 4.0, dim_floor 3.5, case_min 3). Three strong, well-tested cases (validated config loader, safe user lookup, retry-with-backoff). baseline.json is blessed from a PASS run (all dims 5.0), so the regression check (any dim dropping > 0.5 below baseline -> gate FAIL) is now exercised. Verified deterministically: feeding a result with safety 5.0->4.0 yields gate FAIL with regressions: [safety], even though all absolute thresholds still pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.archon/evals/golden/cases/safe-user-lookup.yaml:
- Around line 16-26: The getUser snippet is incomplete because it uses
db.query<User>(...) and the User type without importing either dependency, so
make the embedded candidate code self-contained by adding the missing imports
for db and User in get-user.ts. Keep the function name getUser and the existing
query logic unchanged, but ensure every referenced symbol is explicitly imported
so the golden case is compile-ready and consistent with the other examples.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e04e634e-40c9-4b35-b8e6-418b9d939328

📥 Commits

Reviewing files that changed from the base of the PR and between 9b86554 and e7df69b.

📒 Files selected for processing (5)

.archon/evals/golden/baseline.json
.archon/evals/golden/cases/retry-with-backoff.yaml
.archon/evals/golden/cases/safe-user-lookup.yaml
.archon/evals/golden/cases/validated-config-loader.yaml
.archon/evals/golden/suite.json

✅ Files skipped from review due to trivial changes (1)

.archon/evals/golden/baseline.json

coderabbitai · 2026-06-30T21:56:16Z

+candidate: |
+  // src/users/get-user.ts
+  export async function getUser(id: string): Promise<User | null> {
+    if (!id || !id.trim()) {
+      throw new Error("getUser: id must be a non-empty string");
+    }
+    // Parameterized — no interpolation. A missing row yields undefined -> null.
+    // Any DB/connection error throws out of db.query and propagates to the caller.
+    const row = await db.query<User>("SELECT * FROM users WHERE id = $1", [id]);
+    return row ?? null;
+  }


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Candidate snippet references db and User without importing them.

Unlike the other two golden cases (validated-config-loader.yaml, retry-with-backoff.yaml), which import every dependency they use, this candidate uses db.query<User>(...) and the User type with no corresponding import statement. Since this case is meant to be the blessed "GOOD — correct, complete" reference, the embedded code itself should be complete/compilable to set a consistent quality bar across golden cases.

🩹 Proposed fix

candidate: | // src/users/get-user.ts + import { db } from "../db"; + import type { User } from "./types"; + export async function getUser(id: string): Promise<User | null> {

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

candidate: |

// src/users/get-user.ts

export async function getUser(id: string): Promise<User | null> {

if (!id || !id.trim()) {

throw new Error("getUser: id must be a non-empty string");

}

// Parameterized — no interpolation. A missing row yields undefined -> null.

// Any DB/connection error throws out of db.query and propagates to the caller.

const row = await db.query<User>("SELECT * FROM users WHERE id = $1", [id]);

return row ?? null;

}

candidate: |

// src/users/get-user.ts

import { db } from "../db";

import type { User } from "./types";

export async function getUser(id: string): Promise<User | null> {

if (!id || !id.trim()) {

throw new Error("getUser: id must be a non-empty string");

}

// Parameterized — no interpolation. A missing row yields undefined -> null.

// Any DB/connection error throws out of db.query and propagates to the caller.

const row = await db.query<User>("SELECT * FROM users WHERE id = $1", [id]);

return row ?? null;

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.archon/evals/golden/cases/safe-user-lookup.yaml around lines 16 - 26, The getUser snippet is incomplete because it uses db.query<User>(...) and the User type without importing either dependency, so make the embedded candidate code self-contained by adding the missing imports for db and User in get-user.ts. Keep the function name getUser and the existing query logic unchanged, but ensure every referenced symbol is explicitly imported so the golden case is compile-ready and consistent with the other examples.

narutomugens-byte · 2026-07-01T00:09:30Z

Calibration check — is the judge measuring real quality?

Before treating a blessed baseline as a real bar, I validated that the medium-tier judge measures quality rather than surface features. Two probes, against labels committed before any judge run:

1. Gradation — an independent annotator scored the 6 shipped cases across all 5 dimensions; the judge landed within ±1 on 30/30 dim-cells (max diff 1).

2. Substance vs. theater (blinded) — 3 engineered hard-negatives whose reference: described only what a correct solution looks like, never naming the candidate's defect or the expected score. The judge independently floored every trap:

hard-negative	defect hidden behind…	trap dim	score
vacuous tests	a present test file asserting only `toBeDefined()` / `typeof === "string"`	verification	1
SQL injection	input guard + green tests + clean structure	safety	1
happy-path median	a passing odd-length test + a confident "works for any length" comment	correctness	1

The rationales confirm genuine detection, not pattern-matching — it flagged the slugify tests as checks that "pass for any string-returning function," and traced the median formula to "returns undefined for empty arrays … passes only by coincidence on a pre-sorted odd-length input." Scores were ~identical whether or not the reference leaked the answer, so the leak wasn't driving detection.

Conclusion: the judge reads substance. One observed limitation — it awards a flat 5 to genuinely-good code, so the all-5 baseline.json has no headroom; on a tiny suite, correlated ±1 wobbles on the same dimension could throw a false regression. Documented next step (out of scope for this PR): median-of-3 N-vote on score-cases to collapse that wobble.

Method note for future case authors: the seed/golden reference: fields currently hint the expected verdict ("expect mid scores", "should score low on verification"), so a standard run partly tests reading-comprehension; the blinded hard-negatives above are what isolate independent detection.

coderabbitai Bot reviewed Jun 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(workflows): add standing-eval-suite (labelled-regression eval harness)#2052

feat(workflows): add standing-eval-suite (labelled-regression eval harness)#2052
narutomugens-byte wants to merge 2 commits into
coleam00:devfrom
narutomugens-byte:feat/standing-eval-suite

narutomugens-byte commented Jun 30, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 30, 2026

Uh oh!

coderabbitai Bot Jun 30, 2026

Uh oh!

coderabbitai Bot Jun 30, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 30, 2026

Uh oh!

narutomugens-byte commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

narutomugens-byte commented Jun 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

UX Journey

After

Architecture Diagram

After

Label Snapshot

Change Metadata

Linked Issue

Validation Evidence (required)

Security Impact (required)

Compatibility / Migration

Human Verification (required)

Side Effects / Blast Radius (required)

Rollback Plan (required)

Risks and Mitigations

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

narutomugens-byte commented Jul 1, 2026

Calibration check — is the judge measuring real quality?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

narutomugens-byte commented Jun 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading