feat(workflows): Kimi cross-model harness + eval gate (MiniMax parity)#2049
feat(workflows): Kimi cross-model harness + eval gate (MiniMax parity)#2049narutomugens-byte wants to merge 2 commits into
Conversation
…iniMax parity)
Bring the home-scoped Kimi build harness into .archon/workflows/, following
the committed MiniMax precedent. Code already supports Kimi via the Pi
`kimi-coding` vendor — this is workflow-YAML only.
- experimental/opus-plan-kimi-build.yaml: cross-model PIV build loop (Opus
plans/reviews via the `large` tier, Kimi implements/fixes via
pi + kimi-coding/kimi-for-coding). Default isolation off `dev`.
- experimental/archon-fix-github-issue-kimi.yaml: faithful mirror of
archon-fix-github-issue-minimax.yaml, all nodes on Kimi.
- test-workflows/e2e-kimi-smoke.yaml: connectivity/capability smoke mirroring
e2e-minimax-smoke.yaml, asserts Pi session-log routing to kimi-coding.
- experimental/agentic-eval-gate.yaml: standalone output+trajectory pre-merge
gate ("set the bar at the eval, not the demo"); pairs with the build harness.
All four pass `archon validate workflows`. The two non-blocking shell-quoting
warnings are inherited verbatim from the MiniMax sources (parity, not new).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughFour new Archon workflow YAML files are added: ChangesAgentic Evaluation Gate
Kimi fix-github-issue workflow
Opus-plan / Kimi-build cross-model PIV loop
e2e Kimi smoke test
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…e-check/lint have() already wraps $1 in quotes when grepping package.json, so calling have '"type-check"' / have '"lint"' searched for ""type-check"" (doubled quotes) and never matched — type-check and lint always reported "skipped", blinding the trajectory eval to the very checks it must weigh. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.archon/workflows/experimental/agentic-eval-gate.yaml:
- Around line 66-74: The fallback in the eval gate logic is too aggressive: when
origin/$BASE exists but HEAD has no commits ahead of it, the current MODE=last
path in the shell block still evaluates the previous commit instead of treating
the branch as having no changes. Update the decision flow in the same
conditional chain so the branch only selects MODE=base when there are commits
ahead of origin/$BASE, and otherwise reports no changes to evaluate instead of
falling back to the last commit; keep the fix localized around the STATUS_PORC,
git rev-parse/git log check, and the MODE/ LABEL assignments.
- Around line 66-95: The worktree path in the diff-gating logic drops newly
untracked files because `run_diff()` uses `git diff ... HEAD`, so additions can
be reported as no changes. Update the `MODE=worktree` branch and the `run_diff`
helper in `agentic-eval-gate.yaml` so worktree mode includes untracked paths
alongside tracked changes, and make `--name-only` report those files too. Keep
the existing `MODE`, `LABEL`, and `run_diff` structure, but ensure the worktree
comparison captures all files visible from `git status --porcelain`.
In @.archon/workflows/experimental/archon-fix-github-issue-kimi.yaml:
- Around line 74-83: The `fetch-issue` step is interpolating model output into
the bash script before sanitization, which can allow command execution; update
the workflow so `extract-issue-number.output` is treated as data only and never
rendered directly into shell evaluation. In `fetch-issue`, read the output
through a safe quoting/escaping path or environment variable, then perform the
numeric extraction and validation before calling `gh issue view`, keeping the
hardening inside this step and preserving the existing `ISSUE_NUM` check.
In @.archon/workflows/experimental/opus-plan-kimi-build.yaml:
- Around line 154-163: Pass the original plan context into both the
fresh-context fix and re-review nodes so they can see the plan-derived
verification steps; update the prompt wiring around the reviewer output handling
in opus-plan-kimi-build.yaml to include the original plan’s “Done
means”/acceptance commands alongside $review.output and $fix.output. Keep the
existing NO FIXES NEEDED / ship guard intact, but ensure the fresh-context nodes
can reliably rerun the intended build/lint/test commands and validate against
the original criteria.
In @.archon/workflows/test-workflows/e2e-kimi-smoke.yaml:
- Around line 104-126: The session validation in the Kimi smoke workflow is too
broad because it scans any recent Pi session and can be satisfied by unrelated
concurrent runs. Update the logic around the recent_sessions/matched check to
capture a baseline before the hello step starts, then only consider session
files created or modified after that marker. Use the existing session-log grep
flow to verify provider=kimi-coding and modelId=kimi-for-coding, but restrict it
to files correlated with this run rather than any session touched in the last 10
minutes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 17febcd7-186b-42fe-bad0-413d5d9b1645
📒 Files selected for processing (4)
.archon/workflows/experimental/agentic-eval-gate.yaml.archon/workflows/experimental/archon-fix-github-issue-kimi.yaml.archon/workflows/experimental/opus-plan-kimi-build.yaml.archon/workflows/test-workflows/e2e-kimi-smoke.yaml
| if [ -n "$STATUS_PORC" ]; then | ||
| MODE=worktree; LABEL="(uncommitted working-tree changes vs HEAD)" | ||
| elif git rev-parse --verify --quiet "origin/$BASE" >/dev/null && [ -n "$(git log --oneline "origin/$BASE..HEAD" 2>/dev/null)" ]; then | ||
| MODE=base; LABEL="(this branch vs origin/$BASE)" | ||
| elif [ -n "$HAS_PARENT" ]; then | ||
| MODE=last; LABEL="(fallback: last commit)" | ||
| else | ||
| MODE=root; LABEL="(root commit: full initial tree vs empty tree)" | ||
| fi |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Don't fall back to HEAD~1..HEAD when the branch is clean and not ahead of base.
If origin/$BASE exists and HEAD has no commits ahead of it, Lines 68-71 still switch to MODE=last. That makes the gate evaluate the previous commit instead of reporting “no changes to evaluate,” so a clean synced branch can fail on historical work unrelated to the current diff.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.archon/workflows/experimental/agentic-eval-gate.yaml around lines 66 - 74,
The fallback in the eval gate logic is too aggressive: when origin/$BASE exists
but HEAD has no commits ahead of it, the current MODE=last path in the shell
block still evaluates the previous commit instead of treating the branch as
having no changes. Update the decision flow in the same conditional chain so the
branch only selects MODE=base when there are commits ahead of origin/$BASE, and
otherwise reports no changes to evaluate instead of falling back to the last
commit; keep the fix localized around the STATUS_PORC, git rev-parse/git log
check, and the MODE/ LABEL assignments.
| if [ -n "$STATUS_PORC" ]; then | ||
| MODE=worktree; LABEL="(uncommitted working-tree changes vs HEAD)" | ||
| elif git rev-parse --verify --quiet "origin/$BASE" >/dev/null && [ -n "$(git log --oneline "origin/$BASE..HEAD" 2>/dev/null)" ]; then | ||
| MODE=base; LABEL="(this branch vs origin/$BASE)" | ||
| elif [ -n "$HAS_PARENT" ]; then | ||
| MODE=last; LABEL="(fallback: last commit)" | ||
| else | ||
| MODE=root; LABEL="(root commit: full initial tree vs empty tree)" | ||
| fi | ||
|
|
||
| run_diff() { | ||
| case "$MODE" in | ||
| worktree) git diff "$@" HEAD ;; | ||
| base) git diff "$@" "origin/$BASE...HEAD" ;; | ||
| last) git diff "$@" HEAD~1..HEAD ;; | ||
| root) git diff "$@" "$EMPTY_TREE" HEAD ;; | ||
| esac | ||
| } | ||
|
|
||
| echo "=== DIFF ===" | ||
| echo "$LABEL" | ||
| DIFF_OUT="$(run_diff)" | ||
| if [ -n "$DIFF_OUT" ]; then | ||
| echo "$DIFF_OUT" | ||
| else | ||
| echo "NO CHANGES DETECTED - working tree clean and nothing to compare against." | ||
| echo "Evaluators: state there is nothing to evaluate; do not invent findings." | ||
| fi | ||
| echo "=== CHANGED FILES ===" | ||
| run_diff --name-only |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Include untracked files in worktree mode.
git status --porcelain will enter MODE=worktree for newly added files, but git diff HEAD and git diff --name-only HEAD omit untracked paths. A diff that only adds new files can therefore show up as “NO CHANGES DETECTED” or miss files the evaluators need to inspect.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.archon/workflows/experimental/agentic-eval-gate.yaml around lines 66 - 95,
The worktree path in the diff-gating logic drops newly untracked files because
`run_diff()` uses `git diff ... HEAD`, so additions can be reported as no
changes. Update the `MODE=worktree` branch and the `run_diff` helper in
`agentic-eval-gate.yaml` so worktree mode includes untracked paths alongside
tracked changes, and make `--name-only` report those files too. Keep the
existing `MODE`, `LABEL`, and `run_diff` structure, but ensure the worktree
comparison captures all files visible from `git status --porcelain`.
| - id: fetch-issue | ||
| bash: | | ||
| # Strip quotes, whitespace, markdown backticks from AI output | ||
| ISSUE_NUM=$(echo "$extract-issue-number.output" | tr -d "'\"\`\n " | grep -oE '[0-9]+' | head -1) | ||
| if [ -z "$ISSUE_NUM" ]; then | ||
| echo "Failed to extract issue number from: $extract-issue-number.output" >&2 | ||
| exit 1 | ||
| fi | ||
| gh issue view "$ISSUE_NUM" --json title,body,labels,comments,state,url,author | ||
| depends_on: [extract-issue-number] |
There was a problem hiding this comment.
🔒 Security & Privacy | 🔴 Critical | ⚡ Quick win
Avoid interpolating model output directly into bash.
Line 77 sanitizes after shell evaluation. If $extract-issue-number.output is rendered into the script, output like $(...) or backticks can execute before tr/grep, turning a malformed model response into runner command execution.
🛡️ Proposed hardening
- id: fetch-issue
bash: |
# Strip quotes, whitespace, markdown backticks from AI output
- ISSUE_NUM=$(echo "$extract-issue-number.output" | tr -d "'\"\`\n " | grep -oE '[0-9]+' | head -1)
+ ISSUE_RAW=$(cat <<'__ARCHON_ISSUE_NUMBER_OUTPUT__'
+$extract-issue-number.output
+__ARCHON_ISSUE_NUMBER_OUTPUT__
+ )
+ ISSUE_NUM=$(printf '%s\n' "$ISSUE_RAW" | tr -d "'\"\`\n " | grep -oE '[0-9]+' | head -1)
if [ -z "$ISSUE_NUM" ]; then
- echo "Failed to extract issue number from: $extract-issue-number.output" >&2
+ printf 'Failed to extract issue number from: %s\n' "$ISSUE_RAW" >&2
exit 1
fi
gh issue view "$ISSUE_NUM" --json title,body,labels,comments,state,url,author📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - id: fetch-issue | |
| bash: | | |
| # Strip quotes, whitespace, markdown backticks from AI output | |
| ISSUE_NUM=$(echo "$extract-issue-number.output" | tr -d "'\"\`\n " | grep -oE '[0-9]+' | head -1) | |
| if [ -z "$ISSUE_NUM" ]; then | |
| echo "Failed to extract issue number from: $extract-issue-number.output" >&2 | |
| exit 1 | |
| fi | |
| gh issue view "$ISSUE_NUM" --json title,body,labels,comments,state,url,author | |
| depends_on: [extract-issue-number] | |
| - id: fetch-issue | |
| bash: | | |
| # Strip quotes, whitespace, markdown backticks from AI output | |
| ISSUE_RAW=$(cat <<'__ARCHON_ISSUE_NUMBER_OUTPUT__' | |
| $extract-issue-number.output | |
| __ARCHON_ISSUE_NUMBER_OUTPUT__ | |
| ) | |
| ISSUE_NUM=$(printf '%s\n' "$ISSUE_RAW" | tr -d "'\"\`\n " | grep -oE '[0-9]+' | head -1) | |
| if [ -z "$ISSUE_NUM" ]; then | |
| printf 'Failed to extract issue number from: %s\n' "$ISSUE_RAW" >&2 | |
| exit 1 | |
| fi | |
| gh issue view "$ISSUE_NUM" --json title,body,labels,comments,state,url,author | |
| depends_on: [extract-issue-number] |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.archon/workflows/experimental/archon-fix-github-issue-kimi.yaml around
lines 74 - 83, The `fetch-issue` step is interpolating model output into the
bash script before sanitization, which can allow command execution; update the
workflow so `extract-issue-number.output` is treated as data only and never
rendered directly into shell evaluation. In `fetch-issue`, read the output
through a safe quoting/escaping path or environment variable, then perform the
numeric extraction and validation before calling `gh issue view`, keeping the
hardening inside this step and preserving the existing `ISSUE_NUM` check.
| ## Reviewer output (findings + fix plan + verdict) | ||
|
|
||
| $review.output | ||
|
|
||
| ## Instructions | ||
|
|
||
| 1. If the Fix Plan section is `NO FIXES NEEDED` (or the verdict is `ship` with no | ||
| fix items), make NO changes and say so — stop here. | ||
| 2. Otherwise apply each fix-plan item precisely. Read each file before editing. | ||
| 3. Re-run any build/lint/test command and confirm it passes after your edits. |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Pass the original plan into the fresh-context fix and re-review nodes.
Both nodes reference plan-derived verification, but context: fresh means they only see $review.output / $fix.output. If the review does not restate the plan’s “Done means” commands, Kimi cannot reliably rerun them, and Opus cannot validate the final state against the original acceptance criteria.
Proposed prompt context fix
## Reviewer output (findings + fix plan + verdict)
$review.output
+
+ ## Original Plan
+
+ $plan.output
## Instructions
@@
## Your earlier review (findings + fix plan)
$review.output
## What the implementer reported doing
$fix.output
+
+ ## Original Plan
+
+ $plan.outputAlso applies to: 178-191
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.archon/workflows/experimental/opus-plan-kimi-build.yaml around lines 154 -
163, Pass the original plan context into both the fresh-context fix and
re-review nodes so they can see the plan-derived verification steps; update the
prompt wiring around the reviewer output handling in opus-plan-kimi-build.yaml
to include the original plan’s “Done means”/acceptance commands alongside
$review.output and $fix.output. Keep the existing NO FIXES NEEDED / ship guard
intact, but ensure the fresh-context nodes can reliably rerun the intended
build/lint/test commands and validate against the original criteria.
| recent_sessions=$(find "$HOME/.pi/agent/sessions" -name '*.jsonl' -mmin -10 -print 2>/dev/null) | ||
| if [ -z "$recent_sessions" ]; then | ||
| echo "FAIL: no Pi session jsonl modified in the last 10 minutes" | ||
| exit 1 | ||
| fi | ||
|
|
||
| matched="" | ||
| while IFS= read -r session; do | ||
| # Two separate greps for order-independence — JSON field ordering | ||
| # isn't part of Pi's contract, so a single regex with `.*` between | ||
| # the two fields would silently false-FAIL if Pi ever reorders. | ||
| if grep -q '"provider":"kimi-coding"' "$session" \ | ||
| && grep -q '"modelId":"kimi-for-coding"' "$session"; then | ||
| matched="$session" | ||
| break | ||
| fi | ||
| done <<< "$recent_sessions" | ||
|
|
||
| if [ -n "$matched" ]; then | ||
| echo "PASS: Pi session log confirms provider=kimi-coding, modelId=kimi-for-coding" | ||
| echo " session: $matched" | ||
| else | ||
| echo "FAIL: no recent Pi session log confirmed kimi routing — possible misroute" |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Correlate the session log to this run, not any recent Kimi call.
This check passes on any Kimi session touched in the last 10 minutes, so an unrelated or concurrent kimi-coding/kimi-for-coding run can make the smoke test PASS even if these nodes were misrouted. Capture a baseline before hello starts (timestamp or file list) and only inspect session files created/updated after that marker.
Suggested direction
+ - id: mark-session-baseline
+ bash: |
+ baseline="$(mktemp)"
+ touch "$baseline"
+ echo "{\"baseline\":\"$baseline\"}"
+ output_format:
+ type: object
+ properties:
+ baseline:
+ type: string
+ required: [baseline]
+
- id: hello
+ depends_on: [mark-session-baseline]
prompt: 'What is 2+2? Answer with just the number, nothing else.'
allowed_tools: []
effort: low
idle_timeout: 60000
...
- - id: assert
- depends_on: [hello, identify, json]
+ - id: assert
+ depends_on: [mark-session-baseline, hello, identify, json]
bash: |
+ baseline="$mark-session-baseline.output.baseline"
...
- recent_sessions=$(find "$HOME/.pi/agent/sessions" -name '*.jsonl' -mmin -10 -print 2>/dev/null)
+ recent_sessions=$(find "$HOME/.pi/agent/sessions" -type f -name '*.jsonl' -newer "$baseline" -print 2>/dev/null)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| recent_sessions=$(find "$HOME/.pi/agent/sessions" -name '*.jsonl' -mmin -10 -print 2>/dev/null) | |
| if [ -z "$recent_sessions" ]; then | |
| echo "FAIL: no Pi session jsonl modified in the last 10 minutes" | |
| exit 1 | |
| fi | |
| matched="" | |
| while IFS= read -r session; do | |
| # Two separate greps for order-independence — JSON field ordering | |
| # isn't part of Pi's contract, so a single regex with `.*` between | |
| # the two fields would silently false-FAIL if Pi ever reorders. | |
| if grep -q '"provider":"kimi-coding"' "$session" \ | |
| && grep -q '"modelId":"kimi-for-coding"' "$session"; then | |
| matched="$session" | |
| break | |
| fi | |
| done <<< "$recent_sessions" | |
| if [ -n "$matched" ]; then | |
| echo "PASS: Pi session log confirms provider=kimi-coding, modelId=kimi-for-coding" | |
| echo " session: $matched" | |
| else | |
| echo "FAIL: no recent Pi session log confirmed kimi routing — possible misroute" | |
| baseline="$mark-session-baseline.output.baseline" | |
| recent_sessions=$(find "$HOME/.pi/agent/sessions" -type f -name '*.jsonl' -newer "$baseline" -print 2>/dev/null) | |
| if [ -z "$recent_sessions" ]; then | |
| echo "FAIL: no Pi session jsonl modified in the last 10 minutes" | |
| exit 1 | |
| fi | |
| matched="" | |
| while IFS= read -r session; do | |
| # Two separate greps for order-independence — JSON field ordering | |
| # isn't part of Pi's contract, so a single regex with `.*` between | |
| # the two fields would silently false-FAIL if Pi ever reorders. | |
| if grep -q '"provider":"kimi-coding"' "$session" \ | |
| && grep -q '"modelId":"kimi-for-coding"' "$session"; then | |
| matched="$session" | |
| break | |
| fi | |
| done <<< "$recent_sessions" | |
| if [ -n "$matched" ]; then | |
| echo "PASS: Pi session log confirms provider=kimi-coding, modelId=kimi-for-coding" | |
| echo " session: $matched" | |
| else | |
| echo "FAIL: no recent Pi session log confirmed kimi routing — possible misroute" |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.archon/workflows/test-workflows/e2e-kimi-smoke.yaml around lines 104 - 126,
The session validation in the Kimi smoke workflow is too broad because it scans
any recent Pi session and can be satisfied by unrelated concurrent runs. Update
the logic around the recent_sessions/matched check to capture a baseline before
the hello step starts, then only consider session files created or modified
after that marker. Use the existing session-log grep flow to verify
provider=kimi-coding and modelId=kimi-for-coding, but restrict it to files
correlated with this run rather than any session touched in the last 10 minutes.
Summary
kimi-for-codingvia the Pi provider) had no first-class workflow coverage in the repo, while MiniMax shipped a full committed variant set. Cross-model Kimi work lived only as volatile, home-scoped personal workflows.agentic-eval-gateverification gate.kimi-codingis already a registered Pi vendor (pi-vendor-map.generated.ts), so this is YAML-only. The pre-existing"$node.output"double-quote validator warning on the two Kimi bash nodes is matched to the MiniMax precedent on purpose and left for a separate family-wide cleanup.UX Journey
Before
After
Architecture Diagram
Before
After
Connection inventory:
kimi-coding)largetier → Claude Opuskimi-coding/kimi-for-codingLabel Snapshot
risk: lowsize: Mworkflowsworkflows:definitionsChange Metadata
featureworkflowsLinked Issue
Validation Evidence (required)
warning on the two Kimi bash nodes (
"$node.output"double-quoting) is identical to thecommitted MiniMax counterparts (
archon-fix-github-issue-minimax,e2e-minimax-smoke) —kept for faithful parity.
bun run validate(type-check/lint/test/check:bundled) skipped — this PR adds only repo
.archon/workflows/YAML, touches noTypeScript, no bundled defaults, and no DB schema, so those gates are unaffected.
Security Impact (required)
NoNo(uses the already-registered Pikimi-codingvendor and the operator's existing local Pi/Claude credentials)No(workflows document the existingarchon ai key set kimi-codingflow; no new secret paths)NoCompatibility / Migration
Yes(purely additive — four new workflow files)NoNoHuman Verification (required)
archon validate workflows(0 errors). DAGshapes,
depends_onedges,when:gates, and provider/model routing reviewed against theMiniMax precedent file-by-file.
kimi-codingis a registered Pi vendor inpi-vendor-map.generated.ts; confirmed the validator warnings match the MiniMax variants;confirmed the eval gate avoids the
$BASE_BRANCHeager-resolution trap (resolves base in-shell).operator's
kimi-codingPi auth; the smoke workflow exists precisely to do this on demand).Side Effects / Blast Radius (required)
/workflow list).archon validate workflowsin CI;e2e-kimi-smokefor runtime routing proof.Rollback Plan (required)
git revert e479b57b, or delete the four YAML files — no state, no migration.KIMI_API_KEYin~/.archon/.envoverriding Pi auth (documented in each file's header).Risks and Mitigations
output_formatnodes rely on best-effort JSON extraction, which can be flakier than Claude/Codex."$node.output"double-quote warning carried over from the MiniMax precedent.Summary by CodeRabbit
New Features
Tests