Skip to content

feat(workflows): Kimi cross-model harness + eval gate (MiniMax parity)#2049

Open
narutomugens-byte wants to merge 2 commits into
coleam00:devfrom
narutomugens-byte:archon/thread-f8b06149
Open

feat(workflows): Kimi cross-model harness + eval gate (MiniMax parity)#2049
narutomugens-byte wants to merge 2 commits into
coleam00:devfrom
narutomugens-byte:archon/thread-f8b06149

Conversation

@narutomugens-byte

@narutomugens-byte narutomugens-byte commented Jun 30, 2026

Copy link
Copy Markdown

Summary

  • Problem: Kimi (kimi-for-coding via the Pi provider) had no first-class workflow coverage in the repo, while MiniMax shipped a full committed variant set. Cross-model Kimi work lived only as volatile, home-scoped personal workflows.
  • Why it matters: Brings Kimi to parity with the existing MiniMax precedent so the cheapest coding backend is usable from repo workflows (Opus plans/reviews, Kimi writes), plus lands the whitepaper-derived pre-merge eval gate the build harness pairs with.
  • What changed: Adds four experimental workflow definitions — a PIV build harness, a GitHub-issue-fix variant, an e2e connectivity smoke, and the agentic-eval-gate verification gate.
  • What did NOT change (scope boundary): No code, no bundled defaults, no DB schema. kimi-coding is already a registered Pi vendor (pi-vendor-map.generated.ts), so this is YAML-only. The pre-existing "$node.output" double-quote validator warning on the two Kimi bash nodes is matched to the MiniMax precedent on purpose and left for a separate family-wide cleanup.

UX Journey

Before

Operator wants Kimi to do a build/fix in a repo workflow
  → no repo workflow exists; only home-scoped personal YAMLs (volatile, per-machine)
  → MiniMax has the full set; Kimi does not  → no parity
  → no committed pre-merge eval gate to pair with a build loop

After

Operator                Archon                       Provider
────────                ──────                       ────────
run opus-plan-kimi-build ─▶ plan        (Opus/large) ─▶ writes file-level plan (no edits)
                           implement   (Pi/Kimi)     ─▶ *applies the edits*
                           review      (Opus/large)  ─▶ findings + fix-plan (no edits)
                           fix         (Pi/Kimi)     ─▶ applies fix-plan exactly
                           re-review   (Opus/large)  ─▶ ship / needs-work verdict

run agentic-eval-gate   ─▶ get-diff/run-checks (bash) ─▶ output-eval + trajectory-eval
                           ─▶ verdict: PASS only if output=pass AND trajectory=sound
run e2e-kimi-smoke      ─▶ hello/identify/json (Kimi) ─▶ assert routing via Pi session jsonl
run archon-fix-github-issue-kimi ─▶ classify→smoke-validate→…→PR→review→self-fix (all Kimi)

Architecture Diagram

Before

.archon/workflows/
├── maintainer/        repo-triage-minimax, maintainer-standup-minimax
├── experimental/      archon-fix-github-issue-minimax
└── test-workflows/    e2e-minimax-smoke, minimax-{isolate,seq,smoke}
                       (no Kimi equivalents; no committed eval gate)

After

.archon/workflows/
├── experimental/   [+] opus-plan-kimi-build.yaml         (PIV build harness)
│                   [+] archon-fix-github-issue-kimi.yaml (issue-fix variant)
│                   [+] agentic-eval-gate.yaml            (whitepaper pre-merge gate)
└── test-workflows/ [+] e2e-kimi-smoke.yaml               (connectivity smoke)
                        === routes to Pi provider → kimi-coding/kimi-for-coding
                        === eval gate routes to claude (large/medium/small tiers)

Connection inventory:

From To Status Notes
Kimi workflows Pi provider (kimi-coding) new vendor already registered; no code wiring
opus-plan-kimi-build (plan/review) large tier → Claude Opus new reasoning nodes
opus-plan-kimi-build (implement/fix) Pi kimi-coding/kimi-for-coding new code-writing nodes
agentic-eval-gate Claude (medium/small tiers) new read-only judge; cheap-model routing

Label Snapshot

  • Risk: risk: low
  • Size: size: M
  • Scope: workflows
  • Module: workflows:definitions

Change Metadata

  • Change type: feature
  • Primary scope: workflows

Linked Issue

  • Closes #
  • Related # (MiniMax parity precedent)

Validation Evidence (required)

bun run cli validate workflows opus-plan-kimi-build --json
# → valid:true, errors:0, warnings:0
bun run cli validate workflows archon-fix-github-issue-kimi --json
# → valid:true, errors:0, warnings:1  (matches MiniMax precedent)
bun run cli validate workflows e2e-kimi-smoke --json
# → valid:true, errors:0, warnings:1  (matches MiniMax precedent)
bun run cli validate workflows agentic-eval-gate --json
# → valid:true, errors:0, warnings:0
  • Evidence provided: workflow validation output above (all four valid, 0 errors). The single
    warning on the two Kimi bash nodes ("$node.output" double-quoting) is identical to the
    committed MiniMax counterparts
    (archon-fix-github-issue-minimax, e2e-minimax-smoke) —
    kept for faithful parity.
  • If any command is intentionally skipped: full bun run validate (type-check/lint/test/
    check:bundled) skipped — this PR adds only repo .archon/workflows/ YAML, touches no
    TypeScript, no bundled defaults, and no DB schema, so those gates are unaffected.

Security Impact (required)

  • New permissions/capabilities? No
  • New external network calls? No (uses the already-registered Pi kimi-coding vendor and the operator's existing local Pi/Claude credentials)
  • Secrets/tokens handling changed? No (workflows document the existing archon ai key set kimi-coding flow; no new secret paths)
  • File system access scope changed? No

Compatibility / Migration

  • Backward compatible? Yes (purely additive — four new workflow files)
  • Config/env changes? No
  • Database migration needed? No

Human Verification (required)

  • Verified scenarios: all four workflows pass archon validate workflows (0 errors). DAG
    shapes, depends_on edges, when: gates, and provider/model routing reviewed against the
    MiniMax precedent file-by-file.
  • Edge cases checked: confirmed kimi-coding is a registered Pi vendor in
    pi-vendor-map.generated.ts; confirmed the validator warnings match the MiniMax variants;
    confirmed the eval gate avoids the $BASE_BRANCH eager-resolution trap (resolves base in-shell).
  • What was not verified: live end-to-end runs against a real Kimi credential (requires the
    operator's kimi-coding Pi auth; the smoke workflow exists precisely to do this on demand).

Side Effects / Blast Radius (required)

  • Affected subsystems/workflows: workflow discovery only (four new entries in /workflow list).
  • Potential unintended effects: none for existing workflows (additive, distinct names).
  • Guardrails/monitoring: archon validate workflows in CI; e2e-kimi-smoke for runtime routing proof.

Rollback Plan (required)

  • Fast rollback: git revert e479b57b, or delete the four YAML files — no state, no migration.
  • Feature flags/toggles: none needed (workflows are opt-in by invocation).
  • Observable failure symptoms: a Kimi workflow 401s → stale KIMI_API_KEY in ~/.archon/.env overriding Pi auth (documented in each file's header).

Risks and Mitigations

  • Risk: Pi has no native structured-output mode → output_format nodes rely on best-effort JSON extraction, which can be flakier than Claude/Codex.
    • Mitigation: documented in the workflow headers; schema is still validated with up-to-3 re-asks; smoke test exercises the JSON path explicitly.
  • Risk: pre-existing "$node.output" double-quote warning carried over from the MiniMax precedent.
    • Mitigation: intentionally matched for parity; flagged here for a separate family-wide cleanup so MiniMax and Kimi are fixed together.

Summary by CodeRabbit

  • New Features

    • Added several new experimental workflows for agentic change verification, issue-fixing, and build planning/review.
    • Introduced safer validation steps that check changes against specs, claims, and available local checks before approving results.
  • Tests

    • Added an end-to-end smoke test for the Kimi provider, including connectivity, structured output, and session verification checks.

…iniMax parity)

Bring the home-scoped Kimi build harness into .archon/workflows/, following
the committed MiniMax precedent. Code already supports Kimi via the Pi
`kimi-coding` vendor — this is workflow-YAML only.

- experimental/opus-plan-kimi-build.yaml: cross-model PIV build loop (Opus
  plans/reviews via the `large` tier, Kimi implements/fixes via
  pi + kimi-coding/kimi-for-coding). Default isolation off `dev`.
- experimental/archon-fix-github-issue-kimi.yaml: faithful mirror of
  archon-fix-github-issue-minimax.yaml, all nodes on Kimi.
- test-workflows/e2e-kimi-smoke.yaml: connectivity/capability smoke mirroring
  e2e-minimax-smoke.yaml, asserts Pi session-log routing to kimi-coding.
- experimental/agentic-eval-gate.yaml: standalone output+trajectory pre-merge
  gate ("set the bar at the eval, not the demo"); pairs with the build harness.

All four pass `archon validate workflows`. The two non-blocking shell-quoting
warnings are inherited verbatim from the MiniMax sources (parity, not new).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Four new Archon workflow YAML files are added: agentic-eval-gate (a PASS/FAIL verification gate using diff analysis and two model evaluators), archon-fix-github-issue-kimi (a Kimi/Pi-provider fix pipeline with smoke-validate and conditional review agents), opus-plan-kimi-build (a five-node cross-model Opus↔Kimi PIV loop), and e2e-kimi-smoke (a smoke test validating Pi/Kimi provider routing via session JSONL files).

Changes

Agentic Evaluation Gate

Layer / File(s) Summary
Workflow metadata and get-diff node
.archon/workflows/experimental/agentic-eval-gate.yaml
Defines workflow-level metadata with read-only config and the get-diff node that computes a resilient diff across worktree/base/last/root modes.
run-checks, output-eval, and trajectory-eval nodes
.archon/workflows/experimental/agentic-eval-gate.yaml
run-checks conditionally runs type-check/lint (not tests) via bun or npm; output-eval judges diff against spec; trajectory-eval judges whether verification steps were actually performed.
verdict node
.archon/workflows/experimental/agentic-eval-gate.yaml
Synthesizes both evaluations into a single structured PASS/FAIL decision; PASS requires output verdict "pass" and soundness "sound".

Kimi fix-github-issue workflow

Layer / File(s) Summary
Workflow metadata, provider config, and issue fetch/classify
.archon/workflows/experimental/archon-fix-github-issue-kimi.yaml
Sets pi provider and kimi-coding/kimi-for-coding model; defines extract-issue-number, fetch-issue, and classify nodes for structured issue ingestion.
smoke-validate and conditional routing
.archon/workflows/experimental/archon-fix-github-issue-kimi.yaml
smoke-validate checks cited file/line/symbol claims against the codebase; web-research is gated on external research need or inaccurate claims; routing goes to investigate (bugs) or plan (non-bugs).
bridge-artifacts, implement, validate, create-pr
.archon/workflows/experimental/archon-fix-github-issue-kimi.yaml
Ensures investigation.md exists by copying plan.md if missing; runs implement and validate; creates a draft PR with template discovery and strict artifact-commit rules.
Review orchestration through report
.archon/workflows/experimental/archon-fix-github-issue-kimi.yaml
review-scope/review-classify gate conditional agents (error-handling, test-coverage, comment-quality, docs-impact); code-review always runs; synthesize, self-fix, simplify, and report follow sequentially.

Opus-plan / Kimi-build cross-model PIV loop

Layer / File(s) Summary
Workflow metadata and plan node
.archon/workflows/experimental/opus-plan-kimi-build.yaml
Defines PIV metadata with mutates_checkout: true and the Opus plan node that produces a concrete file-level plan without editing files.
implement, review, fix, re-review nodes
.archon/workflows/experimental/opus-plan-kimi-build.yaml
Kimi implement applies the plan; Opus review produces severity-ranked findings and a fix-plan; Kimi fix applies minimal-diff fixes; Opus re-review outputs a final ship/needs-work verdict.

e2e Kimi smoke test

Layer / File(s) Summary
Smoke workflow: metadata, test nodes, and assert
.archon/workflows/test-workflows/e2e-kimi-smoke.yaml
Defines pi provider/kimi model, hello/identify/json nodes, and an assert node that validates output content and confirms Pi session routing via .jsonl files modified in the last 10 minutes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • coleam00/Archon#1438: The opus-plan-kimi-build.yaml workflow sets mutates_checkout: true, directly relying on the mutates_checkout field semantics introduced by this PR in the executor/loader.

Poem

🐇 Hop hop, four workflows land in the warren today,
Kimi and Opus take turns in the fray,
A gate checks your diff with a PASS or a FAIL,
Smoke tests sniff sessions down each JSONL trail,
The rabbit approves — let the pipelines prevail! 🌿

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and accurately summarizes the new Kimi workflow harness and eval gate.
Description check ✅ Passed The description covers the required template sections and is mostly complete, including diagrams, validation, and risks.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

…e-check/lint

have() already wraps $1 in quotes when grepping package.json, so calling
have '"type-check"' / have '"lint"' searched for ""type-check"" (doubled
quotes) and never matched — type-check and lint always reported "skipped",
blinding the trajectory eval to the very checks it must weigh.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@narutomugens-byte narutomugens-byte marked this pull request as ready for review June 30, 2026 15:34

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.archon/workflows/experimental/agentic-eval-gate.yaml:
- Around line 66-74: The fallback in the eval gate logic is too aggressive: when
origin/$BASE exists but HEAD has no commits ahead of it, the current MODE=last
path in the shell block still evaluates the previous commit instead of treating
the branch as having no changes. Update the decision flow in the same
conditional chain so the branch only selects MODE=base when there are commits
ahead of origin/$BASE, and otherwise reports no changes to evaluate instead of
falling back to the last commit; keep the fix localized around the STATUS_PORC,
git rev-parse/git log check, and the MODE/ LABEL assignments.
- Around line 66-95: The worktree path in the diff-gating logic drops newly
untracked files because `run_diff()` uses `git diff ... HEAD`, so additions can
be reported as no changes. Update the `MODE=worktree` branch and the `run_diff`
helper in `agentic-eval-gate.yaml` so worktree mode includes untracked paths
alongside tracked changes, and make `--name-only` report those files too. Keep
the existing `MODE`, `LABEL`, and `run_diff` structure, but ensure the worktree
comparison captures all files visible from `git status --porcelain`.

In @.archon/workflows/experimental/archon-fix-github-issue-kimi.yaml:
- Around line 74-83: The `fetch-issue` step is interpolating model output into
the bash script before sanitization, which can allow command execution; update
the workflow so `extract-issue-number.output` is treated as data only and never
rendered directly into shell evaluation. In `fetch-issue`, read the output
through a safe quoting/escaping path or environment variable, then perform the
numeric extraction and validation before calling `gh issue view`, keeping the
hardening inside this step and preserving the existing `ISSUE_NUM` check.

In @.archon/workflows/experimental/opus-plan-kimi-build.yaml:
- Around line 154-163: Pass the original plan context into both the
fresh-context fix and re-review nodes so they can see the plan-derived
verification steps; update the prompt wiring around the reviewer output handling
in opus-plan-kimi-build.yaml to include the original plan’s “Done
means”/acceptance commands alongside $review.output and $fix.output. Keep the
existing NO FIXES NEEDED / ship guard intact, but ensure the fresh-context nodes
can reliably rerun the intended build/lint/test commands and validate against
the original criteria.

In @.archon/workflows/test-workflows/e2e-kimi-smoke.yaml:
- Around line 104-126: The session validation in the Kimi smoke workflow is too
broad because it scans any recent Pi session and can be satisfied by unrelated
concurrent runs. Update the logic around the recent_sessions/matched check to
capture a baseline before the hello step starts, then only consider session
files created or modified after that marker. Use the existing session-log grep
flow to verify provider=kimi-coding and modelId=kimi-for-coding, but restrict it
to files correlated with this run rather than any session touched in the last 10
minutes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 17febcd7-186b-42fe-bad0-413d5d9b1645

📥 Commits

Reviewing files that changed from the base of the PR and between 59bbd00 and 043185a.

📒 Files selected for processing (4)
  • .archon/workflows/experimental/agentic-eval-gate.yaml
  • .archon/workflows/experimental/archon-fix-github-issue-kimi.yaml
  • .archon/workflows/experimental/opus-plan-kimi-build.yaml
  • .archon/workflows/test-workflows/e2e-kimi-smoke.yaml

Comment on lines +66 to +74
if [ -n "$STATUS_PORC" ]; then
MODE=worktree; LABEL="(uncommitted working-tree changes vs HEAD)"
elif git rev-parse --verify --quiet "origin/$BASE" >/dev/null && [ -n "$(git log --oneline "origin/$BASE..HEAD" 2>/dev/null)" ]; then
MODE=base; LABEL="(this branch vs origin/$BASE)"
elif [ -n "$HAS_PARENT" ]; then
MODE=last; LABEL="(fallback: last commit)"
else
MODE=root; LABEL="(root commit: full initial tree vs empty tree)"
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Don't fall back to HEAD~1..HEAD when the branch is clean and not ahead of base.

If origin/$BASE exists and HEAD has no commits ahead of it, Lines 68-71 still switch to MODE=last. That makes the gate evaluate the previous commit instead of reporting “no changes to evaluate,” so a clean synced branch can fail on historical work unrelated to the current diff.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.archon/workflows/experimental/agentic-eval-gate.yaml around lines 66 - 74,
The fallback in the eval gate logic is too aggressive: when origin/$BASE exists
but HEAD has no commits ahead of it, the current MODE=last path in the shell
block still evaluates the previous commit instead of treating the branch as
having no changes. Update the decision flow in the same conditional chain so the
branch only selects MODE=base when there are commits ahead of origin/$BASE, and
otherwise reports no changes to evaluate instead of falling back to the last
commit; keep the fix localized around the STATUS_PORC, git rev-parse/git log
check, and the MODE/ LABEL assignments.

Comment on lines +66 to +95
if [ -n "$STATUS_PORC" ]; then
MODE=worktree; LABEL="(uncommitted working-tree changes vs HEAD)"
elif git rev-parse --verify --quiet "origin/$BASE" >/dev/null && [ -n "$(git log --oneline "origin/$BASE..HEAD" 2>/dev/null)" ]; then
MODE=base; LABEL="(this branch vs origin/$BASE)"
elif [ -n "$HAS_PARENT" ]; then
MODE=last; LABEL="(fallback: last commit)"
else
MODE=root; LABEL="(root commit: full initial tree vs empty tree)"
fi

run_diff() {
case "$MODE" in
worktree) git diff "$@" HEAD ;;
base) git diff "$@" "origin/$BASE...HEAD" ;;
last) git diff "$@" HEAD~1..HEAD ;;
root) git diff "$@" "$EMPTY_TREE" HEAD ;;
esac
}

echo "=== DIFF ==="
echo "$LABEL"
DIFF_OUT="$(run_diff)"
if [ -n "$DIFF_OUT" ]; then
echo "$DIFF_OUT"
else
echo "NO CHANGES DETECTED - working tree clean and nothing to compare against."
echo "Evaluators: state there is nothing to evaluate; do not invent findings."
fi
echo "=== CHANGED FILES ==="
run_diff --name-only

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Include untracked files in worktree mode.

git status --porcelain will enter MODE=worktree for newly added files, but git diff HEAD and git diff --name-only HEAD omit untracked paths. A diff that only adds new files can therefore show up as “NO CHANGES DETECTED” or miss files the evaluators need to inspect.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.archon/workflows/experimental/agentic-eval-gate.yaml around lines 66 - 95,
The worktree path in the diff-gating logic drops newly untracked files because
`run_diff()` uses `git diff ... HEAD`, so additions can be reported as no
changes. Update the `MODE=worktree` branch and the `run_diff` helper in
`agentic-eval-gate.yaml` so worktree mode includes untracked paths alongside
tracked changes, and make `--name-only` report those files too. Keep the
existing `MODE`, `LABEL`, and `run_diff` structure, but ensure the worktree
comparison captures all files visible from `git status --porcelain`.

Comment on lines +74 to +83
- id: fetch-issue
bash: |
# Strip quotes, whitespace, markdown backticks from AI output
ISSUE_NUM=$(echo "$extract-issue-number.output" | tr -d "'\"\`\n " | grep -oE '[0-9]+' | head -1)
if [ -z "$ISSUE_NUM" ]; then
echo "Failed to extract issue number from: $extract-issue-number.output" >&2
exit 1
fi
gh issue view "$ISSUE_NUM" --json title,body,labels,comments,state,url,author
depends_on: [extract-issue-number]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🔴 Critical | ⚡ Quick win

Avoid interpolating model output directly into bash.

Line 77 sanitizes after shell evaluation. If $extract-issue-number.output is rendered into the script, output like $(...) or backticks can execute before tr/grep, turning a malformed model response into runner command execution.

🛡️ Proposed hardening
   - id: fetch-issue
     bash: |
       # Strip quotes, whitespace, markdown backticks from AI output
-      ISSUE_NUM=$(echo "$extract-issue-number.output" | tr -d "'\"\`\n " | grep -oE '[0-9]+' | head -1)
+      ISSUE_RAW=$(cat <<'__ARCHON_ISSUE_NUMBER_OUTPUT__'
+$extract-issue-number.output
+__ARCHON_ISSUE_NUMBER_OUTPUT__
+      )
+      ISSUE_NUM=$(printf '%s\n' "$ISSUE_RAW" | tr -d "'\"\`\n " | grep -oE '[0-9]+' | head -1)
       if [ -z "$ISSUE_NUM" ]; then
-        echo "Failed to extract issue number from: $extract-issue-number.output" >&2
+        printf 'Failed to extract issue number from: %s\n' "$ISSUE_RAW" >&2
         exit 1
       fi
       gh issue view "$ISSUE_NUM" --json title,body,labels,comments,state,url,author
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- id: fetch-issue
bash: |
# Strip quotes, whitespace, markdown backticks from AI output
ISSUE_NUM=$(echo "$extract-issue-number.output" | tr -d "'\"\`\n " | grep -oE '[0-9]+' | head -1)
if [ -z "$ISSUE_NUM" ]; then
echo "Failed to extract issue number from: $extract-issue-number.output" >&2
exit 1
fi
gh issue view "$ISSUE_NUM" --json title,body,labels,comments,state,url,author
depends_on: [extract-issue-number]
- id: fetch-issue
bash: |
# Strip quotes, whitespace, markdown backticks from AI output
ISSUE_RAW=$(cat <<'__ARCHON_ISSUE_NUMBER_OUTPUT__'
$extract-issue-number.output
__ARCHON_ISSUE_NUMBER_OUTPUT__
)
ISSUE_NUM=$(printf '%s\n' "$ISSUE_RAW" | tr -d "'\"\`\n " | grep -oE '[0-9]+' | head -1)
if [ -z "$ISSUE_NUM" ]; then
printf 'Failed to extract issue number from: %s\n' "$ISSUE_RAW" >&2
exit 1
fi
gh issue view "$ISSUE_NUM" --json title,body,labels,comments,state,url,author
depends_on: [extract-issue-number]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.archon/workflows/experimental/archon-fix-github-issue-kimi.yaml around
lines 74 - 83, The `fetch-issue` step is interpolating model output into the
bash script before sanitization, which can allow command execution; update the
workflow so `extract-issue-number.output` is treated as data only and never
rendered directly into shell evaluation. In `fetch-issue`, read the output
through a safe quoting/escaping path or environment variable, then perform the
numeric extraction and validation before calling `gh issue view`, keeping the
hardening inside this step and preserving the existing `ISSUE_NUM` check.

Comment on lines +154 to +163
## Reviewer output (findings + fix plan + verdict)

$review.output

## Instructions

1. If the Fix Plan section is `NO FIXES NEEDED` (or the verdict is `ship` with no
fix items), make NO changes and say so — stop here.
2. Otherwise apply each fix-plan item precisely. Read each file before editing.
3. Re-run any build/lint/test command and confirm it passes after your edits.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Pass the original plan into the fresh-context fix and re-review nodes.

Both nodes reference plan-derived verification, but context: fresh means they only see $review.output / $fix.output. If the review does not restate the plan’s “Done means” commands, Kimi cannot reliably rerun them, and Opus cannot validate the final state against the original acceptance criteria.

Proposed prompt context fix
       ## Reviewer output (findings + fix plan + verdict)
 
       $review.output
+
+      ## Original Plan
+
+      $plan.output
 
       ## Instructions
@@
       ## Your earlier review (findings + fix plan)
 
       $review.output
 
       ## What the implementer reported doing
 
       $fix.output
+
+      ## Original Plan
+
+      $plan.output

Also applies to: 178-191

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.archon/workflows/experimental/opus-plan-kimi-build.yaml around lines 154 -
163, Pass the original plan context into both the fresh-context fix and
re-review nodes so they can see the plan-derived verification steps; update the
prompt wiring around the reviewer output handling in opus-plan-kimi-build.yaml
to include the original plan’s “Done means”/acceptance commands alongside
$review.output and $fix.output. Keep the existing NO FIXES NEEDED / ship guard
intact, but ensure the fresh-context nodes can reliably rerun the intended
build/lint/test commands and validate against the original criteria.

Comment on lines +104 to +126
recent_sessions=$(find "$HOME/.pi/agent/sessions" -name '*.jsonl' -mmin -10 -print 2>/dev/null)
if [ -z "$recent_sessions" ]; then
echo "FAIL: no Pi session jsonl modified in the last 10 minutes"
exit 1
fi

matched=""
while IFS= read -r session; do
# Two separate greps for order-independence — JSON field ordering
# isn't part of Pi's contract, so a single regex with `.*` between
# the two fields would silently false-FAIL if Pi ever reorders.
if grep -q '"provider":"kimi-coding"' "$session" \
&& grep -q '"modelId":"kimi-for-coding"' "$session"; then
matched="$session"
break
fi
done <<< "$recent_sessions"

if [ -n "$matched" ]; then
echo "PASS: Pi session log confirms provider=kimi-coding, modelId=kimi-for-coding"
echo " session: $matched"
else
echo "FAIL: no recent Pi session log confirmed kimi routing — possible misroute"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Correlate the session log to this run, not any recent Kimi call.

This check passes on any Kimi session touched in the last 10 minutes, so an unrelated or concurrent kimi-coding/kimi-for-coding run can make the smoke test PASS even if these nodes were misrouted. Capture a baseline before hello starts (timestamp or file list) and only inspect session files created/updated after that marker.

Suggested direction
+  - id: mark-session-baseline
+    bash: |
+      baseline="$(mktemp)"
+      touch "$baseline"
+      echo "{\"baseline\":\"$baseline\"}"
+    output_format:
+      type: object
+      properties:
+        baseline:
+          type: string
+      required: [baseline]
+
   - id: hello
+    depends_on: [mark-session-baseline]
     prompt: 'What is 2+2? Answer with just the number, nothing else.'
     allowed_tools: []
     effort: low
     idle_timeout: 60000
...
-  - id: assert
-    depends_on: [hello, identify, json]
+  - id: assert
+    depends_on: [mark-session-baseline, hello, identify, json]
     bash: |
+      baseline="$mark-session-baseline.output.baseline"
...
-      recent_sessions=$(find "$HOME/.pi/agent/sessions" -name '*.jsonl' -mmin -10 -print 2>/dev/null)
+      recent_sessions=$(find "$HOME/.pi/agent/sessions" -type f -name '*.jsonl' -newer "$baseline" -print 2>/dev/null)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
recent_sessions=$(find "$HOME/.pi/agent/sessions" -name '*.jsonl' -mmin -10 -print 2>/dev/null)
if [ -z "$recent_sessions" ]; then
echo "FAIL: no Pi session jsonl modified in the last 10 minutes"
exit 1
fi
matched=""
while IFS= read -r session; do
# Two separate greps for order-independence — JSON field ordering
# isn't part of Pi's contract, so a single regex with `.*` between
# the two fields would silently false-FAIL if Pi ever reorders.
if grep -q '"provider":"kimi-coding"' "$session" \
&& grep -q '"modelId":"kimi-for-coding"' "$session"; then
matched="$session"
break
fi
done <<< "$recent_sessions"
if [ -n "$matched" ]; then
echo "PASS: Pi session log confirms provider=kimi-coding, modelId=kimi-for-coding"
echo " session: $matched"
else
echo "FAIL: no recent Pi session log confirmed kimi routing — possible misroute"
baseline="$mark-session-baseline.output.baseline"
recent_sessions=$(find "$HOME/.pi/agent/sessions" -type f -name '*.jsonl' -newer "$baseline" -print 2>/dev/null)
if [ -z "$recent_sessions" ]; then
echo "FAIL: no Pi session jsonl modified in the last 10 minutes"
exit 1
fi
matched=""
while IFS= read -r session; do
# Two separate greps for order-independence — JSON field ordering
# isn't part of Pi's contract, so a single regex with `.*` between
# the two fields would silently false-FAIL if Pi ever reorders.
if grep -q '"provider":"kimi-coding"' "$session" \
&& grep -q '"modelId":"kimi-for-coding"' "$session"; then
matched="$session"
break
fi
done <<< "$recent_sessions"
if [ -n "$matched" ]; then
echo "PASS: Pi session log confirms provider=kimi-coding, modelId=kimi-for-coding"
echo " session: $matched"
else
echo "FAIL: no recent Pi session log confirmed kimi routing — possible misroute"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.archon/workflows/test-workflows/e2e-kimi-smoke.yaml around lines 104 - 126,
The session validation in the Kimi smoke workflow is too broad because it scans
any recent Pi session and can be satisfied by unrelated concurrent runs. Update
the logic around the recent_sessions/matched check to capture a baseline before
the hello step starts, then only consider session files created or modified
after that marker. Use the existing session-log grep flow to verify
provider=kimi-coding and modelId=kimi-for-coding, but restrict it to files
correlated with this run rather than any session touched in the last 10 minutes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant