ci(benchmarks): add manual v4.1 Gemini pilot workflow by Davincc77 · Pull Request #87 · Davincc77/klickdskill

Davincc77 · 2026-05-28T20:54:42Z

Summary

Adds a manual-only GitHub Actions workflow that runs the controlled
x.klickd v4.1 benchmark pilot against Gemini.

Trigger: workflow_dispatch only — no push / pull_request.
Provider: locked to gemini; any other value fails input validation.
Hard caps: users <= 10, concurrency <= 2. Both enforced in the
workflow itself (in addition to the runner's own --users cap).
Secret handling: GEMINI_API_KEY is only set as an env on the single
execute step. A preflight step fails fast (exit 2) if the secret is
missing or empty. The key is never echoed.
Steps: validate inputs -> secret preflight -> install minimal deps
(google-genai only when executing) -> generate fixtures -> dry-run ->
pilot plan-only -> pilot execute -> locate run dir -> audit -> job
summary -> upload artifacts.
Artifacts uploaded: fixtures manifest.json, dry-run + pilot
planned_run.json, raw_outputs.jsonl, errors.jsonl,
metrics_summary.json, run_manifest.json, audit_report.{md,json},
any *.log from the run dir. No secrets included.
No publish / no tag / no release / no Zenodo / no npm / no PyPI.
The full-run path is intentionally not wired (the runner itself
refuses full even with XKLICKD_BENCHMARK_FULL_APPROVED=1).

Inputs

input	default	bounds
`users`	`10`	1..10 (hard-capped)
`concurrency`	`2`	1..2 (hard-capped)
`seed`	`4242`	integer
`sessions_per_user`	`10`	integer (matches approved Test B in `BENCHMARK_PROTOCOL.md`)
`provider`	`gemini`	must equal `gemini`
`execute`	`true`	`true` / `false`

When execute=false, the workflow stops after the pilot plan step (no
provider call, no audit step) — useful for dry-runs from the Actions UI.

Testing

YAML parses (python3 -c "import yaml; yaml.safe_load(...)").
Runner CLI flags used in the workflow match
python3 benchmarks/v4.1/runner/runner.py pilot --help.
Fixture generator flags match
--seed --users --sessions-per-user --out.
Manual dispatch with execute=false to verify plan-only path.
Manual dispatch with execute=true once GEMINI_API_KEY is set
as a repo secret; verify artifacts and audit report.

This workflow does not run on its own — parent will merge and dispatch.

🤖 Generated with Claude Code

workflow_dispatch-only job that runs the controlled v4.1 benchmark pilot against Gemini. Inputs are validated and hard-capped at 10 users / concurrency 2; provider is locked to gemini; no full-run path is wired. GEMINI_API_KEY is only injected on the execute step and is never echoed; a preflight fails fast when the secret is missing. Steps: validate inputs -> secret preflight -> install minimal deps -> generate fixtures -> dry-run -> pilot plan -> pilot execute -> audit -> upload artifacts (manifest, plan, raw outputs, errors, metrics, audit, logs). No publish / no tag / no release / no Zenodo / no npm / no PyPI. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Davincc77 merged commit 0706ff1 into main May 28, 2026
3 checks passed

Davincc77 deleted the ci/benchmark-v41-pilot-workflow branch May 28, 2026 20:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(benchmarks): add manual v4.1 Gemini pilot workflow#87

ci(benchmarks): add manual v4.1 Gemini pilot workflow#87
Davincc77 merged 1 commit into
mainfrom
ci/benchmark-v41-pilot-workflow

Davincc77 commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Davincc77 commented May 28, 2026

Summary

Inputs

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants