Skip to content

ci(benchmarks): add manual v4.1 Gemini pilot workflow#87

Merged
Davincc77 merged 1 commit into
mainfrom
ci/benchmark-v41-pilot-workflow
May 28, 2026
Merged

ci(benchmarks): add manual v4.1 Gemini pilot workflow#87
Davincc77 merged 1 commit into
mainfrom
ci/benchmark-v41-pilot-workflow

Conversation

@Davincc77

Copy link
Copy Markdown
Owner

Summary

Adds a manual-only GitHub Actions workflow that runs the controlled
x.klickd v4.1 benchmark pilot against Gemini.

  • Trigger: workflow_dispatch only — no push / pull_request.
  • Provider: locked to gemini; any other value fails input validation.
  • Hard caps: users <= 10, concurrency <= 2. Both enforced in the
    workflow itself (in addition to the runner's own --users cap).
  • Secret handling: GEMINI_API_KEY is only set as an env on the single
    execute step. A preflight step fails fast (exit 2) if the secret is
    missing or empty. The key is never echoed.
  • Steps: validate inputs -> secret preflight -> install minimal deps
    (google-genai only when executing) -> generate fixtures -> dry-run ->
    pilot plan-only -> pilot execute -> locate run dir -> audit -> job
    summary -> upload artifacts.
  • Artifacts uploaded: fixtures manifest.json, dry-run + pilot
    planned_run.json, raw_outputs.jsonl, errors.jsonl,
    metrics_summary.json, run_manifest.json, audit_report.{md,json},
    any *.log from the run dir. No secrets included.
  • No publish / no tag / no release / no Zenodo / no npm / no PyPI.
    The full-run path is intentionally not wired (the runner itself
    refuses full even with XKLICKD_BENCHMARK_FULL_APPROVED=1).

Inputs

input default bounds
users 10 1..10 (hard-capped)
concurrency 2 1..2 (hard-capped)
seed 4242 integer
sessions_per_user 10 integer (matches approved Test B in BENCHMARK_PROTOCOL.md)
provider gemini must equal gemini
execute true true / false

When execute=false, the workflow stops after the pilot plan step (no
provider call, no audit step) — useful for dry-runs from the Actions UI.

Testing

  • YAML parses (python3 -c "import yaml; yaml.safe_load(...)").
  • Runner CLI flags used in the workflow match
    python3 benchmarks/v4.1/runner/runner.py pilot --help.
  • Fixture generator flags match
    --seed --users --sessions-per-user --out.
  • Manual dispatch with execute=false to verify plan-only path.
  • Manual dispatch with execute=true once GEMINI_API_KEY is set
    as a repo secret; verify artifacts and audit report.

This workflow does not run on its own — parent will merge and dispatch.

🤖 Generated with Claude Code

workflow_dispatch-only job that runs the controlled v4.1 benchmark pilot
against Gemini. Inputs are validated and hard-capped at 10 users /
concurrency 2; provider is locked to gemini; no full-run path is wired.
GEMINI_API_KEY is only injected on the execute step and is never echoed;
a preflight fails fast when the secret is missing.

Steps: validate inputs -> secret preflight -> install minimal deps ->
generate fixtures -> dry-run -> pilot plan -> pilot execute -> audit ->
upload artifacts (manifest, plan, raw outputs, errors, metrics, audit,
logs). No publish / no tag / no release / no Zenodo / no npm / no PyPI.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Davincc77 Davincc77 merged commit 0706ff1 into main May 28, 2026
3 checks passed
@Davincc77 Davincc77 deleted the ci/benchmark-v41-pilot-workflow branch May 28, 2026 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants