ci(benchmarks): add manual v4.1 Gemini pilot workflow#87
Merged
Conversation
workflow_dispatch-only job that runs the controlled v4.1 benchmark pilot against Gemini. Inputs are validated and hard-capped at 10 users / concurrency 2; provider is locked to gemini; no full-run path is wired. GEMINI_API_KEY is only injected on the execute step and is never echoed; a preflight fails fast when the secret is missing. Steps: validate inputs -> secret preflight -> install minimal deps -> generate fixtures -> dry-run -> pilot plan -> pilot execute -> audit -> upload artifacts (manifest, plan, raw outputs, errors, metrics, audit, logs). No publish / no tag / no release / no Zenodo / no npm / no PyPI. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a manual-only GitHub Actions workflow that runs the controlled
x.klickd v4.1 benchmark pilot against Gemini.
workflow_dispatchonly — nopush/pull_request.gemini; any other value fails input validation.workflow itself (in addition to the runner's own
--userscap).GEMINI_API_KEYis only set as anenvon the singleexecute step. A preflight step fails fast (exit 2) if the secret is
missing or empty. The key is never echoed.
(
google-genaionly when executing) -> generate fixtures -> dry-run ->pilot plan-only -> pilot execute -> locate run dir -> audit -> job
summary -> upload artifacts.
manifest.json, dry-run + pilotplanned_run.json,raw_outputs.jsonl,errors.jsonl,metrics_summary.json,run_manifest.json,audit_report.{md,json},any
*.logfrom the run dir. No secrets included.The full-run path is intentionally not wired (the runner itself
refuses
fulleven withXKLICKD_BENCHMARK_FULL_APPROVED=1).Inputs
users10concurrency2seed4242sessions_per_user10BENCHMARK_PROTOCOL.md)providergeminigeminiexecutetruetrue/falseWhen
execute=false, the workflow stops after the pilot plan step (noprovider call, no audit step) — useful for dry-runs from the Actions UI.
Testing
python3 -c "import yaml; yaml.safe_load(...)").python3 benchmarks/v4.1/runner/runner.py pilot --help.--seed --users --sessions-per-user --out.execute=falseto verify plan-only path.execute=trueonceGEMINI_API_KEYis setas a repo secret; verify artifacts and audit report.
This workflow does not run on its own — parent will merge and dispatch.
🤖 Generated with Claude Code