Skip to content

feat(benchmarks): bundle-based Test B real-project design (1,800/wave; 9,000 full)#90

Merged
Davincc77 merged 1 commit into
mainfrom
feat/benchmark-v41-test-b-bundles
May 29, 2026
Merged

feat(benchmarks): bundle-based Test B real-project design (1,800/wave; 9,000 full)#90
Davincc77 merged 1 commit into
mainfrom
feat/benchmark-v41-test-b-bundles

Conversation

@Davincc77

Copy link
Copy Markdown
Owner

Summary

  • Implements the final approved Test B "real project" design for v4.1: 5 representative project bundles × 150 sessions × 12 conditions = 9,000 outputs full, with a 1,800-output long pilot per bundle.
  • No real LLM calls. No publish / no tag / no release / no Zenodo / no npm / no PyPI.
  • The full 5-bundle design is intentionally not launchable as a single run; it must be dispatched as 5 separate waves of the long pilot, each separated by manual review of the prior wave's audit report.

What changed

  • benchmarks/v4.1/fixtures/bundles.py — deterministic bundle generator. 5 bundles, 10 phases × 15 sessions, role / language / contradiction anchors, JSONL + bundle_manifest.json with per-file SHA-256.
  • benchmarks/v4.1/prompts/test_b_bundles.py — 12 condition builders. Same user probe + generation config across all 12 conditions; only the prepended memory block differs.
  • benchmarks/v4.1/runner/executor_b_bundles.py — bundle pilot executor, reuses retry / backoff / batching / JSONL / resumability primitives.
  • benchmarks/v4.1/runner/runner.py — new pilot-test-b-bundles subcommand. Hard caps: --bundles ≤ 1, --concurrency ∈ [1, 2], --sessions-per-bundle ≤ 150. --full-design intentionally refused.
  • benchmarks/v4.1/runner/audit_b_bundles.py — robust auditor. Hard checks: condition balance, bundle/phase/session/role coverage, hash completeness, secret scan, forbidden claim phrases, missing timestamps. Soft per-condition cost curves and session-depth token growth bins.
  • .github/workflows/benchmark-v41-pilot-testb-bundles.yml — manual-only workflow. Provider locked to gemini. All inputs validated and capped before secret access. execute=false by default.
  • benchmarks/v4.1/tests/test_pilot_test_b_bundles.py — 16 tests (mock provider only). Covers full=9000 specs, pilot=1800 specs, 5 bundles, 150 sessions, 12 conditions, all 10 phases, prompt determinism, runner caps (including --full-design refusal), plan-only and execute paths, audit PASS and audit-FAIL on forbidden claim phrases.
  • README.md and BENCHMARK_PROTOCOL.md updated with the final design, scientific rationale, cost/throughput caution, and the exact dispatch command for the long pilot.

12 conditions (in audit order)

`no_memory`, `prompt_history`, `manual_context_repetition`, `project_docs_only`, `xklickd_static_bundle`, `xklickd_compressed_bundle`, `xklickd_cross_session_resume`, `xklickd_cross_language`, `xklickd_cross_agent`, `xklickd_human_veto`, `xklickd_contradiction_handling`, `xklickd_ci_weakening_resistance`.

Exact dispatch — long pilot (1,800 outputs, plan-only first)

```bash
gh workflow run benchmark-v41-pilot-testb-bundles.yml \
-f bundle_index=0 \
-f sessions_per_bundle=150 \
-f concurrency=2 \
-f seed=4242 \
-f provider=gemini \
-f execute=false
```

To actually call Gemini after human review, dispatch again with `execute=true`. To run the full design, repeat with `bundle_index = 1, 2, 3, 4` between manual audit reviews.

Test plan

  • `python3 -m pytest benchmarks/v4.1/tests/` — 91 passed (16 new + 75 existing)
  • Bundle generator smoke test: 5 × 150 = 750 sessions, full design = 9000 outputs, long pilot = 1800 outputs
  • All 12 conditions produce distinct prompts; user probe is byte-identical across conditions
  • Runner refuses `--full-design`, `--bundles > 1`, `--concurrency > 2`, `--sessions-per-bundle > 150`
  • Plan-only path emits `expected_outputs = 1800` without provider call
  • Audit passes on mock-provider output; fails on injected forbidden claim phrases
  • Manual workflow dispatch with `execute=false` (intentionally not run by this PR)
  • Real Gemini long pilot (intentionally not run by this PR)

🤖 Generated with Claude Code

Implements the final approved Test B benchmark for v4.1: 5 representative
project bundles x 150 sessions x 12 conditions = 9,000 outputs full
design, with a 1,800-output long pilot per bundle.

- fixtures/bundles.py: deterministic generator for 5 bundles, 10 phases
  of 15 sessions each, role/language/contradiction anchors per fact,
  JSONL + bundle_manifest.json with SHA-256 per file.
- prompts/test_b_bundles.py: 12 condition builders. Same user probe and
  generation config across conditions; only the prepended memory block
  differs.
- runner/executor_b_bundles.py: bundle pilot executor. Re-uses retry/
  backoff/batching/JSONL primitives so the mock provider drives tests
  with no network.
- runner/runner.py: new pilot-test-b-bundles subcommand. Hard caps:
  bundles<=1, concurrency<=2, sessions_per_bundle<=150. --full-design
  is intentionally refused; the full 5-bundle design is launched as
  five separate waves.
- runner/audit_b_bundles.py: robust auditor. Hard checks for condition
  balance, bundle/phase/session/role coverage, hash completeness,
  secret scan, forbidden claim phrases, and missing timestamps. Soft
  per-condition cost curves and session-depth token growth bins.
- .github/workflows/benchmark-v41-pilot-testb-bundles.yml: manual-only
  workflow. Provider locked to gemini. Validates inputs, hard-caps
  bundle_index/concurrency/sessions/retry/backoff/sleep before secret
  access. execute=false by default.
- tests/test_pilot_test_b_bundles.py: 16 tests covering full=9000 and
  pilot=1800 spec counts, 5 bundles, 150 sessions, 12 conditions, all
  phases, prompt determinism, runner caps including --full-design
  refusal, plan-only and execute paths with mock provider, audit pass,
  and audit failure on forbidden claim phrases. Mock provider only;
  no network calls.
- README.md and BENCHMARK_PROTOCOL.md updated with the final design,
  scientific rationale, cost/throughput caution, and the exact gh
  workflow run dispatch command for the long pilot.

No publish / no tag / no release / no Zenodo / no npm / no PyPI. No
real LLM calls are made by tests or by the runner under default flags.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Davincc77 Davincc77 merged commit 9448382 into main May 29, 2026
3 checks passed
@Davincc77 Davincc77 deleted the feat/benchmark-v41-test-b-bundles branch May 29, 2026 05:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants