Skip to content

feat(benchmarks): add controlled v4.1 Test B pilot#89

Merged
Davincc77 merged 1 commit into
mainfrom
feat/benchmark-v41-pilot-testb
May 28, 2026
Merged

feat(benchmarks): add controlled v4.1 Test B pilot#89
Davincc77 merged 1 commit into
mainfrom
feat/benchmark-v41-pilot-testb

Conversation

@Davincc77

Copy link
Copy Markdown
Owner

Summary

Adds Test B (memory-level) controlled pilot to the v4.1 benchmark harness, mirroring the existing Test A pilot infrastructure. Test B was previously only present in fixtures; this PR wires it through to execution + audit, and adds a sibling workflow with the same conservative safety envelope.

  • Harness extension: new prompts/test_b.py (four condition blocks: no_memory, prompt_history, xklickd_compressed, and mem0 / mem0_skipped), new runner/executor_b.py (reuses Test A retry/backoff/JSONL primitives), new runner/audit_b.py (parameterised condition audit), and a new pilot-test-b runner subcommand. The xklickd_compressed condition is built using the RFC-010 reference runtime that already lives in this repo.
  • CI workflow: .github/workflows/benchmark-v41-pilot-testb.yml is workflow_dispatch only with hard caps — users ≤ 10, concurrency ≤ 2, provider = gemini (only), retry_max ≤ 8, retry_backoff ≤ 10s, retry_backoff_max ≤ 30s. Generates fixtures with 500 users and sessions_per_user=15, runs dry-run, plan-only Test B, executes Test B (10 users × 15 sessions × 4 conditions = 600 calls), audits, and uploads artefacts (raw_outputs.jsonl, errors.jsonl, metrics_summary.json, run_manifest.json, audit_report.{md,json}, planned manifests, fixtures manifest).
  • Safety envelope preserved: no full run path is wired; the workflow refuses anything other than provider=gemini; GEMINI_API_KEY is only injected on the single execute step and never echoed; no publish / no tag / no release / no Zenodo / no npm / no PyPI.
  • No Mem0 compatibility claim: both run_manifest.json and the audit explicitly assert compatibility_claim: false; the audit hard-fails if any future change flips that. The mem0 condition is only enumerated when mem0_present() returns True (and even then is a deterministic placeholder block — Mem0 is never called at benchmark time).

Test plan

  • python3 -m pytest benchmarks/v4.1/tests/ — 75/75 pass (9 new in test_pilot_test_b.py, 66 pre-existing).
  • python3 benchmarks/v4.1/runner/runner.py pilot-test-b --help renders.
  • Plan-only smoke: python3 benchmarks/v4.1/runner/runner.py pilot-test-b --fixtures /tmp/fxt --users 10 --provider gemini --concurrency 2 writes planned_run.json without calling any provider.
  • Execute smoke (mock): --execute --provider mock against 10-user / 15-sessions fixtures produced 600 ok / 0 errors with balanced conditions (150 each).
  • Audit smoke: audit_b.py reports PASS, no secret hits, no Mem0 compatibility claim.
  • All new tests inject MockProvider; no network call, no GEMINI_API_KEY required. One test installs a tripwire provider that fails the test if it is ever invoked under --execute=false.

Exact dispatch command (do not run unsupervised)

gh workflow run benchmark-v41-pilot-testb.yml \
  -f users=10 \
  -f concurrency=2 \
  -f seed=4242 \
  -f sessions_per_user=15 \
  -f provider=gemini \
  -f execute=true \
  -f retry_max=5 \
  -f retry_backoff=2 \
  -f retry_backoff_max=30

Set -f execute=false first if you want a CI dry-run / plan-only that does not touch the secret.

🤖 Generated with Claude Code

Extend the v4.1 benchmark harness with Test B (memory-level) pilot
execution and add a sibling workflow that mirrors the controlled
Test A pilot.

Harness:
- prompts/test_b.py: deterministic condition blocks for the four Test B
  conditions (no_memory, prompt_history, xklickd_compressed, and either
  mem0 when present or mem0_skipped). Fairness preserved: identical
  user probe + generation config per (user, session); only the context
  block differs.
- runner/executor_b.py: per-(session, condition) call expansion reusing
  the Test A retry/backoff/JSONL primitives. Same on-disk artefacts:
  raw_outputs.jsonl, errors.jsonl, metrics_summary.json, run_manifest.json.
- runner/runner.py: new pilot-test-b subcommand. Same caps as pilot
  (<= 10 users, concurrency <= 8 in runner / <= 2 in CI, retry caps).
  Plan-only when --execute is absent or no LLM key configured.
- runner/audit_b.py: condition-balance + secret-scan + hash + model
  consistency audit parameterised by the manifest's conditions list.
  Asserts the manifest never carries a Mem0 compatibility claim.

CI:
- .github/workflows/benchmark-v41-pilot-testb.yml: workflow_dispatch only.
  Hard caps users<=10, concurrency<=2, provider=gemini, retry_max<=8,
  retry_backoff<=10s, retry_backoff_max<=30s. Generates fixtures with
  500 users and sessions_per_user=15, runs dry-run, plan-only Test B,
  executes Test B, audits, and uploads artefacts. No publish / no tag /
  no release / no Zenodo / no npm / no PyPI. GEMINI_API_KEY only set on
  the execute step; never echoed.

Tests:
- benchmarks/v4.1/tests/test_pilot_test_b.py: 9 tests covering writes,
  condition balance, mem0 present/absent paths, concurrency / user-cap
  validation, plan-only safety (provider that asserts on call), audit
  PASS on the mock-provider run, prompt determinism + per-condition
  distinctness. No real provider calls; MockProvider only.

All 75 v4.1 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Davincc77 Davincc77 merged commit 0787881 into main May 28, 2026
3 checks passed
@Davincc77 Davincc77 deleted the feat/benchmark-v41-pilot-testb branch May 28, 2026 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants