feat(benchmarks): add controlled v4.1 Test B pilot#89
Merged
Conversation
Extend the v4.1 benchmark harness with Test B (memory-level) pilot execution and add a sibling workflow that mirrors the controlled Test A pilot. Harness: - prompts/test_b.py: deterministic condition blocks for the four Test B conditions (no_memory, prompt_history, xklickd_compressed, and either mem0 when present or mem0_skipped). Fairness preserved: identical user probe + generation config per (user, session); only the context block differs. - runner/executor_b.py: per-(session, condition) call expansion reusing the Test A retry/backoff/JSONL primitives. Same on-disk artefacts: raw_outputs.jsonl, errors.jsonl, metrics_summary.json, run_manifest.json. - runner/runner.py: new pilot-test-b subcommand. Same caps as pilot (<= 10 users, concurrency <= 8 in runner / <= 2 in CI, retry caps). Plan-only when --execute is absent or no LLM key configured. - runner/audit_b.py: condition-balance + secret-scan + hash + model consistency audit parameterised by the manifest's conditions list. Asserts the manifest never carries a Mem0 compatibility claim. CI: - .github/workflows/benchmark-v41-pilot-testb.yml: workflow_dispatch only. Hard caps users<=10, concurrency<=2, provider=gemini, retry_max<=8, retry_backoff<=10s, retry_backoff_max<=30s. Generates fixtures with 500 users and sessions_per_user=15, runs dry-run, plan-only Test B, executes Test B, audits, and uploads artefacts. No publish / no tag / no release / no Zenodo / no npm / no PyPI. GEMINI_API_KEY only set on the execute step; never echoed. Tests: - benchmarks/v4.1/tests/test_pilot_test_b.py: 9 tests covering writes, condition balance, mem0 present/absent paths, concurrency / user-cap validation, plan-only safety (provider that asserts on call), audit PASS on the mock-provider run, prompt determinism + per-condition distinctness. No real provider calls; MockProvider only. All 75 v4.1 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds Test B (memory-level) controlled pilot to the v4.1 benchmark harness, mirroring the existing Test A pilot infrastructure. Test B was previously only present in fixtures; this PR wires it through to execution + audit, and adds a sibling workflow with the same conservative safety envelope.
prompts/test_b.py(four condition blocks:no_memory,prompt_history,xklickd_compressed, andmem0/mem0_skipped), newrunner/executor_b.py(reuses Test A retry/backoff/JSONL primitives), newrunner/audit_b.py(parameterised condition audit), and a newpilot-test-brunner subcommand. Thexklickd_compressedcondition is built using the RFC-010 reference runtime that already lives in this repo..github/workflows/benchmark-v41-pilot-testb.ymlisworkflow_dispatchonly with hard caps —users ≤ 10,concurrency ≤ 2,provider = gemini(only),retry_max ≤ 8,retry_backoff ≤ 10s,retry_backoff_max ≤ 30s. Generates fixtures with 500 users andsessions_per_user=15, runs dry-run, plan-only Test B, executes Test B (10 users × 15 sessions × 4 conditions = 600 calls), audits, and uploads artefacts (raw_outputs.jsonl,errors.jsonl,metrics_summary.json,run_manifest.json,audit_report.{md,json}, planned manifests, fixtures manifest).provider=gemini;GEMINI_API_KEYis only injected on the single execute step and never echoed; no publish / no tag / no release / no Zenodo / no npm / no PyPI.run_manifest.jsonand the audit explicitly assertcompatibility_claim: false; the audit hard-fails if any future change flips that. Themem0condition is only enumerated whenmem0_present()returns True (and even then is a deterministic placeholder block — Mem0 is never called at benchmark time).Test plan
python3 -m pytest benchmarks/v4.1/tests/— 75/75 pass (9 new intest_pilot_test_b.py, 66 pre-existing).python3 benchmarks/v4.1/runner/runner.py pilot-test-b --helprenders.python3 benchmarks/v4.1/runner/runner.py pilot-test-b --fixtures /tmp/fxt --users 10 --provider gemini --concurrency 2writesplanned_run.jsonwithout calling any provider.--execute --provider mockagainst 10-user / 15-sessions fixtures produced 600 ok / 0 errors with balanced conditions (150 each).audit_b.pyreports PASS, no secret hits, no Mem0 compatibility claim.MockProvider; no network call, noGEMINI_API_KEYrequired. One test installs a tripwire provider that fails the test if it is ever invoked under--execute=false.Exact dispatch command (do not run unsupervised)
Set
-f execute=falsefirst if you want a CI dry-run / plan-only that does not touch the secret.🤖 Generated with Claude Code