feat(benchmarks): add controlled v4.1 Test B pilot by Davincc77 · Pull Request #89 · Davincc77/klickdskill

Davincc77 · 2026-05-28T21:57:03Z

Summary

Adds Test B (memory-level) controlled pilot to the v4.1 benchmark harness, mirroring the existing Test A pilot infrastructure. Test B was previously only present in fixtures; this PR wires it through to execution + audit, and adds a sibling workflow with the same conservative safety envelope.

Harness extension: new prompts/test_b.py (four condition blocks: no_memory, prompt_history, xklickd_compressed, and mem0 / mem0_skipped), new runner/executor_b.py (reuses Test A retry/backoff/JSONL primitives), new runner/audit_b.py (parameterised condition audit), and a new pilot-test-b runner subcommand. The xklickd_compressed condition is built using the RFC-010 reference runtime that already lives in this repo.
CI workflow: .github/workflows/benchmark-v41-pilot-testb.yml is workflow_dispatch only with hard caps — users ≤ 10, concurrency ≤ 2, provider = gemini (only), retry_max ≤ 8, retry_backoff ≤ 10s, retry_backoff_max ≤ 30s. Generates fixtures with 500 users and sessions_per_user=15, runs dry-run, plan-only Test B, executes Test B (10 users × 15 sessions × 4 conditions = 600 calls), audits, and uploads artefacts (raw_outputs.jsonl, errors.jsonl, metrics_summary.json, run_manifest.json, audit_report.{md,json}, planned manifests, fixtures manifest).
Safety envelope preserved: no full run path is wired; the workflow refuses anything other than provider=gemini; GEMINI_API_KEY is only injected on the single execute step and never echoed; no publish / no tag / no release / no Zenodo / no npm / no PyPI.
No Mem0 compatibility claim: both run_manifest.json and the audit explicitly assert compatibility_claim: false; the audit hard-fails if any future change flips that. The mem0 condition is only enumerated when mem0_present() returns True (and even then is a deterministic placeholder block — Mem0 is never called at benchmark time).

Test plan

python3 -m pytest benchmarks/v4.1/tests/ — 75/75 pass (9 new in test_pilot_test_b.py, 66 pre-existing).
python3 benchmarks/v4.1/runner/runner.py pilot-test-b --help renders.
Plan-only smoke: python3 benchmarks/v4.1/runner/runner.py pilot-test-b --fixtures /tmp/fxt --users 10 --provider gemini --concurrency 2 writes planned_run.json without calling any provider.
Execute smoke (mock): --execute --provider mock against 10-user / 15-sessions fixtures produced 600 ok / 0 errors with balanced conditions (150 each).
Audit smoke: audit_b.py reports PASS, no secret hits, no Mem0 compatibility claim.
All new tests inject MockProvider; no network call, no GEMINI_API_KEY required. One test installs a tripwire provider that fails the test if it is ever invoked under --execute=false.

Exact dispatch command (do not run unsupervised)

gh workflow run benchmark-v41-pilot-testb.yml \
  -f users=10 \
  -f concurrency=2 \
  -f seed=4242 \
  -f sessions_per_user=15 \
  -f provider=gemini \
  -f execute=true \
  -f retry_max=5 \
  -f retry_backoff=2 \
  -f retry_backoff_max=30

Set -f execute=false first if you want a CI dry-run / plan-only that does not touch the secret.

🤖 Generated with Claude Code

Extend the v4.1 benchmark harness with Test B (memory-level) pilot execution and add a sibling workflow that mirrors the controlled Test A pilot. Harness: - prompts/test_b.py: deterministic condition blocks for the four Test B conditions (no_memory, prompt_history, xklickd_compressed, and either mem0 when present or mem0_skipped). Fairness preserved: identical user probe + generation config per (user, session); only the context block differs. - runner/executor_b.py: per-(session, condition) call expansion reusing the Test A retry/backoff/JSONL primitives. Same on-disk artefacts: raw_outputs.jsonl, errors.jsonl, metrics_summary.json, run_manifest.json. - runner/runner.py: new pilot-test-b subcommand. Same caps as pilot (<= 10 users, concurrency <= 8 in runner / <= 2 in CI, retry caps). Plan-only when --execute is absent or no LLM key configured. - runner/audit_b.py: condition-balance + secret-scan + hash + model consistency audit parameterised by the manifest's conditions list. Asserts the manifest never carries a Mem0 compatibility claim. CI: - .github/workflows/benchmark-v41-pilot-testb.yml: workflow_dispatch only. Hard caps users<=10, concurrency<=2, provider=gemini, retry_max<=8, retry_backoff<=10s, retry_backoff_max<=30s. Generates fixtures with 500 users and sessions_per_user=15, runs dry-run, plan-only Test B, executes Test B, audits, and uploads artefacts. No publish / no tag / no release / no Zenodo / no npm / no PyPI. GEMINI_API_KEY only set on the execute step; never echoed. Tests: - benchmarks/v4.1/tests/test_pilot_test_b.py: 9 tests covering writes, condition balance, mem0 present/absent paths, concurrency / user-cap validation, plan-only safety (provider that asserts on call), audit PASS on the mock-provider run, prompt determinism + per-condition distinctness. No real provider calls; MockProvider only. All 75 v4.1 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Davincc77 merged commit 0787881 into main May 28, 2026
3 checks passed

Davincc77 deleted the feat/benchmark-v41-pilot-testb branch May 28, 2026 22:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmarks): add controlled v4.1 Test B pilot#89

feat(benchmarks): add controlled v4.1 Test B pilot#89
Davincc77 merged 1 commit into
mainfrom
feat/benchmark-v41-pilot-testb

Davincc77 commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Davincc77 commented May 28, 2026

Summary

Test plan

Exact dispatch command (do not run unsupervised)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants