The procedural-recall benchmark scores how well Remnic injects a stored
procedure into the recall context when the user's prompt looks like
task initiation. It is the source of the lift number that justifies
flipping procedural.enabled to true by default (issue #567).
Do not commit real user prompts. The repository is public. The fixture under
packages/bench/src/benchmarks/remnic/procedural-recall/real-scenarios.tsis synthetic and hand-authored. If you add scenarios, keep them synthetic.
- The fixture in
real-scenarios.tscarries 20 synthetic scenarios grouped across four categories:exact-re-run(5) — prompt matches a stored procedure near-verbatim.parameter-variation(5) — same verb + goal, different nouns (service name, environment, ticket id).decomposition(5) — prompt kicks off a multi-step runbook.distractor-rejection(5) — prompt looks task-shaped but should NOT recall (past tense, summary request, unrelated domain, courtesy).
- The ablation harness from
#586 seeds a temp
StorageManagerper scenario with the scenario's procedure, then runsbuildProcedureRecallSectiontwice: once withprocedural.enabled=trueand once withprocedural.enabled=false. - Binary correctness (did recall produce a non-null section when
expectMatch=trueand stay null whenexpectMatch=false) is averaged across scenarios.lift = onScore - offScore. - A seeded mulberry32 RNG drives the bootstrap confidence interval, so regenerated artifacts are byte-stable on the same fixture.
The committed baseline lives at
packages/bench/baselines/procedural-recall-baseline.json.
| Metric | Value |
|---|---|
| Scenarios | 20 |
onScore |
0.75 |
offScore |
0.25 |
lift |
0.50 (50 points) |
| Seed | 0x72656d6e |
Interpretation: with procedural recall ON, Remnic correctly labels 15 of the 20 scenarios (all 5 distractors + 10 of the 15 task-initiation rows). With procedural recall OFF, only the 5 distractor rows score — the gate returns null for everything else, which is the correct behavior on distractors and the wrong behavior on the task rows. The difference is the published lift.
The 5 task-initiation rows that ON still misses are parameter-variation / decomposition cases where the synthetic procedure's vocabulary diverges enough from the prompt that the current token-overlap + intent-classifier composite does not clear the 0.04 threshold. They are the upside an LLM-scored procedure matcher would capture; we keep them in the fixture so a future scoring improvement shows up as lift.
The baseline is deterministic:
pnpm --filter @remnic/bench run build
tsx packages/bench/scripts/generate-procedural-recall-baseline.ts
git add packages/bench/baselines/procedural-recall-baseline.json
git commit -m "bench(procedural): refresh baseline"The procedural-recall-baseline.json matches a fresh deterministic run
unit test asserts that a live ablation run reproduces every field in the
committed baseline. If scenarios change, update the baseline in the same
commit.
Use the CLI from PR 1 (#586):
remnic bench procedural-ablation \
--fixture ./my-fixture.json \
--out /tmp/my-ablation.json \
--seed 42my-fixture.json must be either a bare array of scenarios or
{ "scenarios": [...] }. Each scenario requires id, prompt,
procedurePreamble, procedureSteps, procedureTags, and expectMatch.
The committed number is produced by the deterministic, LLM-free
composite scorer (buildProcedureRecallSection). For a human to validate
the fixture's expectMatch labels with a real model:
- Install
openaiand exportOPENAI_API_KEYin your shell. - For each scenario with
expectMatch: true, hand-pipe the prompt + the stored procedure body into gpt-4o-mini (via the OpenAI Responses API, per CLAUDE.md) with an instruction like:"Given this procedure and this user turn, answer yes/no: should the assistant start executing the procedure right now?"
- For
expectMatch: falsescenarios, the model should answer "no". - A fixture is accepted once gpt-4o-mini agrees with the
expectMatchlabel on ≥ 90% of rows. Record disagreements in your PR description.
This step is not automated and must not be run in CI — costs and rate limits would creep into every green build. Keep the deterministic path as the canonical baseline; the LLM oracle is a manual audit only.
- ≥ 20 synthetic scenarios spanning all four categories.
- Committed
procedural-recall-baseline.jsonwith a recorded lift. - Unit test asserts
lift >= 3 pointson a fresh deterministic run. - Unit test asserts baseline JSON matches a fresh run (anti-drift).
- Deterministic seed — no LLM calls in the test path.
Downstream slices:
- PR 3/5 — only raises floor thresholds; does not flip the default.
- PR 4/5 — flips
procedural.enableddefault totrueacross both plugin manifests andparseConfig. - PR 5/5 — adds
remnic procedural statsCLI + HTTP + MCP surface.