fix(embedding): deterministic ONNX inference — bit-reproducible recall (#310) (#317)

tcconnally · tcconnally · claude · web-flow · commit 9391cd53ca40 · 2026-06-30T12:59:53.000-05:00
The embedding-backed dense/hybrid/auto recall metrics were not reproducible run-to-run: two full-500 LongMemEval runs of the SAME binary/command gave 84.6% vs 85.0% recall@1 (signatures 9babb85 vs 2477b51). fts5 was always exact; only the modes that depend on the bundled ONNX model drifted, because multi-threaded ORT reduces in nondeterministic order → tiny FP differences → borderline cosine ranks flip on ~2-3 of 500 questions. Pin the bundled (and file-backed) ONNX session to a single intra-op thread and enable deterministic compute, so the same input yields a byte-identical embedding every run. The model is tiny (MiniLM-L6, short inputs) and results are LRU-cached, so the single-thread cost is negligible. Validated: two full-500 runs of the rebuilt binary now produce IDENTICAL signatures (9babb85 == 9babb85). Recall gate still PASS (no quality regression). README determinism wording updated to reflect that all modes are now reproducible run-to-run (reverting the caveat #309 had added). Closes #310. Co-authored-by: tcconnally <hermes@perseus.observer> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
diff --git a/benchmark/longmemeval/README.md b/benchmark/longmemeval/README.md
@@ -38,11 +38,12 @@ curl -L https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/m
 python benchmark/longmemeval/run.py --data longmemeval_s_cleaned.json
 ```
 
-Output: a signed `report.json` plus a console table. The run is offline. `fts5`
-is bit-identical run-to-run and the RRF fusion step is deterministic (#247); the
-embedding-backed `dense`/`hybrid`/`auto` metrics vary by ~0.3% across runs
-because the ONNX backend's float math is not bit-reproducible (#310). Treat the
-hybrid headline as a representative number ±~0.3%, not a byte-exact signature.
+Output: a signed `report.json` plus a console table. The run is offline and the
+metrics are **deterministic run-to-run** across every mode: `fts5` and the RRF
+fusion step always were (#247), and the embedding-backed `dense`/`hybrid`/`auto`
+modes are now too — the bundled ONNX backend is pinned to single-threaded,
+deterministic inference (#310), so the same input yields a byte-identical
+embedding (and therefore a byte-identical signature) on every run.
 
 ## Method
 
@@ -85,7 +86,8 @@ step. `auto` == `hybrid` to the digit, confirming the default equals the ceiling
 (Standalone dense, measured separately, is 77.0% / 93.8% — so fusing the keyword arm
 adds ~8 points of recall@1 over dense alone.) **#309** raised the keyword arm to equal
 weight in the RRF fusion (it had been under-weighted at 0.5), lifting the default from
-82.2% / 0.884 MRR to the numbers above. The headline carries ~0.3% run-to-run noise (#310).
+82.2% / 0.884 MRR to the numbers above. These numbers are now reproducible run-to-run
+across all modes (deterministic embeddings, #310).
 
 By question type (default/auto recall@1 / recall@5):
 
@@ -102,8 +104,8 @@ Equal-weight fusion (#309) improved 5 of the 6 types vs the old 0.5 weight; the
 `single-session-preference` set (n=30) traded down (63.3→56.7 recall@1) as the net
 across all 500 rose. Reproduce the default experience:
 `python benchmark/longmemeval/run.py --data longmemeval_s_cleaned.json --skip-explicit-embed --modes auto fts5`
-(one representative signature `9babb85...`; `fts5` is exact, the hybrid number moves
-~0.3% run-to-run per #310). Drop the flags to also measure the explicit dense/hybrid modes.
+(signature `9babb85...`, byte-identical run-to-run now that embeddings are deterministic,
+#310). Drop the flags to also measure the explicit dense/hybrid modes.
 <!-- RESULTS-END -->
 
 ## Stage 2: QA accuracy (answer generation + LongMemEval's official judge)
diff --git a/src/embedding.rs b/src/embedding.rs
@@ -240,7 +240,16 @@ fn cached_ort_model(
     let (session, tokenizer) = if config.bundled {
         // #237: load the compiled-in model + tokenizer straight from memory — no
         // file on disk, no network.
-        let session = Session::builder()?.commit_from_memory(BUNDLED_MODEL)?;
+        // #310: pin a single intra-op thread + deterministic compute so the
+        // embedding is bit-reproducible run-to-run. Multi-threaded ORT reduces
+        // in nondeterministic order, producing tiny FP differences that flip
+        // near-tied cosine ranks (≈0.3% of LongMemEval at scale). The model is
+        // tiny (MiniLM-L6, short inputs) and results are LRU-cached, so the
+        // single-thread cost is negligible.
+        let session = Session::builder()?
+            .with_intra_threads(1)?
+            .with_deterministic_compute(true)?
+            .commit_from_memory(BUNDLED_MODEL)?;
         let tokenizer = tokenizers::Tokenizer::from_bytes(BUNDLED_TOKENIZER)
             .map_err(|e| format!("failed to load bundled tokenizer: {}", e))?;
         (session, tokenizer)
@@ -250,7 +259,10 @@ fn cached_ort_model(
             .parent()
             .ok_or("model_path must have a parent directory")?;
         let tokenizer_path = model_dir.join("tokenizer.json");
-        let session = Session::builder()?.commit_from_file(&config.model_path)?;
+        let session = Session::builder()?
+            .with_intra_threads(1)?
+            .with_deterministic_compute(true)?
+            .commit_from_file(&config.model_path)?;
         let tokenizer = tokenizers::Tokenizer::from_file(&tokenizer_path)
             .map_err(|e| format!("failed to load tokenizer: {}", e))?;
         (session, tokenizer)