Skip to content

Commit 9391cd5

Browse files
tcconnallytcconnallyclaude
authored
fix(embedding): deterministic ONNX inference — bit-reproducible recall (#310) (#317)
The embedding-backed dense/hybrid/auto recall metrics were not reproducible run-to-run: two full-500 LongMemEval runs of the SAME binary/command gave 84.6% vs 85.0% recall@1 (signatures 9babb85 vs 2477b51). fts5 was always exact; only the modes that depend on the bundled ONNX model drifted, because multi-threaded ORT reduces in nondeterministic order → tiny FP differences → borderline cosine ranks flip on ~2-3 of 500 questions. Pin the bundled (and file-backed) ONNX session to a single intra-op thread and enable deterministic compute, so the same input yields a byte-identical embedding every run. The model is tiny (MiniLM-L6, short inputs) and results are LRU-cached, so the single-thread cost is negligible. Validated: two full-500 runs of the rebuilt binary now produce IDENTICAL signatures (9babb85 == 9babb85). Recall gate still PASS (no quality regression). README determinism wording updated to reflect that all modes are now reproducible run-to-run (reverting the caveat #309 had added). Closes #310. Co-authored-by: tcconnally <hermes@perseus.observer> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 19f9f38 commit 9391cd5

2 files changed

Lines changed: 24 additions & 10 deletions

File tree

benchmark/longmemeval/README.md

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -38,11 +38,12 @@ curl -L https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/m
3838
python benchmark/longmemeval/run.py --data longmemeval_s_cleaned.json
3939
```
4040

41-
Output: a signed `report.json` plus a console table. The run is offline. `fts5`
42-
is bit-identical run-to-run and the RRF fusion step is deterministic (#247); the
43-
embedding-backed `dense`/`hybrid`/`auto` metrics vary by ~0.3% across runs
44-
because the ONNX backend's float math is not bit-reproducible (#310). Treat the
45-
hybrid headline as a representative number ±~0.3%, not a byte-exact signature.
41+
Output: a signed `report.json` plus a console table. The run is offline and the
42+
metrics are **deterministic run-to-run** across every mode: `fts5` and the RRF
43+
fusion step always were (#247), and the embedding-backed `dense`/`hybrid`/`auto`
44+
modes are now too — the bundled ONNX backend is pinned to single-threaded,
45+
deterministic inference (#310), so the same input yields a byte-identical
46+
embedding (and therefore a byte-identical signature) on every run.
4647

4748
## Method
4849

@@ -85,7 +86,8 @@ step. `auto` == `hybrid` to the digit, confirming the default equals the ceiling
8586
(Standalone dense, measured separately, is 77.0% / 93.8% — so fusing the keyword arm
8687
adds ~8 points of recall@1 over dense alone.) **#309** raised the keyword arm to equal
8788
weight in the RRF fusion (it had been under-weighted at 0.5), lifting the default from
88-
82.2% / 0.884 MRR to the numbers above. The headline carries ~0.3% run-to-run noise (#310).
89+
82.2% / 0.884 MRR to the numbers above. These numbers are now reproducible run-to-run
90+
across all modes (deterministic embeddings, #310).
8991

9092
By question type (default/auto recall@1 / recall@5):
9193

@@ -102,8 +104,8 @@ Equal-weight fusion (#309) improved 5 of the 6 types vs the old 0.5 weight; the
102104
`single-session-preference` set (n=30) traded down (63.3→56.7 recall@1) as the net
103105
across all 500 rose. Reproduce the default experience:
104106
`python benchmark/longmemeval/run.py --data longmemeval_s_cleaned.json --skip-explicit-embed --modes auto fts5`
105-
(one representative signature `9babb85...`; `fts5` is exact, the hybrid number moves
106-
~0.3% run-to-run per #310). Drop the flags to also measure the explicit dense/hybrid modes.
107+
(signature `9babb85...`, byte-identical run-to-run now that embeddings are deterministic,
108+
#310). Drop the flags to also measure the explicit dense/hybrid modes.
107109
<!-- RESULTS-END -->
108110

109111
## Stage 2: QA accuracy (answer generation + LongMemEval's official judge)

src/embedding.rs

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -240,7 +240,16 @@ fn cached_ort_model(
240240
let (session, tokenizer) = if config.bundled {
241241
// #237: load the compiled-in model + tokenizer straight from memory — no
242242
// file on disk, no network.
243-
let session = Session::builder()?.commit_from_memory(BUNDLED_MODEL)?;
243+
// #310: pin a single intra-op thread + deterministic compute so the
244+
// embedding is bit-reproducible run-to-run. Multi-threaded ORT reduces
245+
// in nondeterministic order, producing tiny FP differences that flip
246+
// near-tied cosine ranks (≈0.3% of LongMemEval at scale). The model is
247+
// tiny (MiniLM-L6, short inputs) and results are LRU-cached, so the
248+
// single-thread cost is negligible.
249+
let session = Session::builder()?
250+
.with_intra_threads(1)?
251+
.with_deterministic_compute(true)?
252+
.commit_from_memory(BUNDLED_MODEL)?;
244253
let tokenizer = tokenizers::Tokenizer::from_bytes(BUNDLED_TOKENIZER)
245254
.map_err(|e| format!("failed to load bundled tokenizer: {}", e))?;
246255
(session, tokenizer)
@@ -250,7 +259,10 @@ fn cached_ort_model(
250259
.parent()
251260
.ok_or("model_path must have a parent directory")?;
252261
let tokenizer_path = model_dir.join("tokenizer.json");
253-
let session = Session::builder()?.commit_from_file(&config.model_path)?;
262+
let session = Session::builder()?
263+
.with_intra_threads(1)?
264+
.with_deterministic_compute(true)?
265+
.commit_from_file(&config.model_path)?;
254266
let tokenizer = tokenizers::Tokenizer::from_file(&tokenizer_path)
255267
.map_err(|e| format!("failed to load tokenizer: {}", e))?;
256268
(session, tokenizer)

0 commit comments

Comments
 (0)