Skip to content

Commit 98a233f

Browse files
tcconnallyclaude
andcommitted
perf(remember): stored per-row dedup signatures — exact near-duplicate verdicts without the O(M·N) per-insert trigram rebuild (#392)
find_near_duplicate rebuilt every same-category candidate's character- trigram set from its body on each insert (~30µs per candidate at 1KB), so remember() stalled ~1.6s at 50k entities — 0.6 inserts/s, a hang to an MCP client. Owner decision on #392: lossless signatures. Design (schema v10, lazy backfill): - src/dedup.rs: trigrams packed injectively into u64s (3×21-bit scalar values), stored per row as a delta-varint sorted set. Jaccard over packed sets is bit-identical to the old HashSet<[char;3]> Jaccard. - Two side tables split by access pattern: dedup_signatures (freshness guard body_len, set size, 256-bucket prune histogram — small rows the scan LEFT JOINs for every candidate) and dedup_signature_blobs (the multi-KB exact set, point-fetched only for candidates surviving both prunes). Keeps the scan's hot page footprint tiny, which is what matters when interleaved writers invalidate pooled page caches. - Two provably lossless prunes before any blob is touched: the exact set-size ceiling min/(a+b-min), and the histogram intersection ceiling Σ min(hA[j], hB[j]) — both only ever skip candidates whose best possible similarity is below the threshold; the merge itself early-abandons on the same bound. Verdict expression on completion is the identical f64 comparison the exhaustive scan evaluates. - Signatures derive from the STORED body_json column value, ciphertext when encryption is on: that both preserves the historical encrypted- store dedup behavior exactly (plaintext-vs-ciphertext comparison, effectively never a match) and leaks nothing about the plaintext — unlike entities_fts, which stores the full plaintext body. - Writes ride the same transaction as the row (create, update, revive, rekey_aad refresh); rows without a fresh signature take the old rebuild path with identical verdicts and are backfilled in bounded batches (512/scan) — no eager migration pass. Equivalence proof: randomized property test pits the new scan against the verbatim pre-#392 implementation on identical stores (clones, near-clones, unrelated, tiny, unicode, threshold-boundary sweeps; signed, unsigned and mixed rows; exact + FTS-prefilter variants) and asserts the identical Option<matched id>, plus targeted stale/ malformed-signature repair, encryption, and migration tests. Measured (release, 1KB uniform-length bodies — the length prefilter's worst case; medians over 15 probes): single-insert dedup scan @50k 1628.5ms → 69.3ms (23.5x); bulk import of 5,000 into a fresh store (dedup ON) 123.6s pre (#392) → 11.8s total (~10x); @10k scan 327ms → 3ms-class. #[ignore] bench with env-tunable scale included. Closes #392. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
1 parent 2a559bd commit 98a233f

5 files changed

Lines changed: 1492 additions & 38 deletions

File tree

CHANGELOG.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,25 @@ All notable changes to Perseus Vault (formerly Mimir/Mneme) are documented here.
1515
existing deterministic pick, unchanged.
1616

1717
### Fixed
18+
- `remember()`'s near-duplicate scan no longer rebuilds every candidate's
19+
trigram set per insert (#392) — the O(M·N) cost that made a single write
20+
stall ~1.6s at 50k same-category entities (0.6 inserts/s). Each row's
21+
packed trigram set is now stored once at write time (schema v10:
22+
`dedup_signatures` + `dedup_signature_blobs`, derived from the stored —
23+
i.e. possibly encrypted — `body_json` column value) and the scan computes
24+
its verdict from the stored signature behind two provably lossless prunes
25+
(exact set-size ceiling, 256-bucket histogram intersection ceiling).
26+
Dedup semantics are EXACT: the new path returns the identical
27+
match-or-not and matched id as the exhaustive trigram-Jaccard scan
28+
(randomized property-tested against the old implementation, including
29+
threshold-boundary, tiny-body, unicode and encrypted stores). Existing
30+
rows need no migration pass: unsigned rows take the old rebuild path and
31+
are backfilled lazily in bounded batches (512/scan). Measured (release,
32+
1KB uniform-length bodies — the length prefilter's worst case; medians
33+
over 15 probes): single-insert dedup scan @50k 1628.5ms → 69.3ms
34+
(23.5x); bulk import of 5,000 (fresh store, dedup ON) 123.6s (pre, per
35+
#392) → 11.8s total (~10x). The opt-in `MIMIR_DEDUP_FTS_PREFILTER` path
36+
is unchanged and composes with the stored signatures.
1837
- `follow()`'s row resolution no longer collapses real DB errors into
1938
"not found" (#396, the #394 principle): only `QueryReturnedNoRows` maps to
2039
the clean `found: false` report; a locked file or corruption error now

0 commit comments

Comments
 (0)