Skip to content

Commit f70dc3a

Browse files
tcconnallytcconnallyclaude
authored
perf(remember): stored per-row dedup signatures — exact verdicts without the O(M·N) trigram rebuild (closes #392) (#419)
* perf(remember): stored per-row dedup signatures — exact near-duplicate verdicts without the O(M·N) per-insert trigram rebuild (#392) find_near_duplicate rebuilt every same-category candidate's character- trigram set from its body on each insert (~30µs per candidate at 1KB), so remember() stalled ~1.6s at 50k entities — 0.6 inserts/s, a hang to an MCP client. Owner decision on #392: lossless signatures. Design (schema v10, lazy backfill): - src/dedup.rs: trigrams packed injectively into u64s (3×21-bit scalar values), stored per row as a delta-varint sorted set. Jaccard over packed sets is bit-identical to the old HashSet<[char;3]> Jaccard. - Two side tables split by access pattern: dedup_signatures (freshness guard body_len, set size, 256-bucket prune histogram — small rows the scan LEFT JOINs for every candidate) and dedup_signature_blobs (the multi-KB exact set, point-fetched only for candidates surviving both prunes). Keeps the scan's hot page footprint tiny, which is what matters when interleaved writers invalidate pooled page caches. - Two provably lossless prunes before any blob is touched: the exact set-size ceiling min/(a+b-min), and the histogram intersection ceiling Σ min(hA[j], hB[j]) — both only ever skip candidates whose best possible similarity is below the threshold; the merge itself early-abandons on the same bound. Verdict expression on completion is the identical f64 comparison the exhaustive scan evaluates. - Signatures derive from the STORED body_json column value, ciphertext when encryption is on: that both preserves the historical encrypted- store dedup behavior exactly (plaintext-vs-ciphertext comparison, effectively never a match) and leaks nothing about the plaintext — unlike entities_fts, which stores the full plaintext body. - Writes ride the same transaction as the row (create, update, revive, rekey_aad refresh); rows without a fresh signature take the old rebuild path with identical verdicts and are backfilled in bounded batches (512/scan) — no eager migration pass. Equivalence proof: randomized property test pits the new scan against the verbatim pre-#392 implementation on identical stores (clones, near-clones, unrelated, tiny, unicode, threshold-boundary sweeps; signed, unsigned and mixed rows; exact + FTS-prefilter variants) and asserts the identical Option<matched id>, plus targeted stale/ malformed-signature repair, encryption, and migration tests. Measured (release, 1KB uniform-length bodies — the length prefilter's worst case; medians over 15 probes): single-insert dedup scan @50k 1628.5ms → 69.3ms (23.5x); bulk import of 5,000 into a fresh store (dedup ON) 123.6s pre (#392) → 11.8s total (~10x); @10k scan 327ms → 3ms-class. #[ignore] bench with env-tunable scale included. Closes #392. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(dedup-sig): (len, hash) freshness guard + race-safe lazy backfill (#392 review) Review found two defects sharing one root cause — a length-only freshness guard cannot detect a same-length body rewrite: 1. Backfill flush race (losslessness violation): the scan queued rebuilt signatures and flushed them AFTER via unconditional INSERT OR REPLACE. A remember() update committing a same-length different body (+ fresh signature) between scan and flush was then overwritten by the stale signature — body_len matched, blob was well-formed, so it was trusted forever and dedup verdicts silently diverged from the exhaustive scan. 2. Rollback hazard: a pre-v10 binary running against a v10 store rewrites body_json without touching the signature tables; a same-length rewrite (which AES-GCM re-encryption always is) left an undetectably stale signature. Fix: - dedup_signatures gains body_hash (stable inline 64-bit chunked hash — deliberately NOT std's DefaultHasher, since the value is persisted and the algorithm must never drift across Rust versions). The scan trusts a signature only while BOTH body_len and body_hash match the fetched body; mismatches fall back to the exact rebuild path (identical verdicts) and self-heal. v10 stores are rollback-safe. - flush_dedup_sig_backfill takes the write lock up front (BEGIN IMMEDIATE) and re-verifies each row's CURRENT body (length + hash) under that lock before writing, so a racing update always wins regardless of arrival order. Both defects are pinned by regression tests that fail pre-fix: dedup_sig_backfill_flush_loses_to_concurrent_rewrite (stale flush overwrote the fresh signature; verdict None vs reference Some) and dedup_sig_hash_guard_catches_same_length_rewrite (direct same-length SQL rewrite; verdict None vs reference Some). Plus a pinned-stability unit test for the hash and an env-tunable seed count for the equivalence property test (MIMIR_DEDUP_PROP_SEEDS). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs(changelog): refresh #392 measurements post hash-guard The (len, hash) freshness guard verifies a 64-bit content hash per candidate row, which honestly costs ~0.5µs/row: @50k the paired-run numbers are now 1363.3ms -> 89.4ms (15.3x) and bulk-5000 15.1s (~8x). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test(dedup-sig): pin body_hash64 to literal constants (review follow-up) The stability test only checked self-consistency (body_hash64(x) == build_row_signature(x).body_hash — both call the same fn), so a real algorithm drift (e.g. rotate_left(23)->(24)) still passed while silently invalidating every persisted v10 signature across binaries. Add hardcoded literal pins for "perseus-vault" (-4349344705766122978) and "" (1530470515733238723); verified a rotate-constant tweak now fails the test, then reverted. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: tcconnally <hermes@perseus.observer> Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
1 parent 19161cf commit f70dc3a

5 files changed

Lines changed: 1756 additions & 38 deletions

File tree

CHANGELOG.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,33 @@ All notable changes to Perseus Vault (formerly Mimir/Mneme) are documented here.
113113
`mimir_history`/`mimir_as_of` and the journal append-only forever.
114114
`PurgeReport` gains `history_rows_deleted` / `journal_rows_redacted`
115115
(dry_run previews both with the same predicates).
116+
- `remember()`'s near-duplicate scan no longer rebuilds every candidate's
117+
trigram set per insert (#392) — the O(M·N) cost that made a single write
118+
stall ~1.6s at 50k same-category entities (0.6 inserts/s). Each row's
119+
packed trigram set is now stored once at write time (schema v10:
120+
`dedup_signatures` + `dedup_signature_blobs`, derived from the stored —
121+
i.e. possibly encrypted — `body_json` column value) and the scan computes
122+
its verdict from the stored signature behind two provably lossless prunes
123+
(exact set-size ceiling, 256-bucket histogram intersection ceiling).
124+
Dedup semantics are EXACT: the new path returns the identical
125+
match-or-not and matched id as the exhaustive trigram-Jaccard scan
126+
(randomized property-tested against the old implementation, including
127+
threshold-boundary, tiny-body, unicode and encrypted stores). Existing
128+
rows need no migration pass: unsigned rows take the old rebuild path and
129+
are backfilled lazily in bounded batches (512/scan). A signature is
130+
trusted only while BOTH the stored body's byte length and a stable
131+
64-bit content hash still match, so same-length rewrites by
132+
signature-unaware writers — e.g. a rolled-back pre-v10 binary running
133+
against a v10 store — read as stale and self-heal instead of poisoning
134+
verdicts (rollback-safe; dropping the two side tables is also always a
135+
safe reset). The lazy write-back re-verifies the row's current body
136+
under the write lock before landing, so a backfill can never overwrite
137+
the fresher signature a concurrent update just committed. Measured
138+
(release, 1KB uniform-length bodies — the length prefilter's worst
139+
case; medians over 15 probes, same run for both paths): single-insert
140+
dedup scan @50k 1363.3ms → 89.4ms (15.3x); bulk import of 5,000
141+
(fresh store, dedup ON) 123.6s (pre, per #392) → 15.1s total (~8x). The opt-in `MIMIR_DEDUP_FTS_PREFILTER` path
142+
is unchanged and composes with the stored signatures.
116143
- `follow()`'s row resolution no longer collapses real DB errors into
117144
"not found" (#396, the #394 principle): only `QueryReturnedNoRows` maps to
118145
the clean `found: false` report; a locked file or corruption error now

0 commit comments

Comments
 (0)