Commit 98a233f
perf(remember): stored per-row dedup signatures — exact near-duplicate verdicts without the O(M·N) per-insert trigram rebuild (#392)
find_near_duplicate rebuilt every same-category candidate's character-
trigram set from its body on each insert (~30µs per candidate at 1KB),
so remember() stalled ~1.6s at 50k entities — 0.6 inserts/s, a hang to
an MCP client. Owner decision on #392: lossless signatures.
Design (schema v10, lazy backfill):
- src/dedup.rs: trigrams packed injectively into u64s (3×21-bit scalar
values), stored per row as a delta-varint sorted set. Jaccard over
packed sets is bit-identical to the old HashSet<[char;3]> Jaccard.
- Two side tables split by access pattern: dedup_signatures (freshness
guard body_len, set size, 256-bucket prune histogram — small rows the
scan LEFT JOINs for every candidate) and dedup_signature_blobs (the
multi-KB exact set, point-fetched only for candidates surviving both
prunes). Keeps the scan's hot page footprint tiny, which is what
matters when interleaved writers invalidate pooled page caches.
- Two provably lossless prunes before any blob is touched: the exact
set-size ceiling min/(a+b-min), and the histogram intersection
ceiling Σ min(hA[j], hB[j]) — both only ever skip candidates whose
best possible similarity is below the threshold; the merge itself
early-abandons on the same bound. Verdict expression on completion
is the identical f64 comparison the exhaustive scan evaluates.
- Signatures derive from the STORED body_json column value, ciphertext
when encryption is on: that both preserves the historical encrypted-
store dedup behavior exactly (plaintext-vs-ciphertext comparison,
effectively never a match) and leaks nothing about the plaintext —
unlike entities_fts, which stores the full plaintext body.
- Writes ride the same transaction as the row (create, update, revive,
rekey_aad refresh); rows without a fresh signature take the old
rebuild path with identical verdicts and are backfilled in bounded
batches (512/scan) — no eager migration pass.
Equivalence proof: randomized property test pits the new scan against
the verbatim pre-#392 implementation on identical stores (clones,
near-clones, unrelated, tiny, unicode, threshold-boundary sweeps;
signed, unsigned and mixed rows; exact + FTS-prefilter variants) and
asserts the identical Option<matched id>, plus targeted stale/
malformed-signature repair, encryption, and migration tests.
Measured (release, 1KB uniform-length bodies — the length prefilter's
worst case; medians over 15 probes): single-insert dedup scan @50k
1628.5ms → 69.3ms (23.5x); bulk import of 5,000 into a fresh store
(dedup ON) 123.6s pre (#392) → 11.8s total (~10x); @10k scan
327ms → 3ms-class. #[ignore] bench with env-tunable scale included.
Closes #392.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>1 parent 2a559bd commit 98a233f
5 files changed
Lines changed: 1492 additions & 38 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
18 | 37 | | |
19 | 38 | | |
20 | 39 | | |
| |||
0 commit comments