Commit f70dc3a
* perf(remember): stored per-row dedup signatures — exact near-duplicate verdicts without the O(M·N) per-insert trigram rebuild (#392)
find_near_duplicate rebuilt every same-category candidate's character-
trigram set from its body on each insert (~30µs per candidate at 1KB),
so remember() stalled ~1.6s at 50k entities — 0.6 inserts/s, a hang to
an MCP client. Owner decision on #392: lossless signatures.
Design (schema v10, lazy backfill):
- src/dedup.rs: trigrams packed injectively into u64s (3×21-bit scalar
values), stored per row as a delta-varint sorted set. Jaccard over
packed sets is bit-identical to the old HashSet<[char;3]> Jaccard.
- Two side tables split by access pattern: dedup_signatures (freshness
guard body_len, set size, 256-bucket prune histogram — small rows the
scan LEFT JOINs for every candidate) and dedup_signature_blobs (the
multi-KB exact set, point-fetched only for candidates surviving both
prunes). Keeps the scan's hot page footprint tiny, which is what
matters when interleaved writers invalidate pooled page caches.
- Two provably lossless prunes before any blob is touched: the exact
set-size ceiling min/(a+b-min), and the histogram intersection
ceiling Σ min(hA[j], hB[j]) — both only ever skip candidates whose
best possible similarity is below the threshold; the merge itself
early-abandons on the same bound. Verdict expression on completion
is the identical f64 comparison the exhaustive scan evaluates.
- Signatures derive from the STORED body_json column value, ciphertext
when encryption is on: that both preserves the historical encrypted-
store dedup behavior exactly (plaintext-vs-ciphertext comparison,
effectively never a match) and leaks nothing about the plaintext —
unlike entities_fts, which stores the full plaintext body.
- Writes ride the same transaction as the row (create, update, revive,
rekey_aad refresh); rows without a fresh signature take the old
rebuild path with identical verdicts and are backfilled in bounded
batches (512/scan) — no eager migration pass.
Equivalence proof: randomized property test pits the new scan against
the verbatim pre-#392 implementation on identical stores (clones,
near-clones, unrelated, tiny, unicode, threshold-boundary sweeps;
signed, unsigned and mixed rows; exact + FTS-prefilter variants) and
asserts the identical Option<matched id>, plus targeted stale/
malformed-signature repair, encryption, and migration tests.
Measured (release, 1KB uniform-length bodies — the length prefilter's
worst case; medians over 15 probes): single-insert dedup scan @50k
1628.5ms → 69.3ms (23.5x); bulk import of 5,000 into a fresh store
(dedup ON) 123.6s pre (#392) → 11.8s total (~10x); @10k scan
327ms → 3ms-class. #[ignore] bench with env-tunable scale included.
Closes #392.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(dedup-sig): (len, hash) freshness guard + race-safe lazy backfill (#392 review)
Review found two defects sharing one root cause — a length-only
freshness guard cannot detect a same-length body rewrite:
1. Backfill flush race (losslessness violation): the scan queued
rebuilt signatures and flushed them AFTER via unconditional
INSERT OR REPLACE. A remember() update committing a same-length
different body (+ fresh signature) between scan and flush was then
overwritten by the stale signature — body_len matched, blob was
well-formed, so it was trusted forever and dedup verdicts silently
diverged from the exhaustive scan.
2. Rollback hazard: a pre-v10 binary running against a v10 store
rewrites body_json without touching the signature tables; a
same-length rewrite (which AES-GCM re-encryption always is) left an
undetectably stale signature.
Fix:
- dedup_signatures gains body_hash (stable inline 64-bit chunked hash
— deliberately NOT std's DefaultHasher, since the value is persisted
and the algorithm must never drift across Rust versions). The scan
trusts a signature only while BOTH body_len and body_hash match the
fetched body; mismatches fall back to the exact rebuild path
(identical verdicts) and self-heal. v10 stores are rollback-safe.
- flush_dedup_sig_backfill takes the write lock up front
(BEGIN IMMEDIATE) and re-verifies each row's CURRENT body
(length + hash) under that lock before writing, so a racing update
always wins regardless of arrival order.
Both defects are pinned by regression tests that fail pre-fix:
dedup_sig_backfill_flush_loses_to_concurrent_rewrite (stale flush
overwrote the fresh signature; verdict None vs reference Some) and
dedup_sig_hash_guard_catches_same_length_rewrite (direct same-length
SQL rewrite; verdict None vs reference Some). Plus a pinned-stability
unit test for the hash and an env-tunable seed count for the
equivalence property test (MIMIR_DEDUP_PROP_SEEDS).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(changelog): refresh #392 measurements post hash-guard
The (len, hash) freshness guard verifies a 64-bit content hash per
candidate row, which honestly costs ~0.5µs/row: @50k the paired-run
numbers are now 1363.3ms -> 89.4ms (15.3x) and bulk-5000 15.1s (~8x).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test(dedup-sig): pin body_hash64 to literal constants (review follow-up)
The stability test only checked self-consistency (body_hash64(x) ==
build_row_signature(x).body_hash — both call the same fn), so a real
algorithm drift (e.g. rotate_left(23)->(24)) still passed while
silently invalidating every persisted v10 signature across binaries.
Add hardcoded literal pins for "perseus-vault" (-4349344705766122978)
and "" (1530470515733238723); verified a rotate-constant tweak now
fails the test, then reverted.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---------
Co-authored-by: tcconnally <hermes@perseus.observer>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
1 parent 19161cf commit f70dc3a
5 files changed
Lines changed: 1756 additions & 38 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
113 | 113 | | |
114 | 114 | | |
115 | 115 | | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
116 | 143 | | |
117 | 144 | | |
118 | 145 | | |
| |||
0 commit comments