Skip to content

Commit bbb6bde

Browse files
tcconnallytcconnallyclaude
authored
fix(embed): clear embedding on content change — lag means absent not stale; drain queue on shutdown (#418)
The reviewer demonstrated (cap=1 + 400ms fake backend) that a content-changing UPDATE whose deferred embed job was dropped on queue overflow kept the PREVIOUS body's embedding: non-NULL, so mimir_embed batch repair (WHERE embedding IS NULL) never touched it and dense search served the wrong vector permanently. Even without overflow, the update-lag window served the old body's vector. Fix: the content_changed stamp UPDATE in remember's update path now also sets embedding = NULL, emb_sig = NULL — same transaction that rewrites the FTS row. Lag/drop now means ABSENT (row out of dense search, keyword still finds it) instead of STALE, and every dropped job is genuinely NULL-scan-recoverable, making the documented overflow contract true. Cleared regardless of backend config: a vector for a body the row no longer has is wrong in any configuration. Identical re-asserts (gated on content_changed) still keep their vector. Also, per review suggestions: - Shutdown now DRAINS queued jobs instead of skipping them (bounded by the existing 5s Drop grace, detached past it): CLI one-shot writes get their embeddings as they did synchronously pre-#393; the shutdown-skip flag is gone. - debug_assert against pending-counter underflow at both decrement sites (saturating_sub no longer masks accounting drift in debug/test builds). - Overflow warning, code comments, and CHANGELOG updated so every claim matches the fixed behavior. Regression test overflow_dropped_update_clears_stale_vector_and_batch_repairs_it (deterministic via a fake-server accept counter — no sleeps for the worker-busy window): update under overflow-drop -> embedding is NULL, then batch embed_entity repairs it with the CURRENT body's vector. Pre-fix proof at the previous head: FAILED with "a dropped update job must leave the embedding ABSENT (NULL), not body A's stale vector". Full default suite 282 passed / 0 failed; lite (--no-default-features) 275 passed / 0 failed. Co-authored-by: tcconnally <hermes@perseus.observer> Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
1 parent 6a940e4 commit bbb6bde

2 files changed

Lines changed: 226 additions & 96 deletions

File tree

CHANGELOG.md

Lines changed: 32 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -22,20 +22,40 @@ All notable changes to Perseus Vault (formerly Mimir/Mneme) are documented here.
2222
with a rate-limited warning) and return immediately; the worker drains up
2323
to 32 jobs per wake, embeds, and stores each vector through a stale guard
2424
(an atomic conditional UPDATE against the entity's current FTS plaintext),
25-
so a queued embed can never overwrite a newer body's vector. Deferral is
26-
within the existing contract — auto-embed already ran post-commit with
27-
non-fatal failures; a row simply doesn't surface in dense/hybrid search
28-
until embedded (now milliseconds later; dropped-on-overflow rows are
29-
recoverable via `mimir_embed` batch mode or their next change). Explicit
30-
`mimir_embed` stays synchronous. The write path also no longer consults
31-
the #219 embedding session cache (new/changed bodies can never hit it —
32-
each write paid up to 256 full-body string compares for nothing), and the
25+
so a queued embed can never overwrite a newer body's vector. A
26+
content-changing UPDATE also clears the row's stored embedding inside the
27+
write transaction, so embed lag — or a dropped job — means the row is
28+
ABSENT from dense search (keyword search still finds it), never served
29+
with the previous body's stale vector, and every unembedded row is
30+
genuinely recoverable via `mimir_embed` batch mode
31+
(`WHERE embedding IS NULL`) or its next change. Deferral is within the
32+
existing contract — auto-embed already ran post-commit with non-fatal
33+
failures; a row simply doesn't surface in dense/hybrid search until
34+
embedded (now milliseconds later). Explicit `mimir_embed` stays
35+
synchronous. The write path also no longer consults the #219 embedding
36+
session cache (new/changed bodies can never hit it — each write paid up
37+
to 256 full-body string compares for nothing), and the
3338
misconfigured-backend log (enabled, model missing, no endpoint — formerly
3439
one eprintln per write) is rate-limited to once per minute. `Drop` for
35-
`Database` signals the worker and waits up to 5s: the in-flight embed
36-
finishes, remaining queued jobs are dropped. Measured (debug profile,
37-
1KB bodies, bundled ONNX, n=40, median-of-5-runs): write median
38-
7,714µs → ~159µs (~48×).
40+
`Database` disconnects the queue and waits up to 5s while the worker
41+
DRAINS the remaining jobs (CLI one-shot writes still get embedded, as
42+
they did synchronously pre-#393); a drain that outlives the grace
43+
continues on the detached thread, and any never-embedded row is NULL —
44+
batch-recoverable, never stale. Measured (debug profile, 1KB bodies,
45+
bundled ONNX, n=40, median-of-5-runs): write median 7,714µs → ~159µs
46+
(~48×).
47+
- **Empty-string `workspace_hash` is now STRICT everywhere (#408).**
48+
`list_entities`/`count_entities` (the dashboard entity list and
49+
its `total`) and `get_entity_graph` treated `Some("")` as *unscoped*
50+
no filter, every workspace's rows — while `recall`/`recall_when`/`follow`
51+
treat `""` with strict equality (only the global `''` rows). The same
52+
argument value meant two different scopes depending on the surface. All
53+
three now apply strict equality for any `Some`, including `Some("")`;
54+
`None`/omitting the param remains the unscoped view. On the web API this
55+
means `?workspace=` (present but empty) now returns only global-`''`
56+
rows instead of everything — omit the parameter entirely for the
57+
unscoped view. The bundled dashboard never sends the parameter, so its
58+
behavior is unchanged.
3959

4060
### Fixed
4161
- `follow()`'s row resolution no longer collapses real DB errors into
@@ -107,20 +127,6 @@ All notable changes to Perseus Vault (formerly Mimir/Mneme) are documented here.
107127
rejected with 400; the responses now echo the effective `limit` (and
108128
`offset` for `/api/entities`).
109129

110-
### Changed
111-
- **Empty-string `workspace_hash` is now STRICT everywhere (#408).**
112-
`list_entities`/`count_entities` (the dashboard entity list and
113-
its `total`) and `get_entity_graph` treated `Some("")` as *unscoped*
114-
no filter, every workspace's rows — while `recall`/`recall_when`/`follow`
115-
treat `""` with strict equality (only the global `''` rows). The same
116-
argument value meant two different scopes depending on the surface. All
117-
three now apply strict equality for any `Some`, including `Some("")`;
118-
`None`/omitting the param remains the unscoped view. On the web API this
119-
means `?workspace=` (present but empty) now returns only global-`''`
120-
rows instead of everything — omit the parameter entirely for the
121-
unscoped view. The bundled dashboard never sends the parameter, so its
122-
behavior is unchanged.
123-
124130
## [2.14.0] - 2026-07-02
125131

126132
### Added

0 commit comments

Comments
 (0)