Skip to content

Commit 254c584

Browse files
tcconnallyclaude
andcommitted
fix(db): bound cohere's writer-lock window — split the maintenance transaction and chunk the decay pass
cohere previously ran promotion, a full-table decay UPDATE, link building, and archive inside ONE BEGIN IMMEDIATE, so the writer-lock hold grew linearly with store size (measured 0.34s @100k / 0.68s @200k release on a fast dev box; 4.45s @100k on the #400 deep-dive box) and crossed the default 5000ms busy_timeout just past ~130k entities — every concurrent remember failed SQLITE_BUSY during maintenance runs. The decay pass needs no atomicity with promotion or link building, so cohere now runs three bounded lock windows: promotion; the ×0.95 decay groom chunked at COHERE_DECAY_CHUNK_ROWS (1000) rowids per drop-safe IMMEDIATE transaction with a 2ms inter-chunk yield (SQLite's coarse busy handler otherwise starves waiters across back-to-back chunks); and link+archive (candidate_budget-bounded, links read-modify-write kept under one IMMEDIATE tx so the #388 clobber analysis still holds). The new non-atomicity boundaries are documented at the split site; every step is idempotent and each transaction stays drop-safe. Longest single hold @200k: 0.68s -> 0.09s; write work per hold is now bounded by the chunk size, and the residual linear component (the promote/archive full-table read scans) is ~7.5x shallower. Preserves the #399/#405 no-op write skip (floored rows are not rewritten — chunk UPDATEs carry the same epsilon guard) and cohere's documented standalone ×0.95 decay semantics (deliberately NOT deduplicated into decay_tick's Ebbinghaus recompute; only the chunking structure is aligned). Regression tests: chunk-transaction count for a store above the chunk size; no-op skip write-count for floored rows; a concurrent writer with a ~250ms fine-grained wait budget succeeding during cohere @150k (fails 3/3 on pre-fix main, passes 3/3 post-fix); plus an #[ignore] lock-window measurement harness (writer-lock hold probe over a seeded store, MIMIR_COHERE_MEASURE_ROWS). Also adds a store-size ops note for MIMIR_BUSY_TIMEOUT_MS at the pool-config site. Closes #400 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
1 parent a56e9fd commit 254c584

2 files changed

Lines changed: 501 additions & 29 deletions

File tree

CHANGELOG.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,27 @@
33
All notable changes to Perseus Vault (formerly Mimir/Mneme) are documented here. This project adheres to
44
[Semantic Versioning](https://semver.org/).
55

6+
## [Unreleased]
7+
8+
### Fixed
9+
- `mimir_cohere` no longer holds one writer lock for the whole grooming pass
10+
(#400). The single BEGIN IMMEDIATE previously spanned promotion, a
11+
full-table decay UPDATE, link building, and archive — a lock window linear
12+
in store size (~4.4s @100k entities) that crossed the default 5000ms
13+
`busy_timeout` just past ~130k entities, so concurrent `remember`s failed
14+
SQLITE_BUSY during every maintenance run. cohere now runs three bounded
15+
windows: promotion, a decay pass chunked at 1000 rows per drop-safe
16+
transaction (with a 2ms inter-chunk yield so waiting writers can actually
17+
acquire the lock), and link+archive. Longest single hold measured @200k
18+
entities (release, ~450B rows): 0.68s → 0.09s; the write work under any
19+
one lock is now bounded by the chunk size, and the remaining linear
20+
component (the promote/archive full-table read scans) is ~7.5x shallower
21+
than before. Preserves the #399/#405 no-op write skip (floored
22+
rows are not rewritten), cohere's documented ×0.95 standalone decay
23+
semantics, and per-transaction drop-safety (#388); the run stays correct
24+
under interleaved writers, and the new non-atomicity boundaries are
25+
documented at the split site.
26+
627
## [2.14.0] - 2026-07-02
728

829
### Added

0 commit comments

Comments
 (0)