Skip to content

Commit 2efb38f

Browse files
tcconnallyclaude
andcommitted
fix: drive selective FTS recall from the match set, not the ranking index (#401)
FTS recall cost was O(total entities), not O(hits): the plan drove from idx_entities_recall and probed an up-to-100k-rowid IN-list materialized from the FTS subquery, so a 20-hit rare-term query cost the same ~5-6ms as a 33k-hit query at 100k entities. Selective queries now materialize the FTS-matched rowids FIRST; when the full match set fits under FTS_DRIVEN_MAX_MATCHES (512) the same ranking ORDER BY runs over just those rows via INTEGER PRIMARY KEY lookups (NOT INDEXED pins the plan - the rowid PK remains explicitly usable under it). Larger match sets keep the legacy rank-index plan, which is efficient exactly when matches are dense. A query whose FTS terms match nothing short-circuits without touching the entities table. Result semantics are byte-identical (same filters, ranking order, LIMIT/ OFFSET) - equivalence-tested old-plan vs new-plan, including ranking-key ties, filter parity, probe-overflow fallback, OFFSET paging and the zero-match case. EXPLAIN QUERY PLAN test asserts the selective arm SEARCHes entities by rowid and never scans the ranking index. Measured @100k (release, p50/50 iters): rare term (20 hits) 5.1ms -> 0.08ms (~66x); 1-hit term 3.4ms -> 0.04ms. Dense-match queries pay a small fixed probe cost (common ~33k-hit term +~0.4ms) - intrinsic FTS5 prefix-doclist materialization; they were and remain O(corpus). Recall gate green locally: auto recall@5 0.958, MRR 0.910, fts5 0.208. Closes #401 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
1 parent d70a8c0 commit 2efb38f

2 files changed

Lines changed: 412 additions & 7 deletions

File tree

CHANGELOG.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,20 @@ All notable changes to Perseus Vault (formerly Mimir/Mneme) are documented here.
1919
"not found" (#396, the #394 principle): only `QueryReturnedNoRows` maps to
2020
the clean `found: false` report; a locked file or corruption error now
2121
propagates.
22+
- Selective FTS recall cost now tracks the number of HITS, not corpus size
23+
(#401). Queries whose FTS match set fits under a cap (512 rows)
24+
materialize the matched rowids first and run the ranking ORDER BY over
25+
just those rows via INTEGER PRIMARY KEY lookups (`NOT INDEXED` pins the
26+
plan), instead of scanning `idx_entities_recall` while probing an
27+
up-to-100k-rowid FTS IN-list. Larger match sets keep the legacy
28+
rank-index-driven plan, which is efficient exactly when matches are
29+
dense. Result semantics are byte-identical (same filters, ranking order,
30+
LIMIT/OFFSET — equivalence-tested against the legacy plan); a query
31+
whose FTS terms match nothing now short-circuits without touching the
32+
entities table. Measured @100k (release, p50/50 iters): rare-term
33+
(20 hits) recall 5.1ms → 0.08ms (~64x); dense-match queries pay a small
34+
fixed probe cost (common term ~33k hits: +~0.5ms, the intrinsic FTS5
35+
prefix-doclist materialization — they were and remain O(corpus)).
2236

2337
## [2.14.0] - 2026-07-02
2438

0 commit comments

Comments
 (0)