Skip to content

Commit a92a8df

Browse files
tcconnallyclaude
andcommitted
fix: drive selective FTS recall from the match set, not the ranking index (#401)
FTS recall cost was O(total entities), not O(hits): the plan drove from idx_entities_recall and probed an up-to-100k-rowid IN-list materialized from the FTS subquery, so a 20-hit rare-term query cost the same ~5-6ms as a 33k-hit query at 100k entities. Selective queries now materialize the FTS-matched rowids FIRST; when the full match set fits under FTS_DRIVEN_MAX_MATCHES (512) the same ranking ORDER BY runs over just those rows via INTEGER PRIMARY KEY lookups (NOT INDEXED pins the plan - the rowid PK remains explicitly usable under it). Larger match sets keep the legacy rank-index plan, which is efficient exactly when matches are dense. A query whose FTS terms match nothing short-circuits without touching the entities table. Result semantics are byte-identical (same filters, ranking order, LIMIT/ OFFSET) - equivalence-tested old-plan vs new-plan, including ranking-key ties, filter parity, probe-overflow fallback, OFFSET paging and the zero-match case. EXPLAIN QUERY PLAN test asserts the selective arm SEARCHes entities by rowid and never scans the ranking index. Measured @100k (release, p50/50 iters): rare term (20 hits) 5.1ms -> 0.08ms (~66x); 1-hit term 3.4ms -> 0.04ms. Dense-match queries pay a small fixed probe cost (common ~33k-hit term +~0.4ms) - intrinsic FTS5 prefix-doclist materialization; they were and remain O(corpus). Recall gate green locally: auto recall@5 0.958, MRR 0.910, fts5 0.208. Closes #401 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
1 parent a56e9fd commit a92a8df

2 files changed

Lines changed: 416 additions & 7 deletions

File tree

CHANGELOG.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,24 @@
33
All notable changes to Perseus Vault (formerly Mimir/Mneme) are documented here. This project adheres to
44
[Semantic Versioning](https://semver.org/).
55

6+
## [Unreleased]
7+
8+
### Fixed
9+
- Selective FTS recall cost now tracks the number of HITS, not corpus size
10+
(#401). Queries whose FTS match set fits under a cap (512 rows)
11+
materialize the matched rowids first and run the ranking ORDER BY over
12+
just those rows via INTEGER PRIMARY KEY lookups (`NOT INDEXED` pins the
13+
plan), instead of scanning `idx_entities_recall` while probing an
14+
up-to-100k-rowid FTS IN-list. Larger match sets keep the legacy
15+
rank-index-driven plan, which is efficient exactly when matches are
16+
dense. Result semantics are byte-identical (same filters, ranking order,
17+
LIMIT/OFFSET — equivalence-tested against the legacy plan); a query
18+
whose FTS terms match nothing now short-circuits without touching the
19+
entities table. Measured @100k (release, p50/50 iters): rare-term
20+
(20 hits) recall 5.1ms → 0.08ms (~64x); dense-match queries pay a small
21+
fixed probe cost (common term ~33k hits: +~0.5ms, the intrinsic FTS5
22+
prefix-doclist materialization — they were and remain O(corpus)).
23+
624
## [2.14.0] - 2026-07-02
725

826
### Added

0 commit comments

Comments
 (0)