Skip to content

v1.0.2

Choose a tag to compare

@github-actions github-actions released this 03 Jun 06:41
· 61 commits to main since this release
e71c479

Removed

  • Reverted the entire query-word-importance line (#163 exemption + #164 ranking). Validation showed both layers were inert — the #164 incidentalMatchWeight re-ranking and the #163 semantic exemption gate changed result ordering on zero real queries. Removed: the query_word_importance plumbing into the WASM score_results input and the JS fallback weighting, the incidentalMatchWeight config, the aiQueryWordImportance flag, the expand-prompt classification instruction and its query_word_importance parsing in AiEndpointHandler, the contentWords exemption gate in scolta.js, and the bundled WASM that carried the scoring weight. The #156 frequency guard (#161) and the Fix A/D typed-query-term exemption + expandSubwordDenyList veto (#162) are unchanged — the browser tree is back to its #162 state, with the bundled WASM matching the reverted scolta-core.

Fixed

  • Sub-word frequency guard no longer drops words the user actually typed (#156 follow-up). The #156 guard used corpus frequency as a proxy for "non-discriminating / generic," but in a topical corpus a word is often high-frequency because it is the subject matter — the proxy conflates "common because central" with "common because generic." So searching spicy on a recipe corpus had its typed subject word (~6.4% of docs, above the 0.05/0.10 threshold) silently dropped from the decomposed expansion terms, collapsing recall. The frequency check is now bypassed for any sub-word that is a token of the user's actual query (Fix A), with a per-site expandSubwordDenyList veto for the rare word that is both typed and genuinely generic (Fix D). Non-query expansion-derived words (e.g. fried, tender) remain frequency-gated as before.
  • customStopWords now applied consistently in scolta.js (#156 follow-up). extractSearchTerms() previously filtered against only the built-in STOPWORDS set, so JS query tokenization ignored customStopWords while the WASM scorer honored it. JS now strips the union of STOPWORDS and customStopWords, matching WASM.
  • Broad multi-word queries recover the recall lost in v1.0.0 (closes #156). Commit 690a2288 removed the sub-word expansion block from scolta.js, causing broad-query result counts to drop 4–50× on high-overlap corpora. Sub-word expansion is reintroduced behind a corpus-frequency guard: a multi-word expansion term's constituent words are added as search terms only when each word's corpus frequency is below expandSubwordMaxFrequency (default 0.05; 0.10 for the content_catalog and none presets). This restores low-frequency domain words ("vegetarian", "cuisine") while blocking high-frequency noise ("recipes", "cooking") that polluted pre-v1.0.0 results. Frequency is measured against the same active filters the search uses (including the language partition when autoLanguageFilter is on), so the denominator matches the corpus actually being searched. The guard applies in both the relevance and native-sort code paths. Setting the threshold to 0 reproduces v1.0.0 (no sub-words); >= 1.0 reproduces the pre-v1.0.0 behavior (all sub-words).

Added

  • expandSubwordDenyList config option (default []). Guard-only veto list for the sub-word query-term exemption (#156 follow-up). Words here are never auto-exempted from the sub-word frequency guard even when the user types them, so a site can stop a typed generic word (e.g. hot on a recipe corpus) from re-flooding results via the exemption. Unlike customStopWords this does NOT affect relevance scoring or query tokenization — the word stays searchable and scorable. Configurable via expand_subword_deny_list in all platform adapters.
  • expandSubwordMaxFrequency config option (default 0.05). Maximum corpus frequency (fraction of indexed documents) for a multi-word expansion term's constituent word to be added as a standalone search term. Configurable via expand_subword_max_frequency in all platform adapters. Presets content_catalog and none ship 0.10; reference, ecommerce, and blog use the 0.05 default. Set to 0 to disable sub-word expansion entirely.
  • Result-count baseline regression test (tests/js/result-count-baseline.test.js + tests/fixtures/result-count-baseline.json). Drives the real scolta.js guard against a synthetic corpus built from real measured frequencies and asserts merged result counts stay within a per-demo band, flagging both recall collapse (sub-word block removed) and precision spikes (high-frequency noise admitted). This is the regression guard whose absence let 690a2288 ship a silent count drop.
  • Sub-word frequency guard behavioral test (tests/js/subword-frequency-guard.test.js). Executes scolta.js against a recording Pagefind mock and asserts only sub-words below the threshold feed results, covering the boundary behaviors (0 and >= 1).

Changed

  • Opened 1.0.2-dev development cycle.
  • Scoring default tuning from the full-matrix sweep (120+ queries, 9 demos). crossListBonus default 0.15 -> 0.05 (a smaller tie-breaker that doesn't override single-source precision); recencyBoostMax default 0.5 -> 0.25, with preset overrides reference: 0, content_catalog: 0, and blog: 0.25 (recency adds noise on non-time-sensitive content); titleMatchBoost default 1.0 -> 2.0 (improves top-1 precision across all demos; already shipped in the reference/content_catalog presets). These are ranking-quality changes independent of the #156 recall fix.