| name | web-to-semantic | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| description | Wire the self-hosted web-ingest engine (A) to the query-indexed semantic layer (B) for the WEB source kind. Recall first; fetch only the gap; distill one sourced fact per page; verify the passage at its source; then GATE every fact through the fetch manifest so only pages we actually fetched can be persisted — closing the hole where a plausible-but-never-fetched URL would pass the schema's source-presence check. Specializes semantic-ingest-loop; does not fork it. | ||||||||
| category | research | ||||||||
| trigger | When a topic must be answered from the open web and the result should become cheaply recallable; when "did we actually fetch the page this claim cites, or did we just believe it exists?" must be provable before a web fact is stored; whenever web-ingest and the semantic layer are used together. | ||||||||
| agents |
|
||||||||
| shapes |
|
||||||||
| input | A topic or question to answer from the web. The agent's domain. Optional acquisition budget (pages/depth). | ||||||||
| output | A validated semantic-layer.yaml entry in which every fact's source is a URL this run fetched, Cortex memories for full content, sources[] auto-filled from the manifest, and gaps recording every claim that could not be grounded (blocked / TLS-failed / dropped). | ||||||||
| zetetic_gate |
|
||||||||
| composes |
|
||||||||
| aliases |
|
||||||||
| hand_off |
|
semantic-ingest-loop specialized for the web source kind, with one structural addition: a
manifest-membership gate inserted between distill and persist. It is the only place A
(tools/web-ingest.sh) and B (tools/semantic-layer.sh) are wired together — the tools import none
of each other (DIP / coding-standards §5). The wiring is two gates in series:
- membership —
tools/manifest-gate.sh: is the fact's source URL one we actually fetched? - presence —
semantic_layer.validate_entry(unchanged): does the fact carry a source string at all?
A fact must pass both. Presence alone (B's existing check) cannot distinguish a real fetched URL from a plausible one the model invented; membership can, because it has the fetch manifest.
- INV-1 — recall before ingest. Query the semantic layer and Cortex before fetching a single page. A fresh recall HIT fetches nothing. (semantic-ingest-loop Phase 1; unchanged.)
- INV-2 — source in manifest. Every persisted fact's source URL MUST be a page this run fetched
(present in the manifest). Stricter than §8's "has a source": the source must be one we fetched and
read, not one we believe exists. The membership gate enforces it; ungrounded facts never reach
record. Fail closed — what is not provably grounded is dropped, not stored. - INV-3 — no coupling.
web_ingest,semantic_layer, andmanifest_gateimport none of each other. This skill is the sole wiring point, so each tool stays independently testable and replaceable (swap the ingestion engine without touching the layer — OCP). - INV-4 — never close without the learn signal. On exit the entry records
outcome+gaps(firecrawl'smissingContent). Blocked, TLS-failed, and gate-dropped facts all become gaps that seed the next visit's backlog. A topic never closes without it.
Steps 0, 1, 3, and 7 ARE
semantic-ingest-loopPhases 0, 1, 3, and 5 — follow them there. This skill adds Step 2 (build the manifest), Step 4 (verify-at-source annotation), Step 5 (the gate), and specializes Step 6 (persist) to auto-fillsources[]from the manifest.
Restate the topic as one natural-language question (the human key). Derive a stable kebab-case
query_id and 1–3 aliases. Same topic must always map to the same id, or recall misses and the
topic forks. (= semantic-ingest-loop Phase 0.)
tools/semantic-layer.sh query "<topic>" and cortex recall the same query.
- HIT &
status: fresh→ loadsummary,facts,pointers.cortex_memories; jump to Step 6 (persist only if you learned something new — a clean HIT still owes Step 7). - HIT &
stale/superseded, or MISS → continue. Existinggapsfrom a prior visit are the priority acquisition backlog. (= semantic-ingest-loop Phase 1.)
Fetch only the gap, with the self-hosted engine — run diverse framings (peirce), not one query:
tools/web-ingest.sh map "<seed-url>" --search "<term>" # discover candidate URLs
tools/web-ingest.sh scrape "<url>" --out /tmp/ingest/<query_id> # fetch one page
tools/web-ingest.sh crawl "<seed-url>" --depth 1 --out /tmp/ingest/<query_id> # fetch a small setThe manifest is the set of URLs you FETCHED — not the ones you discovered. Build it from the
url field of each fetched record:
- with
--out DIR: theurlvalues inDIR/index.json. - to stdout: the
.urlof each scrape/crawl record.
Never build the manifest from
mapoutput or a record'slinks[]— those are discovered, not fetched. Grounding a fact against a link you never opened is exactly the hole this gate closes.maponly tells Step 2 what to scrape next.
One fact = one verifiable claim + the exact page URL it came from (the URL of the page whose
markdown you read, not a parent or a sibling link). No URL → drop it. (= semantic-ingest-loop Phase 3.)
Read the actual passage in the fetched markdown that supports each load-bearing claim — do not trust
a snippet. Optionally record the proof inline by annotating the source:
"<url> -> \"the exact quoted passage\"". The gate strips the -> "…" suffix before comparing, so
the annotation is free to keep. Use two-independent-methods for high-stakes facts.
Mass-balance the entry against what you fetched: every claimed source must appear as a fetched input.
Build {"entry": <draft>, "manifest": [<fetched urls>]} and pipe it through the membership gate:
echo "$payload" | tools/manifest-gate.sh # all grounded -> entry on stdout, exit 0
echo "$payload" | tools/manifest-gate.sh --drop # keep grounded facts, demote the rest to gaps- exit 0 → the emitted entry is grounded; carry it to Step 6.
- exit 3 → ≥1 ungrounded fact. The gate emits
{"gaps":[…]}and withholds the entry. Do not weaken the gate. Either fetch the missing page (back to Step 2) or run--dropto persist the grounded survivors and let the ungrounded claims become gaps. If--dropleaves zero facts, the fetched pages support no claim — widen acquisition; do not lower the bar. - exit 2 → malformed payload (e.g. you forgot the manifest). Fix the payload.
cortex remembereach grounded, durable fact → returns a memory id.- Build the final entry: put the ids in
facts[].cortex_idandpointers.cortex_memories; auto- fillsources[]from the manifest — one{ref: <url>, kind: web, accessed: <date>, valuable: …}per fetched page. echo '<entry-json>' | tools/semantic-layer.sh record. This runs the presence gate (validate_entry, §8) — the second gate in the series — and upserts byquery_id. (Persist mechanics = semantic-ingest-loop Phase 4, with sources[] sourced from the manifest.)
echo '{"outcome":{...},"gaps":[...],"status":"fresh"}' | tools/semantic-layer.sh feedback <query_id>gaps = everything you could not ground: pages blocked by robots.txt, TLS/network failures, and
every fact the manifest gate dropped (the gate's {"gaps":[…]} output drops straight in here). Set
freshness by volatility — fast-moving topic → short ttl_days; watch: true to resurface it.
(= semantic-ingest-loop Phase 5; never skipped, even on a clean HIT.)
## Web -> Semantic: [query]
### query_id: [slug] recall: [HIT-fresh | HIT-stale | MISS] manifest: [N pages fetched]
### Grounded facts persisted (each fetched + sourced):
- [claim] — [url] (cortex: [id])
### Gate result: [all-grounded | dropped K ungrounded | --drop survivors=M]
### Gaps recorded (next ingest backlog):
- [topic] — [open|researching] (blocked / TLS / ungrounded)
### Outcome: [good|partial|bad] Revalidate after: [date]
- Two gates in series, both mandatory: membership (fetched?) then presence (sourced?). A fact that passes one but not the other is not stored.
- Manifest = fetched, never discovered.
map/links[]describe what could be fetched; only theurlof a fetched record may enter the manifest. - Fail closed. The gate's default is exit 3 (withhold the whole entry).
--dropis a deliberate choice to keep survivors, not a way to wave claims through. - No coupling, no forking. The tools stay decoupled; this skill stays a thin specialization of
semantic-ingest-loop— fix the loop once, both inherit it.