zetetic-team-subagents/skills/research/web-to-semantic.md at main · cdeust/zetetic-team-subagents

name

web-to-semantic

description

Wire the self-hosted web-ingest engine (A) to the query-indexed semantic layer (B) for the WEB source kind. Recall first; fetch only the gap; distill one sourced fact per page; verify the passage at its source; then GATE every fact through the fetch manifest so only pages we actually fetched can be persisted — closing the hole where a plausible-but-never-fetched URL would pass the schema's source-presence check. Specializes semantic-ingest-loop; does not fork it.

What this skill is

semantic-ingest-loop specialized for the web source kind, with one structural addition: a manifest-membership gate inserted between distill and persist. It is the only place A (tools/web-ingest.sh) and B (tools/semantic-layer.sh) are wired together — the tools import none of each other (DIP / coding-standards §5). The wiring is two gates in series:

membership — tools/manifest-gate.sh: is the fact's source URL one we actually fetched?
presence — semantic_layer.validate_entry (unchanged): does the fact carry a source string at all?

A fact must pass both. Presence alone (B's existing check) cannot distinguish a real fetched URL from a plausible one the model invented; membership can, because it has the fetch manifest.

Invariants (state these before acting)

INV-1 — recall before ingest. Query the semantic layer and Cortex before fetching a single page. A fresh recall HIT fetches nothing. (semantic-ingest-loop Phase 1; unchanged.)
INV-2 — source in manifest. Every persisted fact's source URL MUST be a page this run fetched (present in the manifest). Stricter than §8's "has a source": the source must be one we fetched and read, not one we believe exists. The membership gate enforces it; ungrounded facts never reach record. Fail closed — what is not provably grounded is dropped, not stored.
INV-3 — no coupling. web_ingest, semantic_layer, and manifest_gate import none of each other. This skill is the sole wiring point, so each tool stays independently testable and replaceable (swap the ingestion engine without touching the layer — OCP).
INV-4 — never close without the learn signal. On exit the entry records outcome + gaps (firecrawl's missingContent). Blocked, TLS-failed, and gate-dropped facts all become gaps that seed the next visit's backlog. A topic never closes without it.

Procedure (7 steps + a normalize pre-step)

Steps 0, 1, 3, and 7 ARE semantic-ingest-loop Phases 0, 1, 3, and 5 — follow them there. This skill adds Step 2 (build the manifest), Step 4 (verify-at-source annotation), Step 5 (the gate), and specializes Step 6 (persist) to auto-fill sources[] from the manifest.

Step 0 — Normalize the query (shannon + ranganathan)

Restate the topic as one natural-language question (the human key). Derive a stable kebab-case query_id and 1–3 aliases. Same topic must always map to the same id, or recall misses and the topic forks. (= semantic-ingest-loop Phase 0.)

Step 1 — RECALL FIRST (INV-1)

tools/semantic-layer.sh query "<topic>" and cortex recall the same query.

HIT & status: fresh → load summary, facts, pointers.cortex_memories; jump to Step 6 (persist only if you learned something new — a clean HIT still owes Step 7).
HIT & stale/superseded, or MISS → continue. Existing gaps from a prior visit are the priority acquisition backlog. (= semantic-ingest-loop Phase 1.)

Step 2 — Acquire, and BUILD THE MANIFEST (peirce)

Fetch only the gap, with the self-hosted engine — run diverse framings (peirce), not one query:

tools/web-ingest.sh map   "<seed-url>" --search "<term>"        # discover candidate URLs
tools/web-ingest.sh scrape "<url>" --out /tmp/ingest/<query_id> # fetch one page
tools/web-ingest.sh crawl "<seed-url>" --depth 1 --out /tmp/ingest/<query_id>  # fetch a small set

The manifest is the set of URLs you FETCHED — not the ones you discovered. Build it from the url field of each fetched record:

with --out DIR: the url values in DIR/index.json.
to stdout: the .url of each scrape/crawl record.

Never build the manifest from map output or a record's links[] — those are discovered, not fetched. Grounding a fact against a link you never opened is exactly the hole this gate closes. map only tells Step 2 what to scrape next.

Step 3 — Distill (feynman)

One fact = one verifiable claim + the exact page URL it came from (the URL of the page whose markdown you read, not a parent or a sibling link). No URL → drop it. (= semantic-ingest-loop Phase 3.)

Step 4 — Verify at the source (popper)

Read the actual passage in the fetched markdown that supports each load-bearing claim — do not trust a snippet. Optionally record the proof inline by annotating the source: "<url> -> \"the exact quoted passage\"". The gate strips the -> "…" suffix before comparing, so the annotation is free to keep. Use two-independent-methods for high-stakes facts.

Step 5 — GATE on the manifest (lavoisier + liskov) — INV-2

Mass-balance the entry against what you fetched: every claimed source must appear as a fetched input. Build {"entry": <draft>, "manifest": [<fetched urls>]} and pipe it through the membership gate:

echo "$payload" | tools/manifest-gate.sh            # all grounded -> entry on stdout, exit 0
echo "$payload" | tools/manifest-gate.sh --drop     # keep grounded facts, demote the rest to gaps

exit 0 → the emitted entry is grounded; carry it to Step 6.
exit 3 → ≥1 ungrounded fact. The gate emits {"gaps":[…]} and withholds the entry. Do not weaken the gate. Either fetch the missing page (back to Step 2) or run --drop to persist the grounded survivors and let the ungrounded claims become gaps. If --drop leaves zero facts, the fetched pages support no claim — widen acquisition; do not lower the bar.
exit 2 → malformed payload (e.g. you forgot the manifest). Fix the payload.

Step 6 — Persist (dual write; presence gate runs here)

cortex remember each grounded, durable fact → returns a memory id.
Build the final entry: put the ids in facts[].cortex_id and pointers.cortex_memories; auto- fill sources[] from the manifest — one {ref: <url>, kind: web, accessed: <date>, valuable: …} per fetched page.
echo '<entry-json>' | tools/semantic-layer.sh record. This runs the presence gate (validate_entry, §8) — the second gate in the series — and upserts by query_id. (Persist mechanics = semantic-ingest-loop Phase 4, with sources[] sourced from the manifest.)

Step 7 — Feedback, the learn signal (INV-4)

echo '{"outcome":{...},"gaps":[...],"status":"fresh"}' | tools/semantic-layer.sh feedback <query_id>

gaps = everything you could not ground: pages blocked by robots.txt, TLS/network failures, and every fact the manifest gate dropped (the gate's {"gaps":[…]} output drops straight in here). Set freshness by volatility — fast-moving topic → short ttl_days; watch: true to resurface it. (= semantic-ingest-loop Phase 5; never skipped, even on a clean HIT.)

Output Format

## Web -> Semantic: [query]

### query_id: [slug]   recall: [HIT-fresh | HIT-stale | MISS]   manifest: [N pages fetched]

### Grounded facts persisted (each fetched + sourced):
- [claim]  — [url]  (cortex: [id])

### Gate result: [all-grounded | dropped K ungrounded | --drop survivors=M]

### Gaps recorded (next ingest backlog):
- [topic] — [open|researching]   (blocked / TLS / ungrounded)

### Outcome: [good|partial|bad]   Revalidate after: [date]

Invariants recap (the gate is the point)

Two gates in series, both mandatory: membership (fetched?) then presence (sourced?). A fact that passes one but not the other is not stored.
Manifest = fetched, never discovered. map/links[] describe what could be fetched; only the url of a fetched record may enter the manifest.
Fail closed. The gate's default is exit 3 (withhold the whole entry). --drop is a deliberate choice to keep survivors, not a way to wave claims through.
No coupling, no forking. The tools stay decoupled; this skill stays a thin specialization of semantic-ingest-loop — fix the loop once, both inherit it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What this skill is

Invariants (state these before acting)

Procedure (7 steps + a normalize pre-step)

Step 0 — Normalize the query (shannon + ranganathan)

Step 1 — RECALL FIRST (INV-1)

Step 2 — Acquire, and BUILD THE MANIFEST (peirce)

Step 3 — Distill (feynman)

Step 4 — Verify at the source (popper)

Step 5 — GATE on the manifest (lavoisier + liskov) — INV-2

Step 6 — Persist (dual write; presence gate runs here)

Step 7 — Feedback, the learn signal (INV-4)

Output Format

Invariants recap (the gate is the point)

FilesExpand file tree

web-to-semantic.md

Latest commit

History

web-to-semantic.md

File metadata and controls

What this skill is

Invariants (state these before acting)

Procedure (7 steps + a normalize pre-step)

Step 0 — Normalize the query (shannon + ranganathan)

Step 1 — RECALL FIRST (INV-1)

Step 2 — Acquire, and BUILD THE MANIFEST (peirce)

Step 3 — Distill (feynman)

Step 4 — Verify at the source (popper)

Step 5 — GATE on the manifest (lavoisier + liskov) — INV-2

Step 6 — Persist (dual write; presence gate runs here)

Step 7 — Feedback, the learn signal (INV-4)

Output Format

Invariants recap (the gate is the point)