A local-first web-to-RAG ingestion quality kit for turning crawled pages into auditable chunks, retrieval fixtures, success reports, and cost reports.
Most crawling examples stop at extracted HTML or Markdown. This repo focuses on the handoff after extraction: deterministic chunks, metadata that survives debugging, retrieval checks, and transparent cost estimates before scaling.
Ships today:
- Local page fixtures instead of live crawling
- Deterministic word-window chunking
- Chunk metadata with source URL, title, ordinal, token estimate, and version
- Retrieval fixture evaluation for known queries
- Ingestion success report
- Cost report with page and chunk estimates
- FastAPI health and reports endpoints
- Local-only demo, tests, CI, package metadata, trust docs, production docs, and public boundary scan
python3 -m venv .venv
source .venv/bin/activate
python -m pip install -e '.[dev]'
pytest
python examples/local-only-demo/demo_client.pyRun the FastAPI app:
uvicorn prodkit_rag_ingestion.app:app --reloadRead reports:
curl -s http://127.0.0.1:8000/reportsInspect pages and chunks:
curl -s http://127.0.0.1:8000/pages
curl -s http://127.0.0.1:8000/chunksGenerate fixtures:
python scripts/generate_reports.pyRun the public boundary scan:
python scripts/scan_public_boundary.pyexamples/fixtures/chunks.jsonlexamples/fixtures/ingestion-success-report.jsonexamples/fixtures/retrieval-report.jsonexamples/fixtures/cost-report.json
This is not a hosted crawler, a live scraping service, or a vector database. It is a local reference kit for testing the quality of web-to-RAG ingestion before connecting live crawling and indexing systems.
- Production docs map
- Chunk metadata
- Retrieval fixtures
- Success reports
- Cost controls
- Deployment
- Observability
- Troubleshooting
MIT