Skip to content

prodkit-labs/crawlee-rag-ingestion-production-kit

Repository files navigation

Crawlee RAG Ingestion Production Kit

Python FastAPI License: MIT

A local-first web-to-RAG ingestion quality kit for turning crawled pages into auditable chunks, retrieval fixtures, success reports, and cost reports.

Most crawling examples stop at extracted HTML or Markdown. This repo focuses on the handoff after extraction: deterministic chunks, metadata that survives debugging, retrieval checks, and transparent cost estimates before scaling.

Ships today:

  • Local page fixtures instead of live crawling
  • Deterministic word-window chunking
  • Chunk metadata with source URL, title, ordinal, token estimate, and version
  • Retrieval fixture evaluation for known queries
  • Ingestion success report
  • Cost report with page and chunk estimates
  • FastAPI health and reports endpoints
  • Local-only demo, tests, CI, package metadata, trust docs, production docs, and public boundary scan

Quickstart

python3 -m venv .venv
source .venv/bin/activate
python -m pip install -e '.[dev]'
pytest
python examples/local-only-demo/demo_client.py

Run the FastAPI app:

uvicorn prodkit_rag_ingestion.app:app --reload

Read reports:

curl -s http://127.0.0.1:8000/reports

Inspect pages and chunks:

curl -s http://127.0.0.1:8000/pages
curl -s http://127.0.0.1:8000/chunks

Generate fixtures:

python scripts/generate_reports.py

Run the public boundary scan:

python scripts/scan_public_boundary.py

Output Files

  • examples/fixtures/chunks.jsonl
  • examples/fixtures/ingestion-success-report.json
  • examples/fixtures/retrieval-report.json
  • examples/fixtures/cost-report.json

What This Is Not

This is not a hosted crawler, a live scraping service, or a vector database. It is a local reference kit for testing the quality of web-to-RAG ingestion before connecting live crawling and indexing systems.

Production Guides

License

MIT

About

Production kit for Crawlee RAG ingestion with deterministic chunks, metadata, retrieval fixtures, and local reports.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages