Crawlee RAG Ingestion Production Kit

A local-first web-to-RAG ingestion quality kit for turning crawled pages into auditable chunks, retrieval fixtures, success reports, and cost reports.

Most crawling examples stop at extracted HTML or Markdown. This repo focuses on the handoff after extraction: deterministic chunks, metadata that survives debugging, retrieval checks, and transparent cost estimates before scaling.

Ships today:

Local page fixtures instead of live crawling
Deterministic word-window chunking
Chunk metadata with source URL, title, ordinal, token estimate, and version
Retrieval fixture evaluation for known queries
Ingestion success report
Cost report with page and chunk estimates
FastAPI health and reports endpoints
Local-only demo, tests, CI, package metadata, trust docs, production docs, and public boundary scan

Quickstart

python3 -m venv .venv
source .venv/bin/activate
python -m pip install -e '.[dev]'
pytest
python examples/local-only-demo/demo_client.py

Run the FastAPI app:

uvicorn prodkit_rag_ingestion.app:app --reload

Read reports:

curl -s http://127.0.0.1:8000/reports

Inspect pages and chunks:

curl -s http://127.0.0.1:8000/pages
curl -s http://127.0.0.1:8000/chunks

Generate fixtures:

python scripts/generate_reports.py

Run the public boundary scan:

python scripts/scan_public_boundary.py

Output Files

examples/fixtures/chunks.jsonl
examples/fixtures/ingestion-success-report.json
examples/fixtures/retrieval-report.json
examples/fixtures/cost-report.json

What This Is Not

This is not a hosted crawler, a live scraping service, or a vector database. It is a local reference kit for testing the quality of web-to-RAG ingestion before connecting live crawling and indexing systems.

Production Guides

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github		.github
examples		examples
production		production
scripts		scripts
src/prodkit_rag_ingestion		src/prodkit_rag_ingestion
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Crawlee RAG Ingestion Production Kit

Quickstart

Output Files

What This Is Not

Production Guides

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Crawlee RAG Ingestion Production Kit

Quickstart

Output Files

What This Is Not

Production Guides

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages