You are building a production-style Data Engineering reliability project.
Project name: Production Data Pipeline Reliability & Incident Response Lab
Primary goal: Build a realistic local data engineering reliability lab that simulates a DAG-based event pipeline, injects production-style failures, detects incidents, protects downstream data products, and produces runbooks/postmortems.
Large companies run critical pipelines for payments, ads, fraud, supply chain, finance, healthcare, and product analytics. In production, the main challenge is not only whether the pipeline runs. The real challenge is whether the pipeline remains reliable when assumptions break.
This project must demonstrate the mindset of a production data engineer:
- identify assumptions
- break assumptions intentionally
- detect the failure
- contain the impact
- recover safely
- document the incident
- prevent recurrence
The system should answer:
“What happens when the pipeline runs wrong and nobody notices for six hours?”
- Write clean, modular, production-style Python.
- Use Python 3.12.
- Use type hints.
- Use docstrings for public functions.
- Use structured logging.
- Add error handling.
- Use synthetic data only.
- Do not use real sensitive data.
- Do not require external services in V0.1.
- Keep V0.1 deterministic and locally runnable.
- Simulate event streaming with deterministic micro-batches.
- Every failure scenario must have detection evidence.
- Every incident must include reason codes and recommended action.
- Every recovery pattern should be documented.
- Every major pipeline stage must have tests.
- README must stay public-facing and recruiter-friendly.
- Technical docs must be strong enough for senior data engineers.