This project simulates the failure modes that real data engineering teams face in production: duplicate retries, late-arriving events, schema drift, SLA breaches, corrupt payloads, unsafe backfills, missing batches, hot partitions, and downstream metric mismatches.
A basic ETL project asks: “Does the pipeline run?”
This project asks: “What happens when the pipeline runs wrong, nobody notices for six hours, and downstream teams already consumed the bad data?”
This project demonstrates the production data engineering mindset: do not assume the pipeline is correct — prove it, monitor it, break it intentionally, and design safe recovery.
What was built: a deterministic local reliability lab that generates 250k+ synthetic events, injects 14 production-style failures, runs a DAG-style pipeline, validates contracts, deduplicates retries, handles late data, quarantines unsafe records, detects incidents, writes reliability scorecards, and generates runbooks/postmortems.
Validation proof: 61 pytest tests passing, ruff check . passing, reproducible local pipeline commands, and JSON/CSV evidence files in data/scorecards/.
Large companies depend on data pipelines for payments, ads, fraud detection, product analytics, supply chain operations, finance reporting, healthcare claims, and machine learning features.
At production scale, the risk is not only pipeline failure. The bigger risk is silent pipeline failure:
- duplicates from unsafe retries
- late events corrupting daily metrics
- schema drift breaking downstream jobs
- missing batches passing unnoticed
- corrupted payloads entering trusted tables
- unsafe backfills overwriting correct data
- SLA breaches delaying dashboards or models
- downstream metrics changing unexpectedly
Build a production-style Data Engineering reliability lab that intentionally injects common pipeline failure scenarios, detects them, quarantines unsafe data, calculates reliability scorecards, creates incident timelines, and generates runbooks/postmortems.
Can this pipeline remain trustworthy when assumptions break?
Most ETL projects only prove happy-path transformation. This lab intentionally breaks production assumptions and then proves detection, containment, recovery evidence, and incident documentation.
flowchart LR
A["Synthetic Events"] --> B["Failure Injection"]
B --> C["Landing"]
C --> D["Data Contract Validation"]
D --> E["Bronze Load"]
E --> F["Idempotency Check"]
F --> G["Deduplication"]
G --> H["Watermarking"]
H --> I["Silver Transform"]
I --> J["Gold Aggregation"]
J --> K["Reconciliation"]
K --> L["Monitoring"]
L --> M["Incident Detection"]
M --> N["Runbooks/Postmortems"]
M --> O["API/Dashboard"]
flowchart TD
A["source_event_generation"] --> B["landing"]
B --> C["contract_validation"]
C --> D["bronze_load"]
D --> E["idempotency_check"]
E --> F["deduplication"]
F --> G["watermark_late_data_handling"]
G --> H["silver_transform"]
H --> I["gold_aggregation"]
I --> J["reconciliation"]
J --> K["monitoring"]
K --> L["incident_detection"]
L --> M["runbook_generation"]
M --> N["postmortem_generation"]
The lab injects controlled failures and records them in data/incidents/injected_failure_manifest.json.
| Failure scenario | Production risk | Detection evidence |
|---|---|---|
| Duplicate retry storm | Retries double-count transactions or events | Deduplication and idempotency reports |
| Late / out-of-order events | Daily metrics change after publishing | Watermark validation report |
| Schema drift new/missing columns | Downstream jobs break silently | Schema drift and contract reports |
| Corrupt payloads | Malformed records enter trusted tables | Contract validation and quarantine |
| Missing / partial batches | Incomplete source data looks valid | SLA, row-count, and incident reports |
| Hot partition skew | One key overloads processing | Monitoring and incident detection |
| Unsafe backfill overlap | Correct data is overwritten | Backfill safety report |
| Null spike | Required fields degrade suddenly | Data contract report |
| Downstream metric mismatch | Gold tables no longer reconcile | Downstream impact report |
| SLA breach | Dashboards or ML features are delayed | SLA breach and stage summary reports |
| Control | What it prevents | Evidence file |
|---|---|---|
| Data contracts | Schema, null, payload, and row-count violations | data_contract_validation_report.* |
| Idempotency checks | Unsafe replay and duplicate retries | idempotency_control_report.* |
| Deduplication | Gold double-counting | deduplication_accuracy_report.* |
| Watermarking | Late data corrupting trusted windows | watermark_validation_report.* |
| Quarantine | Bad records reaching silver/gold tables | data/quarantine/ |
| SLA monitoring | Missed operational deadlines | sla_breach_report.*, sla_stage_summary.* |
| Reconciliation | Metric mismatches between layers | downstream_impact_report.* |
| Backfill guard | Overlapping unsafe replay windows | backfill_safety_report.* |
The pipeline writes business-readable and reviewer-friendly evidence to data/scorecards/:
pipeline_reliability_scorecard.json/csv: summarizes reliability score, incident detection, SLA compliance, dedupe accuracy, contract pass rate, watermark policy, and reconciliation.incident_detection_accuracy_report.json/csv: compares injected failures with detected incidents and severity matching.incident_detection_report.json/csv: lists detected incident types, sources, severity, containment, recovery, and prevention actions.sla_breach_report.json/csvandsla_stage_summary.json/csv: show stage-level duration, p95 duration, breach rate, impacted tables, and linked incidents.deduplication_accuracy_report.json/csvandidempotency_control_report.json/csv: prove retry duplicates were detected and did not double-count gold metrics.watermark_validation_report.json/csv: proves late and out-of-order events are handled according to the configured policy.schema_drift_report.json/csv: shows missing/unexpected columns, affected sources, severity, and recommended action.backfill_safety_report.json/csv: proves unsafe overlapping backfills are blocked.data_contract_validation_report.json/csv: captures required-column, type, schema-version, null-rate, uniqueness, timestamp, row-count, value-range, and payload checks.downstream_impact_report.json/csv: explains affected tables, metrics, rows, containment, and recovery status.
Each incident includes a timeline with failure injection, detection, quarantine, alert creation, impact assessment, recovery start, recovery completion, and prevention action. The outputs are stored in:
data/incidents/incidents.csvdata/incidents/incident_timelines.csvdocs/runbooks/docs/postmortems/
Runbooks live in docs/runbooks/ and postmortems live in docs/postmortems/. Each incident includes symptoms, detection method, containment, recovery, validation, prevention, timeline, and follow-up actions.
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
python -m src.data_generation.generate_reference_data
python -m src.data_generation.generate_events
python -m src.failures.inject_failures
python -m src.pipeline.run_all
python -m pytest
python -m ruff check .Current validation: 61 passed and ruff check . passes.
Launch the API:
python -m uvicorn src.api.main:app --reloadLaunch the dashboard:
python -m streamlit run src/dashboard/app.pyEndpoints include:
GET /healthGET /pipeline-summaryGET /pipeline-runsGET /incidentsGET /incidents/{incident_id}GET /scorecardsGET /sla-statusGET /data-contractsGET /quarantineGET /downstream-impactPOST /simulate-failurePOST /run-pipelinePOST /run-backfill
Example summary request:
curl http://127.0.0.1:8000/pipeline-summaryExample deterministic simulation request:
curl -X POST http://127.0.0.1:8000/simulate-failure \
-H "Content-Type: application/json" \
-d '{"failure_type": "duplicate_retry_storm"}'The Streamlit dashboard includes executive overview, DAG status, incident timeline, failure scenario detection, SLA monitoring, data contracts, deduplication/idempotency, late data/watermarking, quarantine, backfill safety, downstream impact, runbooks, and postmortems.
Screenshot capture guidance lives in docs/screenshots/. Recommended portfolio screenshots:
- Executive Overview
- Pipeline DAG Status
- Incident Timeline
- Failure Scenario Detection
- SLA Monitoring
- Data Contract Validation
- Deduplication & Idempotency
- Late Data / Watermarking
- Downstream Impact
- Runbooks & Postmortems
- Synthetic data only.
- Local DuckDB instead of cloud warehouse.
- Simulated micro-batches instead of Kafka/Spark/Flink.
- Deterministic incident detection rules.
- No cloud deployment.
- No authentication.
- No real pager/alerting integration.
- No OpenLineage/Marquez lineage integration yet.
- Kafka/Redpanda streaming ingestion.
- Spark Structured Streaming.
- Flink event-time processing.
- Airflow DAG orchestration.
- dbt models.
- Great Expectations.
- OpenLineage/Marquez.
- Snowflake/Databricks deployment.
- Prometheus/Grafana observability.
- Slack/PagerDuty alerting.
- Cloud deployment.
- Role-based access control.
Production data pipelines fail in ways that basic tutorials do not cover. Retries create duplicates, late data changes metrics, schemas drift, jobs miss SLAs, and bad data can silently flow into dashboards, finance reporting, fraud features, or operational systems.
Build a reliability lab that simulates realistic pipeline incidents and demonstrates how to detect, contain, recover, and document failures.
Created synthetic event pipelines, DAG-style processing, failure injection, data contract validation, idempotency checks, deduplication, watermarking, SLA monitoring, anomaly detection, quarantine handling, reliability scorecards, runbooks, postmortems, API endpoints, dashboards, tests, and CI/CD.
Produced a reproducible portfolio project that demonstrates production data engineering reliability patterns instead of only basic ETL execution.
V0.1: Working baseline. V0.2: Reliability evidence hardening.