Production Data Pipeline Reliability & Incident Response Lab

Executive Summary

This project simulates the failure modes that real data engineering teams face in production: duplicate retries, late-arriving events, schema drift, SLA breaches, corrupt payloads, unsafe backfills, missing batches, hot partitions, and downstream metric mismatches.

A basic ETL project asks: “Does the pipeline run?”

This project asks: “What happens when the pipeline runs wrong, nobody notices for six hours, and downstream teams already consumed the bad data?”

This project demonstrates the production data engineering mindset: do not assume the pipeline is correct — prove it, monitor it, break it intentionally, and design safe recovery.

What was built: a deterministic local reliability lab that generates 250k+ synthetic events, injects 14 production-style failures, runs a DAG-style pipeline, validates contracts, deduplicates retries, handles late data, quarantines unsafe records, detects incidents, writes reliability scorecards, and generates runbooks/postmortems.

Validation proof: 61 pytest tests passing, ruff check . passing, reproducible local pipeline commands, and JSON/CSV evidence files in data/scorecards/.

Business Problem

Large companies depend on data pipelines for payments, ads, fraud detection, product analytics, supply chain operations, finance reporting, healthcare claims, and machine learning features.

At production scale, the risk is not only pipeline failure. The bigger risk is silent pipeline failure:

duplicates from unsafe retries
late events corrupting daily metrics
schema drift breaking downstream jobs
missing batches passing unnoticed
corrupted payloads entering trusted tables
unsafe backfills overwriting correct data
SLA breaches delaying dashboards or models
downstream metrics changing unexpectedly

Project Goal

Build a production-style Data Engineering reliability lab that intentionally injects common pipeline failure scenarios, detects them, quarantines unsafe data, calculates reliability scorecards, creates incident timelines, and generates runbooks/postmortems.

Core Business Question

Can this pipeline remain trustworthy when assumptions break?

Why This Is Not A Generic ETL Project

Most ETL projects only prove happy-path transformation. This lab intentionally breaks production assumptions and then proves detection, containment, recovery evidence, and incident documentation.

Architecture

flowchart LR
    A["Synthetic Events"] --> B["Failure Injection"]
    B --> C["Landing"]
    C --> D["Data Contract Validation"]
    D --> E["Bronze Load"]
    E --> F["Idempotency Check"]
    F --> G["Deduplication"]
    G --> H["Watermarking"]
    H --> I["Silver Transform"]
    I --> J["Gold Aggregation"]
    J --> K["Reconciliation"]
    K --> L["Monitoring"]
    L --> M["Incident Detection"]
    M --> N["Runbooks/Postmortems"]
    M --> O["API/Dashboard"]

DAG Flow

flowchart TD
    A["source_event_generation"] --> B["landing"]
    B --> C["contract_validation"]
    C --> D["bronze_load"]
    D --> E["idempotency_check"]
    E --> F["deduplication"]
    F --> G["watermark_late_data_handling"]
    G --> H["silver_transform"]
    H --> I["gold_aggregation"]
    I --> J["reconciliation"]
    J --> K["monitoring"]
    K --> L["incident_detection"]
    L --> M["runbook_generation"]
    M --> N["postmortem_generation"]

Failure Scenarios

The lab injects controlled failures and records them in data/incidents/injected_failure_manifest.json.

Failure scenario	Production risk	Detection evidence
Duplicate retry storm	Retries double-count transactions or events	Deduplication and idempotency reports
Late / out-of-order events	Daily metrics change after publishing	Watermark validation report
Schema drift new/missing columns	Downstream jobs break silently	Schema drift and contract reports
Corrupt payloads	Malformed records enter trusted tables	Contract validation and quarantine
Missing / partial batches	Incomplete source data looks valid	SLA, row-count, and incident reports
Hot partition skew	One key overloads processing	Monitoring and incident detection
Unsafe backfill overlap	Correct data is overwritten	Backfill safety report
Null spike	Required fields degrade suddenly	Data contract report
Downstream metric mismatch	Gold tables no longer reconcile	Downstream impact report
SLA breach	Dashboards or ML features are delayed	SLA breach and stage summary reports

Reliability Controls

Control	What it prevents	Evidence file
Data contracts	Schema, null, payload, and row-count violations	`data_contract_validation_report.*`
Idempotency checks	Unsafe replay and duplicate retries	`idempotency_control_report.*`
Deduplication	Gold double-counting	`deduplication_accuracy_report.*`
Watermarking	Late data corrupting trusted windows	`watermark_validation_report.*`
Quarantine	Bad records reaching silver/gold tables	`data/quarantine/`
SLA monitoring	Missed operational deadlines	`sla_breach_report.`, `sla_stage_summary.`
Reconciliation	Metric mismatches between layers	`downstream_impact_report.*`
Backfill guard	Overlapping unsafe replay windows	`backfill_safety_report.*`

Evidence Generated by the Pipeline

The pipeline writes business-readable and reviewer-friendly evidence to data/scorecards/:

pipeline_reliability_scorecard.json/csv: summarizes reliability score, incident detection, SLA compliance, dedupe accuracy, contract pass rate, watermark policy, and reconciliation.
incident_detection_accuracy_report.json/csv: compares injected failures with detected incidents and severity matching.
incident_detection_report.json/csv: lists detected incident types, sources, severity, containment, recovery, and prevention actions.
sla_breach_report.json/csv and sla_stage_summary.json/csv: show stage-level duration, p95 duration, breach rate, impacted tables, and linked incidents.
deduplication_accuracy_report.json/csv and idempotency_control_report.json/csv: prove retry duplicates were detected and did not double-count gold metrics.
watermark_validation_report.json/csv: proves late and out-of-order events are handled according to the configured policy.
schema_drift_report.json/csv: shows missing/unexpected columns, affected sources, severity, and recommended action.
backfill_safety_report.json/csv: proves unsafe overlapping backfills are blocked.
data_contract_validation_report.json/csv: captures required-column, type, schema-version, null-rate, uniqueness, timestamp, row-count, value-range, and payload checks.
downstream_impact_report.json/csv: explains affected tables, metrics, rows, containment, and recovery status.

Incident Lifecycle

Each incident includes a timeline with failure injection, detection, quarantine, alert creation, impact assessment, recovery start, recovery completion, and prevention action. The outputs are stored in:

data/incidents/incidents.csv
data/incidents/incident_timelines.csv
docs/runbooks/
docs/postmortems/

Runbooks And Postmortems

Runbooks live in docs/runbooks/ and postmortems live in docs/postmortems/. Each incident includes symptoms, detection method, containment, recovery, validation, prevention, timeline, and follow-up actions.

Quickstart

python -m pip install --upgrade pip
python -m pip install -r requirements.txt
python -m src.data_generation.generate_reference_data
python -m src.data_generation.generate_events
python -m src.failures.inject_failures
python -m src.pipeline.run_all
python -m pytest
python -m ruff check .

Current validation: 61 passed and ruff check . passes.

Launch the API:

python -m uvicorn src.api.main:app --reload

Launch the dashboard:

python -m streamlit run src/dashboard/app.py

API

Endpoints include:

GET /health
GET /pipeline-summary
GET /pipeline-runs
GET /incidents
GET /incidents/{incident_id}
GET /scorecards
GET /sla-status
GET /data-contracts
GET /quarantine
GET /downstream-impact
POST /simulate-failure
POST /run-pipeline
POST /run-backfill

Example summary request:

curl http://127.0.0.1:8000/pipeline-summary

Example deterministic simulation request:

curl -X POST http://127.0.0.1:8000/simulate-failure \
  -H "Content-Type: application/json" \
  -d '{"failure_type": "duplicate_retry_storm"}'

Dashboard

The Streamlit dashboard includes executive overview, DAG status, incident timeline, failure scenario detection, SLA monitoring, data contracts, deduplication/idempotency, late data/watermarking, quarantine, backfill safety, downstream impact, runbooks, and postmortems.

Screenshots

Screenshot capture guidance lives in docs/screenshots/. Recommended portfolio screenshots:

Executive Overview
Pipeline DAG Status
Incident Timeline
Failure Scenario Detection
SLA Monitoring
Data Contract Validation
Deduplication & Idempotency
Late Data / Watermarking
Downstream Impact
Runbooks & Postmortems

Known Limitations

Synthetic data only.
Local DuckDB instead of cloud warehouse.
Simulated micro-batches instead of Kafka/Spark/Flink.
Deterministic incident detection rules.
No cloud deployment.
No authentication.
No real pager/alerting integration.
No OpenLineage/Marquez lineage integration yet.

Future Enhancements

Kafka/Redpanda streaming ingestion.
Spark Structured Streaming.
Flink event-time processing.
Airflow DAG orchestration.
dbt models.
Great Expectations.
OpenLineage/Marquez.
Snowflake/Databricks deployment.
Prometheus/Grafana observability.
Slack/PagerDuty alerting.
Cloud deployment.
Role-based access control.

STAR Story

Situation

Production data pipelines fail in ways that basic tutorials do not cover. Retries create duplicates, late data changes metrics, schemas drift, jobs miss SLAs, and bad data can silently flow into dashboards, finance reporting, fraud features, or operational systems.

Task

Build a reliability lab that simulates realistic pipeline incidents and demonstrates how to detect, contain, recover, and document failures.

Action

Created synthetic event pipelines, DAG-style processing, failure injection, data contract validation, idempotency checks, deduplication, watermarking, SLA monitoring, anomaly detection, quarantine handling, reliability scorecards, runbooks, postmortems, API endpoints, dashboards, tests, and CI/CD.

Result

Produced a reproducible portfolio project that demonstrates production data engineering reliability patterns instead of only basic ETL execution.

Project Status

V0.1: Working baseline. V0.2: Reliability evidence hardening.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
architecture		architecture
config		config
data		data
docs		docs
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Production Data Pipeline Reliability & Incident Response Lab

Executive Summary

Business Problem

Project Goal

Core Business Question

Why This Is Not A Generic ETL Project

Architecture

DAG Flow

Failure Scenarios

Reliability Controls

Evidence Generated by the Pipeline

Incident Lifecycle

Runbooks And Postmortems

Quickstart

API

Dashboard

Screenshots

Known Limitations

Future Enhancements

STAR Story

Situation

Task

Action

Result

Project Status

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Production Data Pipeline Reliability & Incident Response Lab

Executive Summary

Business Problem

Project Goal

Core Business Question

Why This Is Not A Generic ETL Project

Architecture

DAG Flow

Failure Scenarios

Reliability Controls

Evidence Generated by the Pipeline

Incident Lifecycle

Runbooks And Postmortems

Quickstart

API

Dashboard

Screenshots

Known Limitations

Future Enhancements

STAR Story

Situation

Task

Action

Result

Project Status

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages