Skip to content

mohilamin/data-pipeline-reliability-incident-lab

Repository files navigation

Production Data Pipeline Reliability & Incident Response Lab

Python 3.12 Tests Ruff FastAPI Streamlit DuckDB Docker

Executive Summary

This project simulates the failure modes that real data engineering teams face in production: duplicate retries, late-arriving events, schema drift, SLA breaches, corrupt payloads, unsafe backfills, missing batches, hot partitions, and downstream metric mismatches.

A basic ETL project asks: “Does the pipeline run?”

This project asks: “What happens when the pipeline runs wrong, nobody notices for six hours, and downstream teams already consumed the bad data?”

This project demonstrates the production data engineering mindset: do not assume the pipeline is correct — prove it, monitor it, break it intentionally, and design safe recovery.

What was built: a deterministic local reliability lab that generates 250k+ synthetic events, injects 14 production-style failures, runs a DAG-style pipeline, validates contracts, deduplicates retries, handles late data, quarantines unsafe records, detects incidents, writes reliability scorecards, and generates runbooks/postmortems.

Validation proof: 61 pytest tests passing, ruff check . passing, reproducible local pipeline commands, and JSON/CSV evidence files in data/scorecards/.

Business Problem

Large companies depend on data pipelines for payments, ads, fraud detection, product analytics, supply chain operations, finance reporting, healthcare claims, and machine learning features.

At production scale, the risk is not only pipeline failure. The bigger risk is silent pipeline failure:

  • duplicates from unsafe retries
  • late events corrupting daily metrics
  • schema drift breaking downstream jobs
  • missing batches passing unnoticed
  • corrupted payloads entering trusted tables
  • unsafe backfills overwriting correct data
  • SLA breaches delaying dashboards or models
  • downstream metrics changing unexpectedly

Project Goal

Build a production-style Data Engineering reliability lab that intentionally injects common pipeline failure scenarios, detects them, quarantines unsafe data, calculates reliability scorecards, creates incident timelines, and generates runbooks/postmortems.

Core Business Question

Can this pipeline remain trustworthy when assumptions break?

Why This Is Not A Generic ETL Project

Most ETL projects only prove happy-path transformation. This lab intentionally breaks production assumptions and then proves detection, containment, recovery evidence, and incident documentation.

Architecture

flowchart LR
    A["Synthetic Events"] --> B["Failure Injection"]
    B --> C["Landing"]
    C --> D["Data Contract Validation"]
    D --> E["Bronze Load"]
    E --> F["Idempotency Check"]
    F --> G["Deduplication"]
    G --> H["Watermarking"]
    H --> I["Silver Transform"]
    I --> J["Gold Aggregation"]
    J --> K["Reconciliation"]
    K --> L["Monitoring"]
    L --> M["Incident Detection"]
    M --> N["Runbooks/Postmortems"]
    M --> O["API/Dashboard"]
Loading

DAG Flow

flowchart TD
    A["source_event_generation"] --> B["landing"]
    B --> C["contract_validation"]
    C --> D["bronze_load"]
    D --> E["idempotency_check"]
    E --> F["deduplication"]
    F --> G["watermark_late_data_handling"]
    G --> H["silver_transform"]
    H --> I["gold_aggregation"]
    I --> J["reconciliation"]
    J --> K["monitoring"]
    K --> L["incident_detection"]
    L --> M["runbook_generation"]
    M --> N["postmortem_generation"]
Loading

Failure Scenarios

The lab injects controlled failures and records them in data/incidents/injected_failure_manifest.json.

Failure scenario Production risk Detection evidence
Duplicate retry storm Retries double-count transactions or events Deduplication and idempotency reports
Late / out-of-order events Daily metrics change after publishing Watermark validation report
Schema drift new/missing columns Downstream jobs break silently Schema drift and contract reports
Corrupt payloads Malformed records enter trusted tables Contract validation and quarantine
Missing / partial batches Incomplete source data looks valid SLA, row-count, and incident reports
Hot partition skew One key overloads processing Monitoring and incident detection
Unsafe backfill overlap Correct data is overwritten Backfill safety report
Null spike Required fields degrade suddenly Data contract report
Downstream metric mismatch Gold tables no longer reconcile Downstream impact report
SLA breach Dashboards or ML features are delayed SLA breach and stage summary reports

Reliability Controls

Control What it prevents Evidence file
Data contracts Schema, null, payload, and row-count violations data_contract_validation_report.*
Idempotency checks Unsafe replay and duplicate retries idempotency_control_report.*
Deduplication Gold double-counting deduplication_accuracy_report.*
Watermarking Late data corrupting trusted windows watermark_validation_report.*
Quarantine Bad records reaching silver/gold tables data/quarantine/
SLA monitoring Missed operational deadlines sla_breach_report.*, sla_stage_summary.*
Reconciliation Metric mismatches between layers downstream_impact_report.*
Backfill guard Overlapping unsafe replay windows backfill_safety_report.*

Evidence Generated by the Pipeline

The pipeline writes business-readable and reviewer-friendly evidence to data/scorecards/:

  • pipeline_reliability_scorecard.json/csv: summarizes reliability score, incident detection, SLA compliance, dedupe accuracy, contract pass rate, watermark policy, and reconciliation.
  • incident_detection_accuracy_report.json/csv: compares injected failures with detected incidents and severity matching.
  • incident_detection_report.json/csv: lists detected incident types, sources, severity, containment, recovery, and prevention actions.
  • sla_breach_report.json/csv and sla_stage_summary.json/csv: show stage-level duration, p95 duration, breach rate, impacted tables, and linked incidents.
  • deduplication_accuracy_report.json/csv and idempotency_control_report.json/csv: prove retry duplicates were detected and did not double-count gold metrics.
  • watermark_validation_report.json/csv: proves late and out-of-order events are handled according to the configured policy.
  • schema_drift_report.json/csv: shows missing/unexpected columns, affected sources, severity, and recommended action.
  • backfill_safety_report.json/csv: proves unsafe overlapping backfills are blocked.
  • data_contract_validation_report.json/csv: captures required-column, type, schema-version, null-rate, uniqueness, timestamp, row-count, value-range, and payload checks.
  • downstream_impact_report.json/csv: explains affected tables, metrics, rows, containment, and recovery status.

Incident Lifecycle

Each incident includes a timeline with failure injection, detection, quarantine, alert creation, impact assessment, recovery start, recovery completion, and prevention action. The outputs are stored in:

  • data/incidents/incidents.csv
  • data/incidents/incident_timelines.csv
  • docs/runbooks/
  • docs/postmortems/

Runbooks And Postmortems

Runbooks live in docs/runbooks/ and postmortems live in docs/postmortems/. Each incident includes symptoms, detection method, containment, recovery, validation, prevention, timeline, and follow-up actions.

Quickstart

python -m pip install --upgrade pip
python -m pip install -r requirements.txt
python -m src.data_generation.generate_reference_data
python -m src.data_generation.generate_events
python -m src.failures.inject_failures
python -m src.pipeline.run_all
python -m pytest
python -m ruff check .

Current validation: 61 passed and ruff check . passes.

Launch the API:

python -m uvicorn src.api.main:app --reload

Launch the dashboard:

python -m streamlit run src/dashboard/app.py

API

Endpoints include:

  • GET /health
  • GET /pipeline-summary
  • GET /pipeline-runs
  • GET /incidents
  • GET /incidents/{incident_id}
  • GET /scorecards
  • GET /sla-status
  • GET /data-contracts
  • GET /quarantine
  • GET /downstream-impact
  • POST /simulate-failure
  • POST /run-pipeline
  • POST /run-backfill

Example summary request:

curl http://127.0.0.1:8000/pipeline-summary

Example deterministic simulation request:

curl -X POST http://127.0.0.1:8000/simulate-failure \
  -H "Content-Type: application/json" \
  -d '{"failure_type": "duplicate_retry_storm"}'

Dashboard

The Streamlit dashboard includes executive overview, DAG status, incident timeline, failure scenario detection, SLA monitoring, data contracts, deduplication/idempotency, late data/watermarking, quarantine, backfill safety, downstream impact, runbooks, and postmortems.

Screenshots

Screenshot capture guidance lives in docs/screenshots/. Recommended portfolio screenshots:

  • Executive Overview
  • Pipeline DAG Status
  • Incident Timeline
  • Failure Scenario Detection
  • SLA Monitoring
  • Data Contract Validation
  • Deduplication & Idempotency
  • Late Data / Watermarking
  • Downstream Impact
  • Runbooks & Postmortems

Known Limitations

  • Synthetic data only.
  • Local DuckDB instead of cloud warehouse.
  • Simulated micro-batches instead of Kafka/Spark/Flink.
  • Deterministic incident detection rules.
  • No cloud deployment.
  • No authentication.
  • No real pager/alerting integration.
  • No OpenLineage/Marquez lineage integration yet.

Future Enhancements

  • Kafka/Redpanda streaming ingestion.
  • Spark Structured Streaming.
  • Flink event-time processing.
  • Airflow DAG orchestration.
  • dbt models.
  • Great Expectations.
  • OpenLineage/Marquez.
  • Snowflake/Databricks deployment.
  • Prometheus/Grafana observability.
  • Slack/PagerDuty alerting.
  • Cloud deployment.
  • Role-based access control.

STAR Story

Situation

Production data pipelines fail in ways that basic tutorials do not cover. Retries create duplicates, late data changes metrics, schemas drift, jobs miss SLAs, and bad data can silently flow into dashboards, finance reporting, fraud features, or operational systems.

Task

Build a reliability lab that simulates realistic pipeline incidents and demonstrates how to detect, contain, recover, and document failures.

Action

Created synthetic event pipelines, DAG-style processing, failure injection, data contract validation, idempotency checks, deduplication, watermarking, SLA monitoring, anomaly detection, quarantine handling, reliability scorecards, runbooks, postmortems, API endpoints, dashboards, tests, and CI/CD.

Result

Produced a reproducible portfolio project that demonstrates production data engineering reliability patterns instead of only basic ETL execution.

Project Status

V0.1: Working baseline. V0.2: Reliability evidence hardening.

About

Production-style data pipeline reliability lab with failure injection, data contracts, idempotency, SLA monitoring, incident detection, runbooks, and postmortems.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages