Distributed Scrape Pipeline

An end-to-end production-grade platform for large-scale web data ingestion, ETL processing, natural language querying, and automated deployment

Full Architecture Walkthrough — 1 Hour Deep Dive

▶ Watch on YouTube — A complete technical breakdown of every layer of this platform: ingestion design, ETL pipeline internals, Kafka integration, LangChain SQL agent, CI/CD pipeline, and production deployment trade-offs. Highly recommended before diving into the codebase.

Architecture Diagram · Core Components · CI/CD Pipeline · Getting Started

Overview
Architecture
Core Components
CI/CD Pipeline
Technology Stack
Getting Started
Repository Structure
Acknowledgements

Overview

This platform is a production-deployed, end-to-end distributed data engineering system engineered for large-scale web data ingestion, structured processing, and governed data access. Built over approximately 5–6 months, it encompasses every layer of the modern data engineering lifecycle: from raw seed URL ingestion, through a structured multi-stage ETL pipeline, to a natural language query interface backed by an LLM SQL agent.

The system is deployed on a VPS with domain-level isolation, NGINX reverse proxying, SSL termination, and fully automated CI/CD via GitHub Actions — with zero manual intervention required post-merge.

Design Principles

Fault isolation — Celery workers, FastAPI services, and ingestion infrastructure are independently deployable and failure-contained.
Schema consistency — The ETL pipeline enforces data quality guarantees and idempotent processing at each stage before persistence.
Observability-first — Flower task monitoring, an analytics dashboard, and Kafka log streaming provide full operational visibility.
Change-aware automation — CI/CD pipelines detect directory-level changes and trigger only the affected service workflow, minimizing blast radius.
Governed data access — The LLM SQL agent enforces schema constraints and query boundaries, preventing arbitrary execution against production data.

Architecture

For a full spoken walkthrough of every component, trade-off, and design decision shown above — watch the 1-hour YouTube breakdown here.

The platform is structured across six distinct layers: Ingestion → ETL → Storage → Streaming → Access → Deployment. Each layer is independently deployable, failure-isolated, and observable. The sections below cover each in detail.

Core Components

1. Ingestion Layer

The ingestion subsystem accepts client-provided seed URLs and distributes scraping workloads across a pool of Celery workers, backed by Redis as the message broker and task state store.

Key characteristics:

Parallel collection — Multiple workers operate concurrently with configurable concurrency limits per queue.
Controlled retries — Tasks that fail due to transient errors (timeouts, proxy blocks) are automatically re-queued with exponential back-off policies.
Fault isolation — Worker failures do not propagate; tasks are returned to the queue and redistributed.
Selenium-based scraping — Each worker spawns isolated Selenium sessions to handle JavaScript-rendered pages, dynamic content, and session management.
Proxy rotation — Requests are routed through a rotating proxy pool, preventing IP-level blocking and maintaining ingestion stability under varying source constraints.
Ingestion control — Redis coordinates rate limiting and deduplication signals across the worker fleet to prevent redundant work.

2. ETL Pipeline

Raw scraped data flows through a five-stage structured ETL pipeline before persisting to PostgreSQL. Each stage is independently testable and designed for idempotent re-execution.

Stage	Responsibility
Cleanse	Remove null fields, strip HTML artifacts, normalize whitespace and encoding
Transform	Apply schema mappings, type coercions, and structural normalization
Validate	Enforce schema contracts, range constraints, and mandatory field presence
Enrich	Augment records with derived fields, computed metadata, and cross-references
Load	Idempotent upsert into PostgreSQL, preventing duplicate writes on retry

Data quality guarantees are enforced at the validation stage, with records failing validation routed to a dead-letter store for inspection rather than silently dropped.

3. Storage Layer

PostgreSQL — Primary storage for all processed and enriched datasets. Schema migrations are versioned and applied as part of the deployment pipeline.
Redis — Dual-purpose: acts as the Celery message broker and provides ephemeral storage for task state, ingestion coordination signals, and deduplication keys.

4. Log Streaming — Apache Kafka

All platform-level operational logs — task lifecycle events, ETL stage transitions, error records, and pipeline health signals — are published to an Apache Kafka broker. This decouples log production from log consumption and enables:

Real-time log aggregation across distributed workers.
Durable audit trails for compliance and post-incident analysis.
Downstream consumers — The analytics dashboard and alerting systems consume from Kafka topics, enabling near-real-time observability without coupling to application internals.
Replay capability — Kafka's log retention allows historical log replay for debugging or metric recomputation.

5. Observability & Monitoring

Tool	Purpose
Flower	Real-time task-level monitoring — active workers, task states, retry counts, failure rates
Analytics Dashboard	Time-series views of ingestion volume, pipeline throughput, latency percentiles, and overall platform health
Kafka Log Consumer	Aggregated operational log stream with filtering and alerting on error-class events

The combination of task-level telemetry (Flower), business-level metrics (dashboard), and raw log aggregation (Kafka) provides three distinct observability planes covering operations, product, and engineering use cases.

6. API Service Layer

Processed datasets are exposed via a FastAPI service, deployed independently from the ingestion workers on the same VPS environment. This separation of concerns ensures that API availability is not impacted by scraping workload spikes.

RESTful endpoints with automatic OpenAPI schema generation.
Independent horizontal scalability — the API layer can be scaled without touching worker infrastructure.
Domain-level separation — API and ingestion services are served under distinct subdomains via NGINX.
SSL termination — All external traffic is encrypted via certificates managed at the NGINX layer.

7. Natural Language Query Interface

The platform includes a governed natural language querying interface built on LangChain and PostgreSQL, accessible via a React-based conversational UI.

Architecture:

A LangChain SQL agent with access to the PostgreSQL schema is initialized with a constrained system prompt that enforces query boundaries.
User natural-language queries are translated by the agent into parameterized SQL statements, executed against the database, and returned as structured results or prose summaries.
Schema constraints and query boundaries are enforced at the agent prompt level, preventing arbitrary DDL/DML execution, cross-schema leakage, or unbounded scans.
The React frontend provides a chat-style interface optimized for exploratory data analysis without requiring SQL knowledge.

8. Infrastructure & Deployment

The platform is hosted on a VPS with the following infrastructure configuration:

NGINX acts as the edge reverse proxy, routing traffic to the appropriate service by subdomain/path and terminating SSL.
SSL certificates secure all external-facing services.
Docker containerizes all services, ensuring environment parity between development and production.
Monorepo architecture — all services (ingestion workers, ETL pipeline, FastAPI, frontend) live in a single repository with shared tooling and configuration.

CI/CD Pipeline

The platform uses a change-aware, monorepo CI/CD pipeline built on GitHub Actions. The pipeline detects which service directories have changed in a given commit and triggers only the affected workflow — eliminating unnecessary builds and deployments.

git push → main
     │
     ▼
┌────────────────────────────────────────────┐
│         GitHub Actions Orchestrator         │
│       (Directory-Level Change Detection)    │
└─────────────────┬──────────────────────────┘
                  │
        ┌─────────┴──────────┐
        │                    │
        ▼                    ▼
 Frontend Changed?     Backend Changed?
        │                    │
        ▼                    ▼
┌───────────────┐    ┌───────────────────────┐
│ Frontend CI   │    │ Backend CI            │
│               │    │                       │
│ 1. Run Tests  │    │ 1. Lint + Test        │
│ 2. Build      │    │ 2. Build Docker Image │
│    (VM)       │    │ 3. Push → Docker Hub  │
│ 3. SSH →      │    │ 4. SSH → VPS          │
│    Transfer   │    │ 5. Pull & Run         │
│    Assets     │    │    Containers         │
│ 4. Serve via  │    │                       │
│    NGINX      │    └───────────────────────┘
└───────────────┘

Frontend Workflow

Changes detected in the frontend/ directory trigger the frontend pipeline.
Automated tests are executed in a GitHub Actions runner.
A production build is compiled on a virtual machine.
Static assets are securely transferred to the VPS via SSH.
NGINX serves the updated build immediately — no container restart required.

Backend Workflow

Changes detected in backend service directories trigger the backend pipeline.
Code is linted and a full test suite is executed.
A versioned Docker image is built from the predefined Dockerfile.
The image is pushed to Docker Hub with a commit-SHA tag.
The VPS is accessed via SSH; the latest image is pulled and containers are restarted automatically.

The result: A commit merged to main produces a fully deployed, tested, production update with zero manual steps and consistent, reproducible builds.

Technology Stack

Layer	Technology
Scraping / Ingestion	Selenium, Celery, Proxy Rotation
Task Queue / Broker	Redis
ETL Pipeline	Custom Python pipeline (Cleanse → Transform → Validate → Enrich → Load)
Primary Database	PostgreSQL
Log Streaming	Apache Kafka
API Framework	FastAPI
NLQ Agent	LangChain (SQL Agent)
Conversational UI	React
Task Monitoring	Flower
Containerization	Docker, Docker Compose
Reverse Proxy / SSL	NGINX, SSL Certificates
CI/CD	GitHub Actions
Hosting	VPS (Linux)

Getting Started

This platform is fully containerized. Docker is the only prerequisite. Every service — PostgreSQL, Redis, Kafka, Celery workers, FastAPI, and Flower — is orchestrated through Docker Compose. There is no local environment setup, no manual dependency installation, and no service bootstrapping required beyond a single command.

Prerequisite

Tool	Purpose
Docker + Docker Compose	Runs the entire platform stack

That's it.

1. Clone the Repository

git clone https://github.com/your-username/scrape-pipeline.git
cd scrape-pipeline

2. Configure Environment Variables

cp .env.example .env
# Open .env and fill in your values

3. Start the Full Stack

Development:

docker compose -f docker-compose.yml -f docker-compose.dev.yml up --build -d

Production:

docker compose -f docker-compose.yml -f docker-compose.prod.yml up --build -d

This single command brings up the complete platform:

✅ PostgreSQL (primary data store)
✅ Redis (task broker + coordination)
✅ Apache Kafka (log streaming)
✅ Celery workers (distributed ingestion)
✅ FastAPI service (data API)
✅ Flower (task monitoring dashboard)
✅ Analytics dashboard

4. Access the Services

Service	URL
FastAPI Interactive Docs	`http://localhost:8000/docs`
Flower Task Monitor	`http://localhost:5555`
Analytics Dashboard	`http://localhost:8080`
React NLQ Chat Interface	`http://localhost:3000`

Zero-Touch Production Deployment

In production, you never run any of the above manually. Every merge to main automatically:

Runs the full CI suite (lint → test → quality checks)
Builds versioned Docker images
Pushes images to Docker Hub
SSHes into the VPS
Pulls the latest images and restarts containers

The entire pipeline — from git push to live production — completes without a single manual step. See CI/CD Pipeline for the full workflow breakdown.

Repository Structure

This is a monorepo — all services, workers, and infrastructure live under a single repository, enabling unified versioning, shared tooling, and change-aware CI/CD.

scrape-pipeline/
│
├── .github/
│   └── workflows/
│       ├── cd-dev.yml          # Continuous deployment — development environment
│       ├── cd-prod.yml         # Continuous deployment — production environment
│       └── ci.yml              # Continuous integration — lint, test, quality checks
│
├── api/                        # FastAPI service layer
│   └── v1/
│       └── routes/
│           ├── agent.py        # LangChain SQL agent endpoint
│           ├── cars.py         # Domain-specific data routes
│           └── failures.py     # Pipeline failure reporting routes
│   └── main.py                 # FastAPI application entrypoint
│   └── architecture-diagram    # System architecture reference
│
├── core/                       # Shared application core
│   ├── celery_app.py           # Celery application factory & configuration
│   ├── config.py               # Centralized settings & environment loading
│   └── db/                     # Database connection and session management
│
├── extra/                      # Developer utilities
│   ├── db_table_to_excel_file.py   # Export PostgreSQL tables to Excel
│   └── local_test_scrape.py        # Local scraping test harness
│
├── frontend/                   # React conversational UI (NLQ interface)
│   ├── src/
│   ├── .gitignore
│   ├── eslint.config.js
│   ├── index.html
│   ├── package-lock.json
│   ├── package.json
│   ├── README.md
│   ├── sedan.png
│   └── vite.config.js          # Vite build configuration
│
├── utils/                      # Shared utility modules
│   ├── agent.py                # LangChain agent utilities
│   ├── extractors.py           # Data extraction helpers
│   ├── kafka_producer.py       # Kafka log producer client
│   ├── redis_client.py         # Redis connection and helpers
│   ├── repo.py                 # Repository/data access patterns
│   ├── scraping_utils.py       # Scraping support utilities
│   └── selenium_driver.py      # Selenium WebDriver factory & proxy config
│
├── workers/                    # Celery worker definitions
│   ├── crawler.py              # URL crawling task definitions
│   ├── scheduler.py            # Periodic task scheduling
│   └── scraper.py              # Core scraping task logic
│
├── docker-compose.dev.yml      # Docker Compose — development stack
├── docker-compose.override.yml # Local overrides for development
├── docker-compose.prod.yml     # Docker Compose — production stack
├── docker-compose.yml          # Base Docker Compose configuration
├── Dockerfile                  # Multi-stage production Docker image
├── pyproject.toml              # Python project metadata & dependencies
├── .env.example                # Environment variable template
├── .gitignore
├── README.md
└── requirements.txt            # Python dependency lockfile

Acknowledgements

Special thanks to Kashif Sohail for his mentorship and technical guidance throughout the full lifecycle of this project. His insights were instrumental in shaping the architecture, operational design, and production deployment strategy of this platform.

🎬 Watch the Full Architecture Walkthrough

▶ 1-Hour YouTube Deep Dive — Architecture, Trade-offs & Implementation

Every design decision, every trade-off, every component — explained in full.

Contact here for more details: m.safi.ullah@outlook.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Scrape Pipeline

An end-to-end production-grade platform for large-scale web data ingestion, ETL processing, natural language querying, and automated deployment

Full Architecture Walkthrough — 1 Hour Deep Dive

Table of Contents

Overview

Design Principles

Architecture

Core Components

1. Ingestion Layer

2. ETL Pipeline

3. Storage Layer

4. Log Streaming — Apache Kafka

5. Observability & Monitoring

6. API Service Layer

7. Natural Language Query Interface

8. Infrastructure & Deployment

CI/CD Pipeline

Frontend Workflow

Backend Workflow

Technology Stack

Getting Started

Prerequisite

1. Clone the Repository

2. Configure Environment Variables

3. Start the Full Stack

4. Access the Services

Zero-Touch Production Deployment

Repository Structure

Acknowledgements

🎬 Watch the Full Architecture Walkthrough

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Distributed Scrape Pipeline

An end-to-end production-grade platform for large-scale web data ingestion, ETL processing, natural language querying, and automated deployment

Full Architecture Walkthrough — 1 Hour Deep Dive

Table of Contents

Overview

Design Principles

Architecture

Core Components

1. Ingestion Layer

2. ETL Pipeline

3. Storage Layer

4. Log Streaming — Apache Kafka

5. Observability & Monitoring

6. API Service Layer

7. Natural Language Query Interface

8. Infrastructure & Deployment

CI/CD Pipeline

Frontend Workflow

Backend Workflow

Technology Stack

Getting Started

Prerequisite

1. Clone the Repository

2. Configure Environment Variables

3. Start the Full Stack

4. Access the Services

Zero-Touch Production Deployment

Repository Structure

Acknowledgements

🎬 Watch the Full Architecture Walkthrough