An end-to-end production-grade platform for large-scale web data ingestion, ETL processing, natural language querying, and automated deployment
▶ Watch on YouTube — A complete technical breakdown of every layer of this platform: ingestion design, ETL pipeline internals, Kafka integration, LangChain SQL agent, CI/CD pipeline, and production deployment trade-offs. Highly recommended before diving into the codebase.
Architecture Diagram · Core Components · CI/CD Pipeline · Getting Started
- Overview
- Architecture
- Core Components
- CI/CD Pipeline
- Technology Stack
- Getting Started
- Repository Structure
- Acknowledgements
This platform is a production-deployed, end-to-end distributed data engineering system engineered for large-scale web data ingestion, structured processing, and governed data access. Built over approximately 5–6 months, it encompasses every layer of the modern data engineering lifecycle: from raw seed URL ingestion, through a structured multi-stage ETL pipeline, to a natural language query interface backed by an LLM SQL agent.
The system is deployed on a VPS with domain-level isolation, NGINX reverse proxying, SSL termination, and fully automated CI/CD via GitHub Actions — with zero manual intervention required post-merge.
- Fault isolation — Celery workers, FastAPI services, and ingestion infrastructure are independently deployable and failure-contained.
- Schema consistency — The ETL pipeline enforces data quality guarantees and idempotent processing at each stage before persistence.
- Observability-first — Flower task monitoring, an analytics dashboard, and Kafka log streaming provide full operational visibility.
- Change-aware automation — CI/CD pipelines detect directory-level changes and trigger only the affected service workflow, minimizing blast radius.
- Governed data access — The LLM SQL agent enforces schema constraints and query boundaries, preventing arbitrary execution against production data.
For a full spoken walkthrough of every component, trade-off, and design decision shown above — watch the 1-hour YouTube breakdown here.
The platform is structured across six distinct layers: Ingestion → ETL → Storage → Streaming → Access → Deployment. Each layer is independently deployable, failure-isolated, and observable. The sections below cover each in detail.
The ingestion subsystem accepts client-provided seed URLs and distributes scraping workloads across a pool of Celery workers, backed by Redis as the message broker and task state store.
Key characteristics:
- Parallel collection — Multiple workers operate concurrently with configurable concurrency limits per queue.
- Controlled retries — Tasks that fail due to transient errors (timeouts, proxy blocks) are automatically re-queued with exponential back-off policies.
- Fault isolation — Worker failures do not propagate; tasks are returned to the queue and redistributed.
- Selenium-based scraping — Each worker spawns isolated Selenium sessions to handle JavaScript-rendered pages, dynamic content, and session management.
- Proxy rotation — Requests are routed through a rotating proxy pool, preventing IP-level blocking and maintaining ingestion stability under varying source constraints.
- Ingestion control — Redis coordinates rate limiting and deduplication signals across the worker fleet to prevent redundant work.
Raw scraped data flows through a five-stage structured ETL pipeline before persisting to PostgreSQL. Each stage is independently testable and designed for idempotent re-execution.
| Stage | Responsibility |
|---|---|
| Cleanse | Remove null fields, strip HTML artifacts, normalize whitespace and encoding |
| Transform | Apply schema mappings, type coercions, and structural normalization |
| Validate | Enforce schema contracts, range constraints, and mandatory field presence |
| Enrich | Augment records with derived fields, computed metadata, and cross-references |
| Load | Idempotent upsert into PostgreSQL, preventing duplicate writes on retry |
Data quality guarantees are enforced at the validation stage, with records failing validation routed to a dead-letter store for inspection rather than silently dropped.
- PostgreSQL — Primary storage for all processed and enriched datasets. Schema migrations are versioned and applied as part of the deployment pipeline.
- Redis — Dual-purpose: acts as the Celery message broker and provides ephemeral storage for task state, ingestion coordination signals, and deduplication keys.
All platform-level operational logs — task lifecycle events, ETL stage transitions, error records, and pipeline health signals — are published to an Apache Kafka broker. This decouples log production from log consumption and enables:
- Real-time log aggregation across distributed workers.
- Durable audit trails for compliance and post-incident analysis.
- Downstream consumers — The analytics dashboard and alerting systems consume from Kafka topics, enabling near-real-time observability without coupling to application internals.
- Replay capability — Kafka's log retention allows historical log replay for debugging or metric recomputation.
| Tool | Purpose |
|---|---|
| Flower | Real-time task-level monitoring — active workers, task states, retry counts, failure rates |
| Analytics Dashboard | Time-series views of ingestion volume, pipeline throughput, latency percentiles, and overall platform health |
| Kafka Log Consumer | Aggregated operational log stream with filtering and alerting on error-class events |
The combination of task-level telemetry (Flower), business-level metrics (dashboard), and raw log aggregation (Kafka) provides three distinct observability planes covering operations, product, and engineering use cases.
Processed datasets are exposed via a FastAPI service, deployed independently from the ingestion workers on the same VPS environment. This separation of concerns ensures that API availability is not impacted by scraping workload spikes.
- RESTful endpoints with automatic OpenAPI schema generation.
- Independent horizontal scalability — the API layer can be scaled without touching worker infrastructure.
- Domain-level separation — API and ingestion services are served under distinct subdomains via NGINX.
- SSL termination — All external traffic is encrypted via certificates managed at the NGINX layer.
The platform includes a governed natural language querying interface built on LangChain and PostgreSQL, accessible via a React-based conversational UI.
Architecture:
- A LangChain SQL agent with access to the PostgreSQL schema is initialized with a constrained system prompt that enforces query boundaries.
- User natural-language queries are translated by the agent into parameterized SQL statements, executed against the database, and returned as structured results or prose summaries.
- Schema constraints and query boundaries are enforced at the agent prompt level, preventing arbitrary DDL/DML execution, cross-schema leakage, or unbounded scans.
- The React frontend provides a chat-style interface optimized for exploratory data analysis without requiring SQL knowledge.
The platform is hosted on a VPS with the following infrastructure configuration:
- NGINX acts as the edge reverse proxy, routing traffic to the appropriate service by subdomain/path and terminating SSL.
- SSL certificates secure all external-facing services.
- Docker containerizes all services, ensuring environment parity between development and production.
- Monorepo architecture — all services (ingestion workers, ETL pipeline, FastAPI, frontend) live in a single repository with shared tooling and configuration.
The platform uses a change-aware, monorepo CI/CD pipeline built on GitHub Actions. The pipeline detects which service directories have changed in a given commit and triggers only the affected workflow — eliminating unnecessary builds and deployments.
git push → main
│
▼
┌────────────────────────────────────────────┐
│ GitHub Actions Orchestrator │
│ (Directory-Level Change Detection) │
└─────────────────┬──────────────────────────┘
│
┌─────────┴──────────┐
│ │
▼ ▼
Frontend Changed? Backend Changed?
│ │
▼ ▼
┌───────────────┐ ┌───────────────────────┐
│ Frontend CI │ │ Backend CI │
│ │ │ │
│ 1. Run Tests │ │ 1. Lint + Test │
│ 2. Build │ │ 2. Build Docker Image │
│ (VM) │ │ 3. Push → Docker Hub │
│ 3. SSH → │ │ 4. SSH → VPS │
│ Transfer │ │ 5. Pull & Run │
│ Assets │ │ Containers │
│ 4. Serve via │ │ │
│ NGINX │ └───────────────────────┘
└───────────────┘
- Changes detected in the
frontend/directory trigger the frontend pipeline. - Automated tests are executed in a GitHub Actions runner.
- A production build is compiled on a virtual machine.
- Static assets are securely transferred to the VPS via SSH.
- NGINX serves the updated build immediately — no container restart required.
- Changes detected in backend service directories trigger the backend pipeline.
- Code is linted and a full test suite is executed.
- A versioned Docker image is built from the predefined
Dockerfile. - The image is pushed to Docker Hub with a commit-SHA tag.
- The VPS is accessed via SSH; the latest image is pulled and containers are restarted automatically.
The result: A commit merged to main produces a fully deployed, tested, production update with zero manual steps and consistent, reproducible builds.
| Layer | Technology |
|---|---|
| Scraping / Ingestion | Selenium, Celery, Proxy Rotation |
| Task Queue / Broker | Redis |
| ETL Pipeline | Custom Python pipeline (Cleanse → Transform → Validate → Enrich → Load) |
| Primary Database | PostgreSQL |
| Log Streaming | Apache Kafka |
| API Framework | FastAPI |
| NLQ Agent | LangChain (SQL Agent) |
| Conversational UI | React |
| Task Monitoring | Flower |
| Containerization | Docker, Docker Compose |
| Reverse Proxy / SSL | NGINX, SSL Certificates |
| CI/CD | GitHub Actions |
| Hosting | VPS (Linux) |
This platform is fully containerized. Docker is the only prerequisite. Every service — PostgreSQL, Redis, Kafka, Celery workers, FastAPI, and Flower — is orchestrated through Docker Compose. There is no local environment setup, no manual dependency installation, and no service bootstrapping required beyond a single command.
| Tool | Purpose |
|---|---|
| Docker + Docker Compose | Runs the entire platform stack |
That's it.
git clone https://github.com/your-username/scrape-pipeline.git
cd scrape-pipelinecp .env.example .env
# Open .env and fill in your valuesDevelopment:
docker compose -f docker-compose.yml -f docker-compose.dev.yml up --build -dProduction:
docker compose -f docker-compose.yml -f docker-compose.prod.yml up --build -dThis single command brings up the complete platform:
- ✅ PostgreSQL (primary data store)
- ✅ Redis (task broker + coordination)
- ✅ Apache Kafka (log streaming)
- ✅ Celery workers (distributed ingestion)
- ✅ FastAPI service (data API)
- ✅ Flower (task monitoring dashboard)
- ✅ Analytics dashboard
| Service | URL |
|---|---|
| FastAPI Interactive Docs | http://localhost:8000/docs |
| Flower Task Monitor | http://localhost:5555 |
| Analytics Dashboard | http://localhost:8080 |
| React NLQ Chat Interface | http://localhost:3000 |
In production, you never run any of the above manually. Every merge to main automatically:
- Runs the full CI suite (lint → test → quality checks)
- Builds versioned Docker images
- Pushes images to Docker Hub
- SSHes into the VPS
- Pulls the latest images and restarts containers
The entire pipeline — from git push to live production — completes without a single manual step. See CI/CD Pipeline for the full workflow breakdown.
This is a monorepo — all services, workers, and infrastructure live under a single repository, enabling unified versioning, shared tooling, and change-aware CI/CD.
scrape-pipeline/
│
├── .github/
│ └── workflows/
│ ├── cd-dev.yml # Continuous deployment — development environment
│ ├── cd-prod.yml # Continuous deployment — production environment
│ └── ci.yml # Continuous integration — lint, test, quality checks
│
├── api/ # FastAPI service layer
│ └── v1/
│ └── routes/
│ ├── agent.py # LangChain SQL agent endpoint
│ ├── cars.py # Domain-specific data routes
│ └── failures.py # Pipeline failure reporting routes
│ └── main.py # FastAPI application entrypoint
│ └── architecture-diagram # System architecture reference
│
├── core/ # Shared application core
│ ├── celery_app.py # Celery application factory & configuration
│ ├── config.py # Centralized settings & environment loading
│ └── db/ # Database connection and session management
│
├── extra/ # Developer utilities
│ ├── db_table_to_excel_file.py # Export PostgreSQL tables to Excel
│ └── local_test_scrape.py # Local scraping test harness
│
├── frontend/ # React conversational UI (NLQ interface)
│ ├── src/
│ ├── .gitignore
│ ├── eslint.config.js
│ ├── index.html
│ ├── package-lock.json
│ ├── package.json
│ ├── README.md
│ ├── sedan.png
│ └── vite.config.js # Vite build configuration
│
├── utils/ # Shared utility modules
│ ├── agent.py # LangChain agent utilities
│ ├── extractors.py # Data extraction helpers
│ ├── kafka_producer.py # Kafka log producer client
│ ├── redis_client.py # Redis connection and helpers
│ ├── repo.py # Repository/data access patterns
│ ├── scraping_utils.py # Scraping support utilities
│ └── selenium_driver.py # Selenium WebDriver factory & proxy config
│
├── workers/ # Celery worker definitions
│ ├── crawler.py # URL crawling task definitions
│ ├── scheduler.py # Periodic task scheduling
│ └── scraper.py # Core scraping task logic
│
├── docker-compose.dev.yml # Docker Compose — development stack
├── docker-compose.override.yml # Local overrides for development
├── docker-compose.prod.yml # Docker Compose — production stack
├── docker-compose.yml # Base Docker Compose configuration
├── Dockerfile # Multi-stage production Docker image
├── pyproject.toml # Python project metadata & dependencies
├── .env.example # Environment variable template
├── .gitignore
├── README.md
└── requirements.txt # Python dependency lockfile
Special thanks to Kashif Sohail for his mentorship and technical guidance throughout the full lifecycle of this project. His insights were instrumental in shaping the architecture, operational design, and production deployment strategy of this platform.
▶ 1-Hour YouTube Deep Dive — Architecture, Trade-offs & Implementation
Every design decision, every trade-off, every component — explained in full.
Contact here for more details: m.safi.ullah@outlook.com
