An enterprise-grade Big Data engineering project transforming raw transaction data into serverless analytics via a modern stack: Hadoop, Hive, Spark, Delta Lake, Sqoop, AWS S3, Athena, and QuickSight.
- Project Overview
- Infrastructure & Cluster Setup
- Data Architecture & Modeling
- ELT Pipeline (Medallion)
- Cloud Native & FinOps Optimization
- Business Intelligence (QuickSight)
- Implementation Gallery
- Contact
This project simulates a real-world enterprise data migration and modernization strategy. It extracts transactional data from a simulated "On-Premise" environment (hosted on AWS EC2), performs heavy distributed processing using a Hadoop/Spark cluster, and ultimately serves the data via a Cloud-Native, serverless architecture to optimize costs (FinOps).
Key Features:
- Hybrid Architecture: Bridging IaaS (EC2 Cluster) with PaaS/SaaS (S3, Athena, QuickSight).
- Medallion Pipeline: Ingestion of raw data into Bronze (HiveQL), Silver, and Gold layers (SparkSQL).
- Advanced Modeling: Snowflake schema design with Slowly Changing Dimensions (SCD Type 2) in the Silver layer.
- FinOps Strategy: Decoupling storage and compute by shutting down the heavy processing cluster and querying data serverlessly via Athena.
I provisioned and configured a distributed Big Data cluster from scratch using 4 AWS EC2 instances (t3.medium).
- Machine 1 (Source & Metastore): Hosts the source SQL Server database (AdventureWorks2022) and a PostgreSQL database acting as the Hive Metastore.
- Machine 2 (Master Node): The brain of the cluster (Hadoop NameNode, Hadoop SecNameNode, Spark Master, Hive, Sqoop).
- Machines 3 & 4 (Worker Nodes): The processing muscle (Hadoop DataNodes, Spark Workers).
The project focuses on the InternetSales business process. The data undergoes rigorous transformation to ensure analytical performance and historical tracking.
I designed a normalized Snowflake schema for the Data Warehouse layer. To maintain historical accuracy of business entities (like product price changes or customer addresses), I implemented SCD Type 2 in the Silver layer.
The data flows through a strict Medallion architecture, ensuring data quality and ACID compliance using Delta Lake.
- Ingestion (Extract & Load): Apache Sqoop extracts raw tables from the SQL Server and loads them directly into HDFS natively, preventing network bottlenecks.
- 🥉 Bronze Layer: Raw ingested data managed via HiveQL.
- 🥈 Silver Layer: Cleaned, filtered, and standardized data processed via SparkSQL, acting as the Enterprise Data Warehouse (with SCD Type 2).
- 🥇 Gold Layer (Data Marts): Business-level aggregations. I built the
monthly_salesaggregate table via SparkSQL to serve the BI layer directly.
The Problem: Keeping a 4-node EC2 cluster running 24/7 just to serve a dashboard via Spark Thrift Server is highly inefficient and expensive. The Solution: Separation of Storage and Compute.
I exported the gold.agg_monthly_sales table from local HDFS to an Amazon S3 bucket in Parquet format. I then cataloged this S3 data using Amazon Athena.
- Impact: The EC2 cluster can be safely terminated after the ELT job completes. BI users query the data using Athena's serverless SQL engine, meaning we only pay for queries executed, reducing infrastructure costs by +80%.
The final data product is an interactive dashboard built in Amazon QuickSight, natively connected to the Athena serverless engine.
Focus: Monthly revenue trends, gross margin tracking, tax calculations, and territorial performance.
.jpg)
Proof of Concept: The following captures demonstrate the actual deployment and orchestration of the infrastructure on AWS.
| 1. Compute & Cluster | 2. Storage Layer |
|---|---|
![]() |
![]() |
| EC2 t3.medium instances running the Hadoop/Spark/Hive cluster. | S3 Bucket. |
| 3. Serverless Querying (Athena) | 4. Data Visualization (QuickSight) |
|---|---|
![]() |
![]() |
| SQL Validation of the Gold aggregates via Amazon Athena. | Final business dashboard showing sales performance. |





