🚵‍♂️ AdventureWorks Hybrid Data Lakehouse

On-Premise Cluster to Cloud-Native BI | Medallion Architecture

An enterprise-grade Big Data engineering project transforming raw transaction data into serverless analytics via a modern stack: Hadoop, Hive, Spark, Delta Lake, Sqoop, AWS S3, Athena, and QuickSight.

Project Architecture

🔭 Project Overview

This project simulates a real-world enterprise data migration and modernization strategy. It extracts transactional data from a simulated "On-Premise" environment (hosted on AWS EC2), performs heavy distributed processing using a Hadoop/Spark cluster, and ultimately serves the data via a Cloud-Native, serverless architecture to optimize costs (FinOps).

Key Features:

Hybrid Architecture: Bridging IaaS (EC2 Cluster) with PaaS/SaaS (S3, Athena, QuickSight).
Medallion Pipeline: Ingestion of raw data into Bronze (HiveQL), Silver, and Gold layers (SparkSQL).
Advanced Modeling: Snowflake schema design with Slowly Changing Dimensions (SCD Type 2) in the Silver layer.
FinOps Strategy: Decoupling storage and compute by shutting down the heavy processing cluster and querying data serverlessly via Athena.

🖥️ Infrastructure & Cluster Setup (IaaS)

I provisioned and configured a distributed Big Data cluster from scratch using 4 AWS EC2 instances (t3.medium).

Machine 1 (Source & Metastore): Hosts the source SQL Server database (AdventureWorks2022) and a PostgreSQL database acting as the Hive Metastore.
Machine 2 (Master Node): The brain of the cluster (Hadoop NameNode, Hadoop SecNameNode, Spark Master, Hive, Sqoop).
Machines 3 & 4 (Worker Nodes): The processing muscle (Hadoop DataNodes, Spark Workers).

🏗️ Data Architecture & Modeling

The project focuses on the InternetSales business process. The data undergoes rigorous transformation to ensure analytical performance and historical tracking.

The Snowflake Schema & SCD Type 2

I designed a normalized Snowflake schema for the Data Warehouse layer. To maintain historical accuracy of business entities (like product price changes or customer addresses), I implemented SCD Type 2 in the Silver layer.

Snowflake Schema illustrating the InternetSales dimensions and facts.

🌪️ ELT Pipeline (Medallion Architecture)

The data flows through a strict Medallion architecture, ensuring data quality and ACID compliance using Delta Lake.

Ingestion (Extract & Load): Apache Sqoop extracts raw tables from the SQL Server and loads them directly into HDFS natively, preventing network bottlenecks.
🥉 Bronze Layer: Raw ingested data managed via HiveQL.
🥈 Silver Layer: Cleaned, filtered, and standardized data processed via SparkSQL, acting as the Enterprise Data Warehouse (with SCD Type 2).
🥇 Gold Layer (Data Marts): Business-level aggregations. I built the monthly_sales aggregate table via SparkSQL to serve the BI layer directly.

☁️ Cloud Native & FinOps Optimization

The Problem: Keeping a 4-node EC2 cluster running 24/7 just to serve a dashboard via Spark Thrift Server is highly inefficient and expensive. The Solution: Separation of Storage and Compute.

I exported the gold.agg_monthly_sales table from local HDFS to an Amazon S3 bucket in Parquet format. I then cataloged this S3 data using Amazon Athena.

Impact: The EC2 cluster can be safely terminated after the ELT job completes. BI users query the data using Athena's serverless SQL engine, meaning we only pay for queries executed, reducing infrastructure costs by +80%.

📊 Business Intelligence (Amazon QuickSight)

The final data product is an interactive dashboard built in Amazon QuickSight, natively connected to the Athena serverless engine.

InternetSales Performance Dashboard

Focus: Monthly revenue trends, gross margin tracking, tax calculations, and territorial performance.

📸 Implementation Gallery

Proof of Concept: The following captures demonstrate the actual deployment and orchestration of the infrastructure on AWS.

1. Compute & Cluster	2. Storage Layer

EC2 t3.medium instances running the Hadoop/Spark/Hive cluster.	S3 Bucket.

3. Serverless Querying (Athena)	4. Data Visualization (QuickSight)

SQL Validation of the Gold aggregates via Amazon Athena.	Final business dashboard showing sales performance.

📨 Contact Me

LinkedIn • Gmail

Made with ❤️ by Abderrahmane Chahiri

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
Dashboard- Quick Sight		Dashboard- Quick Sight
Data_model		Data_model
Implementation Gallery		Implementation Gallery
aggregates		aggregates
bronze_layer		bronze_layer
gold_layer		gold_layer
queries_runned_in_athena		queries_runned_in_athena
silver_layer		silver_layer
.gitignore		.gitignore
Hybrid Data Lakehouse Architecture Diagram.gif		Hybrid Data Lakehouse Architecture Diagram.gif
export_aggregates_from_hdfs_to_s3.py		export_aggregates_from_hdfs_to_s3.py
readme.md		readme.md
sqoop_script.txt		sqoop_script.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚵‍♂️ AdventureWorks Hybrid Data Lakehouse

On-Premise Cluster to Cloud-Native BI | Medallion Architecture

Project Architecture

📝 Table of Contents

🔭 Project Overview

🖥️ Infrastructure & Cluster Setup (IaaS)

🏗️ Data Architecture & Modeling

The Snowflake Schema & SCD Type 2

🌪️ ELT Pipeline (Medallion Architecture)

☁️ Cloud Native & FinOps Optimization

📊 Business Intelligence (Amazon QuickSight)

InternetSales Performance Dashboard

📸 Implementation Gallery

📨 Contact Me

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚵‍♂️ AdventureWorks Hybrid Data Lakehouse

On-Premise Cluster to Cloud-Native BI | Medallion Architecture

Project Architecture

📝 Table of Contents

🔭 Project Overview

🖥️ Infrastructure & Cluster Setup (IaaS)

🏗️ Data Architecture & Modeling

The Snowflake Schema & SCD Type 2

🌪️ ELT Pipeline (Medallion Architecture)

☁️ Cloud Native & FinOps Optimization

📊 Business Intelligence (Amazon QuickSight)

InternetSales Performance Dashboard

📸 Implementation Gallery

📨 Contact Me

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages