Skip to content

ChahiriAbderrahmane/Sales-analytics-Data-Lakehouse

Repository files navigation

🚵‍♂️ AdventureWorks Hybrid Data Lakehouse

On-Premise Cluster to Cloud-Native BI | Medallion Architecture

An enterprise-grade Big Data engineering project transforming raw transaction data into serverless analytics via a modern stack: Hadoop, Hive, Spark, Delta Lake, Sqoop, AWS S3, Athena, and QuickSight.


Project Architecture

Project Architecture


📝 Table of Contents

  1. Project Overview
  2. Infrastructure & Cluster Setup
  3. Data Architecture & Modeling
  4. ELT Pipeline (Medallion)
  5. Cloud Native & FinOps Optimization
  6. Business Intelligence (QuickSight)
  7. Implementation Gallery
  8. Contact

🔭 Project Overview

This project simulates a real-world enterprise data migration and modernization strategy. It extracts transactional data from a simulated "On-Premise" environment (hosted on AWS EC2), performs heavy distributed processing using a Hadoop/Spark cluster, and ultimately serves the data via a Cloud-Native, serverless architecture to optimize costs (FinOps).

Key Features:

  • Hybrid Architecture: Bridging IaaS (EC2 Cluster) with PaaS/SaaS (S3, Athena, QuickSight).
  • Medallion Pipeline: Ingestion of raw data into Bronze (HiveQL), Silver, and Gold layers (SparkSQL).
  • Advanced Modeling: Snowflake schema design with Slowly Changing Dimensions (SCD Type 2) in the Silver layer.
  • FinOps Strategy: Decoupling storage and compute by shutting down the heavy processing cluster and querying data serverlessly via Athena.

🖥️ Infrastructure & Cluster Setup (IaaS)

I provisioned and configured a distributed Big Data cluster from scratch using 4 AWS EC2 instances (t3.medium).

  • Machine 1 (Source & Metastore): Hosts the source SQL Server database (AdventureWorks2022) and a PostgreSQL database acting as the Hive Metastore.
  • Machine 2 (Master Node): The brain of the cluster (Hadoop NameNode, Hadoop SecNameNode, Spark Master, Hive, Sqoop).
  • Machines 3 & 4 (Worker Nodes): The processing muscle (Hadoop DataNodes, Spark Workers).

🏗️ Data Architecture & Modeling

The project focuses on the InternetSales business process. The data undergoes rigorous transformation to ensure analytical performance and historical tracking.

The Snowflake Schema & SCD Type 2

I designed a normalized Snowflake schema for the Data Warehouse layer. To maintain historical accuracy of business entities (like product price changes or customer addresses), I implemented SCD Type 2 in the Silver layer.

Snowflake Schema

Snowflake Schema illustrating the InternetSales dimensions and facts.

🌪️ ELT Pipeline (Medallion Architecture)

The data flows through a strict Medallion architecture, ensuring data quality and ACID compliance using Delta Lake.

  1. Ingestion (Extract & Load): Apache Sqoop extracts raw tables from the SQL Server and loads them directly into HDFS natively, preventing network bottlenecks.
  2. 🥉 Bronze Layer: Raw ingested data managed via HiveQL.
  3. 🥈 Silver Layer: Cleaned, filtered, and standardized data processed via SparkSQL, acting as the Enterprise Data Warehouse (with SCD Type 2).
  4. 🥇 Gold Layer (Data Marts): Business-level aggregations. I built the monthly_sales aggregate table via SparkSQL to serve the BI layer directly.

☁️ Cloud Native & FinOps Optimization

The Problem: Keeping a 4-node EC2 cluster running 24/7 just to serve a dashboard via Spark Thrift Server is highly inefficient and expensive. The Solution: Separation of Storage and Compute.

I exported the gold.agg_monthly_sales table from local HDFS to an Amazon S3 bucket in Parquet format. I then cataloged this S3 data using Amazon Athena.

  • Impact: The EC2 cluster can be safely terminated after the ELT job completes. BI users query the data using Athena's serverless SQL engine, meaning we only pay for queries executed, reducing infrastructure costs by +80%.

📊 Business Intelligence (Amazon QuickSight)

The final data product is an interactive dashboard built in Amazon QuickSight, natively connected to the Athena serverless engine.

InternetSales Performance Dashboard

Focus: Monthly revenue trends, gross margin tracking, tax calculations, and territorial performance. QuickSight Dashboard

📸 Implementation Gallery

Proof of Concept: The following captures demonstrate the actual deployment and orchestration of the infrastructure on AWS.

1. Compute & Cluster 2. Storage Layer
Cluster Instance S3 Bucket
EC2 t3.medium instances running the Hadoop/Spark/Hive cluster. S3 Bucket.
3. Serverless Querying (Athena) 4. Data Visualization (QuickSight)
Athena QuickSight
SQL Validation of the Gold aggregates via Amazon Athena. Final business dashboard showing sales performance.

📨 Contact Me

LinkedInGmail

Made with ❤️ by Abderrahmane Chahiri

About

This project simulates a real-world enterprise data migration and modernization strategy. It extracts transactional data from a simulated "On-Premise" environment (hosted on AWS EC2), performs heavy distributed processing using a Hadoop/Spark cluster, and ultimately serves the data via a Cloud-Native, serverless architecture to optimize costs .

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages