End-to-End E-commerce ETL Pipeline using AWS Glue, Amazon S3 and Snowflake

Ruben Encinas Moreno

Project Overview

This project demonstrates the design and implementation of an end-to-end batch ETL data pipeline for an e-commerce clickstream and transactions dataset, using AWS Glue, Amazon S3, and Snowflake.

The objective of the project is to ingest raw CSV data, transform it into analytics-ready Parquet format, and load it into a cloud data warehouse following dimensional modeling best practices.

Tools: AWS S3, AWS Glue (PySpark), AWS IAM, Snowflake
Pipeline Type: Batch ETL
Focus: Cloud data engineering, data lake architecture, and analytics-ready modeling

Project Architecture

The following diagram illustrates the complete data flow from raw source files to analytics-ready tables in Snowflake.

High-level flow:

Raw CSV files sourced from Kaggle
Data stored in Amazon S3 (Raw Zone)
Transformation using AWS Glue ETL jobs
Curated data written back to S3 in Parquet format (Processed Zone)
Data loaded into Snowflake for analytics and reporting

Dataset Source

The dataset used in this project comes from Kaggle and represents e-commerce transactions, sessions, events, and customer interactions.

Steps I Followed

1. Amazon S3 Setup (Raw and Processed Zones)

An Amazon S3 bucket was created to act as the data lake, with separate folders for raw and processed data.

2. AWS Glue IAM Configuration

An IAM role was created to allow AWS Glue to read from the raw S3 zone and write transformed data to the processed zone.

3. AWS Glue ETL Jobs (PySpark)

A separate AWS Glue job was created for each table. The screenshot below shows the Glue job execution for the customers table as an example.

After transformation, the data is stored in Parquet format in the processed zone.

4. Snowflake Warehouse and Integration Setup

A Snowflake virtual warehouse was created to support data loading and querying.

A storage integration was then configured to securely connect Snowflake with Amazon S3.

An IAM role was created in AWS to allow Snowflake read access to the processed S3 data.

5. Snowflake External Stages and Tables

External stages were created in Snowflake to reference the processed Parquet files in S3.

Dimension and fact tables were then created following a star schema design.

Data was loaded into Snowflake using COPY INTO.

Once loaded, the data was validated using SELECT queries.

6. Data Quality Checks

Basic data quality validations were performed to ensure data integrity.

Primary key validation:

Sanity check (orders vs order items):

7. Analytics Views

Two analytical views were created to demonstrate business insights.

Daily revenue view:

Top products by revenue:

Project Structure

aws-snowflake/
│
├── data/
│   └── raw/                         
│       ├── customers.csv
│       ├── products.csv
│       ├── orders.csv
│       ├── order_items.csv
│       ├── sessions.csv
│       ├── events.csv
│       └── reviews.csv
│
├── glue/                            
│   ├── customers/
│   │   └── glue_customers_etl.py
│   ├── products/
│   │   └── glue_products_etl.py
│   ├── orders/
│   │   └── glue_orders_etl.py
│   ├── order_items/
│   │   └── glue_order_items_etl.py
│   ├── sessions/
│   │   └── glue_sessions_etl.py
│   ├── events/
│   │   └── glue_events_etl.py
│   └── reviews/
│       └── glue_reviews_etl.py
│
├── snowflake/                       
│   ├── 00_setup/                    
│   ├── 01_integration/              
│   ├── 02_stages/                   
│   ├── 03_tables/                   
│   ├── 04_loads/                    
│   ├── 05_validation/               
│   └── 06_views/                    
│
├── images/                          
│
├── .gitignore
└── README.md

Scripts Used

AWS Glue (PySpark)
ETL logic for each table is implemented using AWS Glue jobs written in Python. These scripts handle data cleaning, type casting, and transformation from CSV to Parquet.
Snowflake (SQL)
SQL scripts are used to create warehouses, storage integrations, external stages, dimension and fact tables, data quality validations, and analytics views.

Learnings

Designing cloud-native ETL pipelines using AWS services.
Implementing data lake architecture with raw and processed zones.
Writing PySpark ETL jobs with AWS Glue.
Integrating Snowflake with Amazon S3 using secure storage integrations.
Applying dimensional modeling and validating data quality.

Time Spent

~6 hours (design, implementation, debugging, and documentation).

How to Reproduce

Clone this repository.
Upload raw CSV files to the S3 raw zone.
Run AWS Glue jobs to process the data.
Configure Snowflake storage integration and external stages.
Execute Snowflake SQL scripts to create tables, load data, and generate views.

Credits

This project was completed as a self-directed cloud data engineering portfolio project.

Dataset source: Kaggle – E-commerce Transactions and Clickstream Data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

End-to-End E-commerce ETL Pipeline using AWS Glue, Amazon S3 and Snowflake

Ruben Encinas Moreno

Project Overview

Project Architecture

Dataset Source

Steps I Followed

1. Amazon S3 Setup (Raw and Processed Zones)

2. AWS Glue IAM Configuration

3. AWS Glue ETL Jobs (PySpark)

4. Snowflake Warehouse and Integration Setup

5. Snowflake External Stages and Tables

6. Data Quality Checks

7. Analytics Views

Project Structure

Scripts Used

Learnings

Time Spent

How to Reproduce

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data/raw		data/raw
glue		glue
images		images
snowflake		snowflake
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

End-to-End E-commerce ETL Pipeline using AWS Glue, Amazon S3 and Snowflake

Ruben Encinas Moreno

Project Overview

Project Architecture

Dataset Source

Steps I Followed

1. Amazon S3 Setup (Raw and Processed Zones)

2. AWS Glue IAM Configuration

3. AWS Glue ETL Jobs (PySpark)

4. Snowflake Warehouse and Integration Setup

5. Snowflake External Stages and Tables

6. Data Quality Checks

7. Analytics Views

Project Structure

Scripts Used

Learnings

Time Spent

How to Reproduce

Credits

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages