This project demonstrates the design and implementation of an end-to-end batch ETL data pipeline for an e-commerce clickstream and transactions dataset, using AWS Glue, Amazon S3, and Snowflake.
The objective of the project is to ingest raw CSV data, transform it into analytics-ready Parquet format, and load it into a cloud data warehouse following dimensional modeling best practices.
- Tools: AWS S3, AWS Glue (PySpark), AWS IAM, Snowflake
- Pipeline Type: Batch ETL
- Focus: Cloud data engineering, data lake architecture, and analytics-ready modeling
The following diagram illustrates the complete data flow from raw source files to analytics-ready tables in Snowflake.
High-level flow:
- Raw CSV files sourced from Kaggle
- Data stored in Amazon S3 (Raw Zone)
- Transformation using AWS Glue ETL jobs
- Curated data written back to S3 in Parquet format (Processed Zone)
- Data loaded into Snowflake for analytics and reporting
The dataset used in this project comes from Kaggle and represents e-commerce transactions, sessions, events, and customer interactions.
An Amazon S3 bucket was created to act as the data lake, with separate folders for raw and processed data.
An IAM role was created to allow AWS Glue to read from the raw S3 zone and write transformed data to the processed zone.
A separate AWS Glue job was created for each table. The screenshot below shows the Glue job execution for the customers table as an example.
After transformation, the data is stored in Parquet format in the processed zone.
A Snowflake virtual warehouse was created to support data loading and querying.
A storage integration was then configured to securely connect Snowflake with Amazon S3.
An IAM role was created in AWS to allow Snowflake read access to the processed S3 data.
External stages were created in Snowflake to reference the processed Parquet files in S3.
Dimension and fact tables were then created following a star schema design.
Data was loaded into Snowflake using COPY INTO.
Once loaded, the data was validated using SELECT queries.
Basic data quality validations were performed to ensure data integrity.
Primary key validation:
Sanity check (orders vs order items):
Two analytical views were created to demonstrate business insights.
Daily revenue view:
Top products by revenue:
aws-snowflake/
│
├── data/
│ └── raw/
│ ├── customers.csv
│ ├── products.csv
│ ├── orders.csv
│ ├── order_items.csv
│ ├── sessions.csv
│ ├── events.csv
│ └── reviews.csv
│
├── glue/
│ ├── customers/
│ │ └── glue_customers_etl.py
│ ├── products/
│ │ └── glue_products_etl.py
│ ├── orders/
│ │ └── glue_orders_etl.py
│ ├── order_items/
│ │ └── glue_order_items_etl.py
│ ├── sessions/
│ │ └── glue_sessions_etl.py
│ ├── events/
│ │ └── glue_events_etl.py
│ └── reviews/
│ └── glue_reviews_etl.py
│
├── snowflake/
│ ├── 00_setup/
│ ├── 01_integration/
│ ├── 02_stages/
│ ├── 03_tables/
│ ├── 04_loads/
│ ├── 05_validation/
│ └── 06_views/
│
├── images/
│
├── .gitignore
└── README.md-
AWS Glue (PySpark)
ETL logic for each table is implemented using AWS Glue jobs written in Python. These scripts handle data cleaning, type casting, and transformation from CSV to Parquet. -
Snowflake (SQL)
SQL scripts are used to create warehouses, storage integrations, external stages, dimension and fact tables, data quality validations, and analytics views.
- Designing cloud-native ETL pipelines using AWS services.
- Implementing data lake architecture with raw and processed zones.
- Writing PySpark ETL jobs with AWS Glue.
- Integrating Snowflake with Amazon S3 using secure storage integrations.
- Applying dimensional modeling and validating data quality.
~6 hours (design, implementation, debugging, and documentation).
- Clone this repository.
- Upload raw CSV files to the S3 raw zone.
- Run AWS Glue jobs to process the data.
- Configure Snowflake storage integration and external stages.
- Execute Snowflake SQL scripts to create tables, load data, and generate views.
This project was completed as a self-directed cloud data engineering portfolio project.
Dataset source: Kaggle – E-commerce Transactions and Clickstream Data



















