An end-to-end data engineering project that collects survey data from a web application, processes it through an automated pipeline, and visualizes insights using Power BI.
This project demonstrates a modern data engineering workflow:
- Data is collected from a web-based survey application (HTML, CSS, JS, FastAPI)
- Stored in Supabase (PostgreSQL)
- Extracted and processed using Apache Airflow
- Stored in Amazon S3 (Data Lake)
- Queried using Amazon Athena
- Visualized in Power BI using ODBC connection
Frontend (HTML/CSS/JS + FastAPI) -> Supabase (PostgreSQL) -> Apache Airflow (ETL Pipeline) → Amazon S3 (Raw + Processed Data) -> Amazon Athena (SQL Queries) -> Power BI (Dashboard)
- Frontend: HTML, CSS, JavaScript
- Backend: FastAPI
- Database: Supabase (PostgreSQL)
- Orchestration: Apache Airflow (Docker)
- Cloud Storage: Amazon S3
- Query Engine: Amazon Athena
- Visualization: Power BI (ODBC)
- Languages: Python (Pandas, SQLAlchemy, Boto3)
- Automated data ingestion using Airflow DAGs
- Multi-stage ETL pipeline (Extract, Transform, Load)
- Data cleaning and preprocessing using Pandas
- Storage of raw and processed data in S3
- SQL-based querying using Athena
- Interactive dashboards in Power BI
- Secure credential management using
.env
-
Extract
- Fetch data from Supabase using SQLAlchemy
-
Transform
- Clean data using Pandas
-
Load
- Store raw data in S3 (raw layer)
- Store processed data in S3 (processed layer)
-
Catalog
- AWS Glue Crawler automatically infers schema and updates the Data Catalog
-
Query
- Amazon Athena queries data using Glue Data Catalog
-
Visualize
- Power BI connects to Athena via ODBC
- Analyze average salary by profession
- Distribution of users across cities
- Age-based demographic analysis
- Survey trend insights
- Credentials managed using
.envfile - IAM users used for controlled AWS access
- Separation of ingestion and analytics roles
- Convert CSV to Parquet for optimized queries
- Implement incremental data loading
- Add partitioning for better performance
- Integrate Apache Spark for large-scale processing
- Automate Power BI refresh
This project demonstrates:
- End-to-end data pipeline design
- Cloud data engineering (AWS)
- Workflow orchestration (Airflow)
- Data visualization (Power BI)
- Real-world system integration
Maneesh Karlapudi
- GitHub: https://github.com/maneesh6531
- LinkedIn: https://www.linkedin.com/in/maneeshkarlapudi
This project was built as part of hands-on learning in data engineering and cloud-based analytics systems.