A production-style e-commerce metrics layer built with dbt and BigQuery on the Olist Brazil dataset from Kaggle. The project demonstrates a full analytics engineering workflow: raw data ingestion, layered dbt transformations, a MetricFlow semantic layer, automated CI/CD, and a self-hosted Metabase dashboard.
- Architecture
- Project Structure
- Data Model
- Metrics Defined
- Dashboards
- Tech Stack
- Getting Started
- CI/CD
- Known Data Quality Issues
- Raw CSVs (Kaggle) → downloaded manually into
data/ - Python ingestion (
scripts/load_to_bq.py) → loads toolist_rawin BigQuery - dbt Staging → views with renaming, casting, timezone conversion
- dbt Intermediate → joins, aggregations, derived fields
- dbt Marts → domain fact and dimension tables materialized as tables
- MetricFlow → semantic layer with metric definitions
- Metabase → self-hosted dashboards via Docker on
olist_prod
dbt-olist-metrics/
├── .github/workflows/ci.yml
├── docs/screenshots/
├── olist_metrics/
│ ├── models/
│ │ ├── staging/
│ │ ├── intermediate/
│ │ ├── marts/
│ │ │ ├── core/
│ │ │ ├── finance/
│ │ │ ├── marketing/
│ │ │ ├── operations/
│ │ │ ├── product/
│ │ │ └── ml/
│ │ └── metrics/
│ ├── tests/
│ ├── seeds/
│ └── dbt_project.yml
├── scripts/load_to_bq.py
├── docker-compose.yml
└── pyproject.toml
| Table | Description |
|---|---|
orders |
One row per order, central fact |
order_items |
One row per item within an order |
order_payments |
Payment transactions per order |
order_reviews |
Customer reviews per order |
customers |
Customer dimension |
products |
Product dimension |
sellers |
Seller dimension |
geolocation |
ZIP code to lat/lng mapping |
product_category_name_translation |
Portuguese to English category mapping |
All staging models are materialized as views. Transformations applied:
- Column renaming and type casting
- Financial columns cast to
NUMERICrounded to 2 decimals - Timestamps converted to
DATETIMEinAmerica/Sao_Paulotimezone - String columns lowercased for consistency
freight_valuerenamed toshipping_value- Typos fixed (
product_name_lenght→product_name_length)
| Model | Description | Grain |
|---|---|---|
int_orders_with_payments |
Orders + aggregated payment info | order |
int_orders_with_items |
Orders + item/freight aggregates | order |
int_orders_with_reviews |
Orders + review scores | order |
int_orders_enriched |
Central enriched order table joining all above | order |
int_customers_with_orders |
Customer order history aggregates | customer |
int_products_with_category |
Products + English category translation | product |
| Model | Description | Grain |
|---|---|---|
fct_orders |
Central order fact table | order |
dim_customers |
Customer dimension with RFM segmentation | customer |
dim_products |
Product dimension with physical attributes | product |
| Model | Description | Grain |
|---|---|---|
fct_revenue |
Order-level revenue, excludes canceled orders | order |
fct_payments |
Payment transaction fact table | payment transaction |
| Model | Description | Grain |
|---|---|---|
fct_customer_orders |
Customer order history with RFM scores | customer |
fct_cac_payback |
Monthly CAC and payback period (synthetic spend data) | month |
| Model | Description | Grain |
|---|---|---|
fct_deliveries |
Delivery performance per order | order |
fct_seller_performance |
Seller metrics by month | seller + month |
| Model | Description | Grain |
|---|---|---|
fct_category_performance |
Revenue and delivery metrics by category | category + month |
fct_order_reviews |
Order-level review metrics | month + state |
fct_product_reviews |
Product-level reviews (single-product orders only) | product + month |
| Model | Description | Grain |
|---|---|---|
fct_customer_features |
Customer features for churn/LTV models | customer |
fct_order_features |
Order features for delivery delay prediction | order |
fct_product_features |
Product features for recommendation models | product |
Four high-volume fact tables use incremental materialization with a 90-day lookback window to capture late-arriving data (status updates, reviews, delivery timestamps):
| Model | Partition | Cluster |
|---|---|---|
fct_orders |
order_purchase_timestamp (day) |
customer_state, order_status, delivery_status |
fct_revenue |
order_date (day) |
customer_state, order_status |
fct_deliveries |
order_purchase_timestamp (day) |
delivery_status, customer_state, order_status |
fct_payments |
order_date (day) |
customer_state, payment_type |
First run requires --full-refresh to apply partition and cluster configuration. Subsequent runs merge only the last 90 days.
Metrics are defined using MetricFlow on top of the marts layer.
| Metric | Definition |
|---|---|
gmv |
Sum of total_payment_value on non-canceled orders |
order_count |
Total number of orders placed |
avg_order_value |
Average total_payment_value per order |
customer_count |
Total unique customers |
repeat_customer_count |
Customers with more than one order |
repeat_purchase_rate |
repeat_customer_count / customer_count |
conversion_rate |
Delivered orders / total orders |
late_delivery_rate |
Late deliveries / total delivered orders |
Dashboards are built in self-hosted Metabase (Docker) on top of olist_prod BigQuery tables.
| Tool | Purpose |
|---|---|
| dbt Core | Data transformation and testing |
| BigQuery | Cloud data warehouse (GCP) |
| Python | Data ingestion (google-cloud-bigquery, pandas) |
| MetricFlow | Semantic layer and metric definitions |
| Metabase | Self-hosted BI dashboards |
| Docker | Metabase containerization |
| GitHub Actions | CI/CD pipeline |
| Poetry | Python dependency management |
- Python 3.11+
- Poetry
- Docker Desktop
- GCP project with BigQuery enabled
gcloudauthenticated locally (gcloud auth application-default login)
git clone git@github.com:yourusername/dbt-olist-metrics.git
cd dbt-olist-metricspoetry installCreate a .env file at the root:
GCP_PROJECT_ID=your-gcp-project-id GCP_DATASET_ID=olist_raw GCP_LOCATION=europe-west1
Download the Olist dataset from Kaggle and place the CSV files in the data/ folder. Then run:
poetry run python scripts/load_to_bq.pyAdd the following to ~/.dbt/profiles.yml:
olist_metrics:
outputs:
dev:
type: bigquery
method: oauth
project: your-gcp-project-id
dataset: olist_dev
location: europe-west1
threads: 4
target: devcd olist_metrics
poetry run dbt deps
poetry run dbt seed
poetry run dbt builddocker compose up -dOpen http://localhost:3000 and connect to BigQuery using your service account credentials, pointing to the olist_prod dataset.
GitHub Actions runs dbt build --target prod on every PR and push to main. The pipeline:
- Sets up Python 3.11 and Poetry
- Installs project dependencies
- Configures dbt profiles using GitHub secrets
- Runs
dbt depsto install dbt packages - Runs
dbt seedto load seed data - Runs
dbt buildto build all models and run all tests againstolist_prod
Two GitHub secrets are required:
GCP_PROJECT_IDGCP_SERVICE_ACCOUNT_KEY
| Issue | Location | Handling |
|---|---|---|
customer_unique_id maps to multiple customer_id values with different zip codes |
olist_raw.customers |
Grouped by customer_unique_id in int_customers_with_orders, arbitrary zip kept via max() |
| ~100 rows with unescaped quotes in review comments | olist_raw.order_reviews |
Loaded with allow_quoted_newlines=True |
| 11947 geolocation coordinates outside Brazil's bounding box | olist_raw.geolocation |
Singular test with severity: warn, geolocation not used in any mart |
| 1 order with no payment record | olist_raw.order_payments |
not_null test on total_payment_value set to severity: warn |
| 8 delivered orders with missing delivery timestamps | olist_raw.orders |
ML feature tests set to severity: warn |
| Products with no English category translation | product_category_name_translation |
not_null test on product_category_name_english set to severity: warn |
| Marketing spend data is synthetic | seeds/marketing_spend.csv |
Documented in model comments and README |



