A production-grade streaming data platform for processing Formula 1 telemetry in real-time. Built with Kafka, Spark Structured Streaming, Delta Lake, dbt, and Great Expectations.
┌─────────────────┐ ┌──────────────────┐ ┌───────────────────────┐
│ Telemetry │ │ Spark Streaming │ │ Delta Lake │
│ Sources │────▶│ Processing │───▶│ Storage │
│ (50 cars) │ │ (Enrich/Aggr) │ │ (Bronze/Silver/Gold)│
└─────────────────┘ └──────────────────┘ └───────────────────────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Kafka │ │ dbt │
│ Re-stream │ │ Gold │
│ (Enriched) │ │ Models │
└──────────────┘ └──────────────┘
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Prometheus │ │ Grafana │
│ + Grafana │ │ Dashboards │
└─────────────┘ └─────────────┘
- Real-time Processing: Sub-second latency streaming pipeline processing 10K+ events/sec
- Medallion Architecture: Bronze (raw) → Silver (enriched) → Gold (aggregated) layers
- Data Quality: Great Expectations validation with quarantine policy for failed records
- Schema Governance: Protobuf + JSON Schema with schema registry integration
- Exactly-Once Semantics: Checkpoint-based recovery with idempotent writes
- Observability: Prometheus metrics + Grafana dashboards for monitoring
| Component | Technology | Purpose |
|---|---|---|
| Messaging | Apache Kafka 7.5 | Event streaming backbone |
| Processing | Spark Structured Streaming 3.4 | Real-time stream processing |
| Storage | Delta Lake 2.4 | ACID storage layer |
| Analytics | dbt 1.7 | Transformation and gold layer |
| Quality | Great Expectations 0.18 | Data validation |
| Orchestration | Docker Compose | Container management |
| Monitoring | Prometheus + Grafana | Observability |
- Docker 24.0+
- Docker Compose 2.20+
- 16GB RAM available
# Start all services
make up
# Wait for services to be healthy
make wait
# Initialize schemas and seed test data
make init
# View logs
make logs
# Stop services
make down| Service | URL | Credentials |
|---|---|---|
| Spark Master | http://localhost:8080 | - |
| Grafana | http://localhost:3000 | admin/admin |
| Prometheus | http://localhost:9090 | - |
| MinIO Console | http://localhost:9001 | minioadmin/minioadmin |
| Jupyter Notebook | http://localhost:8888 | - |
| Schema Registry | http://localhost:8081 | - |
f1-telemetry/
├── proto/ # Protocol Buffer schemas
│ └── telemetry_event.proto
├── schemas/ # JSON Schemas
│ └── telemetry_event.json
├── spark/ # Spark streaming jobs
│ └── telemetry_stream.py
├── sql/ # Delta Lake DDL
│ └── delta_ddl.sql
├── docker/ # Docker Compose config
│ └── docker-compose.yml
├── dbt/ # dbt analytics engineering
│ ├── models/
│ │ └── silver/
│ ├── profiles.yml
│ └── dbt_project.yml
├── expectations/ # Great Expectations suites
│ └── telemetry_suite.py
├── monitoring/ # Observability configs
│ ├── prometheus.yml
│ └── grafana/
├── data/ # Local Delta Lake storage
├── tests/ # Test suites
├── Makefile # Orchestration
├── .env.example # Environment template
└── README.md
| Event Type | Frequency | Description |
|---|---|---|
position |
10Hz | GPS coordinates, speed, heading |
dynamics |
20Hz | Throttle, brake, gear, G-forces |
power_unit |
50Hz | Engine RPM, ERS, temperatures |
tire |
5Hz/corner | Pressure, temperature, wear |
aerodynamics |
10Hz | DRS, wing angles, brake balance |
system |
On-change | Errors, warnings, flags |
Bronze (bronze.telemetry_events)
- Raw events as received from Kafka
- JSON payload for schema flexibility
- Partitioned by:
session_date,event_type
Silver (silver.telemetry_enriched)
- Deduplicated, validated, flattened
- Denormalized session context
- Partitioned by:
session_date,car_number
Gold (gold.lap_metrics, gold.session_metrics)
- Aggregated metrics per lap/session
- Materialized views for downstream consumers
Copy .env.example to .env and adjust:
# Kafka
KAFKA_BOOTSTRAP_SERVERS=kafka:9092
# Spark
SPARK_CHECKPOINT_PATH=/data/checkpoints
SPARK_SHUFFLE_PARTITIONS=20
# Delta Lake
DELTA_STORAGE_PATH=/data
# dbt
DBT_TARGET=prod| Parameter | Default | Description |
|---|---|---|
maxOffsetsPerTrigger |
10000 | Max events per micro-batch |
triggerInterval |
10s | Micro-batch frequency |
checkpointInterval |
30s | Checkpoint save frequency |
# Inside Spark container
spark-submit \
--master spark://spark-master:7077 \
--packages io.delta:delta-core_2.12:2.4.0 \
/app/telemetry_stream.py# In dbt container
dbt deps
dbt run --select silver
dbt run --select gold# Run Great Expectations
python -m great_expectations checkpoint run telemetry_checkpoint| Metric | Description | Alert Threshold |
|---|---|---|
streaming_lag |
Kafka-to-processing lag | > 5 seconds |
processing_latency |
Event processing time | > 1 second |
validation_failure_rate |
Failed validation % | > 1% |
checkpoint_age |
Time since last checkpoint | > 60 seconds |
- Stream Health: Processing rate, lag, errors
- Data Quality: Validation pass rates, quarantine metrics
- System Resources: CPU, memory, disk usage
make testmake lintmake clean # Remove all data and containers
make reset # Rebuild containers- Better replay capabilities
- Exactly-once semantics with transactions
- Larger ecosystem and tooling
- Unified batch/streaming API
- Better Delta Lake integration
- More mature ecosystem
- Native Spark integration
- ACID transactions
- Time travel capabilities
- Bronze:
session_date/event_type- optimize for raw event access - Silver:
session_date/car_number- optimize for car-centric queries - Gold:
session_date- optimize for aggregate queries
- Horizontal scaling: Add Spark executors
- Partition scaling: Increase Kafka partitions
- Storage scaling: Tier to S3/ADLS
- Checkpoint every 30 seconds
- Minimum 3x Kafka replication
- Dead letter queue for failed records
- Enable Kafka SASL authentication
- Encrypt Delta Lake at rest
- RBAC for dbt models
MIT License
Contributions welcome. Please open an issue to discuss changes.