This document explains the XGBoost lead scoring pipeline: what it predicts, how it's trained, how MLflow tracks experiments, how the FastAPI endpoint serves predictions, and the current state of the n8n automation integration.
- What the model does
- Input features
- Training pipeline
- MLflow experiment tracking
- Saved model artifacts
- FastAPI scoring endpoint
- n8n lead routing automation
- How to retrain
- How to view experiment results
- Current status and roadmap
The lead scoring model predicts the probability that a website visitor will convert into a closed-won deal (is_won).
This probability — the "lead score" — is used in two ways:
- Dashboards: The Streamlit app and HTML dashboards surface the score as a KPI alongside channel attribution and pipeline metrics
- Automation: The n8n workflow calls the FastAPI endpoint and routes high-score leads (score > 0.8) to the sales team via Slack in real time
The model is intentionally lightweight — four behavioural features, one training script, one JSON file. The goal is a working ML-in-the-loop pipeline, not a production model requiring feature stores.
Features are derived from fct_lead_scoring_features (a dbt mart table) and exported to data/lead_scoring_features.csv.
| Feature | Type | Description |
|---|---|---|
sessions |
int | Total GA4 sessions for this visitor |
engaged_sessions |
int | Sessions where engagement time > 10 seconds |
page_views |
int | Total pages viewed across all sessions |
is_first_visit |
bool (0/1) | Whether this is the visitor's first recorded session |
Target variable: is_won — binary (1 = contact became a closed-won deal, 0 = otherwise)
Training data: 93,000 rows sourced from the Olist e-commerce dataset joined with synthetic marketing engagement data.
The feature set is defined in dbt_project/models/metrics/metrics.yml under the leads semantic model. If you add new features to the dbt model, update the feature list in both ml/src/train.py and api/main.py.
Script: ml/src/train.py
data/lead_scoring_features.csv
│
▼
train_test_split (80 / 20, random_state=42)
│
▼
XGBClassifier fit
│
├──► MLflow run (logs params + metrics + model artifact)
│
└──► ml/lead_scoring_model.json (local copy for the API)
XGBoost hyperparameters:
| Parameter | Value |
|---|---|
max_depth |
5 |
learning_rate |
0.1 |
n_estimators |
100 |
objective |
binary:logistic |
eval_metric |
auc |
random_state |
42 |
Evaluation metrics logged per run:
| Metric | Description |
|---|---|
accuracy |
Fraction of correct predictions |
precision |
True positives / (true positives + false positives) |
recall |
True positives / (true positives + false negatives) |
f1 |
Harmonic mean of precision and recall |
auc |
Area under the ROC curve — primary model selection metric |
Tracking URI: file://ml/mlflow/ (local file-based, no server required)
Experiment name: Lead Scoring Optimization
Every call to python ml/src/train.py creates a new MLflow run under this experiment. MLflow records:
- Parameters: all XGBoost hyperparameters
- Metrics: accuracy, precision, recall, F1, AUC on the test set
- Artefact: the trained XGBoost model (logged via
mlflow.xgboost.log_model) - Metadata: run name, source type, git commit hash
The tracking directory structure:
ml/mlflow/
└── <experiment_id>/
└── <run_id>/
├── params/ # hyperparameters
├── metrics/ # evaluation scores
├── artifacts/
│ └── lead_scoring_xgb_model/
│ ├── MLmodel # model metadata
│ └── model.xgb # serialised model
└── tags/
After training, two artefacts are produced:
| File | Format | Used by |
|---|---|---|
ml/lead_scoring_model.json |
XGBoost native JSON | FastAPI scoring endpoint |
ml/mlflow/<run_id>/artifacts/ |
MLflow artefact store | MLflow UI, model registry |
The JSON file is loaded by api/main.py at startup. It does not need MLflow at inference time — the API depends only on xgboost and fastapi.
Location: api/main.py
Start the server:
cd api && uvicorn main:app --port 8000 --reloadEndpoint: POST /score
Request body:
{
"sessions": 5,
"engaged_sessions": 3,
"page_views": 12,
"channel": "Paid Search",
"is_first_visit": false
}Response:
{
"score": 0.84,
"label": "high",
"model_version": "1.0"
}| Field | Description |
|---|---|
score |
Float 0–1, probability of is_won = 1 |
label |
"high" if score > 0.8, otherwise "low" |
model_version |
Version string from the model JSON |
The channel field in the request is accepted but not currently used as a model feature — it is reserved for a future model version that will include channel as a categorical feature.
Interactive docs: once running, visit http://localhost:8000/docs for the auto-generated Swagger UI.
Workflow file: automation/n8n_workflow.json
The n8n workflow automates the step between scoring and sales action:
Incoming webhook (new lead event)
│
▼
POST http://api:8000/score
{ sessions, engaged_sessions, page_views, channel, is_first_visit }
│
▼
score > 0.8?
├── YES → Slack #sales-notifications
│ "🚀 High Value Lead Found! Email: ... Score: ..."
└── NO → (no action — lead stays in normal nurture sequence)
Nodes:
| Node | Type | Role |
|---|---|---|
Contact Hook |
Webhook trigger | Receives the lead event payload |
ML Score API |
HTTP Request | POSTs to the FastAPI /score endpoint |
High Score? |
IF condition | Routes on score > 0.8 |
Slack Alert |
Slack | Posts to #sales-notifications with lead email and score |
| Component | Status | Notes |
|---|---|---|
| Workflow definition | ✅ Complete | automation/n8n_workflow.json is fully specified |
| FastAPI endpoint | ✅ Running | POST /score is live when uvicorn is started |
| n8n runtime | n8n must be running separately; not started by scripts/run_mlflow_server.sh |
|
| Slack credentials | The Slack node requires a Slack API token and channel ID configured in n8n | |
| Production URL | The workflow uses http://api:8000/score (Docker service name). Change to your actual API URL |
# Install and start n8n
npx n8n
# Open the UI at http://localhost:5678
# Import the workflow:
# Menu → Workflows → Import from file → automation/n8n_workflow.jsonTo connect to the scoring API, update the ML Score API node URL from http://api:8000/score to http://localhost:8000/score for local development.
- In n8n, go to Credentials → New → Slack API
- Add your Slack Bot OAuth token (
xoxb-...) - Assign the credential to the
Slack Alertnode - Update the
channelfield to your actual Slack channel name
Run the training script after generating fresh feature data:
# 1. Ensure features are up to date (requires DuckDB + dbt)
python scripts/load_duckdb.py
cd dbt_project && dbt run --target duckdb --select fct_lead_scoring_features && cd ..
# 2. Export features to CSV (done by load_duckdb.py automatically)
# 3. Train
python ml/src/train.pyThe script will:
- Load
data/lead_scoring_features.csv - Train a new XGBoost model
- Create a new MLflow run under
Lead Scoring Optimization - Overwrite
ml/lead_scoring_model.json
The FastAPI server will use the new model on its next startup (it loads the JSON at init time, not per request).
Start the MLflow UI:
bash scripts/run_mlflow_server.sh
# or directly:
mlflow ui --backend-store-uri file://ml/mlflow --port 5001Open http://localhost:5001 to browse:
- All runs under the
Lead Scoring Optimizationexperiment - Side-by-side metric comparison between runs
- Parameter importance charts
- Registered model artefacts
Compare runs from the CLI:
mlflow runs list --experiment-name "Lead Scoring Optimization"| Item | Status |
|---|---|
| XGBoost training on 93K rows | ✅ Complete |
| MLflow local experiment tracking | ✅ Complete |
FastAPI /score endpoint |
✅ Complete |
| n8n workflow definition | ✅ Complete |
| n8n ↔ Slack integration | |
| n8n ↔ HubSpot (write-back) | 🔲 Not yet implemented — planned to update contact score in CRM |
| Channel as model feature | 🔲 Planned — requires one-hot encoding in the feature pipeline |
| MLflow Model Registry | 🔲 Planned — currently uses file-based artefact store only |
| Automated retraining trigger | 🔲 Planned — would hook into daily-synthetic-data.yml GitHub Actions workflow |