This repository presents an end-to-end churn prediction project built on the KKBox dataset, with a strong emphasis on methodological rigor, temporal correctness, and clean project structure.
The goal of this work was not simply to maximize attractive headline metrics. Instead, the project was designed to build a defensible churn prediction pipeline that avoids optimistic evaluation caused by temporal leakage. Early experiments showed that random train/test splits produced unrealistically strong results. The workflow was therefore redesigned around leakage-safe, time-aware validation, and the final model was selected using the competition’s primary metric: LogLoss.
The final selected model is a tuned XGBoost classifier trained on a leakage-safe February snapshot and evaluated on a March snapshot.
For subscription-based digital platforms, churn prediction is a high-value business problem. If users at risk of leaving can be identified early, the company can intervene through retention campaigns, incentives, or targeted engagement strategies before subscription loss occurs.
In the KKBox setting, the task is to predict whether a user will churn in the following period using historical transaction, subscription, and member-related information.
This project was built to answer four practical questions:
- Can churn be predicted reliably using transaction and member data?
- How much do evaluation results change after removing temporal leakage?
- Which feature groups remain stable under a leakage-safe time-aware setup?
- Which final model gives the best trustworthy LogLoss?
The project uses the KKBox churn competition data, with the following core files:
train.csvtrain_v2.csvtransactions.csvtransactions_v2.csvmembers_v3.csv
Additional user log data was also explored during research and experimentation. However, user-log-based features were not retained in the final selected model, because they did not produce stable improvement under leakage-safe, time-aware evaluation.
One of the most important findings in this project was that random split validation was misleading.
Early experiments using random train/test split produced extremely strong metrics. However, this setup was not reliable for a temporally structured churn problem because future information could leak into training features, directly or indirectly.
To address this, the project was redesigned with a time-aware validation protocol:
train.csvwas treated as the February snapshottrain_v2.csvwas treated as the March snapshot- features were built only from information available before each cutoff date
- final validation was performed from February to March
This produced lower but much more realistic and defensible performance.
| Aspect | Final Approach |
|---|---|
| Evaluation style | Time-aware |
| Leakage handling | Explicitly leakage-safe |
| Training snapshot | February |
| Validation snapshot | March |
| Final selection metric | LogLoss |
| Final selected model | Tuned XGBoost |
Feature engineering focused primarily on transaction and member information.
Examples include:
- number of distinct payment methods
- mean and last payment plan days
- mean and last listed price
- mean and last actual amount paid
- mean and last auto-renew behavior
- cancellation statistics
- earliest and latest transaction dates
- latest membership expiry date
Additional features were created from raw transaction and member signals:
NEW_NO_TRANSACTIONNEW_NO_MEMBER_INFONEW_GENDER_MISSINGNEW_MEMBERSHIP_DURATION_DAYSNEW_LAST_TRANS_TO_EXPIRE_DAYSNEW_REG_TO_LAST_TRANS_DAYSNEW_PRICE_DIFF_LASTNEW_PRICE_DIFF_MEANNEW_IS_DISCOUNT_USERNEW_CANCEL_RATENEW_AUTO_RENEW_RATE
The following groups were researched during experimentation but excluded from the final selected model:
- user log aggregates
- time-window transaction features
- trend-style user log features
- decline-style user log features
These experiments were useful analytically, but they did not deliver stable improvement under the final leakage-safe time-aware setup.
This experiment produced very strong results, but it was not leakage-safe and was therefore rejected as the final methodology.
| Model | Accuracy | F1 | ROC_AUC | LogLoss | Note |
|---|---|---|---|---|---|
| Tuned LightGBM | 0.979304 | 0.887495 | 0.993241 | 0.054774 | Optimistic, not leakage-safe |
These are the models that matter for final model selection.
| Model | Accuracy | F1 | ROC_AUC | LogLoss | Role |
|---|---|---|---|---|---|
| LightGBM | 0.811707 | 0.114439 | 0.639168 | 0.382396 | Leakage-safe baseline |
| Tuned XGBoost | 0.809248 | 0.114385 | 0.643268 | 0.377304 | Final selected model |
The final selected model is:
trained on the leakage-safe February snapshot and evaluated on the March snapshot.
This model was selected because it achieved the best final LogLoss among the trustworthy candidate models.
| Hyperparameter | Value |
|---|---|
subsample |
0.8 |
reg_lambda |
1.5 |
reg_alpha |
0.01 |
n_estimators |
200 |
min_child_weight |
3 |
max_depth |
8 |
learning_rate |
0.05 |
gamma |
0 |
colsample_bytree |
0.9 |
| Metric | Value |
|---|---|
| Accuracy | 0.809247548817665 |
| F1 | 0.11438544480837737 |
| ROC_AUC | 0.6432683589855297 |
| LogLoss | 0.37730401356779536 |
| Comparison | LogLoss |
|---|---|
| Leakage-safe LightGBM | 0.38239637743883853 |
| Leakage-safe Tuned XGBoost | 0.37730401356779536 |
The tuned XGBoost achieved a small but meaningful improvement in LogLoss and also produced the highest ROC_AUC among the leakage-safe final candidates.
The final ROC_AUC is approximately 0.64, indicating that the model performs better than random guessing but still operates in a moderate discrimination regime. This is consistent with the fact that the project deliberately removed optimistic leakage and evaluated the model under a realistic temporal shift.
The final model correctly identifies a large share of non-churn cases, but still misses many churn cases. In other words, the model is useful but not highly aggressive in catching every churn event. This is also reflected in the relatively low F1 score.
The model relies most strongly on features related to subscription continuity, cancellation, and recent payment behavior.
Top signals include:
IS_AUTO_RENEW_LASTIS_CANCEL_LASTNEW_PRICE_DIFF_LASTPAYMENT_METHOD_ID_LAST_*NEW_AUTO_RENEW_RATENEW_LAST_TRANS_TO_EXPIRE_DAYSNEW_NO_MEMBER_INFO
This is business-consistent: the final model is driven primarily by recent renewal behavior, cancellation status, payment pattern changes, and expiry-related information.
| Finding | Interpretation |
|---|---|
| Random split produced extremely high metrics | Evaluation was overly optimistic |
| Time-aware validation reduced performance | Results became much more realistic |
| User-log-based features were tested | No stable final gain under leakage-safe validation |
| Transaction/member features remained strongest | They formed the final selected model |
| Tuned XGBoost achieved the best final LogLoss | It became the final production candidate |
The final metrics in this repository are lower than public top leaderboard-style results, and this is expected.
Likely reasons include:
- top teams used much heavier feature engineering
- top teams exploited user logs more aggressively at scale
- top solutions relied on complex ensembles and stacking
- leaderboard-oriented solutions were often optimized directly for competition scoring behavior
- this repository intentionally prioritized methodological correctness, temporal realism, and leakage-safe evaluation
In other words, this repository is designed as a clean, defensible machine learning project, not a leaderboard-hacking solution.
kkbox-churn-prediction/
│
├── app/
├── data/
│ └── raw/
├── models/
├── notebooks/
├── outputs/
│ └── final_model/
│ ├── final_results.json
│ ├── feature_importance.csv
│ └── model_comparison.csv
├── src/
│ ├── archive/
│ │ ├── data_loading_and_inspection.py
│ │ └── kkbox_time_aware_research.py
│ ├── experiments/
│ │ └── final_model_evaluation.py
│ └── final_model_logloss_selection.py
├── tests/
├── .gitignore
├── README.md
└── requirements.txt
Run the final model selection pipeline:
python src/final_model_logloss_selection.pyThis script:
- builds leakage-safe February and March snapshots
- prepares final train and validation matrices
- trains a LightGBM baseline
- tunes XGBoost for LogLoss
- evaluates final performance on the March snapshot
- saves the final results and model artifacts
| File | Description |
|---|---|
outputs/final_model/final_results.json |
Final selected model metrics and best hyperparameters |
outputs/final_model/feature_importance.csv |
Feature importance values from the final tuned XGBoost model |
outputs/final_model/model_comparison.csv |
Comparison of optimistic and leakage-safe model results |
This repository intentionally separates:
- final production-like pipeline in
src/final_model_logloss_selection.py - research history / exploratory scripts in
src/archive/ - experimental alternative modeling in
src/experiments/
This structure keeps the main path of the project clean while preserving research traceability.
The strongest contribution of this project is not merely the final XGBoost model. The real contribution is the full workflow:
- identifying optimistic evaluation
- diagnosing temporal leakage risk
- redesigning validation around time-aware snapshots
- testing multiple feature paths
- and selecting a final model using a leakage-safe protocol and the correct competition metric
That is what makes this repository a professional and defensible churn modeling project.
Peyami Kenanoğlu