| Metric | Value |
|---|---|
| Model | L2-Regularised Logistic Regression with WoE Encoding |
| Test AUC | 0.7822 |
| Test Gini | 0.5644 |
| Baseline AUC | 0.68 |
| Improvement | +0.1022 AUC |
The resulting framework supports interpretable and regulator-friendly lending decisions, portfolio risk management, approval strategy optimisation, and business-oriented credit-risk analysis.
dataquest_2026/
├── app.py # ← Run this: Streamlit EDA + Dashboard
├── task2_modelling.py # ← Run this: Feature engineering + model
├── research_report.ipynb # Detailed analysis and methodology
├── requirements.txt # pip install -r requirements.txt
├── README.md # Documentation
├── images/ # Generated visualisations
│ ├── roc_pr_curves.png
│ ├── iv_ranking.png
│ ├── coefficients.png
│ ├── confusion_matrix.png
│ └── score_distribution.png
├── utils/
│ ├── __init__.py
│ ├── data_cleaning.py # All cleaning logic
│ └── woe_iv.py # WoE/IV implementation
├── pages/ # Streamlit multi-page components
├── assets/ # Static resources
├── outputs/ # Generated outputs
│ ├── test_predictions.csv # Generated by task2_modelling.py
│ ├── feature_iv_table.csv # Generated by task2_modelling.py
│ └── model_scorecard.csv # Generated by task2_modelling.py
└── data/
└── loan_book.csv # Input data
pip install -r requirements.txtCopy loan_book.csv into the dataquest_2026/ folder.
python task2_modelling.pyThis produces test_predictions.csv, feature_iv_table.csv, and model_scorecard.csv.
streamlit run app.pyOpen http://localhost:8501 in your browser.
| Tab | Content |
|---|---|
| 🏠 Home & Data Quality | KPIs, data quality report, critical issues found |
| 📊 Univariate Explorer | Any feature: distribution, WoE chart, IV table, ranking |
| 🔗 Bivariate Explorer | Scatter, default rate by bin, correlation heatmap |
| 🧠 Research Reference | GLM vs non-linear, WoE/IV theory, metrics, regulatory |
| 🤖 Model Performance | ROC curve, upload predictions, AUC comparison |
| 💼 Business Dashboard | Threshold simulator, volume vs risk, net value optimiser |
The ROC curve demonstrates the model’s ability to separate default and non-default applicants across different classification thresholds.
The IV ranking chart highlights the predictive strength of key affordability, behavioural, and delinquency variables used in the final logistic regression framework.
| Issue | Solution |
|---|---|
home_ownership: 14 variants for 4 categories |
Standardisation map (MORTGAGE/RENT/OWN/OTHER) |
loan_purpose: 21 variants for 7 categories |
Standardisation map |
application_date: 3 mixed formats |
Custom multi-format parser |
months_since_last_delinquency: 50% missing |
Binary flag + 999 sentinel |
annual_income: extreme outliers (max $2M) |
Winsorise at 1st–99th percentile |
| Feature | Formula | Rationale |
|---|---|---|
dti_x_rate |
DTI × Interest Rate | Compounded repayment pressure |
delinq_intensity |
num_delinquencies × has_prior_delinquency | Persistent bad payment behaviour |
log_income |
log(1 + income) | Linearises right-skewed distribution |
log_balance |
log(1 + balance) | Same rationale |
All behavioural, affordability, and delinquency variables were encoded using Weight of Evidence (WoE), transforming non-linear relationships into approximately linear inputs for logistic regression.
- Features: WoE-encoded behavioural, affordability, and delinquency variables
- Regulatory-safe model: excludes age-related variables
- Algorithm: Logistic Regression (
sklearn) - Regularisation: L2, C=0.01
- Class weights:
balanced(handles 85:15 imbalance) - Validation: 5-Fold Stratified Cross-Validation
| Task | Status | Deliverable |
|---|---|---|
| Task 1: EDA Research | ✅ Complete | app.py Tabs 1–4 + Research Reference |
| Task 1: Interactive App | ✅ Complete | app.py (6-tab Streamlit app) |
| Task 2: Feature Engineering | ✅ Complete | task2_modelling.py |
| Task 2: Logistic Regression | ✅ Complete | AUC 0.7822 > 0.68 ✓ |
| Task 3: Business Dashboard | ✅ Complete | app.py Tab 6 — threshold simulator |
Features examined for regulatory risk:
age: Direct protected attribute — excluded from regulatory deploymentemail_domain_type: Proxy for socioeconomic statusapplication_dow: May proxy for employment type (shift worker vs office)region: Geographic redlining risk
See Research Reference tab in the app for full analysis.
This project was independently designed, analysed, and implemented by the author, with AI-assisted development tools used to improve productivity, debugging efficiency, and code quality.
- ChatGPT — debugging support, architecture refinement, dashboard enhancement, and technical guidance
- Claude AI — code review assistance and implementation validation
- VS Code Copilot / Chat — bug fixing, syntax troubleshooting, and development acceleration
AI tools were used as engineering assistants only. All modelling decisions, business interpretations, feature engineering strategies, and final implementation choices were independently evaluated and validated by the project author.
Tinyiko Patience Mathebula

