Skip to content

Tinyiko-Mathebula/dataquest-2026-credit-risk

Repository files navigation

Python Streamlit AUC Gini License

DataQuest 2026 — Credit Risk Intelligence Platform

🎯 Project Summary

Metric Value
Model L2-Regularised Logistic Regression with WoE Encoding
Test AUC 0.7822
Test Gini 0.5644
Baseline AUC 0.68
Improvement +0.1022 AUC

The resulting framework supports interpretable and regulator-friendly lending decisions, portfolio risk management, approval strategy optimisation, and business-oriented credit-risk analysis.


📁 Project Structure

dataquest_2026/
├── app.py                    # ← Run this: Streamlit EDA + Dashboard
├── task2_modelling.py        # ← Run this: Feature engineering + model
├── research_report.ipynb     # Detailed analysis and methodology
├── requirements.txt          # pip install -r requirements.txt
├── README.md                 # Documentation
├── images/                   # Generated visualisations
│   ├── roc_pr_curves.png
│   ├── iv_ranking.png
│   ├── coefficients.png
│   ├── confusion_matrix.png
│   └── score_distribution.png
├── utils/
│   ├── __init__.py
│   ├── data_cleaning.py      # All cleaning logic
│   └── woe_iv.py             # WoE/IV implementation
├── pages/                    # Streamlit multi-page components
├── assets/                   # Static resources
├── outputs/                  # Generated outputs
│   ├── test_predictions.csv  # Generated by task2_modelling.py
│   ├── feature_iv_table.csv  # Generated by task2_modelling.py
│   └── model_scorecard.csv   # Generated by task2_modelling.py
└── data/
    └── loan_book.csv         # Input data

🚀 Quick Start

Step 1 — Install dependencies

pip install -r requirements.txt

Step 2 — Place the data file

Copy loan_book.csv into the dataquest_2026/ folder.

Step 3 — Run the model (Task 2)

python task2_modelling.py

This produces test_predictions.csv, feature_iv_table.csv, and model_scorecard.csv.

Step 4 — Launch the Streamlit app (Tasks 1 & 3)

streamlit run app.py

Open http://localhost:8501 in your browser.


📊 App Features (6 Tabs)

Tab Content
🏠 Home & Data Quality KPIs, data quality report, critical issues found
📊 Univariate Explorer Any feature: distribution, WoE chart, IV table, ranking
🔗 Bivariate Explorer Scatter, default rate by bin, correlation heatmap
🧠 Research Reference GLM vs non-linear, WoE/IV theory, metrics, regulatory
🤖 Model Performance ROC curve, upload predictions, AUC comparison
💼 Business Dashboard Threshold simulator, volume vs risk, net value optimiser

📈 ROC Curve

The ROC curve demonstrates the model’s ability to separate default and non-default applicants across different classification thresholds.

ROC Curve

📊 Information Value (IV) Ranking

The IV ranking chart highlights the predictive strength of key affordability, behavioural, and delinquency variables used in the final logistic regression framework.

IV Ranking


🔬 Methodology

Data Issues Found & Fixed

Issue Solution
home_ownership: 14 variants for 4 categories Standardisation map (MORTGAGE/RENT/OWN/OTHER)
loan_purpose: 21 variants for 7 categories Standardisation map
application_date: 3 mixed formats Custom multi-format parser
months_since_last_delinquency: 50% missing Binary flag + 999 sentinel
annual_income: extreme outliers (max $2M) Winsorise at 1st–99th percentile

Feature Engineering (6 derived features)

Feature Formula Rationale
dti_x_rate DTI × Interest Rate Compounded repayment pressure
delinq_intensity num_delinquencies × has_prior_delinquency Persistent bad payment behaviour
log_income log(1 + income) Linearises right-skewed distribution
log_balance log(1 + balance) Same rationale

WoE Encoding

All behavioural, affordability, and delinquency variables were encoded using Weight of Evidence (WoE), transforming non-linear relationships into approximately linear inputs for logistic regression.

  • Features: WoE-encoded behavioural, affordability, and delinquency variables
  • Regulatory-safe model: excludes age-related variables

Model

  • Algorithm: Logistic Regression (sklearn)
  • Regularisation: L2, C=0.01
  • Class weights: balanced (handles 85:15 imbalance)
  • Validation: 5-Fold Stratified Cross-Validation

📋 Task Coverage

Task Status Deliverable
Task 1: EDA Research ✅ Complete app.py Tabs 1–4 + Research Reference
Task 1: Interactive App ✅ Complete app.py (6-tab Streamlit app)
Task 2: Feature Engineering ✅ Complete task2_modelling.py
Task 2: Logistic Regression ✅ Complete AUC 0.7822 > 0.68 ✓
Task 3: Business Dashboard ✅ Complete app.py Tab 6 — threshold simulator

⚠️ Regulatory Considerations

Features examined for regulatory risk:

  • age: Direct protected attribute — excluded from regulatory deployment
  • email_domain_type: Proxy for socioeconomic status
  • application_dow: May proxy for employment type (shift worker vs office)
  • region: Geographic redlining risk

See Research Reference tab in the app for full analysis.


🤖 AI-Assisted Development

This project was independently designed, analysed, and implemented by the author, with AI-assisted development tools used to improve productivity, debugging efficiency, and code quality.

AI tools used

  • ChatGPT — debugging support, architecture refinement, dashboard enhancement, and technical guidance
  • Claude AI — code review assistance and implementation validation
  • VS Code Copilot / Chat — bug fixing, syntax troubleshooting, and development acceleration

AI tools were used as engineering assistants only. All modelling decisions, business interpretations, feature engineering strategies, and final implementation choices were independently evaluated and validated by the project author.


👤 Author

Tinyiko Patience Mathebula

About

Credit Risk Intelligence Platform — DataQuest 2026. Improved baseline model AUC from 0.68 to 0.7822 using feature engineering and WoE encoding. Interactive Streamlit dashboard for regulator-aware retail lending decisions. Python · Scikit-Learn · Logistic Regression.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors