DataQuest 2026 — Credit Risk Intelligence Platform

🎯 Project Summary

Metric	Value
Model	L2-Regularised Logistic Regression with WoE Encoding
Test AUC	0.7822
Test Gini	0.5644
Baseline AUC	0.68
Improvement	+0.1022 AUC

The resulting framework supports interpretable and regulator-friendly lending decisions, portfolio risk management, approval strategy optimisation, and business-oriented credit-risk analysis.

📁 Project Structure

dataquest_2026/
├── app.py                    # ← Run this: Streamlit EDA + Dashboard
├── task2_modelling.py        # ← Run this: Feature engineering + model
├── research_report.ipynb     # Detailed analysis and methodology
├── requirements.txt          # pip install -r requirements.txt
├── README.md                 # Documentation
├── images/                   # Generated visualisations
│   ├── roc_pr_curves.png
│   ├── iv_ranking.png
│   ├── coefficients.png
│   ├── confusion_matrix.png
│   └── score_distribution.png
├── utils/
│   ├── __init__.py
│   ├── data_cleaning.py      # All cleaning logic
│   └── woe_iv.py             # WoE/IV implementation
├── pages/                    # Streamlit multi-page components
├── assets/                   # Static resources
├── outputs/                  # Generated outputs
│   ├── test_predictions.csv  # Generated by task2_modelling.py
│   ├── feature_iv_table.csv  # Generated by task2_modelling.py
│   └── model_scorecard.csv   # Generated by task2_modelling.py
└── data/
    └── loan_book.csv         # Input data

🚀 Quick Start

Step 1 — Install dependencies

pip install -r requirements.txt

Step 2 — Place the data file

Copy loan_book.csv into the dataquest_2026/ folder.

Step 3 — Run the model (Task 2)

python task2_modelling.py

This produces test_predictions.csv, feature_iv_table.csv, and model_scorecard.csv.

Step 4 — Launch the Streamlit app (Tasks 1 & 3)

streamlit run app.py

Open http://localhost:8501 in your browser.

📊 App Features (6 Tabs)

Tab	Content
🏠 Home & Data Quality	KPIs, data quality report, critical issues found
📊 Univariate Explorer	Any feature: distribution, WoE chart, IV table, ranking
🔗 Bivariate Explorer	Scatter, default rate by bin, correlation heatmap
🧠 Research Reference	GLM vs non-linear, WoE/IV theory, metrics, regulatory
🤖 Model Performance	ROC curve, upload predictions, AUC comparison
💼 Business Dashboard	Threshold simulator, volume vs risk, net value optimiser

📈 ROC Curve

The ROC curve demonstrates the model’s ability to separate default and non-default applicants across different classification thresholds.

📊 Information Value (IV) Ranking

The IV ranking chart highlights the predictive strength of key affordability, behavioural, and delinquency variables used in the final logistic regression framework.

🔬 Methodology

Data Issues Found & Fixed

Issue	Solution
`home_ownership`: 14 variants for 4 categories	Standardisation map (MORTGAGE/RENT/OWN/OTHER)
`loan_purpose`: 21 variants for 7 categories	Standardisation map
`application_date`: 3 mixed formats	Custom multi-format parser
`months_since_last_delinquency`: 50% missing	Binary flag + 999 sentinel
`annual_income`: extreme outliers (max $2M)	Winsorise at 1st–99th percentile

Feature Engineering (6 derived features)

Feature	Formula	Rationale
`dti_x_rate`	DTI × Interest Rate	Compounded repayment pressure
`delinq_intensity`	num_delinquencies × has_prior_delinquency	Persistent bad payment behaviour
`log_income`	log(1 + income)	Linearises right-skewed distribution
`log_balance`	log(1 + balance)	Same rationale

WoE Encoding

All behavioural, affordability, and delinquency variables were encoded using Weight of Evidence (WoE), transforming non-linear relationships into approximately linear inputs for logistic regression.

Features: WoE-encoded behavioural, affordability, and delinquency variables
Regulatory-safe model: excludes age-related variables

Model

Algorithm: Logistic Regression (sklearn)
Regularisation: L2, C=0.01
Class weights: balanced (handles 85:15 imbalance)
Validation: 5-Fold Stratified Cross-Validation

📋 Task Coverage

Task	Status	Deliverable
Task 1: EDA Research	✅ Complete	`app.py` Tabs 1–4 + Research Reference
Task 1: Interactive App	✅ Complete	`app.py` (6-tab Streamlit app)
Task 2: Feature Engineering	✅ Complete	`task2_modelling.py`
Task 2: Logistic Regression	✅ Complete	AUC 0.7822 > 0.68 ✓
Task 3: Business Dashboard	✅ Complete	`app.py` Tab 6 — threshold simulator

⚠️ Regulatory Considerations

Features examined for regulatory risk:

age: Direct protected attribute — excluded from regulatory deployment
email_domain_type: Proxy for socioeconomic status
application_dow: May proxy for employment type (shift worker vs office)
region: Geographic redlining risk

See Research Reference tab in the app for full analysis.

🤖 AI-Assisted Development

This project was independently designed, analysed, and implemented by the author, with AI-assisted development tools used to improve productivity, debugging efficiency, and code quality.

AI tools used

ChatGPT — debugging support, architecture refinement, dashboard enhancement, and technical guidance
Claude AI — code review assistance and implementation validation
VS Code Copilot / Chat — bug fixing, syntax troubleshooting, and development acceleration

AI tools were used as engineering assistants only. All modelling decisions, business interpretations, feature engineering strategies, and final implementation choices were independently evaluated and validated by the project author.

👤 Author

Tinyiko Patience Mathebula

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataQuest 2026 — Credit Risk Intelligence Platform

🎯 Project Summary

📁 Project Structure

🚀 Quick Start

Step 1 — Install dependencies

Step 2 — Place the data file

Step 3 — Run the model (Task 2)

Step 4 — Launch the Streamlit app (Tasks 1 & 3)

📊 App Features (6 Tabs)

📈 ROC Curve

📊 Information Value (IV) Ranking

🔬 Methodology

Data Issues Found & Fixed

Feature Engineering (6 derived features)

WoE Encoding

Model

📋 Task Coverage

⚠️ Regulatory Considerations

🤖 AI-Assisted Development

AI tools used

👤 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
images		images
utils		utils
README.md		README.md
app.py		app.py
feature_iv_table.csv		feature_iv_table.csv
model_scorecard.csv		model_scorecard.csv
requirements.txt		requirements.txt
research_report.ipynb		research_report.ipynb
task2_modelling.py		task2_modelling.py
test_predictions.csv		test_predictions.csv

Folders and files

Latest commit

History

Repository files navigation

DataQuest 2026 — Credit Risk Intelligence Platform

🎯 Project Summary

📁 Project Structure

🚀 Quick Start

Step 1 — Install dependencies

Step 2 — Place the data file

Step 3 — Run the model (Task 2)

Step 4 — Launch the Streamlit app (Tasks 1 & 3)

📊 App Features (6 Tabs)

📈 ROC Curve

📊 Information Value (IV) Ranking

🔬 Methodology

Data Issues Found & Fixed

Feature Engineering (6 derived features)

WoE Encoding

Model

📋 Task Coverage

⚠️ Regulatory Considerations

🤖 AI-Assisted Development

AI tools used

👤 Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages