Premature Death Prediction

A regression study that identifies which county-level health factors most strongly predict premature death across 3,152 U.S. counties. Four models (Linear, Ridge, Lasso, Random Forest) were trained on 24 predictors from the 2025 County Health Rankings dataset to find both the magnitude and direction of each factor's effect on years of potential life lost.

Portfolio · Data Source

Random Forest feature importance across 24 county health predictors.

The Problem

County Health Rankings publishes new data every year. Counties see where they rank but not which factors actually move mortality. A high ranking gives no instruction. This project trains four models on the same 24 predictors to find which factors carry the most weight and in which direction, so local policy makers can prioritize concrete interventions instead of guessing from a leaderboard.

Data

Source: 2025 County Health Rankings (Robert Wood Johnson Foundation, UW Population Health Institute)
Scope: 3,152 U.S. counties, 796 variables per county
Target: Premature death raw value (Years of Potential Life Lost per 100K population before age 75)
Features: 24 selected predictors covering healthcare access, child poverty, housing quality, injury, education, and environment

Distribution of premature death rates. The mean across counties sits near 10,400 YPLL per 100K. The long right tail captures rural South and Appalachia.

Approach

I trained four supervised regression models on an 80/20 split with random_state=42 for reproducibility:

Linear Regression as the interpretable baseline.
Ridge, alpha tuned with GridSearchCV.
Lasso, alpha tuned with GridSearchCV.
Random Forest with 300 trees and no max depth.

StandardScaler ran inside the pipeline for the three linear models. The Random Forest used raw inputs.

Correlation matrix of the 24 predictors plus the target, computed before model training.

Results

Model	R²	RMSE	MAE
Random Forest	0.756	1,965	1,229
Lasso (α=10)	0.743	2,015	1,290
Ridge (α=6.158)	0.742	2,018	1,295
Linear Regression	0.742	2,018	1,296

Random Forest predicted vs actual (R² = 0.756). The red dashed line is y = x.

Residual analysis. The left panel plots residuals against predicted values, the right panel shows the residual distribution. No systematic bias appears at either end of the range.

Key Findings

The five predictors that drove most of the explained variance, ranked by Lasso coefficient magnitude:

Rank	Predictor	Lasso Coefficient	RF Importance
1	Injury Deaths	+1,816	high
2	Children in Poverty	+1,413	0.475
3	Severe Housing Problems	+296	moderate
4	Preventable Hospital Stays	+270	moderate
5	Income Inequality	+197	moderate

Random Forest had the highest R², but tree-based importance only tells you which predictors the model split on most, not whether a predictor pushes mortality up or down. Lasso returns a signed coefficient for each predictor, which is the form a policy maker needs when deciding where to direct funding. For that reason, the policy section below works from the Lasso output.

Policy Recommendations

Three recommendations targeted at Allegheny County (Pittsburgh) as a test case. Each one maps to a top Lasso coefficient.

Child poverty reduction. Expand the Child Tax Credit at the federal level and stack county child care subsidies on top. Children in Poverty had the second-largest Lasso coefficient and an RF importance of 0.475, the single largest in the model.
Preventable hospital stay reduction. Fund community health education and primary care navigation in underserved zip codes. This addresses the fourth-ranked predictor and pulls down avoidable admissions before they reach the hospital.
Housing quality. Pair Housing First placements with voucher expansion and rehabilitation of substandard units. Severe Housing Problems was the third-ranked predictor in the Lasso fit.

Tech Stack

Component	Technology
Language	Python 3.12
Environment	Jupyter Notebook
ML	scikit-learn (LinearRegression, Ridge, Lasso, RandomForestRegressor)
Data	pandas, numpy
Visualization	matplotlib, seaborn
Data Source	County Health Rankings 2025

Quick Start

git clone https://github.com/adarsh-rai-secure/premature-mortality-prediction.git
cd premature-mortality-prediction
pip install pandas numpy scikit-learn matplotlib seaborn
jupyter notebook mortality_prediction.ipynb

Author

Adarsh Rai. MS Information Security Policy and Management, Carnegie Mellon University.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
2025 CHR CSV SAS Analytic Documentation.pdf		2025 CHR CSV SAS Analytic Documentation.pdf
README.md		README.md
analytic_data2025_v2.csv		analytic_data2025_v2.csv
generate_plots.py		generate_plots.py
mortality_prediction.ipynb		mortality_prediction.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Premature Death Prediction

The Problem

Data

Approach

Results

Key Findings

Policy Recommendations

Tech Stack

Quick Start

Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Premature Death Prediction

The Problem

Data

Approach

Results

Key Findings

Policy Recommendations

Tech Stack

Quick Start

Author

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages