Skip to content

adarsh-rai-secure/premature-mortality-prediction

Repository files navigation

Premature Death Prediction

Python scikit-learn License

A regression study that identifies which county-level health factors most strongly predict premature death across 3,152 U.S. counties. Four models (Linear, Ridge, Lasso, Random Forest) were trained on 24 predictors from the 2025 County Health Rankings dataset to find both the magnitude and direction of each factor's effect on years of potential life lost.

Portfolio · Data Source

Random Forest feature importance across 24 county health predictors

Random Forest feature importance across 24 county health predictors.

The Problem

County Health Rankings publishes new data every year. Counties see where they rank but not which factors actually move mortality. A high ranking gives no instruction. This project trains four models on the same 24 predictors to find which factors carry the most weight and in which direction, so local policy makers can prioritize concrete interventions instead of guessing from a leaderboard.

Data

  • Source: 2025 County Health Rankings (Robert Wood Johnson Foundation, UW Population Health Institute)
  • Scope: 3,152 U.S. counties, 796 variables per county
  • Target: Premature death raw value (Years of Potential Life Lost per 100K population before age 75)
  • Features: 24 selected predictors covering healthcare access, child poverty, housing quality, injury, education, and environment

Distribution of premature death rates across 3,152 U.S. counties

Distribution of premature death rates. The mean across counties sits near 10,400 YPLL per 100K. The long right tail captures rural South and Appalachia.

Approach

I trained four supervised regression models on an 80/20 split with random_state=42 for reproducibility:

  1. Linear Regression as the interpretable baseline.
  2. Ridge, alpha tuned with GridSearchCV.
  3. Lasso, alpha tuned with GridSearchCV.
  4. Random Forest with 300 trees and no max depth.

StandardScaler ran inside the pipeline for the three linear models. The Random Forest used raw inputs.

Correlation matrix of 24 predictors and the premature death target

Correlation matrix of the 24 predictors plus the target, computed before model training.

Results

Model RMSE MAE
Random Forest 0.756 1,965 1,229
Lasso (α=10) 0.743 2,015 1,290
Ridge (α=6.158) 0.742 2,018 1,295
Linear Regression 0.742 2,018 1,296

Random Forest predicted vs actual premature death rate

Random Forest predicted vs actual (R² = 0.756). The red dashed line is y = x.

Residual analysis for the Random Forest model

Residual analysis. The left panel plots residuals against predicted values, the right panel shows the residual distribution. No systematic bias appears at either end of the range.

Key Findings

The five predictors that drove most of the explained variance, ranked by Lasso coefficient magnitude:

Rank Predictor Lasso Coefficient RF Importance
1 Injury Deaths +1,816 high
2 Children in Poverty +1,413 0.475
3 Severe Housing Problems +296 moderate
4 Preventable Hospital Stays +270 moderate
5 Income Inequality +197 moderate

Random Forest had the highest R², but tree-based importance only tells you which predictors the model split on most, not whether a predictor pushes mortality up or down. Lasso returns a signed coefficient for each predictor, which is the form a policy maker needs when deciding where to direct funding. For that reason, the policy section below works from the Lasso output.

Policy Recommendations

Three recommendations targeted at Allegheny County (Pittsburgh) as a test case. Each one maps to a top Lasso coefficient.

  1. Child poverty reduction. Expand the Child Tax Credit at the federal level and stack county child care subsidies on top. Children in Poverty had the second-largest Lasso coefficient and an RF importance of 0.475, the single largest in the model.
  2. Preventable hospital stay reduction. Fund community health education and primary care navigation in underserved zip codes. This addresses the fourth-ranked predictor and pulls down avoidable admissions before they reach the hospital.
  3. Housing quality. Pair Housing First placements with voucher expansion and rehabilitation of substandard units. Severe Housing Problems was the third-ranked predictor in the Lasso fit.

Tech Stack

Component Technology
Language Python 3.12
Environment Jupyter Notebook
ML scikit-learn (LinearRegression, Ridge, Lasso, RandomForestRegressor)
Data pandas, numpy
Visualization matplotlib, seaborn
Data Source County Health Rankings 2025

Quick Start

git clone https://github.com/adarsh-rai-secure/premature-mortality-prediction.git
cd premature-mortality-prediction
pip install pandas numpy scikit-learn matplotlib seaborn
jupyter notebook mortality_prediction.ipynb

Author

Adarsh Rai. MS Information Security Policy and Management, Carnegie Mellon University.

License

MIT

About

Using ML Models for predicting the top 5 factors for premature mortality using the County Health Rankings dataset

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors