A regression study that identifies which county-level health factors most strongly predict premature death across 3,152 U.S. counties. Four models (Linear, Ridge, Lasso, Random Forest) were trained on 24 predictors from the 2025 County Health Rankings dataset to find both the magnitude and direction of each factor's effect on years of potential life lost.
Random Forest feature importance across 24 county health predictors.
County Health Rankings publishes new data every year. Counties see where they rank but not which factors actually move mortality. A high ranking gives no instruction. This project trains four models on the same 24 predictors to find which factors carry the most weight and in which direction, so local policy makers can prioritize concrete interventions instead of guessing from a leaderboard.
- Source: 2025 County Health Rankings (Robert Wood Johnson Foundation, UW Population Health Institute)
- Scope: 3,152 U.S. counties, 796 variables per county
- Target: Premature death raw value (Years of Potential Life Lost per 100K population before age 75)
- Features: 24 selected predictors covering healthcare access, child poverty, housing quality, injury, education, and environment
Distribution of premature death rates. The mean across counties sits near 10,400 YPLL per 100K. The long right tail captures rural South and Appalachia.
I trained four supervised regression models on an 80/20 split with random_state=42 for reproducibility:
- Linear Regression as the interpretable baseline.
- Ridge, alpha tuned with GridSearchCV.
- Lasso, alpha tuned with GridSearchCV.
- Random Forest with 300 trees and no max depth.
StandardScaler ran inside the pipeline for the three linear models. The Random Forest used raw inputs.
Correlation matrix of the 24 predictors plus the target, computed before model training.
| Model | R² | RMSE | MAE |
|---|---|---|---|
| Random Forest | 0.756 | 1,965 | 1,229 |
| Lasso (α=10) | 0.743 | 2,015 | 1,290 |
| Ridge (α=6.158) | 0.742 | 2,018 | 1,295 |
| Linear Regression | 0.742 | 2,018 | 1,296 |
Random Forest predicted vs actual (R² = 0.756). The red dashed line is y = x.
Residual analysis. The left panel plots residuals against predicted values, the right panel shows the residual distribution. No systematic bias appears at either end of the range.
The five predictors that drove most of the explained variance, ranked by Lasso coefficient magnitude:
| Rank | Predictor | Lasso Coefficient | RF Importance |
|---|---|---|---|
| 1 | Injury Deaths | +1,816 | high |
| 2 | Children in Poverty | +1,413 | 0.475 |
| 3 | Severe Housing Problems | +296 | moderate |
| 4 | Preventable Hospital Stays | +270 | moderate |
| 5 | Income Inequality | +197 | moderate |
Random Forest had the highest R², but tree-based importance only tells you which predictors the model split on most, not whether a predictor pushes mortality up or down. Lasso returns a signed coefficient for each predictor, which is the form a policy maker needs when deciding where to direct funding. For that reason, the policy section below works from the Lasso output.
Three recommendations targeted at Allegheny County (Pittsburgh) as a test case. Each one maps to a top Lasso coefficient.
- Child poverty reduction. Expand the Child Tax Credit at the federal level and stack county child care subsidies on top. Children in Poverty had the second-largest Lasso coefficient and an RF importance of 0.475, the single largest in the model.
- Preventable hospital stay reduction. Fund community health education and primary care navigation in underserved zip codes. This addresses the fourth-ranked predictor and pulls down avoidable admissions before they reach the hospital.
- Housing quality. Pair Housing First placements with voucher expansion and rehabilitation of substandard units. Severe Housing Problems was the third-ranked predictor in the Lasso fit.
| Component | Technology |
|---|---|
| Language | Python 3.12 |
| Environment | Jupyter Notebook |
| ML | scikit-learn (LinearRegression, Ridge, Lasso, RandomForestRegressor) |
| Data | pandas, numpy |
| Visualization | matplotlib, seaborn |
| Data Source | County Health Rankings 2025 |
git clone https://github.com/adarsh-rai-secure/premature-mortality-prediction.git
cd premature-mortality-prediction
pip install pandas numpy scikit-learn matplotlib seaborn
jupyter notebook mortality_prediction.ipynbAdarsh Rai. MS Information Security Policy and Management, Carnegie Mellon University.
MIT




