Predicting annual medical insurance charges from patient attributes using a Random Forest regression pipeline
Overview β’ Dataset β’ Installation β’ Usage β’ Pipeline β’ App β’ Results
Health insurers price premiums based on a small set of well-known risk factors: age, body mass index, smoking status, and a handful of demographic details. This project builds an end-to-end machine learning pipeline that learns those relationships from historical billing data and predicts the annual medical insurance charges for a new patient profile.
The repo contains three things:
| Component | What it does |
|---|---|
| π§ͺ Training notebook/script | Loads, explores, visualizes, and models the insurance dataset, then exports a trained pipeline |
| π Streamlit app | A lightweight web UI where anyone can enter a patient profile and get an instant cost estimate |
| π¦ Reusable artifacts | A serialized scikit-learn pipeline + metadata file so the app never has to retrain |
Disclaimer: This project is for educational and portfolio purposes only. Predictions are estimates derived from a public dataset and must not be used for actual underwriting, billing, or medical/financial decisions.
The model is trained on the classic Medical Cost Personal Dataset (insurance.csv), a widely-used benchmark for regression practice originally compiled for Machine Learning with R (Brett Lantz) and popularized on Kaggle by Miri Choi.
| Column | Type | Description |
|---|---|---|
age |
numeric | Age of the primary beneficiary (years) |
sex |
categorical | Biological sex (male / female) |
bmi |
numeric | Body Mass Index (kg/mΒ²) |
children |
numeric | Number of dependents covered by the plan |
smoker |
categorical | Smoking status (yes / no) |
region |
categorical | US residential region (northeast, northwest, southeast, southwest) |
charges |
numeric (target) | Individual medical costs billed by health insurance ($) |
1,338 rows, 7 columns, no missing values in the canonical version of the dataset β though the pipeline still includes imputation as a defensive measure for messier real-world data.
medical-insurance-cost-prediction/
β
βββ insurance.csv # Raw dataset (not included β see Dataset section)
βββ data_preprocessing_and_model_training.ipynb # EDA, preprocessing, training, evaluation
β
βββ app.py # Streamlit web app
βββ requirements.txt # Python dependencies
β
βββ delivery_time_model.pkl # Serialized scikit-learn pipeline (generated after training)
βββ model_metadata.json # Valid input ranges/options for the app (generated after training)
β
βββ README.md # You are here
Note: The exported model file is named
delivery_time_model.pklto match the filename produced by the training script in this repo. Feel free to rename it (and update the corresponding path inapp.py) to something likeinsurance_cost_model.pklif you'd like the naming to be project-accurate.
1. Clone the repository
git clone https://github.com/zakir-maswani/Medical-Insurance-Cost-Predictor.git
cd medical-insurance-cost-prediction2. Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate3. Install dependencies
pip install -r requirements.txt4. Add the dataset
Place insurance.csv in the project root. The dataset is available from Kaggle β Medical Cost Personal Datasets.
Run the notebook (or the equivalent .py script) top to bottom:
jupyter notebook data_preprocessing_and_model_training.ipynbThis will:
- Load and explore
insurance.csv - Generate visualizations (charges by sex, smoking status, region, BMI, age, etc.)
- Build a preprocessing + Random Forest pipeline
- Train, evaluate, and print MSE / MAE / RΒ² metrics
- Export two artifacts to the project root:
delivery_time_model.pklβ the fitted pipelinemodel_metadata.jsonβ valid input ranges/categories used to build the app's form
Once the artifacts above exist:
streamlit run app.pyThen open the local URL Streamlit prints (typically http://localhost:8501).
The model is wrapped in a single scikit-learn Pipeline so preprocessing and inference always stay in sync.
Raw input (age, sex, bmi, children, smoker, region)
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββ
β ColumnTransformer β
β β
β Numeric branch Categorical branch. β
β (age, bmi, children) (sex, smoker, β
β ββ Median imputation region) β
β ββ Most-frequent β
β imputation β
β ββ One-Hot Encode β
βββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
RandomForestRegressor (random_state=42)
β
βΌ
Predicted annual charges ($)
Why these choices?
- Median / most-frequent imputation β robust to outliers and safe defaults if production data ever contains nulls, even though the source dataset is complete.
- One-Hot Encoding with
handle_unknown="ignore"β prevents the pipeline from breaking if it ever sees a category it wasn't trained on. - Random Forest Regressor β handles non-linear interactions (e.g., the outsized effect of smoking + high BMI together) without manual feature engineering, and is robust to the mild skew in the
chargestarget.
Train/test split: 70% train / 30% test, random_state=42 for reproducibility.
The notebook reports three standard regression metrics on both the train and test sets:
| Metric | What it measures |
|---|---|
| MSE (Mean Squared Error) | Average squared difference between predicted and actual charges β penalizes large errors heavily |
| MAE (Mean Absolute Error) | Average absolute dollar error β easy to interpret directly in dollars |
| RΒ² (Coefficient of Determination) | Proportion of variance in charges explained by the model (closer to 1 is better) |
Exact values depend on your training run β they're printed at the end of the notebook. As a rule of thumb for this dataset, a well-tuned Random Forest typically explains 80β88% of the variance in charges on the held-out test set, with the dominant predictive signal coming from smoker status, age, and bmi.
π‘ Improvement idea:
GridSearchCVis already imported in the notebook but not yet wired up. Hyperparameter tuning overn_estimators,max_depth, andmin_samples_leafis a natural next step to squeeze out additional performance.
The app (app.py) provides a friendly interface around the trained pipeline:
- Sidebar "Patient Profile" form β sliders and dropdowns for age, sex, BMI, children, smoking status, and region, auto-populated from
model_metadata.jsonso the inputs always stay valid for whatever range the model was trained on. - Styled estimate card β the prediction is rendered as a custom-styled HTML/CSS "insurance estimate card" rather than a plain number, complete with a monthly-equivalent breakdown.
- Context badges β automatic BMI category (underweight / normal / overweight / obese) and a smoker-risk indicator, so the estimate is easy to interpret at a glance.
- Graceful fallbacks β if the model or metadata files aren't found yet, the app explains exactly what to run first instead of crashing.
- Language: Python 3.9+
- Data handling: pandas, NumPy
- Modeling: scikit-learn (Pipeline, ColumnTransformer, RandomForestRegressor)
- Visualization: Matplotlib, Seaborn
- Serialization: joblib, JSON
- Web app: Streamlit + custom HTML/CSS
- Wire up
GridSearchCVfor hyperparameter tuning (already imported, unused) - Add cross-validation instead of a single train/test split for more robust metrics
- Compare Random Forest against Gradient Boosting / XGBoost / linear baselines
- Add SHAP-based feature importance to the app for per-prediction explainability
- Containerize the app with Docker for one-command deployment
- Add automated tests for the preprocessing pipeline
Contributions are welcome! Feel free to open an issue or submit a pull request for bug fixes, new features, or documentation improvements.
- Fork the repo
- Create a feature branch (
git checkout -b feature/your-feature) - Commit your changes (
git commit -m "Add your feature") - Push to the branch (
git push origin feature/your-feature) - Open a Pull Request
This project is licensed under the MIT License β see the LICENSE file for details.
- Dataset: Medical Cost Personal Datasets by Miri Choi on Kaggle, originally from Machine Learning with R by Brett Lantz.
- Built with scikit-learn and Streamlit.
Made with β and RandomForestRegressor