This project aims to predict the premium price of a medical insurance policy based on an individual's health and lifestyle factors. By analyzing features such as age, presence of chronic diseases, height, weight, and family history of cancer, the goal is to develop a regression model that can accurately estimate insurance costs. This is a crucial task for insurance companies in risk assessment and pricing.
- Dataset: Kaggle - Medical Insurance Premium Prediction
- Size: 986 entries, 11 columns
- Key Features:
Age,Diabetes,BloodPressureProblems,AnyTransplants,AnyChronicDiseases,Height,Weight,KnownAllergies,HistoryOfCancerInFamily,NumberOfMajorSurgeries.
- Approach:
- Data Cleaning: The dataset was clean with no missing values or duplicates.
- Exploratory Data Analysis: Histograms, boxplots, and scatter plots were used for visualization to understand data distributions and the relationship between features and the target variable
PremiumPrice. A correlation heatmap was also generated. - Data Standardization: The
Agecolumn was standardized usingStandardScaler, and the rest of the features were also scaled before model training. - Regression Task: The target variable is
PremiumPrice. - Models Used:
- A suite of regression models were trained, including Ridge, XGBoost, Random Forest, AdaBoost, Gradient Boosting, Bagging, Decision Tree, SVR, and K-Nearest Neighbors (KNN).
- Best R² Score:
- 0.860 with Random Forest Regressor.
- 0.855 with Gradient Boosting Regressor.
- 0.824 with Bagging Regressor.
- The high R² scores for the ensemble models indicate that these health metrics are strong predictors of insurance premium prices.
- Enable insurance companies to accurately price medical premiums based on an individual's risk profile.
- Assist in risk assessment for new insurance applicants.
- Support data-driven decision-making in insurance product development.
- Provide a tool for individuals to understand how their health and lifestyle factors influence their insurance costs.
Clone the repository:
git clone https://github.com/BhaveshBhakta/Medical-Insurance-Cost-Prediction-Using-ML.git
cd Medical-Insurance-Cost-Prediction-Using-MLInstall the necessary libraries:
pip install pandas numpy seaborn matplotlib scikit-learn xgboostWe welcome contributions to improve the project. You can help by:
- Performing comprehensive hyperparameter tuning and cross-validation for the top-performing regression models.
- Exploring more advanced feature engineering techniques, such as creating a BMI feature from
HeightandWeight. - Investigating alternative scaling methods or outlier handling strategies.
- Adding explainability (e.g., SHAP or LIME) to understand which health factors are the most significant drivers of premium prices.