Project Description:
An insurance agency, ABC Insurance, has a large dataset containing information about their policyholders and claims. They want to perform exploratory data analysis (EDA) on this dataset to gain insights that can help them make better business decisions and improve their operations.
The agency wants to analyze the different body types and the environment that affect the premium. The disease’s effect or the cost of treatment differs depending on the circumstances. For example, a smoker’s medical insurance premium may be higher than that of a healthy person, because smokers are more likely to develop chronic diseases. The agency wants to analyze the data to research healthcare premium costs.
Tools Used: Python
Libraries Used: Numpy, Pandas, Matplotlib, Seaborn
Project Report:
Python for Data Analysis
Project: Insurance for Data Analysis
By- Santoshkumar Pandey
Import libraries such as Pandas, matplotlib, NumPy, and seaborn and load the insurance dataset
#Import libraries such as Pandas, matplotlib, NumPy, and seaborn
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#load the insurance dataset
insurance=pd.read_csv("insurance.csv")
#Check the shape of the data along with the data types of the column
insurance.shape
(1338, 7)
insurance.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1338 entries, 0 to 1337 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 1338 non-null int64 1 sex 1338 non-null object 2 bmi 1338 non-null float64 3 children 1338 non-null int64 4 smoker 1338 non-null object 5 region 1338 non-null object 6 charges 1338 non-null float64 dtypes: float64(2), int64(2), object(3) memory usage: 73.3+ KB
insurance.dtypes
age int64 sex object bmi float64 children int64 smoker object region object charges float64 dtype: object
#Check missing values in the dataset and find the appropriate measures to fill in the missing values
insurance.isna().sum()
age 0 sex 0 bmi 0 children 0 smoker 0 region 0 charges 0 dtype: int64
#Explore the relationship between the feature and target column using a count plot of categorical columns
#and a scatter plot of numerical columns
insurance.head()
| age | sex | bmi | children | smoker | region | charges | |
|---|---|---|---|---|---|---|---|
| 0 | 19 | female | 27.900 | 0 | yes | southwest | 16884.92400 |
| 1 | 18 | male | 33.770 | 1 | no | southeast | 1725.55230 |
| 2 | 28 | male | 33.000 | 3 | no | southeast | 4449.46200 |
| 3 | 33 | male | 22.705 | 0 | no | northwest | 21984.47061 |
| 4 | 32 | male | 28.880 | 0 | no | northwest | 3866.85520 |
Explore the relationship between the feature and target column using a count plot of categorical columns and a scatter plot of numerical columns
Here we have –
Categorical Variables: Smoker, Region, Sex
Numerical Variables: Age, BMI, Children
Dependent Variable: Chrges
#Relationship between Age and Charges
plt.scatter(x=insurance['age'],y=insurance['charges'])
plt.xlabel('Age')
plt.ylabel('Charges')
plt.title('Age vs Charges')
plt.show()
Inference: We notice charges increase as Age increases
#Relationship between BMI and Charges
plt.scatter(x=insurance['bmi'],y=insurance['charges'])
plt.xlabel('BMI')
plt.ylabel('Charges')
plt.title('BMI vs Charges')
plt.show()
Inference: Here also we notice charges tend to increase as BMI goes to higher side ie. Overweight and Obese categories.
#Relationship between Children and Charges
plt.scatter(x=insurance['children'],y=insurance['charges'])
plt.xlabel('No. of Children')
plt.ylabel('Charges')
plt.title('No of Children vs Charges')
plt.show()
Inference: It looks like the charges decrease as the No of children increase.
#Relationship between Smoking and Charges
plt.scatter(x=insurance['smoker'],y=insurance['charges'])
plt.xlabel('Smoker')
plt.ylabel('Charges')
plt.title('Smoker vs Charges')
plt.show()
Inference: It is quite evident that if a person is a Smoker, the insurance charges are significantly higher.
#Relationship between Sex and Charges
plt.scatter(x=insurance['sex'],y=insurance['charges'])
plt.xlabel('Sex')
plt.ylabel('Charges')
plt.title('Sex vs Charges')
plt.show()
Inference: Sex of the person does not seem to have any impact on the insurance charges.
#Relationship between Region and Charges
plt.scatter(x=insurance['region'],y=insurance['charges'])
plt.xlabel('Region')
plt.ylabel('Charges')
plt.title('Region vs Charges')
plt.show()
Inference: The region of the person also does not seem to have any significant impact on the insurance charges.
Summary:
The Insurance charges are influenced by: Age, BMI, No. of Children, SmokingHabits.
The Insurance charges seem to be independent of : Sex and Region
#To Ignore Warnings
import warnings
warnings.filterwarnings('ignore')
#To get all Pair Plots
sns.pairplot(insurance, hue='smoker')
plt.show()
#Simplyfying The Pair Plots for only the variables important for analysis
plt.figure().set_figheight(5)
sns.pairplot(insurance,x_vars=['age','children','bmi'],y_vars='charges',hue='smoker',kind='scatter')
plt.show()
<Figure size 640x500 with 0 Axes>
Analysis across variables using smoking as a marker. The charges are higher for smokers across all categories.
#Now Analysis of Categorical Variables
sns.set(style='dark')
sns.swarmplot(x= 'region', y='charges', hue='smoker', data=insurance)
plt.show()
sns.set(style='dark')
sns.swarmplot(x= 'sex', y='charges', hue='smoker', data=insurance)
plt.show()
Sex and Region do not have any impact on charges by themselves. But here also we observe higher charges for smokers.
#Create subplots
fig, axes = plt.subplots(1, 2, figsize=(14, 7))
# Plot countplot for 'sex' with hue 'smoker' on the first subplot
sns.countplot(x='sex', data=insurance, hue='smoker', ax=axes[0])
axes[0].set_title('Count Plot of Sex with Smoker Status')
axes[0].set_xlabel('Sex')
axes[0].set_ylabel('Count')
# Plot countplot for 'region' with hue 'smoker' on the second subplot
sns.countplot(x='region', data=insurance, hue='smoker', ax=axes[1])
axes[1].set_title('Count Plot of Region with Smoker Status')
axes[1].set_xlabel('Region')
axes[1].set_ylabel('Count')
plt.tight_layout()
plt.show()
Smoker and Non smokers Count plot by Sex and Region. The no of smokers is very less compared to non-smokers across both categories.
#Perform Data Visualization using plots of feature vs feature
#Categorical Variables vs Dependent Variable
fig,axes=plt.subplots(1,3,figsize=(21,7))
sns.violinplot(x='sex',y='charges',data=insurance, palette='rainbow', ax=axes[0])
sns.stripplot(x='region',y='charges',data=insurance, ax=axes[1])
sns.swarmplot(x='smoker',y='charges',data=insurance, ax=axes[2])
plt.show()
Check if the number of premium charges for smokers or non-smokers is increasing as they are aging
#Check if the number of premium charges for smokers or non-smokers is increasing as they are aging
sns.scatterplot(x='age',y='charges', hue='smoker', data=insurance).set(title='Charges vs Age: For Smokers & Non-Smokers')
plt.show()
Inference: Very Evident that the Premium Charges increase with age for both smokers and non-smokers.
But comparatively the charges are higher for smokers than non-smokers across all age categories.