Project Description:

An insurance agency, ABC Insurance, has a large dataset containing information about their policyholders and claims. They want to perform exploratory data analysis (EDA) on this dataset to gain insights that can help them make better business decisions and improve their operations.

The agency wants to analyze the different body types and the environment that affect the premium. The disease’s effect or the cost of treatment differs depending on the circumstances. For example, a smoker’s medical insurance premium may be higher than that of a healthy person, because smokers are more likely to develop chronic diseases. The agency wants to analyze the data to research healthcare premium costs.

Tools Used: Python

Libraries Used: Numpy, Pandas, Matplotlib, Seaborn

Project Report:

Python for Data Analysis
Project: Insurance for Data Analysis
By- Santoshkumar Pandey

Import libraries such as Pandas, matplotlib, NumPy, and seaborn and load the insurance dataset

In [34]:

#Import libraries such as Pandas, matplotlib, NumPy, and seaborn 
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [4]:

#load the insurance dataset
insurance=pd.read_csv("insurance.csv")

In [5]:

#Check the shape of the data along with the data types of the column
insurance.shape

Out[5]:

(1338, 7)

In [6]:

insurance.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB

In [7]:

insurance.dtypes

Out[7]:

age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object

In [8]:

#Check missing values in the dataset and find the appropriate measures to fill in the missing values
insurance.isna().sum()

Out[8]:

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [9]:

#Explore the relationship between the feature and target column using a count plot of categorical columns 
#and a scatter plot of numerical columns
insurance.head()

Out[9]:

	age	sex	bmi	children	smoker	region	charges
0	19	female	27.900	0	yes	southwest	16884.92400
1	18	male	33.770	1	no	southeast	1725.55230
2	28	male	33.000	3	no	southeast	4449.46200
3	33	male	22.705	0	no	northwest	21984.47061
4	32	male	28.880	0	no	northwest	3866.85520

Explore the relationship between the feature and target column using a count plot of categorical columns and a scatter plot of numerical columns

Here we have –
Categorical Variables: Smoker, Region, Sex
Numerical Variables: Age, BMI, Children
Dependent Variable: Chrges

In [23]:

#Relationship between Age and Charges
plt.scatter(x=insurance['age'],y=insurance['charges'])
plt.xlabel('Age')
plt.ylabel('Charges')
plt.title('Age vs Charges')
plt.show()

No description has been provided for this image

Inference: We notice charges increase as Age increases

In [24]:

#Relationship between BMI and Charges
plt.scatter(x=insurance['bmi'],y=insurance['charges'])
plt.xlabel('BMI')
plt.ylabel('Charges')
plt.title('BMI vs Charges')
plt.show()

Inference: Here also we notice charges tend to increase as BMI goes to higher side ie. Overweight and Obese categories.

In [25]:

#Relationship between Children and Charges
plt.scatter(x=insurance['children'],y=insurance['charges'])
plt.xlabel('No. of Children')
plt.ylabel('Charges')
plt.title('No of Children vs Charges')
plt.show()

Inference: It looks like the charges decrease as the No of children increase.

In [27]:

#Relationship between Smoking and Charges
plt.scatter(x=insurance['smoker'],y=insurance['charges'])
plt.xlabel('Smoker')
plt.ylabel('Charges')
plt.title('Smoker vs Charges')
plt.show()

Inference: It is quite evident that if a person is a Smoker, the insurance charges are significantly higher.

In [29]:

#Relationship between Sex and Charges
plt.scatter(x=insurance['sex'],y=insurance['charges'])
plt.xlabel('Sex')
plt.ylabel('Charges')
plt.title('Sex vs Charges')
plt.show()

Inference: Sex of the person does not seem to have any impact on the insurance charges.

In [25]:

#Relationship between Region and Charges
plt.scatter(x=insurance['region'],y=insurance['charges'])
plt.xlabel('Region')
plt.ylabel('Charges')
plt.title('Region vs Charges')
plt.show()

Inference: The region of the person also does not seem to have any significant impact on the insurance charges.

Summary:
The Insurance charges are influenced by: Age, BMI, No. of Children, SmokingHabits.

The Insurance charges seem to be independent of : Sex and Region

In [36]:

#To Ignore Warnings
import warnings
warnings.filterwarnings('ignore')

#To get all Pair Plots
sns.pairplot(insurance, hue='smoker')
plt.show()

In [12]:

#Simplyfying The Pair Plots for only the variables important for analysis
plt.figure().set_figheight(5)
sns.pairplot(insurance,x_vars=['age','children','bmi'],y_vars='charges',hue='smoker',kind='scatter')
plt.show()

<Figure size 640x500 with 0 Axes>

Analysis across variables using smoking as a marker. The charges are higher for smokers across all categories.

In [32]:

#Now Analysis of Categorical Variables
sns.set(style='dark')
sns.swarmplot(x= 'region', y='charges', hue='smoker', data=insurance)
plt.show()

In [34]:

sns.set(style='dark')
sns.swarmplot(x= 'sex', y='charges', hue='smoker', data=insurance)
plt.show()

Sex and Region do not have any impact on charges by themselves. But here also we observe higher charges for smokers.

In [11]:

#Create subplots 
fig, axes = plt.subplots(1, 2, figsize=(14, 7)) 
# Plot countplot for 'sex' with hue 'smoker' on the first subplot 
sns.countplot(x='sex', data=insurance, hue='smoker', ax=axes[0]) 
axes[0].set_title('Count Plot of Sex with Smoker Status') 
axes[0].set_xlabel('Sex') 
axes[0].set_ylabel('Count') 

# Plot countplot for 'region' with hue 'smoker' on the second subplot 
sns.countplot(x='region', data=insurance, hue='smoker', ax=axes[1]) 
axes[1].set_title('Count Plot of Region with Smoker Status') 
axes[1].set_xlabel('Region') 
axes[1].set_ylabel('Count') 

plt.tight_layout() 
plt.show()

Smoker and Non smokers Count plot by Sex and Region. The no of smokers is very less compared to non-smokers across both categories.

In [37]:

#Perform Data Visualization using plots of feature vs feature
#Categorical Variables vs Dependent Variable 
fig,axes=plt.subplots(1,3,figsize=(21,7))

sns.violinplot(x='sex',y='charges',data=insurance, palette='rainbow', ax=axes[0])
sns.stripplot(x='region',y='charges',data=insurance, ax=axes[1])
sns.swarmplot(x='smoker',y='charges',data=insurance, ax=axes[2])
plt.show()

Check if the number of premium charges for smokers or non-smokers is increasing as they are aging

In [33]:

#Check if the number of premium charges for smokers or non-smokers is increasing as they are aging
sns.scatterplot(x='age',y='charges', hue='smoker', data=insurance).set(title='Charges vs Age: For Smokers & Non-Smokers')
plt.show()

Inference: Very Evident that the Premium Charges increase with age for both smokers and non-smokers.
But comparatively the charges are higher for smokers than non-smokers across all age categories.

In [ ]:

Python Project: Insurance Data Analysis

Project Description:

Project Report:

Comments

Leave a Reply Cancel reply