Project Description:

Cardiovascular diseases are the leading cause of death globally. To identify the causes and to develop a system to predict heart attacks in an effective manner is necessary. The presented
data has all the information about all the relevant factors that might impact heart health. The data needs to be explained in detail for any further analysis.

Tools Used: Python, Tableau

Libraries Used: Pandas, Matplotlib, Seaborn, Scikit-learn

Analysis Summary:

Analyzed an open-source dataset with 14 attributes and 4000 data points to develop a system to predict heart attacks effectively. Performed data inspection, data treatment, EDA, data modelling, testing hypothesis, logistic regression, confusion matrix, and dashboarding of results. Used Python with Pandas, Matplotlib, Seaborn, Scikit-learn libraries, and Tableau for dashboarding.

Data Source: Link

Project Report:

Capstone Project Healthcare

Determine and examine the factors that play a significant role in increasing the Rate of Heart Attacks
By Santoshkumar Pandey

Import Libraries

In [62]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")

import warnings
warnings.filterwarnings('ignore')

Preliminary Data Inspection

In [2]:

data = pd.read_excel('Data Analyst Healthcare Project.xlsx')
data.head()

Out[2]:

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	thal	target
0	63	1	3	145	233	1	0	150	0	2.3	0	1	1
1	37	1	2	130	250	0	1	187	0	3.5	0	2	1
2	41	0	1	130	204	0	0	172	0	1.4	2	2	1
3	56	1	1	120	236	0	1	178	0	0.8	2	2	1
4	57	0	0	120	354	0	1	163	1	0.6	2	2	1

In [3]:

data.shape

Out[3]:

(303, 14)

Check Missing Values

In [4]:

data.isnull().sum()

Out[4]:

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

Checking Duplicates

In [5]:

data.duplicated().any()

Out[5]:

True

There are duplicates. Remove Duplicates

In [6]:

data.drop_duplicates(subset=None, inplace=True)
data.duplicated().any()

Out[6]:

False

In [7]:

data.shape

Out[7]:

(302, 14)

We see that 1 duplicate row was removed.

In [108]:

#Exploring Data. Get Preliminary Statistical Summary, and Measures of Central Tendencies
data.describe()

Out[108]:

	age	sex	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	ca	target
count	302.00000	302.000000	302.000000	302.000000	302.000000	302.000000	302.000000	302.000000	302.000000	302.000000	302.000000
mean	54.42053	0.682119	131.602649	246.500000	0.149007	0.526490	149.569536	0.327815	1.043046	0.718543	0.543046
std	9.04797	0.466426	17.563394	51.753489	0.356686	0.526027	22.903527	0.470196	1.161452	1.006748	0.498970
min	29.00000	0.000000	94.000000	126.000000	0.000000	0.000000	71.000000	0.000000	0.000000	0.000000	0.000000
25%	48.00000	0.000000	120.000000	211.000000	0.000000	0.000000	133.250000	0.000000	0.000000	0.000000	0.000000
50%	55.50000	1.000000	130.000000	240.500000	0.000000	1.000000	152.500000	0.000000	0.800000	0.000000	1.000000
75%	61.00000	1.000000	140.000000	274.750000	0.000000	1.000000	166.000000	1.000000	1.600000	1.000000	1.000000
max	77.00000	1.000000	200.000000	564.000000	1.000000	2.000000	202.000000	1.000000	6.200000	4.000000	1.000000

Describe function is a function that allows analysis between the numerical values contained in the data set. Using this function count, mean, std, min, max, 25%, 50%, 75%.

In [109]:

#Exporing Data Distribution visually

data.hist(layout = (3,5), figsize=(16,10), color = ‘g’) print(‘Data Distribution’)

Data Distribution

No description has been provided for this image

There are several Categorical Variables. Now we will explore Variables with respect to Target. Target here indicates if the patient had a Heart Attack.

In [9]:

print('This looks like a fairly balanced dataset, as distribution of majority and minority class is around 55:45')
sns.countplot(x="target", data=data, palette="mako_r")
plt.show()

This looks like a fairly balanced dataset, as distribution of majority and minority class is around 55:45

In [112]:

print('Composition of patients wrt to Gender')
sns.countplot(x='sex', data=data, palette="bwr")
plt.xlabel("Sex (0 = female, 1= male)")
plt.show()

Composition of patients wrt to Gender

In [110]:

pd.crosstab(data.sex,data.target).plot(kind="bar", color = ['g','r'])
plt.title('Distribution of target and sex (0-female 1-male)')
plt.xlabel("Sex (0 = female, 1= male)")
plt.ylabel('Counts')
plt.show()
print('Looks like the ratio of heart attacks in females is way higher than males in the dataset')

Looks like the ratio of heart attacks in females is way higher than males in the dataset

In [111]:

print('Exploring the occurence of CVD (Heart Attacks) across different ages')
pd.crosstab(data.age,data.target).plot(kind="bar",figsize=(20,6), color = ['g','r'])
plt.title('Heart Disease Distribution by Patient Age')
plt.xlabel('Age')
plt.ylabel('Counts')
plt.show()

Exploring the occurence of CVD (Heart Attacks) across different ages

The occurrence of Heart Attacks (percentage) seems to way higher in the of 40 yrs to 60 years bracket.

In [30]:

plt.bar(data.age[data.target==1], data.thalach[(data.target==1)], color="red")
plt.bar(data.age[data.target==0], data.thalach[(data.target==0)], color="grey")
plt.legend(["Diseased", "Not Diseased"])
plt.xlabel("Age")
plt.ylabel("Maximum Heart Rate")
plt.show()

Higher the Max Heart Rate greater the occurence of Heart Attacks across all age categories.

In [113]:

print('Can we detect Heart Rate based on anomalies in Resting Blood Pressure?')
pd.crosstab(data.trestbps,data.target).plot(kind="bar",figsize=(20,6), color = ['g','r'])
plt.title('Heart Disease Distribution by Resting Blood Pressure')
plt.xlabel('Resting Blood Pressure')
plt.ylabel('Counts')
plt.show()

Can we detect Heart Rate based on anomalies in Resting Blood Pressure?

The incidents of Heart attacks increase rapidly as the resting blood pressure hits 120 mm/hg and above

In [114]:

print('Analyzing the Correlation between various variables')
%matplotlib inline
plt.figure(figsize=(10,10))
sns.heatmap(data.corr(),annot=True,fmt='.1f')
plt.show()

Analyzing the Correlation between various variables

In [51]:

data_corr=data.corr()['target'][:-1]
feature_list=data_corr[abs(data_corr)>0.1].sort_values(ascending=False)
feature_list

Out[51]:

cp          0.432080
thalach     0.419955
slope       0.343940
restecg     0.134874
trestbps   -0.146269
age        -0.221476
sex        -0.283609
thal       -0.343101
ca         -0.408992
oldpeak    -0.429146
exang      -0.435601
Name: target, dtype: float64

from theabove corelation plot we see that cp(chest pain),thalch and slope are highly corelated with the target

Understanding the Relationship Between these strongly correlated variables and Target (Heart Attacks)

In [60]:

pd.crosstab(data.cp,data.target).plot(kind="bar", color = ['g','r'])
plt.title('Heart Disease Distribution by Chest Pain Type')
plt.xlabel('Chest Pain')
plt.ylabel('Counts')
plt.show()

We observe that those who have chaist pain type 1 and chaist pain type 2 is more likely to affected by heart disease.

In [56]:

pd.crosstab(data.slope,data.target).plot(kind="bar", color = ['g','r'])
plt.title('Heart Disease Distribution by Slope (Peak Exercise) Type')
plt.xlabel('Slope (Peak Exercise)')
plt.ylabel('Counts')
plt.show()

We see that the rate of heart attacks is very high in Slope 2 compared to the Slope 0, & Slope 1

In [58]:

pd.crosstab(data.thal,data.target).plot(kind="bar", color = ['g','r'])
plt.title('Heart Disease Distribution by Thalassemia')
plt.xlabel('Thalassemia')
plt.ylabel('Counts')
plt.show()

from this data we observe that those who have Thalassemia 2 (Fixed Defect) are a lot more likely to affected by heart disease.

In [54]:

#Exploring Raltionship Between the three high impact Variables
sns.lmplot(x="trestbps", y="chol",data=data,hue="cp")
plt.show()

In [85]:

#Using Pair Plot to understand Relationship Between given Variables
plt.figure().set_figwidth(15)
sns.pairplot(data,x_vars=['chol','trestbps','thalach',],y_vars=['age',],hue='target',kind='scatter')
plt.show()

<Figure size 1500x480 with 0 Axes>

Creating Dummy Variables

from the dataset’cp’, ‘thal’ and ‘slope’ are categorical variables we’ll turn them into dummy variables.

In [87]:

chest_pain=pd.get_dummies(data['cp'],prefix='cp',drop_first=True)
data=pd.concat([data,chest_pain],axis=1)
data.drop(['cp'],axis=1,inplace=True)
sp=pd.get_dummies(data['slope'],prefix='slope')
th=pd.get_dummies(data['thal'],prefix='thal')
frames=[data,sp,th]
data=pd.concat(frames,axis=1)
data.drop(['slope','thal'],axis=1,inplace=True)

In [89]:

data.head(5)

Out[89]:

	age	sex	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	…	cp_1	cp_2	cp_3	slope_0	slope_1	slope_2	thal_0	thal_1	thal_2	thal_3
0	63	1	145	233	1	0	150	0	2.3	…	False	False	True	True	False	False	False	True	False	False
1	37	1	130	250	0	1	187	0	3.5	…	False	True	False	True	False	False	False	False	True	False
2	41	0	130	204	0	0	172	0	1.4	…	True	False	False	False	False	True	False	False	True	False
3	56	1	120	236	0	1	178	0	0.8	…	True	False	False	False	False	True	False	False	True	False
4	57	0	120	354	0	1	163	1	0.6	…	False	False	False	False	False	True	False	False	True	False

5 rows × 21 columns

Feature selection

In [91]:

X = data.drop(['target'], axis = 1)
y = data.target.values

Spliting the 80% of the dataset into train_data and 20% of the dataset into test_data

In [92]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)

We use StandardScaler. It will transform your data such that its distribution will have a mean value 0 and standard deviation of 1. Given the distribution of the data, each value in the dataset will have the sample mean value subtracted, and then divided by the standard deviation of the whole dataset.

In [93]:

from sklearn.preprocessing import StandardScaler
sc_X=StandardScaler()
X_train=sc_X.fit_transform(X_train)
X_test=sc_X.transform(X_test)

Creating Different Machine Learning Model Using Logistsic Regression

In [97]:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

lr_c=LogisticRegression(random_state=0)
lr_c.fit(X_train,y_train)
lr_pred=lr_c.predict(X_test)
lr_cm=confusion_matrix(y_test,lr_pred)
lr_ac=accuracy_score(y_test, lr_pred)

Validating the Results Using Confusion Matrix

In [99]:

plt.figure(figsize=(20,10))
plt.subplot(2,4,1)
plt.title("LogisticRegression_cm")
sns.heatmap(lr_cm,annot=True,cmap="Blues",fmt="d",cbar=False)
plt.subplot(2,4,2)

Out[99]:

<Axes: title={'center': 'LogisticRegression_cm'}>

Checking the Accuracy of the Model

In [115]:

print('LogisticRegression_accuracy:\t',lr_ac)

model_accuracy = pd.Series(data=[lr_ac], 
                index=['LogisticRegression'])
fig= plt.figure(figsize=(6,2))
model_accuracy.sort_values().plot.barh()
plt.title('Model Accuracy')

LogisticRegression_accuracy:	 0.8852459016393442

Out[115]:

Text(0.5, 1.0, 'Model Accuracy')

The Accuracy of our Model is Quite High.

In [ ]:

Tableau Dashboard:

Analysis of Factors Leading to Heart Attacks in Patients