Project Description:
Cardiovascular diseases are the leading cause of death globally. To identify the causes and to develop a system to predict heart attacks in an effective manner is necessary. The presented
data has all the information about all the relevant factors that might impact heart health. The data needs to be explained in detail for any further analysis.
Tools Used: Python, Tableau
Libraries Used: Pandas, Matplotlib, Seaborn, Scikit-learn
Analysis Summary:
Analyzed an open-source dataset with 14 attributes and 4000 data points to develop a system to predict heart attacks effectively. Performed data inspection, data treatment, EDA, data modelling, testing hypothesis, logistic regression, confusion matrix, and dashboarding of results. Used Python with Pandas, Matplotlib, Seaborn, Scikit-learn libraries, and Tableau for dashboarding.
Data Source: Link
Project Report:
Determine and examine the factors that play a significant role in increasing the Rate of Heart Attacks
By Santoshkumar Pandey
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")
import warnings
warnings.filterwarnings('ignore')
Preliminary Data Inspection
data = pd.read_excel('Data Analyst Healthcare Project.xlsx')
data.head()
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 | 1 |
| 1 | 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 | 1 |
| 2 | 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 | 1 |
| 3 | 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 2 | 0 | 2 | 1 |
| 4 | 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 2 | 0 | 2 | 1 |
data.shape
(303, 14)
Check Missing Values
data.isnull().sum()
age 0 sex 0 cp 0 trestbps 0 chol 0 fbs 0 restecg 0 thalach 0 exang 0 oldpeak 0 slope 0 ca 0 thal 0 target 0 dtype: int64
Checking Duplicates
data.duplicated().any()
True
data.drop_duplicates(subset=None, inplace=True)
data.duplicated().any()
False
data.shape
(302, 14)
We see that 1 duplicate row was removed.
#Exploring Data. Get Preliminary Statistical Summary, and Measures of Central Tendencies
data.describe()
| age | sex | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | ca | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 302.00000 | 302.000000 | 302.000000 | 302.000000 | 302.000000 | 302.000000 | 302.000000 | 302.000000 | 302.000000 | 302.000000 | 302.000000 |
| mean | 54.42053 | 0.682119 | 131.602649 | 246.500000 | 0.149007 | 0.526490 | 149.569536 | 0.327815 | 1.043046 | 0.718543 | 0.543046 |
| std | 9.04797 | 0.466426 | 17.563394 | 51.753489 | 0.356686 | 0.526027 | 22.903527 | 0.470196 | 1.161452 | 1.006748 | 0.498970 |
| min | 29.00000 | 0.000000 | 94.000000 | 126.000000 | 0.000000 | 0.000000 | 71.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 48.00000 | 0.000000 | 120.000000 | 211.000000 | 0.000000 | 0.000000 | 133.250000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 55.50000 | 1.000000 | 130.000000 | 240.500000 | 0.000000 | 1.000000 | 152.500000 | 0.000000 | 0.800000 | 0.000000 | 1.000000 |
| 75% | 61.00000 | 1.000000 | 140.000000 | 274.750000 | 0.000000 | 1.000000 | 166.000000 | 1.000000 | 1.600000 | 1.000000 | 1.000000 |
| max | 77.00000 | 1.000000 | 200.000000 | 564.000000 | 1.000000 | 2.000000 | 202.000000 | 1.000000 | 6.200000 | 4.000000 | 1.000000 |
#Exporing Data Distribution visually
data.hist(layout = (3,5), figsize=(16,10), color = ‘g’) print(‘Data Distribution’)
There are several Categorical Variables. Now we will explore Variables with respect to Target. Target here indicates if the patient had a Heart Attack.
print('This looks like a fairly balanced dataset, as distribution of majority and minority class is around 55:45')
sns.countplot(x="target", data=data, palette="mako_r")
plt.show()
This looks like a fairly balanced dataset, as distribution of majority and minority class is around 55:45
print('Composition of patients wrt to Gender')
sns.countplot(x='sex', data=data, palette="bwr")
plt.xlabel("Sex (0 = female, 1= male)")
plt.show()
Composition of patients wrt to Gender
pd.crosstab(data.sex,data.target).plot(kind="bar", color = ['g','r'])
plt.title('Distribution of target and sex (0-female 1-male)')
plt.xlabel("Sex (0 = female, 1= male)")
plt.ylabel('Counts')
plt.show()
print('Looks like the ratio of heart attacks in females is way higher than males in the dataset')
Looks like the ratio of heart attacks in females is way higher than males in the dataset
print('Exploring the occurence of CVD (Heart Attacks) across different ages')
pd.crosstab(data.age,data.target).plot(kind="bar",figsize=(20,6), color = ['g','r'])
plt.title('Heart Disease Distribution by Patient Age')
plt.xlabel('Age')
plt.ylabel('Counts')
plt.show()
Exploring the occurence of CVD (Heart Attacks) across different ages
The occurrence of Heart Attacks (percentage) seems to way higher in the of 40 yrs to 60 years bracket.
plt.bar(data.age[data.target==1], data.thalach[(data.target==1)], color="red")
plt.bar(data.age[data.target==0], data.thalach[(data.target==0)], color="grey")
plt.legend(["Diseased", "Not Diseased"])
plt.xlabel("Age")
plt.ylabel("Maximum Heart Rate")
plt.show()
Higher the Max Heart Rate greater the occurence of Heart Attacks across all age categories.
print('Can we detect Heart Rate based on anomalies in Resting Blood Pressure?')
pd.crosstab(data.trestbps,data.target).plot(kind="bar",figsize=(20,6), color = ['g','r'])
plt.title('Heart Disease Distribution by Resting Blood Pressure')
plt.xlabel('Resting Blood Pressure')
plt.ylabel('Counts')
plt.show()
The incidents of Heart attacks increase rapidly as the resting blood pressure hits 120 mm/hg and above
print('Analyzing the Correlation between various variables')
%matplotlib inline
plt.figure(figsize=(10,10))
sns.heatmap(data.corr(),annot=True,fmt='.1f')
plt.show()
Analyzing the Correlation between various variables
data_corr=data.corr()['target'][:-1]
feature_list=data_corr[abs(data_corr)>0.1].sort_values(ascending=False)
feature_list
cp 0.432080 thalach 0.419955 slope 0.343940 restecg 0.134874 trestbps -0.146269 age -0.221476 sex -0.283609 thal -0.343101 ca -0.408992 oldpeak -0.429146 exang -0.435601 Name: target, dtype: float64
from theabove corelation plot we see that cp(chest pain),thalch and slope are highly corelated with the target
Understanding the Relationship Between these strongly correlated variables and Target (Heart Attacks)
pd.crosstab(data.cp,data.target).plot(kind="bar", color = ['g','r'])
plt.title('Heart Disease Distribution by Chest Pain Type')
plt.xlabel('Chest Pain')
plt.ylabel('Counts')
plt.show()
We observe that those who have chaist pain type 1 and chaist pain type 2 is more likely to affected by heart disease.
pd.crosstab(data.slope,data.target).plot(kind="bar", color = ['g','r'])
plt.title('Heart Disease Distribution by Slope (Peak Exercise) Type')
plt.xlabel('Slope (Peak Exercise)')
plt.ylabel('Counts')
plt.show()
We see that the rate of heart attacks is very high in Slope 2 compared to the Slope 0, & Slope 1
pd.crosstab(data.thal,data.target).plot(kind="bar", color = ['g','r'])
plt.title('Heart Disease Distribution by Thalassemia')
plt.xlabel('Thalassemia')
plt.ylabel('Counts')
plt.show()
from this data we observe that those who have Thalassemia 2 (Fixed Defect) are a lot more likely to affected by heart disease.
#Exploring Raltionship Between the three high impact Variables
sns.lmplot(x="trestbps", y="chol",data=data,hue="cp")
plt.show()
#Using Pair Plot to understand Relationship Between given Variables
plt.figure().set_figwidth(15)
sns.pairplot(data,x_vars=['chol','trestbps','thalach',],y_vars=['age',],hue='target',kind='scatter')
plt.show()
<Figure size 1500x480 with 0 Axes>
Creating Dummy Variables
from the dataset’cp’, ‘thal’ and ‘slope’ are categorical variables we’ll turn them into dummy variables.
chest_pain=pd.get_dummies(data['cp'],prefix='cp',drop_first=True)
data=pd.concat([data,chest_pain],axis=1)
data.drop(['cp'],axis=1,inplace=True)
sp=pd.get_dummies(data['slope'],prefix='slope')
th=pd.get_dummies(data['thal'],prefix='thal')
frames=[data,sp,th]
data=pd.concat(frames,axis=1)
data.drop(['slope','thal'],axis=1,inplace=True)
data.head(5)
| age | sex | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | ca | … | cp_1 | cp_2 | cp_3 | slope_0 | slope_1 | slope_2 | thal_0 | thal_1 | thal_2 | thal_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 63 | 1 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | … | False | False | True | True | False | False | False | True | False | False |
| 1 | 37 | 1 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | … | False | True | False | True | False | False | False | False | True | False |
| 2 | 41 | 0 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 0 | … | True | False | False | False | False | True | False | False | True | False |
| 3 | 56 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 0 | … | True | False | False | False | False | True | False | False | True | False |
| 4 | 57 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 0 | … | False | False | False | False | False | True | False | False | True | False |
5 rows × 21 columns
Feature selection
X = data.drop(['target'], axis = 1)
y = data.target.values
Spliting the 80% of the dataset into train_data and 20% of the dataset into test_data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)
We use StandardScaler. It will transform your data such that its distribution will have a mean value 0 and standard deviation of 1. Given the distribution of the data, each value in the dataset will have the sample mean value subtracted, and then divided by the standard deviation of the whole dataset.
from sklearn.preprocessing import StandardScaler
sc_X=StandardScaler()
X_train=sc_X.fit_transform(X_train)
X_test=sc_X.transform(X_test)
Creating Different Machine Learning Model Using Logistsic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
lr_c=LogisticRegression(random_state=0)
lr_c.fit(X_train,y_train)
lr_pred=lr_c.predict(X_test)
lr_cm=confusion_matrix(y_test,lr_pred)
lr_ac=accuracy_score(y_test, lr_pred)
Validating the Results Using Confusion Matrix
plt.figure(figsize=(20,10))
plt.subplot(2,4,1)
plt.title("LogisticRegression_cm")
sns.heatmap(lr_cm,annot=True,cmap="Blues",fmt="d",cbar=False)
plt.subplot(2,4,2)
<Axes: title={'center': 'LogisticRegression_cm'}>
Checking the Accuracy of the Model
print('LogisticRegression_accuracy:\t',lr_ac)
model_accuracy = pd.Series(data=[lr_ac],
index=['LogisticRegression'])
fig= plt.figure(figsize=(6,2))
model_accuracy.sort_values().plot.barh()
plt.title('Model Accuracy')
LogisticRegression_accuracy: 0.8852459016393442
Text(0.5, 1.0, 'Model Accuracy')
The Accuracy of our Model is Quite High.