Project Description:
A nationwide survey of hospital costs conducted by the US Agency for Healthcare consists of hospital records of inpatient samples. The given data is restricted to the city of Wisconsin and relates to patients in the age group 0-17 years. The agency wants to analyze the data to research healthcare costs and their utilization.
Date: 5 June, 2024
Tools Used: R
Libraries Used: Base R, dplyr, ggplot2, kableextra, Markdown
Dataset: Link
Dataset Description:
AGE : Age of the patient discharged
FEMALE : Binary variable that indicates if the patient is female
LOS : Length of stay, in days
RACE : Race of the patient (specified numerically)
TOTCHG : Hospital discharge costs
APRDRG : All Patient Refined Diagnosis Related Groups
Project Solution:
Import dataset
setwd("D:/Study Material/R/Project/Projects for Submission/Healthcare")
HospitalData = read.csv("HospitalCosts.csv", header=TRUE)
head(HospitalData)
## AGE FEMALE LOS RACE TOTCHG APRDRG
## 1 17 1 2 1 2660 560
## 2 17 0 2 1 1689 753
## 3 17 1 7 1 20060 930
## 4 17 1 1 1 736 758
## 5 17 1 1 1 1194 754
## 6 17 0 0 1 3305 347
names(HospitalData)
## [1] "AGE" "FEMALE" "LOS" "RACE" "TOTCHG" "APRDRG"
1. Record patient statistics:
The agency wants to find the age category of people who frequently visit the hospital and has the maximum expenditure.
Age: Age of the patient discharged
Totchg: Hospital discharge costs
summary(HospitalData)
## AGE FEMALE LOS RACE
## Min. : 0.000 Min. :0.000 Min. : 0.000 Min. :1.000
## 1st Qu.: 0.000 1st Qu.:0.000 1st Qu.: 2.000 1st Qu.:1.000
## Median : 0.000 Median :1.000 Median : 2.000 Median :1.000
## Mean : 5.086 Mean :0.512 Mean : 2.828 Mean :1.078
## 3rd Qu.:13.000 3rd Qu.:1.000 3rd Qu.: 3.000 3rd Qu.:1.000
## Max. :17.000 Max. :1.000 Max. :41.000 Max. :6.000
## NA's :1
## TOTCHG APRDRG
## Min. : 532 Min. : 21.0
## 1st Qu.: 1216 1st Qu.:640.0
## Median : 1536 Median :640.0
## Mean : 2774 Mean :616.4
## 3rd Qu.: 2530 3rd Qu.:751.0
## Max. :48388 Max. :952.0
##
#Get No of Hospital Visits based on age
summary(as.factor(HospitalData$AGE))
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 307 10 1 3 2 2 2 3 2 2 4 8 15 18 25 29 29 38
Total number of hospital for 0-1 age group is 307
hist(HospitalData$AGE, main="Histogram of Age Group and their hospital visits",
xlab="Age group", border="black", col=c("light blue", "dark blue"), xlim=c(0,20), ylim=c(0,350))
As can be seen here, the maximum number of hospital visits are for age group is 0-1 years
#Summarize expenditure based on age group and Get the maximum expense and its age group
ExpenseBasedOnAge = aggregate(TOTCHG ~ AGE, FUN=sum, data=HospitalData)
which.max(tapply(ExpenseBasedOnAge$TOTCHG, ExpenseBasedOnAge$TOTCHG, FUN=sum))
## 678118
## 18
barplot((tapply(ExpenseBasedOnAge$TOTCHG, ExpenseBasedOnAge$AGE, FUN=sum)),main="Expenses based on Age Group", xlab="Age group",ylab="Expenses", border="black", col=c("light blue", "dark blue"), xlim=c(0,20), ylim=c(0,700000))
Maximum expenditure is for 0-1 yr age category. The aggregate expenditure of this categiry is 678118
3. Race vs Hospitalization costs
To make sure that there is no malpractice, the agency needs to analyze if the race of the patient is related to the hospitalization costs.
Ho (Null hypothesis): Independent variable (RACE) is not influencing dependent variable (COSTS)
H0: there is no correlation among residuals, # p-value > 0.5
#i.e. they are independent in case of regression, we need high p value so that we cannot reject the null
summary(as.factor(HospitalData$RACE))
## 1 2 3 4 5 6 NA's
## 484 6 1 3 3 2 1
HospitalData = na.omit(HospitalData)
summary(as.factor(HospitalData$RACE))
## 1 2 3 4 5 6
## 484 6 1 3 3 2
484 patients out of 499 fall under group 1. The data is heavily skewed.
We will use a linear regression model to check if Race has any influence on costs.
racefit <- lm(TOTCHG ~ RACE, data=HospitalData)
summary(racefit)
##
## Call:
## lm(formula = TOTCHG ~ RACE, data = HospitalData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2256 -1560 -1227 -258 45600
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2925.7 405.0 7.224 1.92e-12 ***
## RACE -137.3 339.1 -0.405 0.686
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3895 on 497 degrees of freedom
## Multiple R-squared: 0.0003299, Adjusted R-squared: -0.001681
## F-statistic: 0.164 on 1 and 497 DF, p-value: 0.6856
The pValue is 0.69 it is much higher than acceptable 0.5. Thus, we can say that in this dataset race does not affect the hospitalization costs
Verify using ANOVA
raceAOV <- aov(TOTCHG ~ RACE, data=HospitalData)
summary(raceAOV)
## Df Sum Sq Mean Sq F value Pr(>F)
## RACE 1 2.488e+06 2488459 0.164 0.686
## Residuals 497 7.540e+09 15170268
- We see residual variance is very high indicating minimal influence of RACE variable.
- the f-value is 0.16 which is less than 0.5 , whereas p value is 0.69 which is higher than 0.5. Both these values confirm that RACE does not influence hospitalization costs.
4. To properly utilize the costs, the agency has to analyze the severity of the hospital costs by age and gender for the proper allocation of resources.
summary(as.factor(HospitalData$FEMALE))
## 0 1
## 244 255
Analysis using ANOVA and Linear Regression
a <- aov(TOTCHG ~ AGE+FEMALE,data=HospitalData)
summary(a)
## Df Sum Sq Mean Sq F value Pr(>F)
## AGE 1 1.297e+08 129749266 8.759 0.00323 **
## FEMALE 1 6.522e+07 65219972 4.403 0.03638 *
## Residuals 496 7.347e+09 14812787
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
b <- lm(TOTCHG ~ AGE+FEMALE,data=HospitalData)
summary(b)
##
## Call:
## lm(formula = TOTCHG ~ AGE + FEMALE, data = HospitalData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3403 -1444 -873 -156 44950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2719.45 261.42 10.403 < 2e-16 ***
## AGE 86.04 25.53 3.371 0.000808 ***
## FEMALE -744.21 354.67 -2.098 0.036382 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3849 on 496 degrees of freedom
## Multiple R-squared: 0.02585, Adjusted R-squared: 0.02192
## F-statistic: 6.581 on 2 and 496 DF, p-value: 0.001511
- The P values in both the models suggest Age has very high influence.
- The gender also has influence on costs.
- The model has good statistical significance.
5. Since the length of stay is the crucial factor for inpatients, the agency wants to find if the length of stay can be predicted from age, gender, and race.
Using ANOVA and Linear Regression to build a prediction model
len_stayaov <- aov(LOS ~ AGE+FEMALE+RACE,data=HospitalData)
summary(len_stayaov)
## Df Sum Sq Mean Sq F value Pr(>F)
## AGE 1 27 26.907 2.378 0.124
## FEMALE 1 17 16.510 1.459 0.228
## RACE 1 1 1.165 0.103 0.748
## Residuals 495 5600 11.313
len_staylm <- lm(LOS ~ AGE+FEMALE+RACE,data=HospitalData)
summary(len_staylm)
##
## Call:
## lm(formula = LOS ~ AGE + FEMALE + RACE, data = HospitalData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.22 -1.22 -0.85 0.15 37.78
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.94377 0.39318 7.487 3.25e-13 ***
## AGE -0.03960 0.02231 -1.775 0.0766 .
## FEMALE 0.37011 0.31024 1.193 0.2334
## RACE -0.09408 0.29312 -0.321 0.7484
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.363 on 495 degrees of freedom
## Multiple R-squared: 0.007898, Adjusted R-squared: 0.001886
## F-statistic: 1.314 on 3 and 495 DF, p-value: 0.2692
- The p-value for all factors ie. Age, Gender, Race is significantly higher than 0.5 .
- This indicates no correlation of Age, Gender, Race with Length of stay. And thus we cannot use these factors to predict the length of stay.
6. Complete analysis
The agency wants to find the variable that mainly affects hospital costs.
Significance method – build a model using all independent variables vs dependent variable
We will use Linear Regression to build a model.
hcm <- lm(TOTCHG ~ .,data=HospitalData)
summary(hcm)
##
## Call:
## lm(formula = TOTCHG ~ ., data = HospitalData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6377 -700 -174 122 43378
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5218.6769 507.6475 10.280 < 2e-16 ***
## AGE 134.6949 17.4711 7.710 7.02e-14 ***
## FEMALE -390.6924 247.7390 -1.577 0.115
## LOS 743.1521 34.9225 21.280 < 2e-16 ***
## RACE -212.4291 227.9326 -0.932 0.352
## APRDRG -7.7909 0.6816 -11.430 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2613 on 493 degrees of freedom
## Multiple R-squared: 0.5536, Adjusted R-squared: 0.5491
## F-statistic: 122.3 on 5 and 493 DF, p-value: < 2.2e-16
It is apparent from the coefficient values that Age, Length of stay (LOS) and patient refined diagnosis related groups(APRDRG) have statistical significance on hospital costs.
Since Gender and Race are the least significant we build a model eliminating these variables.
hcm1 <- lm(TOTCHG ~ AGE + LOS + APRDRG,data=HospitalData)
summary(hcm1)
##
## Call:
## lm(formula = TOTCHG ~ AGE + LOS + APRDRG, data = HospitalData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6603 -719 -169 124 43350
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4960.1705 433.6579 11.44 < 2e-16 ***
## AGE 128.5519 17.0946 7.52 2.59e-13 ***
## LOS 740.8057 34.9161 21.22 < 2e-16 ***
## APRDRG -8.0055 0.6643 -12.05 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2617 on 495 degrees of freedom
## Multiple R-squared: 0.5506, Adjusted R-squared: 0.5479
## F-statistic: 202.2 on 3 and 495 DF, p-value: < 2.2e-16
The t value for APRDRG is -ve. Let us check by eliminating it as well.
hcm2 <- lm(TOTCHG ~ AGE + LOS ,data=HospitalData)
summary(hcm2)
##
## Call:
## lm(formula = TOTCHG ~ AGE + LOS, data = HospitalData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4783 -1103 -458 -133 41382
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 200.66 203.48 0.986 0.325
## AGE 97.96 19.21 5.101 4.83e-07 ***
## LOS 734.27 39.66 18.512 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2973 on 496 degrees of freedom
## Multiple R-squared: 0.4188, Adjusted R-squared: 0.4164
## F-statistic: 178.7 on 2 and 496 DF, p-value: < 2.2e-16
Model Comparison
models <- list(hcm, hcm1, hcm2)
model_names <- c("hcm", "hcm1", "hcm2")
details <- c("signifi, all independent variables", "-RACE -FEMALE(gender)", "AGE + LOS")
extract_model_stats <- function(model) {
summary_model <- summary(model)
r2 <- summary_model$r.squared
adj_r2 <- summary_model$adj.r.squared
std_err <- sqrt(deviance(model)/df.residual(model))
p_value <- pf(summary_model$fstatistic[1],
summary_model$fstatistic[2],
summary_model$fstatistic[3], lower.tail = FALSE)
return(c(r2, adj_r2, std_err, p_value))
}
stats <- t(sapply(models, extract_model_stats))
colnames(stats) <- c("R2", "adj_R2", "std_err", "p_value")
# Combine all information into a data frame
comparison_table <- data.frame(
Data = rep("HospitalData", 3),
Approach = rep("Ap1:significance", 3),
Model_Name = model_names,
Detail = details,
R2 = round(stats[, "R2"], 3),
adj_R2 = round(stats[, "adj_R2"], 3),
std_err = round(stats[, "std_err"], 0),
`R2 - adj R2` = round(stats[, "R2"] - stats[, "adj_R2"], 3),
p_value = format(stats[, "p_value"], scientific = TRUE, digits = 2)
)
# Create the table
library(kableExtra)
kable(comparison_table, format = "markdown", escape = F, align = "c") %>%
kable_styling(full_width = F) %>%
column_spec(1:4, bold = TRUE) %>%
add_header_above(c("Comparing Models" = 9))
| Data | Approach | Model_Name | Detail | R2 | adj_R2 | std_err | R2…adj.R2 | p_value |
|---|---|---|---|---|---|---|---|---|
| HospitalData | Ap1:significance | hcm | signifi, all independent variables | 0.554 | 0.549 | 2613 | 0.005 | 5.6e-84 |
| HospitalData | Ap1:significance | hcm1 | -RACE -FEMALE(gender) | 0.551 | 0.548 | 2617 | 0.003 | 1.4e-85 |
| HospitalData | Ap1:significance | hcm2 | AGE + LOS | 0.419 | 0.416 | 2973 | 0.002 | 3.6e-59 |
- Removing Race and gender doesn’t change the R2 value. It doesn’t impact cost
- Removing APRDRG in model hcm3 increases the standard error. Hence model hcm1 seems to be better.
Analysis Conclusion:
- As is evident in the multiple models above, health care costs is dependent on age, length of stay and the diagnosis type.
- Healthcare cost is the most for patients in the 0-1 yrs age group category
- Maximum expenditure for 0-1 yr is 678118
- Length of Stay increases the hospital cost
- All Patient Diagnosis Related Groups has an influence on healthcare costs
- 640 diagnosis related group had a max cost of 437978
- Race or gender doesn’t have that much impact on hospital cost
–End–