Business Analytics Project: Healthcare cost and Utilization in Wisconsin Hopsitals using R

Project Description:

A nationwide survey of hospital costs conducted by the US Agency for Healthcare consists of hospital records of inpatient samples. The given data is restricted to the city of Wisconsin and relates to patients in the age group 0-17 years. The agency wants to analyze the data to research healthcare costs and their utilization.

Date: 5 June, 2024

Tools Used: R

Libraries Used: Base R, dplyr, ggplot2, kableextra, Markdown

Dataset: Link

Dataset Description:
AGE : Age of the patient discharged
FEMALE : Binary variable that indicates if the patient is female
LOS : Length of stay, in days
RACE : Race of the patient (specified numerically)
TOTCHG : Hospital discharge costs
APRDRG : All Patient Refined Diagnosis Related Groups

Project Solution:

Import dataset

setwd("D:/Study Material/R/Project/Projects for Submission/Healthcare")
HospitalData = read.csv("HospitalCosts.csv", header=TRUE)
head(HospitalData)
##   AGE FEMALE LOS RACE TOTCHG APRDRG
## 1  17      1   2    1   2660    560
## 2  17      0   2    1   1689    753
## 3  17      1   7    1  20060    930
## 4  17      1   1    1    736    758
## 5  17      1   1    1   1194    754
## 6  17      0   0    1   3305    347
names(HospitalData)
## [1] "AGE"    "FEMALE" "LOS"    "RACE"   "TOTCHG" "APRDRG"

1. Record patient statistics:

The agency wants to find the age category of people who frequently visit the hospital and has the maximum expenditure.

Age: Age of the patient discharged

Totchg: Hospital discharge costs

summary(HospitalData)
##       AGE             FEMALE           LOS              RACE      
##  Min.   : 0.000   Min.   :0.000   Min.   : 0.000   Min.   :1.000  
##  1st Qu.: 0.000   1st Qu.:0.000   1st Qu.: 2.000   1st Qu.:1.000  
##  Median : 0.000   Median :1.000   Median : 2.000   Median :1.000  
##  Mean   : 5.086   Mean   :0.512   Mean   : 2.828   Mean   :1.078  
##  3rd Qu.:13.000   3rd Qu.:1.000   3rd Qu.: 3.000   3rd Qu.:1.000  
##  Max.   :17.000   Max.   :1.000   Max.   :41.000   Max.   :6.000  
##                                                    NA's   :1      
##      TOTCHG          APRDRG     
##  Min.   :  532   Min.   : 21.0  
##  1st Qu.: 1216   1st Qu.:640.0  
##  Median : 1536   Median :640.0  
##  Mean   : 2774   Mean   :616.4  
##  3rd Qu.: 2530   3rd Qu.:751.0  
##  Max.   :48388   Max.   :952.0  
## 

#Get No of Hospital Visits based on age

summary(as.factor(HospitalData$AGE))
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17 
## 307  10   1   3   2   2   2   3   2   2   4   8  15  18  25  29  29  38

Total number of hospital for 0-1 age group is 307

hist(HospitalData$AGE, main="Histogram of Age Group and their hospital visits",
     xlab="Age group", border="black", col=c("light blue", "dark blue"), xlim=c(0,20), ylim=c(0,350))

As can be seen here, the maximum number of hospital visits are for age group is 0-1 years

#Summarize expenditure based on age group and Get the maximum expense and its age group

ExpenseBasedOnAge = aggregate(TOTCHG ~ AGE, FUN=sum, data=HospitalData)
which.max(tapply(ExpenseBasedOnAge$TOTCHG, ExpenseBasedOnAge$TOTCHG, FUN=sum))
## 678118 
##     18
barplot((tapply(ExpenseBasedOnAge$TOTCHG, ExpenseBasedOnAge$AGE, FUN=sum)),main="Expenses based on Age Group", xlab="Age group",ylab="Expenses", border="black", col=c("light blue", "dark blue"), xlim=c(0,20), ylim=c(0,700000))

Maximum expenditure is for 0-1 yr age category. The aggregate expenditure of this categiry is 678118

3. Race vs Hospitalization costs

To make sure that there is no malpractice, the agency needs to analyze if the race of the patient is related to the hospitalization costs.

Ho (Null hypothesis): Independent variable (RACE) is not influencing dependent variable (COSTS)

H0: there is no correlation among residuals, # p-value > 0.5
#i.e. they are independent in case of regression, we need high p value so that we cannot reject the null

summary(as.factor(HospitalData$RACE))
##    1    2    3    4    5    6 NA's 
##  484    6    1    3    3    2    1
HospitalData = na.omit(HospitalData)
summary(as.factor(HospitalData$RACE))
##   1   2   3   4   5   6 
## 484   6   1   3   3   2

484 patients out of 499 fall under group 1. The data is heavily skewed.

We will use a linear regression model to check if Race has any influence on costs.

racefit <- lm(TOTCHG ~ RACE, data=HospitalData)
summary(racefit)
## 
## Call:
## lm(formula = TOTCHG ~ RACE, data = HospitalData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -2256  -1560  -1227   -258  45600 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2925.7      405.0   7.224 1.92e-12 ***
## RACE          -137.3      339.1  -0.405    0.686    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3895 on 497 degrees of freedom
## Multiple R-squared:  0.0003299,  Adjusted R-squared:  -0.001681 
## F-statistic: 0.164 on 1 and 497 DF,  p-value: 0.6856

The pValue is 0.69 it is much higher than acceptable 0.5. Thus, we can say that in this dataset race does not affect the hospitalization costs

Verify using ANOVA

raceAOV <- aov(TOTCHG ~ RACE, data=HospitalData)
summary(raceAOV)
##              Df    Sum Sq  Mean Sq F value Pr(>F)
## RACE          1 2.488e+06  2488459   0.164  0.686
## Residuals   497 7.540e+09 15170268
  • We see residual variance is very high indicating minimal influence of RACE variable.
  • the f-value is 0.16 which is less than 0.5 , whereas p value is 0.69 which is higher than 0.5. Both these values confirm that RACE does not influence hospitalization costs.

4. To properly utilize the costs, the agency has to analyze the severity of the hospital costs by age and gender for the proper allocation of resources.

summary(as.factor(HospitalData$FEMALE))
##   0   1 
## 244 255

Analysis using ANOVA and Linear Regression

a <- aov(TOTCHG ~ AGE+FEMALE,data=HospitalData)
summary(a)
##              Df    Sum Sq   Mean Sq F value  Pr(>F)   
## AGE           1 1.297e+08 129749266   8.759 0.00323 **
## FEMALE        1 6.522e+07  65219972   4.403 0.03638 * 
## Residuals   496 7.347e+09  14812787                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
b <- lm(TOTCHG ~ AGE+FEMALE,data=HospitalData)
summary(b)
## 
## Call:
## lm(formula = TOTCHG ~ AGE + FEMALE, data = HospitalData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3403  -1444   -873   -156  44950 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2719.45     261.42  10.403  < 2e-16 ***
## AGE            86.04      25.53   3.371 0.000808 ***
## FEMALE       -744.21     354.67  -2.098 0.036382 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3849 on 496 degrees of freedom
## Multiple R-squared:  0.02585,    Adjusted R-squared:  0.02192 
## F-statistic: 6.581 on 2 and 496 DF,  p-value: 0.001511
  • The P values in both the models suggest Age has very high influence.
  • The gender also has influence on costs.
  • The model has good statistical significance.

5. Since the length of stay is the crucial factor for inpatients, the agency wants to find if the length of stay can be predicted from age, gender, and race.

Using ANOVA and Linear Regression to build a prediction model

len_stayaov <- aov(LOS ~ AGE+FEMALE+RACE,data=HospitalData)
summary(len_stayaov)
##              Df Sum Sq Mean Sq F value Pr(>F)
## AGE           1     27  26.907   2.378  0.124
## FEMALE        1     17  16.510   1.459  0.228
## RACE          1      1   1.165   0.103  0.748
## Residuals   495   5600  11.313
len_staylm <- lm(LOS ~ AGE+FEMALE+RACE,data=HospitalData)
summary(len_staylm)
## 
## Call:
## lm(formula = LOS ~ AGE + FEMALE + RACE, data = HospitalData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3.22  -1.22  -0.85   0.15  37.78 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.94377    0.39318   7.487 3.25e-13 ***
## AGE         -0.03960    0.02231  -1.775   0.0766 .  
## FEMALE       0.37011    0.31024   1.193   0.2334    
## RACE        -0.09408    0.29312  -0.321   0.7484    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.363 on 495 degrees of freedom
## Multiple R-squared:  0.007898,   Adjusted R-squared:  0.001886 
## F-statistic: 1.314 on 3 and 495 DF,  p-value: 0.2692
  • The p-value for all factors ie. Age, Gender, Race is significantly higher than 0.5 .
  • This indicates no correlation of Age, Gender, Race with Length of stay. And thus we cannot use these factors to predict the length of stay.

6. Complete analysis

The agency wants to find the variable that mainly affects hospital costs.

Significance method – build a model using all independent variables vs dependent variable

We will use Linear Regression to build a model.

hcm <- lm(TOTCHG ~ .,data=HospitalData)
summary(hcm)
## 
## Call:
## lm(formula = TOTCHG ~ ., data = HospitalData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6377   -700   -174    122  43378 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5218.6769   507.6475  10.280  < 2e-16 ***
## AGE          134.6949    17.4711   7.710 7.02e-14 ***
## FEMALE      -390.6924   247.7390  -1.577    0.115    
## LOS          743.1521    34.9225  21.280  < 2e-16 ***
## RACE        -212.4291   227.9326  -0.932    0.352    
## APRDRG        -7.7909     0.6816 -11.430  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2613 on 493 degrees of freedom
## Multiple R-squared:  0.5536, Adjusted R-squared:  0.5491 
## F-statistic: 122.3 on 5 and 493 DF,  p-value: < 2.2e-16

It is apparent from the coefficient values that Age, Length of stay (LOS) and patient refined diagnosis related groups(APRDRG) have statistical significance on hospital costs.

Since Gender and Race are the least significant we build a model eliminating these variables.

hcm1 <- lm(TOTCHG ~ AGE + LOS + APRDRG,data=HospitalData)
summary(hcm1)
## 
## Call:
## lm(formula = TOTCHG ~ AGE + LOS + APRDRG, data = HospitalData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6603   -719   -169    124  43350 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4960.1705   433.6579   11.44  < 2e-16 ***
## AGE          128.5519    17.0946    7.52 2.59e-13 ***
## LOS          740.8057    34.9161   21.22  < 2e-16 ***
## APRDRG        -8.0055     0.6643  -12.05  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2617 on 495 degrees of freedom
## Multiple R-squared:  0.5506, Adjusted R-squared:  0.5479 
## F-statistic: 202.2 on 3 and 495 DF,  p-value: < 2.2e-16

The t value for APRDRG is -ve. Let us check by eliminating it as well.

hcm2 <- lm(TOTCHG ~ AGE + LOS ,data=HospitalData)
summary(hcm2)
## 
## Call:
## lm(formula = TOTCHG ~ AGE + LOS, data = HospitalData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4783  -1103   -458   -133  41382 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   200.66     203.48   0.986    0.325    
## AGE            97.96      19.21   5.101 4.83e-07 ***
## LOS           734.27      39.66  18.512  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2973 on 496 degrees of freedom
## Multiple R-squared:  0.4188, Adjusted R-squared:  0.4164 
## F-statistic: 178.7 on 2 and 496 DF,  p-value: < 2.2e-16

Model Comparison

models <- list(hcm, hcm1, hcm2)
model_names <- c("hcm", "hcm1", "hcm2")
details <- c("signifi, all independent variables", "-RACE -FEMALE(gender)", "AGE + LOS")

extract_model_stats <- function(model) {
  summary_model <- summary(model)
  r2 <- summary_model$r.squared
  adj_r2 <- summary_model$adj.r.squared
  std_err <- sqrt(deviance(model)/df.residual(model))
  p_value <- pf(summary_model$fstatistic[1],
                summary_model$fstatistic[2],
                summary_model$fstatistic[3], lower.tail = FALSE)
return(c(r2, adj_r2, std_err, p_value))
}

stats <- t(sapply(models, extract_model_stats))
colnames(stats) <- c("R2", "adj_R2", "std_err", "p_value")
# Combine all information into a data frame
comparison_table <- data.frame(
  Data = rep("HospitalData", 3),
  Approach = rep("Ap1:significance", 3),
  Model_Name = model_names,
  Detail = details,
  R2 = round(stats[, "R2"], 3),
  adj_R2 = round(stats[, "adj_R2"], 3),
  std_err = round(stats[, "std_err"], 0),
  `R2 - adj R2` = round(stats[, "R2"] - stats[, "adj_R2"], 3),
  p_value = format(stats[, "p_value"], scientific = TRUE, digits = 2)
)
# Create the table
library(kableExtra)
kable(comparison_table, format = "markdown", escape = F, align = "c") %>%
  kable_styling(full_width = F) %>%
  column_spec(1:4, bold = TRUE) %>%
  add_header_above(c("Comparing Models" = 9))
Comparing Models
Data Approach Model_Name Detail R2 adj_R2 std_err R2…adj.R2 p_value
HospitalData Ap1:significance hcm signifi, all independent variables 0.554 0.549 2613 0.005 5.6e-84
HospitalData Ap1:significance hcm1 -RACE -FEMALE(gender) 0.551 0.548 2617 0.003 1.4e-85
HospitalData Ap1:significance hcm2 AGE + LOS 0.419 0.416 2973 0.002 3.6e-59
  • Removing Race and gender doesn’t change the R2 value. It doesn’t impact cost
  • Removing APRDRG in model hcm3 increases the standard error. Hence model hcm1 seems to be better.

Analysis Conclusion:

  1. As is evident in the multiple models above, health care costs is dependent on age, length of stay and the diagnosis type.
  2. Healthcare cost is the most for patients in the 0-1 yrs age group category
    • Maximum expenditure for 0-1 yr is 678118

  1. Length of Stay increases the hospital cost
  2. All Patient Diagnosis Related Groups has an influence on healthcare costs
    • 640 diagnosis related group had a max cost of 437978

  1. Race or gender doesn’t have that much impact on hospital cost

 

–End–

Leave a Comment

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *