Effect of median income on PSI-90

Introduction

The goal of this post is to see if the income in the area immediately around a hospital is correlated with the quality of outcomes from these hospitals.

Also quick disclaimer, do not use any information from this post when making decisions about your medical care. I am not a healthcare professional.

Data Sources

Income data was taken from the 2016 5 year American Community survey data, specifically the detail table. Median income was used as it is simple to access and avoids being skewed by a smaller group of high income earners. I was not sure what to use as a predictor for quality outcomes. Some quick research brought up PSI-90, colloquially known as the Patient Safety and Adverse Events Composite. Effectively this value combines the incidence of several preventable factors into one metric for hospital quality. Data for PSI-90 values was taken from this website. The start date for this data is Oct 2015 through Jun 2017, so it covers a similar range to the American Community Survey data.

Data Import

The following libraries will all eventually get used. The APIkey.txt file will be used to load the API key for census data. The csv file was downloaded from the website linked above.

library(tidyverse)
library (RJSONIO)
library(modelr)

fileName <- "~/Projects/medicareCompare/APIkey.txt"
APIkey = readChar(fileName, file.info(fileName)$size-1)
psi90 <- read_csv("~/Projects/medicareCompare/PSI_90_andComponent_Measures.csv")

Census Median Income Data

This function will be used to import data from the U.S. Census API. More info on this may be found in this blog post.

getACSDataDetail <- function(ACSYear,varname,year,APIkey){
  # This function pulls data from the ACS API. Disclaimer: I have only tested 5 year data.
  #
  # Args:
  #   ACSYear: determine if accessing 1, 3, or 5 year data.
  #   year: Date of Census data
  #   varname: String of Census variables. They should be comma separated with no spaces.
  #   APIkey: API key requested from the US Census Bureau
  
  if(year=="2016" | year=="2015" | year=="2011"){
    resURL <- paste("http://api.census.gov/data/",year,"/acs/acs",ACSYear,"/profile?get=NAME,",varname,
                    "&for=zip+code+tabulation+area:*","&key=",APIkey,sep="")
  }
  else{
    resURL <- paste("http://api.census.gov/data/",year,"/acs",ACSYear,"/profile?get=NAME,",varname,
                    "&for=zip+code+tabulation+area:*","&key=",APIkey,sep="")
  }
  lJSON <- fromJSON(resURL)
  s <- gsub(",","",varname)
  count <- nchar(varname) - nchar(s) + 4
  lJSON.st <- sapply(lJSON, function(x) x[count-1])
  df <- data.frame(lJSON.st)
  for (i in 1:(count-2)){
    lJSON.temp <- sapply(lJSON, function(x) x[i])
    tempdf <- data.frame(lJSON.temp)
    df <- bind_cols(df, tempdf)
    next
  }
  colnames(df) <- as.character(unlist(df[1,]))
  df = df[-1,]
  ix <- 4:count-1
  df[ix] <- lapply(df[ix], as.character)
  df[ix] <- lapply(df[ix], as.numeric)
  return(df)
}

The census data is imported here. “ZCTA5” is the census variable for zip code and “DP03_0062E” is the census variable for median income. The name fields from the Census API are dropped to keep only the needed info. A quick summary and graph of the data shows nothing eye popping.

varname <- paste("ZCTA5","DP03_0062E", sep=",")
zipdata <- getACSDataDetail("5",varname,"2016",APIkey) %>% 
  filter(DP03_0062E > 0) %>% 
  select(ZCTA5, DP03_0062E)
colnames(zipdata) <- c("Zip_Code", "Median_Income")
head(zipdata)
##   Zip_Code Median_Income
## 1     1001         56714
## 2     1002         48923
## 3     1003          2499
## 4     1005         70568
## 5     1007         80502
## 6     1008         67250
summary(zipdata)
##     Zip_Code     Median_Income   
##  Min.   :  601   Min.   :  2499  
##  1st Qu.:27048   1st Qu.: 40227  
##  Median :49776   Median : 50354  
##  Mean   :49538   Mean   : 54683  
##  3rd Qu.:71746   3rd Qu.: 63333  
##  Max.   :99929   Max.   :250001
ggplot(zipdata, aes(Zip_Code, Median_Income)) + geom_point()

plot of chunk census_import

PSI-90 Data

Now I will look at the PSI-90 data loaded above. A summary of the data shows that the Measure_Value was loaded in as a character variable, so that is converted to numeric. For now a lot of these columns are not needed, so the select statement is used to get only what is needed. As all of the individual components of the composite are also included, the filter statement selects only PSI-90. Now the summary after this filter shows that roughly 30% of the values are N/A. To forge ahead, these values will be filtered out. It is important to keep in mind with any of the findings ahead that a decent sized portion of the data has been excluded. To get a rough idea of how the data looks, PSI-90 values are graphed by zip code. This is probably a good time to mention that a lower PSI-90 value is indicative of better care.

summary(psi90)
##  Provider_ID        Hospital_Name        Address         
##  Length:50787       Length:50787       Length:50787      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##      City              State              Zip_Code      Measure_ID       
##  Length:50787       Length:50787       Min.   :  603   Length:50787      
##  Class :character   Class :character   1st Qu.:32901   Class :character  
##  Mode  :character   Mode  :character   Median :55008   Mode  :character  
##                                        Mean   :53759                     
##                                        3rd Qu.:75766                     
##                                        Max.   :99929                     
##                                                                          
##  Measure_Name       Measure_Value         Footnote      Start_Date       
##  Length:50787       Length:50787       Min.   : 1.00   Length:50787      
##  Class :character   Class :character   1st Qu.:13.00   Class :character  
##  Mode  :character   Mode  :character   Median :13.00   Mode  :character  
##                                        Mean   :11.32                     
##                                        3rd Qu.:13.00                     
##                                        Max.   :23.00                     
##                                        NA's   :32796                     
##    End_Date           Location        
##  Length:50787       Length:50787      
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
## 
psi90$Measure_Value <- as.numeric(psi90$Measure_Value)
psivals <- psi90 %>% 
  select(Hospital_Name, Measure_ID, Measure_Value, Zip_Code) %>% 
  filter(Measure_ID == "PSI-90")
summary(psivals)
##  Hospital_Name       Measure_ID        Measure_Value       Zip_Code    
##  Length:4617        Length:4617        Min.   :0.5160   Min.   :  603  
##  Class :character   Class :character   1st Qu.:0.8991   1st Qu.:32901  
##  Mode  :character   Mode  :character   Median :0.9708   Median :55008  
##                                        Mean   :0.9948   Mean   :53759  
##                                        3rd Qu.:1.0553   3rd Qu.:75766  
##                                        Max.   :4.2098   Max.   :99929  
##                                        NA's   :1372
psivals <- psivals %>% filter(!is.na(Measure_Value))
ggplot(psivals, aes(Zip_Code, Measure_Value)) + geom_point()

plot of chunk PSI_Explore

Model

Now that we have both datasets with zipcodes, they are simply joined using the left_join function. Plotting the data shows no obvious relationship between income and outcomes.

calcset <- left_join(psivals, zipdata,by="Zip_Code")

summary(calcset)
##  Hospital_Name       Measure_ID        Measure_Value       Zip_Code    
##  Length:3245        Length:3245        Min.   :0.5160   Min.   : 1040  
##  Class :character   Class :character   1st Qu.:0.8991   1st Qu.:30180  
##  Mode  :character   Mode  :character   Median :0.9708   Median :48764  
##                                        Mean   :0.9948   Mean   :51992  
##                                        3rd Qu.:1.0553   3rd Qu.:76104  
##                                        Max.   :4.2098   Max.   :99801  
##                                                                        
##  Median_Income   
##  Min.   : 11899  
##  1st Qu.: 39249  
##  Median : 48466  
##  Mean   : 53803  
##  3rd Qu.: 63697  
##  Max.   :187670  
##  NA's   :167
ggplot(calcset, aes(Median_Income, Measure_Value)) + geom_point()

plot of chunk combine

We will now look at the relationship between these two values with a simple linear model. From the summary of the model, there is a low p-value and also a low R2. What this is effectively showing is that the relationship shown by the model exists, but it does not explain most of the variability in the data. So what is the effect overall? The linear model shows that for every $1000 increase in median income, the PSI-90 value decreases by about 0.00028. Overall this is a very minor effect, and analyzing other sources of variability will be more useful.

m1 <- lm(Measure_Value ~ Median_Income, data=calcset)
m1eval <- calcset %>% add_residuals(m1) %>% add_predictions(m1)
summary(m1)
## 
## Call:
## lm(formula = Measure_Value ~ Median_Income, data = calcset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4804 -0.0941 -0.0245  0.0616  3.2142 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.008e+00  8.344e-03 120.850   <2e-16 ***
## Median_Income -2.808e-07  1.436e-07  -1.956   0.0505 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1752 on 3076 degrees of freedom
##   (167 observations deleted due to missingness)
## Multiple R-squared:  0.001243,	Adjusted R-squared:  0.000918 
## F-statistic: 3.827 on 1 and 3076 DF,  p-value: 0.05051
ggplot(m1eval, aes(Median_Income, Measure_Value)) + 
  geom_point() + geom_line(aes(Median_Income, pred), color="red") 

plot of chunk model A quick look at the residuals of the model confirm what we saw above. The frequency graph does not look normally distributed, even after limiting the x-axis manually to ignore the long tail on the positive side. Looking at the residuals when compared to income, it further illustrates the issue that the data is very clustered in this model, which means the model likely won’t be accurate at the high end of income.

ggplot(m1eval, aes(resid)) + 
  geom_freqpoly(binwidth = 0.05) + 
  ggtitle("Residual Frequency") + xlim(-1, 1)

plot of chunk residuals

ggplot(m1eval, aes(Median_Income, resid)) + 
  geom_point() + ggtitle("Residuals")

plot of chunk residuals

Findings

In summary, overall this data does not show a strong relationship between income around a hospital and the quality of care (PSI-90) of that hospital. Most of the variability in the PSI-90 values will be found in other factors not covered in this quick overview.

Shortcomings

This analysis is extremely cursory and has several ways in which it may be misleading. Some of the major points are, in no particular order:

  1. Medicare PSI-90 values may not be a good indicator for overall hospital quality of care.
  2. The populations served by these hospitals are not well characterized by the median income of the zip code that it is located in. A number of factors determine which hospitals people use, and the overall area served may be larger than the individual zipcode.
  3. Some effects are obscured by the composite calculations for the PSI-90 value.
  4. The most obvious point from this quick overview is that most of the variability in the data is not explained by median income.

Areas for further analysis

This is an area that if very interesting to me, so I will likely be looking to dive deeper into this dataset. Here are some of the areas I am hoping to evaluate further:

  1. Increase dimensionality by looking at a larger range of data from the American Community Survey. This may help explain the rest of the variability in the data.
  2. Expand characterization of the populations served by these hospitals. This could include analyzing the region the hospital is located in, not just the zip code it is located in.
  3. Utilize clustering to group zip codes together by a number of factors.

If you have any thoughts or questions feel free to email me at ([email protected]) or message me on LinkedIn.