Effect of median income on PSI-90
Introduction
The goal of this post is to see if the income in the area immediately around a hospital is correlated with the quality of outcomes from these hospitals.
Also quick disclaimer, do not use any information from this post when making decisions about your medical care. I am not a healthcare professional.
Data Sources
Income data was taken from the 2016 5 year American Community survey data, specifically the detail table. Median income was used as it is simple to access and avoids being skewed by a smaller group of high income earners. I was not sure what to use as a predictor for quality outcomes. Some quick research brought up PSI-90, colloquially known as the Patient Safety and Adverse Events Composite. Effectively this value combines the incidence of several preventable factors into one metric for hospital quality. Data for PSI-90 values was taken from this website. The start date for this data is Oct 2015 through Jun 2017, so it covers a similar range to the American Community Survey data.
Data Import
The following libraries will all eventually get used. The APIkey.txt file will be used to load the API key for census data. The csv file was downloaded from the website linked above.
library(tidyverse)
library (RJSONIO)
library(modelr)
fileName <- "~/Projects/medicareCompare/APIkey.txt"
APIkey = readChar(fileName, file.info(fileName)$size-1)
psi90 <- read_csv("~/Projects/medicareCompare/PSI_90_andComponent_Measures.csv")
Census Median Income Data
This function will be used to import data from the U.S. Census API. More info on this may be found in this blog post.
getACSDataDetail <- function(ACSYear,varname,year,APIkey){
# This function pulls data from the ACS API. Disclaimer: I have only tested 5 year data.
#
# Args:
# ACSYear: determine if accessing 1, 3, or 5 year data.
# year: Date of Census data
# varname: String of Census variables. They should be comma separated with no spaces.
# APIkey: API key requested from the US Census Bureau
if(year=="2016" | year=="2015" | year=="2011"){
resURL <- paste("http://api.census.gov/data/",year,"/acs/acs",ACSYear,"/profile?get=NAME,",varname,
"&for=zip+code+tabulation+area:*","&key=",APIkey,sep="")
}
else{
resURL <- paste("http://api.census.gov/data/",year,"/acs",ACSYear,"/profile?get=NAME,",varname,
"&for=zip+code+tabulation+area:*","&key=",APIkey,sep="")
}
lJSON <- fromJSON(resURL)
s <- gsub(",","",varname)
count <- nchar(varname) - nchar(s) + 4
lJSON.st <- sapply(lJSON, function(x) x[count-1])
df <- data.frame(lJSON.st)
for (i in 1:(count-2)){
lJSON.temp <- sapply(lJSON, function(x) x[i])
tempdf <- data.frame(lJSON.temp)
df <- bind_cols(df, tempdf)
next
}
colnames(df) <- as.character(unlist(df[1,]))
df = df[-1,]
ix <- 4:count-1
df[ix] <- lapply(df[ix], as.character)
df[ix] <- lapply(df[ix], as.numeric)
return(df)
}
The census data is imported here. “ZCTA5” is the census variable for zip code and “DP03_0062E” is the census variable for median income. The name fields from the Census API are dropped to keep only the needed info. A quick summary and graph of the data shows nothing eye popping.
varname <- paste("ZCTA5","DP03_0062E", sep=",")
zipdata <- getACSDataDetail("5",varname,"2016",APIkey) %>%
filter(DP03_0062E > 0) %>%
select(ZCTA5, DP03_0062E)
colnames(zipdata) <- c("Zip_Code", "Median_Income")
head(zipdata)
## Zip_Code Median_Income
## 1 1001 56714
## 2 1002 48923
## 3 1003 2499
## 4 1005 70568
## 5 1007 80502
## 6 1008 67250
summary(zipdata)
## Zip_Code Median_Income
## Min. : 601 Min. : 2499
## 1st Qu.:27048 1st Qu.: 40227
## Median :49776 Median : 50354
## Mean :49538 Mean : 54683
## 3rd Qu.:71746 3rd Qu.: 63333
## Max. :99929 Max. :250001
ggplot(zipdata, aes(Zip_Code, Median_Income)) + geom_point()
PSI-90 Data
Now I will look at the PSI-90 data loaded above. A summary of the data shows that the Measure_Value was loaded in as a character variable, so that is converted to numeric. For now a lot of these columns are not needed, so the select statement is used to get only what is needed. As all of the individual components of the composite are also included, the filter statement selects only PSI-90. Now the summary after this filter shows that roughly 30% of the values are N/A. To forge ahead, these values will be filtered out. It is important to keep in mind with any of the findings ahead that a decent sized portion of the data has been excluded. To get a rough idea of how the data looks, PSI-90 values are graphed by zip code. This is probably a good time to mention that a lower PSI-90 value is indicative of better care.
summary(psi90)
## Provider_ID Hospital_Name Address
## Length:50787 Length:50787 Length:50787
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## City State Zip_Code Measure_ID
## Length:50787 Length:50787 Min. : 603 Length:50787
## Class :character Class :character 1st Qu.:32901 Class :character
## Mode :character Mode :character Median :55008 Mode :character
## Mean :53759
## 3rd Qu.:75766
## Max. :99929
##
## Measure_Name Measure_Value Footnote Start_Date
## Length:50787 Length:50787 Min. : 1.00 Length:50787
## Class :character Class :character 1st Qu.:13.00 Class :character
## Mode :character Mode :character Median :13.00 Mode :character
## Mean :11.32
## 3rd Qu.:13.00
## Max. :23.00
## NA's :32796
## End_Date Location
## Length:50787 Length:50787
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
psi90$Measure_Value <- as.numeric(psi90$Measure_Value)
psivals <- psi90 %>%
select(Hospital_Name, Measure_ID, Measure_Value, Zip_Code) %>%
filter(Measure_ID == "PSI-90")
summary(psivals)
## Hospital_Name Measure_ID Measure_Value Zip_Code
## Length:4617 Length:4617 Min. :0.5160 Min. : 603
## Class :character Class :character 1st Qu.:0.8991 1st Qu.:32901
## Mode :character Mode :character Median :0.9708 Median :55008
## Mean :0.9948 Mean :53759
## 3rd Qu.:1.0553 3rd Qu.:75766
## Max. :4.2098 Max. :99929
## NA's :1372
psivals <- psivals %>% filter(!is.na(Measure_Value))
ggplot(psivals, aes(Zip_Code, Measure_Value)) + geom_point()
Model
Now that we have both datasets with zipcodes, they are simply joined using the left_join function. Plotting the data shows no obvious relationship between income and outcomes.
calcset <- left_join(psivals, zipdata,by="Zip_Code")
summary(calcset)
## Hospital_Name Measure_ID Measure_Value Zip_Code
## Length:3245 Length:3245 Min. :0.5160 Min. : 1040
## Class :character Class :character 1st Qu.:0.8991 1st Qu.:30180
## Mode :character Mode :character Median :0.9708 Median :48764
## Mean :0.9948 Mean :51992
## 3rd Qu.:1.0553 3rd Qu.:76104
## Max. :4.2098 Max. :99801
##
## Median_Income
## Min. : 11899
## 1st Qu.: 39249
## Median : 48466
## Mean : 53803
## 3rd Qu.: 63697
## Max. :187670
## NA's :167
ggplot(calcset, aes(Median_Income, Measure_Value)) + geom_point()
We will now look at the relationship between these two values with a simple linear model. From the summary of the model, there is a low p-value and also a low R2. What this is effectively showing is that the relationship shown by the model exists, but it does not explain most of the variability in the data. So what is the effect overall? The linear model shows that for every $1000 increase in median income, the PSI-90 value decreases by about 0.00028. Overall this is a very minor effect, and analyzing other sources of variability will be more useful.
m1 <- lm(Measure_Value ~ Median_Income, data=calcset)
m1eval <- calcset %>% add_residuals(m1) %>% add_predictions(m1)
summary(m1)
##
## Call:
## lm(formula = Measure_Value ~ Median_Income, data = calcset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4804 -0.0941 -0.0245 0.0616 3.2142
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.008e+00 8.344e-03 120.850 <2e-16 ***
## Median_Income -2.808e-07 1.436e-07 -1.956 0.0505 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1752 on 3076 degrees of freedom
## (167 observations deleted due to missingness)
## Multiple R-squared: 0.001243, Adjusted R-squared: 0.000918
## F-statistic: 3.827 on 1 and 3076 DF, p-value: 0.05051
ggplot(m1eval, aes(Median_Income, Measure_Value)) +
geom_point() + geom_line(aes(Median_Income, pred), color="red")
A quick look at the residuals of the model confirm what we saw above. The frequency graph does not look normally distributed, even after limiting the x-axis manually to ignore the long tail on the positive side. Looking at the residuals when compared to income, it further illustrates the issue that the data is very clustered in this model, which means the model likely won’t be accurate at the high end of income.
ggplot(m1eval, aes(resid)) +
geom_freqpoly(binwidth = 0.05) +
ggtitle("Residual Frequency") + xlim(-1, 1)
ggplot(m1eval, aes(Median_Income, resid)) +
geom_point() + ggtitle("Residuals")
Findings
In summary, overall this data does not show a strong relationship between income around a hospital and the quality of care (PSI-90) of that hospital. Most of the variability in the PSI-90 values will be found in other factors not covered in this quick overview.
Shortcomings
This analysis is extremely cursory and has several ways in which it may be misleading. Some of the major points are, in no particular order:
- Medicare PSI-90 values may not be a good indicator for overall hospital quality of care.
- The populations served by these hospitals are not well characterized by the median income of the zip code that it is located in. A number of factors determine which hospitals people use, and the overall area served may be larger than the individual zipcode.
- Some effects are obscured by the composite calculations for the PSI-90 value.
- The most obvious point from this quick overview is that most of the variability in the data is not explained by median income.
Areas for further analysis
This is an area that if very interesting to me, so I will likely be looking to dive deeper into this dataset. Here are some of the areas I am hoping to evaluate further:
- Increase dimensionality by looking at a larger range of data from the American Community Survey. This may help explain the rest of the variability in the data.
- Expand characterization of the populations served by these hospitals. This could include analyzing the region the hospital is located in, not just the zip code it is located in.
- Utilize clustering to group zip codes together by a number of factors.
If you have any thoughts or questions feel free to email me at ([email protected]) or message me on LinkedIn.