Current biostatistics graduate student at the University of Minnesota. Previously a med device engineer. All opinions are my own. Available for engineering or statistical consulting.
Have you ever wanted to access US census data so you could run your own analysis on them, rather than relying on the much more qualified data journalists and researchers to tell you what is in there? Me too! So I wrote a script to do just that. Check it out on Github if you want to follow along. You will need to request an API key from the US Census Bureau and save it in APIkey.txt for this script to work. The script that is shown below is the PopulationGraph.R file.
This post will run through this simple script that will pull age data from the American Community Survey 5 year data and plot it on a couple of line graphs.
Full disclosure, when I was googling around on how to do this, I mostly used this post from the University of Virginia libary to get it up and running. From there I updated the functions outlined for my own purposes. Alright, let’s jump in!
Library and Source Declarations
We start with the boring stuff. This first section declares all of our libraries and also loads in the API key.
Variable Definitions
Here we define out inputs to the census function. Since I am currently living in Tennessee, let’s start with that for our state. We will use every year available in the ACS 5 year data to get a larger dataset. We then assign the list of all Census data variables we want. Here we are pulling all of the simple “SEX by AGE” variables for the TN population. More details about these variables can be found in the ACS 5 year documentation here. We format these variables as a string separated by commas, as this is how the Census API will read them.
Function Call and Definition
We are getting to the good stuff. Here we will loop over the year array that we created above and store it all in the ACS data Data Frame.
Pretty straightforward, but what is the getACSDataState function actually doing? It is defined in the ACSReqs.R file that we sourced at the top of the script and is shown below. The fucntion builds a request for the census api, submits it and pulls the results out using the fromJSON function. The results are converted into a dataframe (for now everything but the state number, and state name are changed to numeric).
Manipulating the Data
Now that we have the data in the Data Frame, we will calculate the subsets we want by combining the totals from these variables and calculating the percentage of the population each group makes up using the total population B1001_003E value. Let’s split this data out into the classic marketing age groups: 18-24, 25-34, 35-44, 45-54, 55-65, and over 65. We will also include under 18, so no one is left out.
Plotting the Results
Now that we have calucalated our groups, let’s convert the data to a plottable format and give it a look.
The code above gives us this graph:
Hmmm, it is kind of hard to see any trends with such a busy graph. To fix this, we will combine the age groups and plot them again:
Now that is much easier to see. And that wraps up the basics of accessing Census data and outputting it as a simple graph.