MN Senate Election Targeting 2020

Note

As a note, the majority of this blog post was written before the murder of George Floyd. While this post seeks to explore the electoral impact of demographics in Minnesota broadly, I am not going to try to predict how these events may affect the coming election. If you support data-driven solutions to real world problems, consider donating to Campaign Zero.

Introduction

This analysis will focus on which districts should see the most investment in a democratic strategy for the Minnesota State Senate, as well as a first draft of a predictive model for the 2020 Minnesota state senate elections. Data should certainly inform the decision making around elections, but in cases of high uncertainty, like elections, data should not be a stand in for sound strategy, based on reasonable assumptions. This is important to understanding this model and paper, as it currently incorporated demographic data, but does not include current estimates of public sentiment or a likely voter turnout model for the 2020 election.

The code for this project can be found in the following github repository: https://github.com/tajubenv/Minnesota_Election_Analysis.

Data Sources

Data for this post was pulled from the Minnesota secretary of state elections website, https://www.sos.state.mn.us/elections-voting/election-results for election results. Demographic data was pulled from the 5 year American Community Survey (ACS) Data https://www.census.gov/programs-surveys/acs/data.html using the tidycensus in R. The ACS data has varying levels of depth between 1, 3, and 5 year datasets. The 5 year data is the only survey that contains State Senate level data for Minnesota. The 5 year estimates are calculated based on data within the 5 preceding years of the date given on the data. Thus when data for the 2016 ACS is estimated, the estimate includes data from 2012-2016. This poses the obvious issue that it is desirable to understand the demographics on election day, rather than an estimate for the previous 5 years. This analysis is focused more on strategy and prediction, rather than inference, so it is most appropriate to use resources that will be available before the election in November. As a result, the ACS estimates will be examined for the year they end, rather than offset in any manner.

Methods

Data was collected from the ACS 5 year survey for 2010, 2012, and 2016 for all counties that had a Democratic-Farmer-Labor Party candidate running for state senate in those same years. Specific demographic data included median income, median age, and proportions of the county that were White, Black, Native American, Asian or Other as defined by the ACS data. Proportion of Hispanic descent was also included. All proportions for this model are proportions of the voting age population.

After data collection and cleaning, a logistic regression was performed to predict the likelihood of a democratic victory within a MN senate district, based on demographic characteristics of the district. In addition, linear models were used to project the same demographic information for the 2020 election across the state. The model was then used to analyze potential districts for democratic campaigns to target in 2020.

Results

Summaries for the demographic data included in the regression model are shown in the table below.

  year: 2010 (N = 65) year: 2012 (N = 67) year: 2016 (N = 67)
winner         
   minimum 0.00 0.00 0.00
   median (IQR) 0.00 (0.00, 1.00) 1.00 (0.00, 1.00) 0.00 (0.00, 1.00)
   mean (sd) 0.46 ± 0.50 0.58 ± 0.50 0.49 ± 0.50
   maximum 1.00 1.00 1.00
median_income         
   minimum 32,779.00 36,227.00 40,444.00
   median (IQR) 57,115.00 (44,799.00, 69,189.00) 59,081.00 (47,921.50, 74,187.50) 63,232.00 (52,185.50, 77,299.50)
   mean (sd) 58,263.58 ± 15,159.17 61,079.57 ± 15,698.15 65,153.85 ± 16,239.96
   maximum 91,156.00 94,043.00 101,286.00
median_age         
   minimum 27.30 27.60 27.30
   median (IQR) 37.10 (35.10, 40.30) 37.90 (35.10, 41.35) 38.70 (35.95, 41.55)
   mean (sd) 37.37 ± 4.48 37.77 ± 4.42 38.37 ± 4.36
   maximum 46.70 46.50 47.00
white_prop         
   minimum 0.47 0.49 0.48
   median (IQR) 0.92 (0.86, 0.95) 0.92 (0.85, 0.96) 0.90 (0.84, 0.95)
   mean (sd) 0.88 ± 0.11 0.88 ± 0.10 0.87 ± 0.11
   maximum 0.98 0.97 0.97
black_prop         
   minimum 0.00 0.00 0.00
   median (IQR) 0.02 (0.01, 0.06) 0.02 (0.01, 0.06) 0.03 (0.01, 0.06)
   mean (sd) 0.05 ± 0.07 0.04 ± 0.06 0.05 ± 0.06
   maximum 0.35 0.33 0.34
native_prop         
   minimum 0.00 0.00 0.00
   median (IQR) 0.00 (0.00, 0.01) 0.00 (0.00, 0.01) 0.00 (0.00, 0.01)
   mean (sd) 0.01 ± 0.02 0.01 ± 0.02 0.01 ± 0.02
   maximum 0.11 0.11 0.12
asian_prop         
   minimum 0.00 0.00 0.00
   median (IQR) 0.02 (0.01, 0.05) 0.03 (0.01, 0.06) 0.03 (0.01, 0.06)
   mean (sd) 0.04 ± 0.04 0.04 ± 0.04 0.04 ± 0.04
   maximum 0.19 0.19 0.25
other_prop         
   minimum 0.00 0.00 0.00
   median (IQR) 0.01 (0.01, 0.02) 0.01 (0.00, 0.01) 0.01 (0.00, 0.02)
   mean (sd) 0.01 ± 0.01 0.01 ± 0.01 0.01 ± 0.01
   maximum 0.06 0.05 0.10
hispanic_prop         
   minimum 0.01 0.01 0.01
   median (IQR) 0.04 (0.02, 0.06) 0.03 (0.02, 0.06) 0.04 (0.03, 0.06)
   mean (sd) 0.05 ± 0.04 0.05 ± 0.04 0.05 ± 0.04
   maximum 0.24 0.23 0.23
year         
   minimum 2,010.00 2,012.00 2,016.00
   median (IQR) 2,010.00 (2,010.00, 2,010.00) 2,012.00 (2,012.00, 2,012.00) 2,016.00 (2,016.00, 2,016.00)
   mean (sd) 2,010.00 ± 0.00 2,012.00 ± 0.00 2,016.00 ± 0.00
   maximum 2,010.00 2,012.00 2,016.00

There are clearly some areas for concern with modeling this data. The demographic proportion variables have high collinearity, which could make the estimates unreliable or unstable. In this case, removing multiple proportions did not have a large effect on model predictions, so they were included. Results for the logistic regression are shown in the table below. The only variable with a non-significant effect is the proportion of Hispanic descent within the district.

term estimate std.error statistic p.value exp_estimate CI_lower CI_upper
(Intercept) 329.66 75.33 4.38 0.00 1.484003e+143 2.858017e+110 7.705571e+175
median_income 0.00 0.00 -3.83 0.00 1.000000e+00 1.000000e+00 1.000000e+00
median_age 0.13 0.06 2.29 0.02 1.140000e+00 1.080000e+00 1.210000e+00
white_prop -334.59 75.84 -4.41 0.00 0.000000e+00 0.000000e+00 0.000000e+00
black_prop -334.56 85.30 -3.92 0.00 0.000000e+00 0.000000e+00 0.000000e+00
native_prop -369.58 88.29 -4.19 0.00 0.000000e+00 0.000000e+00 0.000000e+00
asian_prop -305.23 75.00 -4.07 0.00 0.000000e+00 0.000000e+00 0.000000e+00
other_prop -387.25 91.32 -4.24 0.00 0.000000e+00 0.000000e+00 0.000000e+00
hispanic_prop 15.03 13.69 1.10 0.27 3.377433e+06 3.830000e+00 2.978202e+12

Strategy

The development of a model to help predict outcomes is useful to help put specific races in context and provides a lens for targeting races. The most obvious strategy however is to simply examine the close races from 2016.

Vulnerable Democratic Districts

The following 10 districts are those that democrats won narrowly in 2016, ranked by the raw difference between the top two candidates.

District Democratic Candidate Democratic Votes Percent DFL Republican Candiate Republican Votes Percent R Margin of Victory
58 Matt Little 22833 50.38 Tim Pitcher 22446 49.53 387
53 Susan Kent 23035 50.38 Sharna Wahlgren 22636 49.51 399
36 John Hoffman 21793 51.00 Jeffrey Lunde 20840 48.77 953
48 Steve Cwodzinski 24303 51.10 David Hann 23205 48.79 1098
37 Jerry Newton 22129 51.41 Brad Sanford 20838 48.41 1291
54 Dan Schoen 22162 53.13 Leilani Holmstadt 19480 46.70 2682
57 Greg Clausen 24519 53.06 Cory Campbell 21633 46.81 2886
11 Tony Lourey 20519 54.50 Michael Cummins 17079 45.36 3440
27 Dan Sparks 20540 54.76 Gene Dornink 16944 45.17 3596
51 Jim Carlson 24358 54.04 Victor Lake 20662 45.84 3696
Total: - 226191 NA - 205763 NA 20428

Vulnerable Republican Districts

The following table shows vulnerable Republican districts using the same methodology as the table above.

District Democratic Candidate Democratic Votes Percent DFL Republican Candidate Republican Votes Percent R Margin of Victory
14 Dan Wolgamott 17378 47.02 Jerry Relph 17519 47.40 141
44 Deb Calvert 25114 49.74 Paul Anderson 25309 50.13 195
5 Tom Saxhaug 19687 49.21 Justin Eichorn 20240 50.59 553
20 Kevin L. Dahle 20577 47.95 Rich Draheim 22274 51.91 1697
21 Matt Schmit 19282 45.67 Mike Goggin 22901 54.24 3619
56 Phillip M. Sterner 19178 44.75 Dan Hall 23602 55.07 4424
26 Rich Wright 18317 43.95 Carla Nelson 23325 55.96 5008
2 Rod Skoe 17002 43.29 Paul Utke 22232 56.60 5230
32 Tim Nelson 18388 43.33 Mark Koran 23992 56.53 5604
17 Lyle Koenen 16713 42.67 Andrew Lang 22421 57.25 5708
Total: - 191636 NA - 223815 NA 32179

Projections

The model was used to create predictions for projected 2020 demographics. A summary of the projections vs reality are shown in the table below. The model clearly shows an under performance in 2016 based purely on demographic factors. Interestingly, the model predicts the same number of seats in both 2020 as 2016. There are obviously many electoral factors that will be different in 2020 compared to 2016. Actually predicting the results will clearly not be possible with this dataset, but it can be used to guide decision making.

year Predicted Democratic Seats Actual Democratic Seats
2010 28 30
2012 33 39
2016 40 33
2020 40 NA

To use the model for district targeting, districts that Democratic candidates lost, but had favorable demographics are shown below:

District Percent DFL Percent R Margin of Victory Modeled Win Probability (%) Voter Turnout (%)
14 47.02 47.40 141 87.67 57.42
26 43.95 55.96 5008 65.65 66.50
56 44.75 55.07 4424 63.69 69.06
55 31.24 68.53 15850 57.27 70.63
2 43.29 56.60 5230 54.69 66.44
5 49.21 50.59 553 54.38 63.38
10 35.56 64.31 12483 54.01 69.72
1 38.54 61.41 8607 53.37 62.22
9 28.71 71.19 16558 50.31 65.46
22 29.72 70.20 14859 41.36 62.19

The most important information that can be gleaned from this table are the districts that did not appear in the “close loss” table above. These represent districts that would not have been identified as winnable simply from vote totals. These districts are 1, 9, 10, 22, and 55. There are likely other factors that play an important part within these districts as outliers, but they are still areas that are demographically favorable to democrats.

Visualizations

While working through this data, I thought it was important to visualize the results to challenge my assumptions and ensure the results appeared consistent. I may work this into a full shiny app in the future, but for now it will remain as the separate visuals below:

plot of chunk unnamed-chunk-15plot of chunk unnamed-chunk-15plot of chunk unnamed-chunk-15plot of chunk unnamed-chunk-15

Limitations

There are a lot of clear limitations with the current data. To begin with, the data does not include encumbent information, or information on the presence of third party candidates. The ACS estimates are for 5 year periods, so they are likely not truly representative of the district on election day, particularly in districts that undergo rapid demographic change. There is also no inclusion of polling data, which could serve to analyze likely voters and also general trends that are not shown in demographic data.

Regarding the modeling, there are several key issues. The projections are simple linear projections from the ACS data, and thus have a high degree of uncertainty. In addition, there is no voter turnout model within the current iteration for this model. Multicollinearity could also be a problem with all of the demographic proportions included within the model. Finally the logistic regression may not be the most appropriate model for this problem. Other methods may result in better predictive power.

Future Work

There is clearly a lot of room to expand this analysis. To begin with I plan on branching this project out in 2 ways. I want to expand this analysis to include the Minnesota House of Representatives. Combining this analysis should allow for a greater level of detail and theoretically better projections. In addition, I would like to spend another post going into deeper depth on building the most predictive model possible for Minnesota elections. To do this, I will address as many of the aforementioned limitations possible and explore other modeling techniques with this data. Finally, once a more complete dataset and model are in place, I would like to develop a shiny app for interactive visualization.