MN Senate Election Targeting 2020
Note
As a note, the majority of this blog post was written before the murder of George Floyd. While this post seeks to explore the electoral impact of demographics in Minnesota broadly, I am not going to try to predict how these events may affect the coming election. If you support data-driven solutions to real world problems, consider donating to Campaign Zero.
Introduction
This analysis will focus on which districts should see the most investment in a democratic strategy for the Minnesota State Senate, as well as a first draft of a predictive model for the 2020 Minnesota state senate elections. Data should certainly inform the decision making around elections, but in cases of high uncertainty, like elections, data should not be a stand in for sound strategy, based on reasonable assumptions. This is important to understanding this model and paper, as it currently incorporated demographic data, but does not include current estimates of public sentiment or a likely voter turnout model for the 2020 election.
The code for this project can be found in the following github repository: https://github.com/tajubenv/Minnesota_Election_Analysis.
Data Sources
Data for this post was pulled from the Minnesota secretary of state elections website, https://www.sos.state.mn.us/elections-voting/election-results for election results. Demographic data was pulled from the 5 year American Community Survey (ACS) Data https://www.census.gov/programs-surveys/acs/data.html using the tidycensus in R. The ACS data has varying levels of depth between 1, 3, and 5 year datasets. The 5 year data is the only survey that contains State Senate level data for Minnesota. The 5 year estimates are calculated based on data within the 5 preceding years of the date given on the data. Thus when data for the 2016 ACS is estimated, the estimate includes data from 2012-2016. This poses the obvious issue that it is desirable to understand the demographics on election day, rather than an estimate for the previous 5 years. This analysis is focused more on strategy and prediction, rather than inference, so it is most appropriate to use resources that will be available before the election in November. As a result, the ACS estimates will be examined for the year they end, rather than offset in any manner.
Methods
Data was collected from the ACS 5 year survey for 2010, 2012, and 2016 for all counties that had a Democratic-Farmer-Labor Party candidate running for state senate in those same years. Specific demographic data included median income, median age, and proportions of the county that were White, Black, Native American, Asian or Other as defined by the ACS data. Proportion of Hispanic descent was also included. All proportions for this model are proportions of the voting age population.
After data collection and cleaning, a logistic regression was performed to predict the likelihood of a democratic victory within a MN senate district, based on demographic characteristics of the district. In addition, linear models were used to project the same demographic information for the 2020 election across the state. The model was then used to analyze potential districts for democratic campaigns to target in 2020.
Results
Summaries for the demographic data included in the regression model are shown in the table below.
year: 2010 (N = 65) | year: 2012 (N = 67) | year: 2016 (N = 67) | |
---|---|---|---|
winner | |||
minimum | 0.00 | 0.00 | 0.00 |
median (IQR) | 0.00 (0.00, 1.00) | 1.00 (0.00, 1.00) | 0.00 (0.00, 1.00) |
mean (sd) | 0.46 ± 0.50 | 0.58 ± 0.50 | 0.49 ± 0.50 |
maximum | 1.00 | 1.00 | 1.00 |
median_income | |||
minimum | 32,779.00 | 36,227.00 | 40,444.00 |
median (IQR) | 57,115.00 (44,799.00, 69,189.00) | 59,081.00 (47,921.50, 74,187.50) | 63,232.00 (52,185.50, 77,299.50) |
mean (sd) | 58,263.58 ± 15,159.17 | 61,079.57 ± 15,698.15 | 65,153.85 ± 16,239.96 |
maximum | 91,156.00 | 94,043.00 | 101,286.00 |
median_age | |||
minimum | 27.30 | 27.60 | 27.30 |
median (IQR) | 37.10 (35.10, 40.30) | 37.90 (35.10, 41.35) | 38.70 (35.95, 41.55) |
mean (sd) | 37.37 ± 4.48 | 37.77 ± 4.42 | 38.37 ± 4.36 |
maximum | 46.70 | 46.50 | 47.00 |
white_prop | |||
minimum | 0.47 | 0.49 | 0.48 |
median (IQR) | 0.92 (0.86, 0.95) | 0.92 (0.85, 0.96) | 0.90 (0.84, 0.95) |
mean (sd) | 0.88 ± 0.11 | 0.88 ± 0.10 | 0.87 ± 0.11 |
maximum | 0.98 | 0.97 | 0.97 |
black_prop | |||
minimum | 0.00 | 0.00 | 0.00 |
median (IQR) | 0.02 (0.01, 0.06) | 0.02 (0.01, 0.06) | 0.03 (0.01, 0.06) |
mean (sd) | 0.05 ± 0.07 | 0.04 ± 0.06 | 0.05 ± 0.06 |
maximum | 0.35 | 0.33 | 0.34 |
native_prop | |||
minimum | 0.00 | 0.00 | 0.00 |
median (IQR) | 0.00 (0.00, 0.01) | 0.00 (0.00, 0.01) | 0.00 (0.00, 0.01) |
mean (sd) | 0.01 ± 0.02 | 0.01 ± 0.02 | 0.01 ± 0.02 |
maximum | 0.11 | 0.11 | 0.12 |
asian_prop | |||
minimum | 0.00 | 0.00 | 0.00 |
median (IQR) | 0.02 (0.01, 0.05) | 0.03 (0.01, 0.06) | 0.03 (0.01, 0.06) |
mean (sd) | 0.04 ± 0.04 | 0.04 ± 0.04 | 0.04 ± 0.04 |
maximum | 0.19 | 0.19 | 0.25 |
other_prop | |||
minimum | 0.00 | 0.00 | 0.00 |
median (IQR) | 0.01 (0.01, 0.02) | 0.01 (0.00, 0.01) | 0.01 (0.00, 0.02) |
mean (sd) | 0.01 ± 0.01 | 0.01 ± 0.01 | 0.01 ± 0.01 |
maximum | 0.06 | 0.05 | 0.10 |
hispanic_prop | |||
minimum | 0.01 | 0.01 | 0.01 |
median (IQR) | 0.04 (0.02, 0.06) | 0.03 (0.02, 0.06) | 0.04 (0.03, 0.06) |
mean (sd) | 0.05 ± 0.04 | 0.05 ± 0.04 | 0.05 ± 0.04 |
maximum | 0.24 | 0.23 | 0.23 |
year | |||
minimum | 2,010.00 | 2,012.00 | 2,016.00 |
median (IQR) | 2,010.00 (2,010.00, 2,010.00) | 2,012.00 (2,012.00, 2,012.00) | 2,016.00 (2,016.00, 2,016.00) |
mean (sd) | 2,010.00 ± 0.00 | 2,012.00 ± 0.00 | 2,016.00 ± 0.00 |
maximum | 2,010.00 | 2,012.00 | 2,016.00 |
There are clearly some areas for concern with modeling this data. The demographic proportion variables have high collinearity, which could make the estimates unreliable or unstable. In this case, removing multiple proportions did not have a large effect on model predictions, so they were included. Results for the logistic regression are shown in the table below. The only variable with a non-significant effect is the proportion of Hispanic descent within the district.
term | estimate | std.error | statistic | p.value | exp_estimate | CI_lower | CI_upper |
---|---|---|---|---|---|---|---|
(Intercept) | 329.66 | 75.33 | 4.38 | 0.00 | 1.484003e+143 | 2.858017e+110 | 7.705571e+175 |
median_income | 0.00 | 0.00 | -3.83 | 0.00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 |
median_age | 0.13 | 0.06 | 2.29 | 0.02 | 1.140000e+00 | 1.080000e+00 | 1.210000e+00 |
white_prop | -334.59 | 75.84 | -4.41 | 0.00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
black_prop | -334.56 | 85.30 | -3.92 | 0.00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
native_prop | -369.58 | 88.29 | -4.19 | 0.00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
asian_prop | -305.23 | 75.00 | -4.07 | 0.00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
other_prop | -387.25 | 91.32 | -4.24 | 0.00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
hispanic_prop | 15.03 | 13.69 | 1.10 | 0.27 | 3.377433e+06 | 3.830000e+00 | 2.978202e+12 |
Strategy
The development of a model to help predict outcomes is useful to help put specific races in context and provides a lens for targeting races. The most obvious strategy however is to simply examine the close races from 2016.
Vulnerable Democratic Districts
The following 10 districts are those that democrats won narrowly in 2016, ranked by the raw difference between the top two candidates.
District | Democratic Candidate | Democratic Votes | Percent DFL | Republican Candiate | Republican Votes | Percent R | Margin of Victory |
---|---|---|---|---|---|---|---|
58 | Matt Little | 22833 | 50.38 | Tim Pitcher | 22446 | 49.53 | 387 |
53 | Susan Kent | 23035 | 50.38 | Sharna Wahlgren | 22636 | 49.51 | 399 |
36 | John Hoffman | 21793 | 51.00 | Jeffrey Lunde | 20840 | 48.77 | 953 |
48 | Steve Cwodzinski | 24303 | 51.10 | David Hann | 23205 | 48.79 | 1098 |
37 | Jerry Newton | 22129 | 51.41 | Brad Sanford | 20838 | 48.41 | 1291 |
54 | Dan Schoen | 22162 | 53.13 | Leilani Holmstadt | 19480 | 46.70 | 2682 |
57 | Greg Clausen | 24519 | 53.06 | Cory Campbell | 21633 | 46.81 | 2886 |
11 | Tony Lourey | 20519 | 54.50 | Michael Cummins | 17079 | 45.36 | 3440 |
27 | Dan Sparks | 20540 | 54.76 | Gene Dornink | 16944 | 45.17 | 3596 |
51 | Jim Carlson | 24358 | 54.04 | Victor Lake | 20662 | 45.84 | 3696 |
Total: | - | 226191 | NA | - | 205763 | NA | 20428 |
Vulnerable Republican Districts
The following table shows vulnerable Republican districts using the same methodology as the table above.
District | Democratic Candidate | Democratic Votes | Percent DFL | Republican Candidate | Republican Votes | Percent R | Margin of Victory |
---|---|---|---|---|---|---|---|
14 | Dan Wolgamott | 17378 | 47.02 | Jerry Relph | 17519 | 47.40 | 141 |
44 | Deb Calvert | 25114 | 49.74 | Paul Anderson | 25309 | 50.13 | 195 |
5 | Tom Saxhaug | 19687 | 49.21 | Justin Eichorn | 20240 | 50.59 | 553 |
20 | Kevin L. Dahle | 20577 | 47.95 | Rich Draheim | 22274 | 51.91 | 1697 |
21 | Matt Schmit | 19282 | 45.67 | Mike Goggin | 22901 | 54.24 | 3619 |
56 | Phillip M. Sterner | 19178 | 44.75 | Dan Hall | 23602 | 55.07 | 4424 |
26 | Rich Wright | 18317 | 43.95 | Carla Nelson | 23325 | 55.96 | 5008 |
2 | Rod Skoe | 17002 | 43.29 | Paul Utke | 22232 | 56.60 | 5230 |
32 | Tim Nelson | 18388 | 43.33 | Mark Koran | 23992 | 56.53 | 5604 |
17 | Lyle Koenen | 16713 | 42.67 | Andrew Lang | 22421 | 57.25 | 5708 |
Total: | - | 191636 | NA | - | 223815 | NA | 32179 |
Projections
The model was used to create predictions for projected 2020 demographics. A summary of the projections vs reality are shown in the table below. The model clearly shows an under performance in 2016 based purely on demographic factors. Interestingly, the model predicts the same number of seats in both 2020 as 2016. There are obviously many electoral factors that will be different in 2020 compared to 2016. Actually predicting the results will clearly not be possible with this dataset, but it can be used to guide decision making.
year | Predicted Democratic Seats | Actual Democratic Seats |
---|---|---|
2010 | 28 | 30 |
2012 | 33 | 39 |
2016 | 40 | 33 |
2020 | 40 | NA |
To use the model for district targeting, districts that Democratic candidates lost, but had favorable demographics are shown below:
District | Percent DFL | Percent R | Margin of Victory | Modeled Win Probability (%) | Voter Turnout (%) |
---|---|---|---|---|---|
14 | 47.02 | 47.40 | 141 | 87.67 | 57.42 |
26 | 43.95 | 55.96 | 5008 | 65.65 | 66.50 |
56 | 44.75 | 55.07 | 4424 | 63.69 | 69.06 |
55 | 31.24 | 68.53 | 15850 | 57.27 | 70.63 |
2 | 43.29 | 56.60 | 5230 | 54.69 | 66.44 |
5 | 49.21 | 50.59 | 553 | 54.38 | 63.38 |
10 | 35.56 | 64.31 | 12483 | 54.01 | 69.72 |
1 | 38.54 | 61.41 | 8607 | 53.37 | 62.22 |
9 | 28.71 | 71.19 | 16558 | 50.31 | 65.46 |
22 | 29.72 | 70.20 | 14859 | 41.36 | 62.19 |
The most important information that can be gleaned from this table are the districts that did not appear in the “close loss” table above. These represent districts that would not have been identified as winnable simply from vote totals. These districts are 1, 9, 10, 22, and 55. There are likely other factors that play an important part within these districts as outliers, but they are still areas that are demographically favorable to democrats.
Visualizations
While working through this data, I thought it was important to visualize the results to challenge my assumptions and ensure the results appeared consistent. I may work this into a full shiny app in the future, but for now it will remain as the separate visuals below:
Limitations
There are a lot of clear limitations with the current data. To begin with, the data does not include encumbent information, or information on the presence of third party candidates. The ACS estimates are for 5 year periods, so they are likely not truly representative of the district on election day, particularly in districts that undergo rapid demographic change. There is also no inclusion of polling data, which could serve to analyze likely voters and also general trends that are not shown in demographic data.
Regarding the modeling, there are several key issues. The projections are simple linear projections from the ACS data, and thus have a high degree of uncertainty. In addition, there is no voter turnout model within the current iteration for this model. Multicollinearity could also be a problem with all of the demographic proportions included within the model. Finally the logistic regression may not be the most appropriate model for this problem. Other methods may result in better predictive power.
Future Work
There is clearly a lot of room to expand this analysis. To begin with I plan on branching this project out in 2 ways. I want to expand this analysis to include the Minnesota House of Representatives. Combining this analysis should allow for a greater level of detail and theoretically better projections. In addition, I would like to spend another post going into deeper depth on building the most predictive model possible for Minnesota elections. To do this, I will address as many of the aforementioned limitations possible and explore other modeling techniques with this data. Finally, once a more complete dataset and model are in place, I would like to develop a shiny app for interactive visualization.