Current biostatistics graduate student at the University of Minnesota. Previously a med device engineer. All opinions are my own. Available for engineering or statistical consulting.
Could San Francisco have predicted whether Kansas City was going to run or pass using only the NFL’s own play by play data?
With the Superbowl a month in the past, this analysis may be a little late, but it does give the opportunity to see how accurate a model would have been using the actual Superbowl data as the testing set.
Implementation
Libraries
Here I will be using the nflscrapR package, which I found to be quite simple to use. There are a couple of quirks that we will get into later, but overall it was straightforward.
Data Import
Kansas City Predictions
To start with, I am going to look at predicting only Kansas City’s offense with only data from the previous Kansas City Games.
Data Capture and Cleaning
The data capture and cleaning is pretty straightforward for this analysis, as nflscrapR returns pretty clean data. The code chunk below walks through the preparation and modeling using comments.
Modeling and Evaluation
I went with a logistic regression for this prediction, since there are a variety of variables being used to predict a binary outcome (Run vs Pass). Now we can take a look at the model coefficients and p values.
From the output it appears that all of the included variables are marginally significant given the contribution of the other variables, except for the 4th down variable. Now the interpretability of the coefficients is not useful in their current form. As logistic regression coefficients are the log of the odds ratio, to interpret what each coefficient means, they need to be exponentiated.
x
(Intercept)
0.1290350
TimeSecs
1.0002012
down2
2.5221472
down3
8.6059214
down4
1.3607891
ydstogo
1.2314592
ScoreDiff
0.9679319
Now the interpretation of the coefficients are more clear. For example, for every point in score differential, the team is 0.97 as likely to pass than run. This means that if a team is down by a touchdown, they are {r} 7*0.97 as likely to pass than run. In addition, on 2nd down, they are 2.5 times as likely to pass than run when compared to 1st down. These results basically match how you would expect this to play out intuitively, so that is an encouraging sign (confirmation bias aside).
Model Analysis
Next we will evaluate how this model performs using the Superbowl as the test set. It would probably make more sense to use cross validation and use a random testing and training set, but I’m trying to show the 49er’s that if they hired me they could win the next Superbowl. As the logistic model outputs probabilities from 0-1 on how likely a pass was, a threshold needs to be selected for when to expect a pass. As we are simply trying to predict which of the two options will happen and we have an equal tolerance for false positives and false negatives, we will set the threshold at 0.5.
Predicted Run
Predicted Pass
Actual Run
9
17
Actual Pass
2
44
The output table shows the predictions vs the actual results of the plays. Calculating the proportion shows that the model was 73.6% accurate in the Superbowl at predicting when Kansas city would pass. What does that really mean though and is it even good? In order to understand that, the model can be compared to a more generalized model that uses all NFL data to see if it is even worth evaluating the Chief’s games by themselves.
Generalized Model
The previous steps are repeated with some minor changes to include data for the entire NFL season and post season.
The model parameters are shown below.
The coefficients are again transformed so they may be interpreted.
x
(Intercept)
0.2554199
TimeSecs
0.9999454
down2
2.2972596
down3
7.1469689
down4
3.6359752
ydstogo
1.1452022
ScoreDiff
0.9659611
The coefficients are roughly in line with those we saw in the Kansas City model, which is to be expected. We will now use this model to predict the results in the Superbowl.
Predicted Run
Predicted Pass
Actual Run
10
16
Actual Pass
14
32
The new model resulted in an accuracy of 58.3%. That is a difference of 15.3%, which shows the power of ensuring the data that is modeled is representative of the results that they are intended to predict. There are other ways to judge the accuracy as well. These results could be compared to running a simple average based on the proportion of runs to passes used by Kansas City.
Conclusions
This model makes it clear that there are several readily available factors in NFL play by play data that can be used to predict how a team will call plays in practice. In addition, focusing only on the data that represents that team is more effective that using the general play by play data for all teams, at least in this one case.
Limitations
While interesting, this model is fairly limited in it’s application as there is a lot of other information that could be included. Some NFL datasets will include factors such as number of players in the box, personnel data, and the formation that the team is in. If this data were included, the accuracy and applicability of the model may be improved. In addition, the prediction is relatively simple in this case, since the actual play is not being predicted. If this model was applied in practice to determine which defensive personnel should be used, the offense may adjust and simply run a different play. Including defensive formations could help improve the applicability of the model.