Superbowl Pass Prediction

Problem Statement

Could San Francisco have predicted whether Kansas City was going to run or pass using only the NFL’s own play by play data?

With the Superbowl a month in the past, this analysis may be a little late, but it does give the opportunity to see how accurate a model would have been using the actual Superbowl data as the testing set.

Implementation

Libraries

Here I will be using the nflscrapR package, which I found to be quite simple to use. There are a couple of quirks that we will get into later, but overall it was straightforward.

Data Import

## Import play by play data for the 2019 season
pbp_2019 <- season_play_by_play(2019)

Kansas City Predictions

To start with, I am going to look at predicting only Kansas City’s offense with only data from the previous Kansas City Games.

Data Capture and Cleaning

The data capture and cleaning is pretty straightforward for this analysis, as nflscrapR returns pretty clean data. The code chunk below walks through the preparation and modeling using comments.

## Select unique GameID's from play by play data for Kansas City
needed_game_id_kc <- pbp_2019 %>% 
  filter(DefensiveTeam == "KC") %>% 
  select(GameID) %>% unique()

## Select only games for KC and remove the Season column.
## Season is removed because it is not included in the post season data.
kc_pbp <- pbp_2019 %>% 
  filter(GameID %in% unlist(needed_game_id_kc)) %>% 
  select(-Season)


## This is a quirk of nflscrapR, where you can't directly read in
## the post season data with the season data. To do so, first we scrape
## all of the post season games.
post_season <- scrape_game_ids(2019, type = "post")

## The data is then filtered for KC and the superbowl is removed, since
## we don't want the superbowl in our model.
post_season_kc_ids <- post_season %>% 
  filter((home_team == "KC" | away_team=="KC") & game_id != "2020020200") %>% 
  mutate(GameID = as.character(game_id)) %>% select(GameID) %>% unlist()

## Declared before loop to allow assignment within
kc_pbp_pre_and_post <- kc_pbp

## Combine all of the post season games and regular season games
for (each in post_season_kc_ids ){
  temp <- game_play_by_play(GameID = each)
  kc_pbp_pre_and_post <- rbind(kc_pbp_pre_and_post, temp)
}

## Filter for only offensive plays
kc_season_offense_pbp <- kc_pbp_pre_and_post %>% 
  filter(posteam == "KC")

## There are multiple types of plays listed within the PlayType variable.
## The ony ones that are neede are Run, Pass, and Sack. Plays like
## extra points and punts were excluded since they are not relevant to the 
## final predictions. (i.e. we are not looking to predict fake punts)
desired_plays <- c("Run", "Pass", "Sack")

## Play Data is filtered to match only the desired plays.
## The data is then converted to a binary prediction variable 
## where sacks and passes are considered pass attempts
kc_model_data <- kc_season_offense_pbp %>% 
  filter(PlayType %in% desired_plays) %>% 
  mutate(pass_model_var = case_when(PlayType == "Run" ~ 0,
                                    PlayType == "Pass" ~ 1,
                                    PlayType == "Sack" ~ 1))

## Convert the Binary variable to a factor
kc_model_data$pass_model_var <- as.factor(kc_model_data$pass_model_var)

## Only the needed variables are included into the final data.
## I went through some variable selection experimentation, but did not
## include it within this blog post. For example, field position was
## included in the analysis, but did not offer much predictive power 
## over the other variables that were included.
kc_model_data <- kc_model_data %>% 
  select(pass_model_var, TimeSecs,  down, ydstogo, ScoreDiff)
pass_mod_kc <- glm(pass_model_var ~ TimeSecs + down + ydstogo + ScoreDiff, 
                family = "binomial", data = kc_model_data)

## Add prediction values back into data so we can evaluate this later
kc_model_data <- kc_model_data %>% 
  add_predictions(pass_mod_kc, type = "response")

Modeling and Evaluation

I went with a logistic regression for this prediction, since there are a variety of variables being used to predict a binary outcome (Run vs Pass). Now we can take a look at the model coefficients and p values.

## Output model summary
summary(pass_mod_kc)

## 
## Call:
## glm(formula = pass_model_var ~ TimeSecs + down + ydstogo + ScoreDiff, 
##     family = "binomial", data = kc_model_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2874  -1.1116   0.5801   0.9733   2.0399  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.048e+00  3.222e-01  -6.355 2.08e-10 ***
## TimeSecs     2.011e-04  7.804e-05   2.578  0.00995 ** 
## down2        9.251e-01  1.737e-01   5.325 1.01e-07 ***
## down3        2.152e+00  2.536e-01   8.487  < 2e-16 ***
## down4        3.081e-01  7.277e-01   0.423  0.67205    
## ydstogo      2.082e-01  2.743e-02   7.589 3.21e-14 ***
## ScoreDiff   -3.259e-02  8.559e-03  -3.808  0.00014 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1420.2  on 1074  degrees of freedom
## Residual deviance: 1252.6  on 1068  degrees of freedom
##   (2 observations deleted due to missingness)
## AIC: 1266.6
## 
## Number of Fisher Scoring iterations: 5

From the output it appears that all of the included variables are marginally significant given the contribution of the other variables, except for the 4th down variable. Now the interpretability of the coefficients is not useful in their current form. As logistic regression coefficients are the log of the odds ratio, to interpret what each coefficient means, they need to be exponentiated.

## Output exponentiated coefficients
exp(coef(pass_mod_kc)) %>% knitr::kable() %>% 
  kableExtra::kable_styling(bootstrap_options = "striped") %>%
  kableExtra::row_spec(0, bold = T)

	x
(Intercept)	0.1290350
TimeSecs	1.0002012
down2	2.5221472
down3	8.6059214
down4	1.3607891
ydstogo	1.2314592
ScoreDiff	0.9679319

Now the interpretation of the coefficients are more clear. For example, for every point in score differential, the team is 0.97 as likely to pass than run. This means that if a team is down by a touchdown, they are {r} 7*0.97 as likely to pass than run. In addition, on 2nd down, they are 2.5 times as likely to pass than run when compared to 1st down. These results basically match how you would expect this to play out intuitively, so that is an encouraging sign (confirmation bias aside).

Model Analysis

Next we will evaluate how this model performs using the Superbowl as the test set. It would probably make more sense to use cross validation and use a random testing and training set, but I’m trying to show the 49er’s that if they hired me they could win the next Superbowl. As the logistic model outputs probabilities from 0-1 on how likely a pass was, a threshold needs to be selected for when to expect a pass. As we are simply trying to predict which of the two options will happen and we have an equal tolerance for false positives and false negatives, we will set the threshold at 0.5.

## Pull play by play data from the superbowl
sb_pbp <- game_play_by_play(GameID = '2020020200') 

## Filter sb data to match model data
kc_sb_pbp <- sb_pbp %>% filter(posteam == "KC") %>% 
  filter(PlayType %in% desired_plays) %>% 
  mutate(pass_model_var = case_when(PlayType == "Run" ~ 0,
                                    PlayType == "Pass" ~ 1,
                                    PlayType == "Sack" ~ 1)) %>% 
  select(pass_model_var, TimeSecs,  down, ydstogo, ScoreDiff)
kc_sb_pbp$pass_model_var <- as.factor(kc_sb_pbp$pass_model_var)

## Add prediction values from model back to data.
kc_sb_play_predictions <- kc_sb_pbp %>% 
  add_predictions(pass_mod_kc, type = "response") %>%
  mutate(pass_predicted = case_when(pred >= 0.5 ~ 1,
                                    pred < 0.5 ~0))

## Calculate prediction accuracy using the real data vs predictions
accuracy <- table(kc_sb_play_predictions$pass_model_var, 
                  kc_sb_play_predictions$pass_predicted)

## Change Row and Column Names
row.names(accuracy) <- c("Actual Run", "Actual Pass")
colnames(accuracy)<- c("Predicted Run", "Predicted Pass")

## Output Table
accuracy %>% knitr::kable()

	Predicted Run	Predicted Pass
Actual Run	9	17
Actual Pass	2	44

## Calculate Proportion
accuracy_prop_kc <- sum(diag(accuracy))/sum(accuracy)

The output table shows the predictions vs the actual results of the plays. Calculating the proportion shows that the model was 73.6% accurate in the Superbowl at predicting when Kansas city would pass. What does that really mean though and is it even good? In order to understand that, the model can be compared to a more generalized model that uses all NFL data to see if it is even worth evaluating the Chief’s games by themselves.

Generalized Model

The previous steps are repeated with some minor changes to include data for the entire NFL season and post season.

## Select all play by play data
all_pbp_2019 <- pbp_2019 %>% select(-Season)

## Get only post game play by play data
post_season <- scrape_game_ids(2019, type = "post")

## Get post season data except the superbowl.
post_season_ids <- post_season %>% 
  filter(game_id != "2020020200") %>% 
  mutate(GameID = as.character(game_id)) %>% select(GameID) %>% unlist()

## Initialize so the for loop can combine pre and post pbp
pbp_pre_and_post <- all_pbp_2019

## Add all game data together.
for (each in post_season_ids ){
  temp <- game_play_by_play(GameID = each)
  pbp_pre_and_post <- rbind(pbp_pre_and_post, temp)
}

## Clean data in the same way as the KC data
model_data <- pbp_pre_and_post %>% filter(PlayType %in% desired_plays) %>% 
  mutate(pass_model_var = case_when(PlayType == "Run" ~ 0,
                                    PlayType == "Pass" ~ 1,
                                    PlayType == "Sack" ~ 1))

## Convert to factor for modeling
model_data$pass_model_var <- as.factor(model_data$pass_model_var)

## Prepare data in the same manner as the KC data
model_data <- model_data %>% 
  select(pass_model_var, TimeSecs,  down, ydstogo, ScoreDiff)

## Model in the same manner as the KC data
pass_mod <- glm(pass_model_var ~ TimeSecs + down + ydstogo + ScoreDiff, 
                family = "binomial", data = model_data)

model_data <- model_data %>% 
  add_predictions(pass_mod, type = "response")

The model parameters are shown below.

## Output model summary
summary(pass_mod)

## 
## Call:
## glm(formula = pass_model_var ~ TimeSecs + down + ydstogo + ScoreDiff, 
##     family = "binomial", data = model_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3390  -1.1093   0.5983   1.0183   2.2355  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.365e+00  4.736e-02 -28.820  < 2e-16 ***
## TimeSecs    -5.463e-05  1.133e-05  -4.822 1.42e-06 ***
## down2        8.317e-01  2.842e-02  29.262  < 2e-16 ***
## down3        1.967e+00  3.989e-02  49.308  < 2e-16 ***
## down4        1.291e+00  9.314e-02  13.859  < 2e-16 ***
## ydstogo      1.356e-01  3.942e-03  34.392  < 2e-16 ***
## ScoreDiff   -3.463e-02  1.159e-03 -29.882  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 45101  on 33391  degrees of freedom
## Residual deviance: 40774  on 33385  degrees of freedom
##   (124 observations deleted due to missingness)
## AIC: 40788
## 
## Number of Fisher Scoring iterations: 4

The coefficients are again transformed so they may be interpreted.

## Output interpretable coefficients
exp(coef(pass_mod)) %>% knitr::kable() %>% 
  kableExtra::kable_styling(bootstrap_options = "striped") %>%
  kableExtra::row_spec(0, bold = T)

	x
(Intercept)	0.2554199
TimeSecs	0.9999454
down2	2.2972596
down3	7.1469689
down4	3.6359752
ydstogo	1.1452022
ScoreDiff	0.9659611

The coefficients are roughly in line with those we saw in the Kansas City model, which is to be expected. We will now use this model to predict the results in the Superbowl.

## Calculate projected Pass vs Run outcomes
sb_play_predictions <- kc_sb_pbp %>% 
  add_predictions(pass_mod, type = "response") %>%
  mutate(pass_predicted = case_when(pred >= 0.5 ~ 1,
                                    pred < 0.5 ~0))
## Calculate accuracy again
accuracy <- table(sb_play_predictions$pass_model_var, 
                  sb_play_predictions$pass_predicted)
## Change Row and Column Names
row.names(accuracy) <- c("Actual Run", "Actual Pass")
colnames(accuracy)<- c("Predicted Run", "Predicted Pass")

## Output Table
accuracy %>% knitr::kable()

	Predicted Run	Predicted Pass
Actual Run	10	16
Actual Pass	14	32

## Calculate Proportion
accuracy_prop <- sum(diag(accuracy))/sum(accuracy)

The new model resulted in an accuracy of 58.3%. That is a difference of 15.3%, which shows the power of ensuring the data that is modeled is representative of the results that they are intended to predict. There are other ways to judge the accuracy as well. These results could be compared to running a simple average based on the proportion of runs to passes used by Kansas City.

Conclusions

This model makes it clear that there are several readily available factors in NFL play by play data that can be used to predict how a team will call plays in practice. In addition, focusing only on the data that represents that team is more effective that using the general play by play data for all teams, at least in this one case.

Limitations

While interesting, this model is fairly limited in it’s application as there is a lot of other information that could be included. Some NFL datasets will include factors such as number of players in the box, personnel data, and the formation that the team is in. If this data were included, the accuracy and applicability of the model may be improved. In addition, the prediction is relatively simple in this case, since the actual play is not being predicted. If this model was applied in practice to determine which defensive personnel should be used, the offense may adjust and simply run a different play. Including defensive formations could help improve the applicability of the model.

Share on

Twitter Facebook LinkedIn

Tyler Jubenville