College Football Winners: A Binary Case of Winning and Losing

Introduction

This post is an extension of previous work done in College Football Scores: What Determines Them? Where I originally looked at what causes points, this post will look at what causes wins.

So, what are the major determinates of winning and/or losing a college football game? Rather than predicting college football scoring as a discrete variable, which alumni, coaches, fans, and the public, generally, care less about, here we are examining winning / losing only. The magnitude of the win/loss, therefore, makes no difference in the analysis.

The usual story that has been told before is that of the difficulty in quantifying sports in general and football in particular. As explained and reiterated in (Terrell 2008) baseball is the general sport that has received the brunt of attention by statisticians and other empiricists (Lewis 2003). This is the case because credit or blame is easier to assign in baseball than football: and RBI, home run, or error (in baseball) is easier to credit to one person than an incomplete pass, interception, or TD run (in football). Cause and effect is simpler with the former than with the latter. There are many reasons (from many people) for incomplete passes, interceptions, and TD runs, there are much fewer for RBI’s, home runs, and errors (Lewis 2003, Terrell 2008). Some have even begun to systematically analyze the entire endeavor (Mosteller 1997).

This paper will, in some respects, be an extension of (Willoughby 2002) that used binary data in explaining the winners and losers of Canadian football games. Willoughby looked closely at turnovers effect on the win/loss probability and took the time to separate out the separate types of turnovers. He concludes that interceptions, on average, have more of an effect on winning/losing than do fumbles. Here, turnovers are not separated. Although this analysis does not account for the magnitude of the win, the easy reply is: So what. Again, the desired outcome is a “win” – for all practical purposes, the magnitude is irrelevant. However, looking at magnitude can be useful in ranking teams and comparing predicted wins and losses from actual observations (Harville 1977).

In this same vein, there is a common knowledge fact that, especially in college football, home field advantage can be the deciding factor in a game (Terrell 2008). This analysis is confirmed by more general studies looking at the home field advantage in a variety of sports (Schwarz and Barsky 1977). Further prediction had been done in this manner looking at NFL scores (Harville 1980).

In using past data to predict the future there is, especially in sports models, a tendency to weight recent performance too much (P. and S. Gray 1997). Basically, this argument is in favor of using a model more akin to a simple moving average rather than a weighted moving average that gives more importance to more recent games. Models may be done better for sports predicting assuming that teams, on average, will more than likely play average, i.e. reversion to the mean and the law of large numbers.

Description of Data and Methods

Standard issue OLS is obviously not appropriate for this type of data. Using a standard multiple linear regression technique (by way of OLS) is trying to fit binary variable data to a discrete variable model. In doing this analysis it is much more appropriate to use a binary model from the beginning.

For the sample, I have chosen games played by top 25 NCAA Division I teams during 2008, towards the end of the regular season. In this sample 112 games were played over three weeks. (The sample is small enough to be reported in its entirety and is in the Appendix.)

In this model: RUSH = number of rushing yards, PASS = number of passing yards, TURNO = number of turnovers, PENALT = number yards penalized, and H = home game and WIN denotes a win or loss (win = 1, loss = 0). H is used as a dummy variable to account for the effect of home field advantage.

This dataset is the same as in the previous post except the dataset is extended to include four times the amount of observations and the column WIN is added.

From this we are given the binary model:

In this manner, a Logit model will be used in estimation as the relevant data is categorical, counted, and discrete, rather than measured and continuous.

Similar to previous analysis, and deductive intuition, RUSH, PASS, and H should have positive coefficients, while TURNO and PENALT should be negative. This makes sense if more rushing yards, passing yards, and home-field advantage lead to a higher probability of winning while more turnovers and penalties will lead to a higher probability of losing.

Results

Using R and estimating a binomial logit model:

The output results are:

Statistically speaking, from this we can say that RUSH, PASS, and TURNO are all significant at the level of 5% significance. Conversely, PENALT, and H are insignificant at the 5% level. In regard to the signs, the coefficients are as expected: More rushing yards, passing yards, and home-field advantage lead to a higher probability of a win. Conversely, turnovers and penalties lead to a higher probability of a loss. Further, the McFadden pseudo r-squared is .3811.

Specifically, an increasing in yards rushing by one is associated with a 0.018 increase in the probability of a win. An additional passing yard is associated with a 0.011 increase in the probability of a win. One turnover is associated with a -0.622 decrease in the probability of a win. Receiving a one-yard penalty will decrease the probability of a win by -0.00098. Finally, playing the game with home-field advantage will increase the chance of a win by 0.259. Comparing these figures to the slope coefficients, evaluated at the means gives similar parameters in every variable except TURNO and H. When evaluated at the means, both turnovers and home-field advantage become less significant in determining the probability of a win or loss.

All this is to say that teams would, on average, primarily do better by making sure they play their tougher opponents at home (which is harder to control) and not turning the ball over (which they have more control over). It is however interesting, and consistent with (Terrell 2008), that penalties play little to no role in the probability of a win/loss. The two best ways for a college football coach to improve his chances of winning a game is to lower turnovers per game and develop a potent running game.

Not addressed in the analysis above is the component of time series; the games in the used dataset we played over three separate weeks. The issue that then arises is: Over this short period, is there any difference in data or team performance from one week to the next? Since the time frame under consideration is so small, only three periods under observation – over only three weeks, it is probable that there will be no significant results. Running the data again using a time component with the new model of:

Shows:

Clearly, the additional data is insignificant. Again this is most likely to be expected. In the future this analysis, to be more accurate, should include data over possibly an entire season or a number of seasons – that way it can be seen if there is a marked change in the observations or coefficients from one time period to the next. Or said different, after holding time constant, how do the nature of the coefficients of interest change – and do they do so significantly? This is most likely not the case but is nearly impossible to deduce over such a small time frame, as is used here. However, although the time parameters are insignificant, the time-varying model does have a higher pseudo r-squared (hinting at a better model fit) and higher AIC (hinting at a worse model fit). Keeping in mind that the time variables are statistically insignificant, the AIC in the time-varying model is higher, and the simplest model is usually best, in this case, the second model should most likely be rejected in favor of the first.

Further research, not constrained by time or budget should look at this data over the last 20 to 30 years. This would allow a complete analysis of time variation. Additionally, the turnovers component should be broken down into fumbles and interceptions to examine if the there is a significant difference in the two that contribute to a win or loss (as was done in Willoughby 2002). Other interesting factors that could be incorporated into the analysis include temperature on the field, grass vs. turf, and the winning traditions of both the individual teams and head coaches.

Conclusion

From this analysis, a number of conclusions can be drawn. Many of them are similar to results that have been seen before in studies of a similar type, but different nature. In relation to increasing the probability of winning a college football game:

On average, penalties play little part in increasing or decreasing the probability of a win or loss – 100 penalty yards decrease the probability of a win by 1.6 percentage points.
On average, 100 passing yards increases the probability of a win by 17.9 percentage points.
On average, 100 rushing yards increases the probability of a win by 29.6 percentage points.
On average, playing at home increases the probability of a win by 4.2 percentage points.
Finally, on average, turning the ball over decreases the probability of a win by 10.2 percentage points.

So: Play at home. Develop a running game. And don’t turn the ball over!

Share this:

Related