Fog gathers in the stadium prior to the

The time change has destroyed most of my brain this week, but I do have some vague and pretentious thoughts on two different approaches to football analytics and the promise and limits of both.

The last major panel I attended at the Sloan MIT Sports Analytics conference was on betting analytics. The panel itself was kind of a young old boys club of inveterate ex-poker-players-turned-sports-betters picking on Matthew Holt, the Director of Race & Sports Data for Cantor Gaming, for how ‘sharp’ friendly his company was, and some shifty moves on the betting lines from time to time. It was gloriously inside baseball stuff.

But this panel got my head racing about two distinct approaches to understanding how to get the most from predictive stats in soccer. I mentioned them briefly last week.

The first involves the traditional Moneyball approach—using stats to uncover inefficiencies either on the pitch or in player recruitment to give teams a market advantage (for a time at least). This is what many in the field are most interested in, because they can sell this information or their expertise to clubs.

The second involves studying football as it is in order to predict winners and losers. This field is of obvious interest both to potential soccer “sharps” and to betting firms interested in drawing sharply accurate betting lines.

It’s this second option that I think currently carries the most utility in soccer analytics. Those working in this area seem to prefer beginning with simple models and adding further layers of complexity to tweak their approach. They aren’t interested in improving football but understanding how football works. As Chris Anderson wrote on a blog post last year, one of the most intriguing questions in all of football is the “enormous slippage from one stage of the goal production process [shots, shots on target, goals] to the next.”

This isn’t a trivial question, particularly in light of this year’s Premier League season in which Manchester United has overcome a low Total Shots Ratio to run away with the league title. Man United are simply scoring more goals on fewer shots. Once we establish this however, it becomes the basis for a more in-depth picture.

This is precisely what this Wooly Jumpers for Goalposts post attempted to do with understanding Gareth Bale’s impact for Tottenham this season. By starting from the macro and moving meaningfully toward the micro, we can get a better picture of just what exactly may be going on.

And in some cases the micro may not even be necessary. The Wallpapering Fog author recently shopped his predictive PL results model at Opta’s HQ, one that appears to have been pretty successful over the course of this season. Perhaps counter-intuitively, he is not a fan of ‘Big Data models’:

For the record, even with all the different things that happen in a football game, you can still fairly effectively predict results (and win at the bookies for the last three weeks in a row) using only…

Pass completion rates
Goalscoring rates
Player dispossession rates
A measure of how good the opposition are at winning the ball back

And essentially, that’s it.

Of course the model isn’t perfect and there are tons of improvements to be made, but the crucial point is that if I’d started with Opta’s event-level statistics, I’d be nowhere. I’d probably still be trying to pull that feed into a useful database and understand any underlying relationships in the data at all.

Again, this is common sense at play. Big Data may be useful over the long run in understanding why certain players develop into senior players, but the blogger’s approach reflects Simon Gleave’s favourite dictum: we need to walk before we can run.

Finally, predictive analytics is rabidly interested in accurate sample size. Theodore Knutson wrote a recent post on the importance of looking abroad in further improving your predictive models, for example. What I find interesting though is the care involved in this approach to use samples that fit your model:

That said, picking your leagues is important. First, you obviously need correct inputs for whatever model you have constructed for each league you want to use. Beyond that, look at the league profiles you have data for and figure out which ones seem to fit a “normal” profile. Examining the data, it looks like EPL, Bundesliga, La Liga, and Serie A are comparatively similar. Ligue 1 in France is an outlier due to much lower totals and average goal expectations than the other leagues, while Eredivisie is an outlier in the other direction. Your model MIGHT work there, but it also might be that those leagues are so different that it might cause problems, so be aware of that.

This is empiricism ‘by accident.’ If you don’t want to mess up your gambling model, you’re going to want to know specifically why Ligue 1 and Eredivisie are outliers, as Knutson as done here. Again, this is all in the name of making money, but the empirical knowledge remains just as valuable.

The more distance between me and Boston, the more I think this aspect of analytics in soccer is the most valuable.

Comments (2)

  1. Hi Richard,

    I think you raise some very interesting points for everyone to consider. I’m especially thrilled (upon reflection, thrilled might be a bit strong) that you’re bringing up the differences between predictive and other models. It’s an important distinction, and one a lot of people get wrong (like me, for instance).

    I do want to push back a bit on the idea that predictive models are different from others in that they are concerned with understanding how football works vs improving it. Predictive models are NOT concerned with how football works AT ALL (at least, they shouldn’t be), they are concerned only with prediction. Where people get in trouble is in thinking that the best predictions come from understanding every causal factor and including them in the model. Usually predictive models are based on “what happened before, plus some small set of adjustments (these would be drawn from those causal factors)”.

    This distinction is what makes your question of what fans want so interesting. What DO fans want from analytics? Do they want to know who is more likely to win a game, or a league, or score the most goals? As you note, they certainly do if they’re betting, and those people are looking for predictive models. But if people want to understand and test theories about how the game works (i.e. is a team better off defensively if it’s center backs have played together longer), predictive models aren’t going to help with that. Of course, the answer probably is “both”, but I’m guessing they’re more interested in the latter than the article suggests.

    The take home is that the first step in modeling football is determining the questions you want to answer. Predictive models aren’t a solution to the problems of complex data in football analysis, they are a tool for predicting results. If you are interested in explanation, rather than prediction, they aren’t going to help you.

    There IS a whole separate, but related, set of issues implied in here about model specification and complexity. Suffice it to say that overcomplexity is a problem in either model type. So researchers should be careful about overspecifying any type of model, for instance. But that is not a question that is solved by deciding whether or not to use a predictive mode. It’s worth reading about parsimony if you haven’t yet.

    Keep up the good work!


    PS. You don’t really get into this, but the article raises other questions about the outside pressures that drive participation in football analytics, and especially public football analytics. My one big question would be: is it likely that gambling, where proprietary models are extremely valuable, will ever be a meaningful driver of public soccer analysis? These people have a lot of incentive to keep their knowledge private. I’d guess that the big pressure driving public analysis is, as you note, that people want to work for teams or media outlets, and public blogging is a good way to audition for those positions. Given that it would seem that teams would be more interested in analysis rather than prediction (with some notable exceptions, especially injuries), I’d guess we’ll still see a lot of efforts in those areas.

    PPS. You pay commenters by the word, right? To whom do I send my bill?

    • Ha ha! Can pay in beer.

      On your central point, I would agree. What I wanted to get across I suppose was not that predictive models are a better empirical approach per se, but that they tend to provide a better foundation for those interested in how football actually works. For the beginner football empiricist (would there be such a thing), the basic understanding of relative anchors that are popular in predictive models might be a better place to begin to get a grasp at the broad, cross-historical factors that make football football. But you’re right…predictive models aren’t going to take them very far on their own—they’re not quite the same thing.

      As for your PS, you raise a point I had meant to include in this piece and forgot (blame the blogging mindset). You’re absolutely right in that betting analytics is just as proprietary as club-centered analytics. That is a significant drawback, to be sure. Funnily enough however, I do believe there are many bloggers really interested in propagating this stuff for its own sake. Sometimes I read certain betting analytics sites and am surprised they would publish such interesting information so freely. Ultimately I think a lot of the drive for a better model, even one ostensibly designed to make money from betting, is out of love of knowledge and good old fashioned curiosity.

      But I’m an idealist…

Leave a Reply

Your email address will not be published. Required fields are marked *