Fog gathers in the stadium prior to the

The time change has destroyed most of my brain this week, but I do have some vague and pretentious thoughts on two different approaches to football analytics and the promise and limits of both.

The last major panel I attended at the Sloan MIT Sports Analytics conference was on betting analytics. The panel itself was kind of a young old boys club of inveterate ex-poker-players-turned-sports-betters picking on Matthew Holt, the Director of Race & Sports Data for Cantor Gaming, for how ‘sharp’ friendly his company was, and some shifty moves on the betting lines from time to time. It was gloriously inside baseball stuff.

But this panel got my head racing about two distinct approaches to understanding how to get the most from predictive stats in soccer. I mentioned them briefly last week.

The first involves the traditional Moneyball approach—using stats to uncover inefficiencies either on the pitch or in player recruitment to give teams a market advantage (for a time at least). This is what many in the field are most interested in, because they can sell this information or their expertise to clubs.

The second involves studying football as it is in order to predict winners and losers. This field is of obvious interest both to potential soccer “sharps” and to betting firms interested in drawing sharply accurate betting lines.

It’s this second option that I think currently carries the most utility in soccer analytics. Those working in this area seem to prefer beginning with simple models and adding further layers of complexity to tweak their approach. They aren’t interested in improving football but understanding how football works. As Chris Anderson wrote on a blog post last year, one of the most intriguing questions in all of football is the “enormous slippage from one stage of the goal production process [shots, shots on target, goals] to the next.”

This isn’t a trivial question, particularly in light of this year’s Premier League season in which Manchester United has overcome a low Total Shots Ratio to run away with the league title. Man United are simply scoring more goals on fewer shots. Once we establish this however, it becomes the basis for a more in-depth picture.

This is precisely what this Wooly Jumpers for Goalposts post attempted to do with understanding Gareth Bale’s impact for Tottenham this season. By starting from the macro and moving meaningfully toward the micro, we can get a better picture of just what exactly may be going on.

And in some cases the micro may not even be necessary. The Wallpapering Fog author recently shopped his predictive PL results model at Opta’s HQ, one that appears to have been pretty successful over the course of this season. Perhaps counter-intuitively, he is not a fan of ‘Big Data models’:

For the record, even with all the different things that happen in a football game, you can still fairly effectively predict results (and win at the bookies for the last three weeks in a row) using only…

Pass completion rates
Goalscoring rates
Player dispossession rates
A measure of how good the opposition are at winning the ball back

And essentially, that’s it.

Of course the model isn’t perfect and there are tons of improvements to be made, but the crucial point is that if I’d started with Opta’s event-level statistics, I’d be nowhere. I’d probably still be trying to pull that feed into a useful database and understand any underlying relationships in the data at all.

Again, this is common sense at play. Big Data may be useful over the long run in understanding why certain players develop into senior players, but the blogger’s approach reflects Simon Gleave’s favourite dictum: we need to walk before we can run.

Finally, predictive analytics is rabidly interested in accurate sample size. Theodore Knutson wrote a recent post on the importance of looking abroad in further improving your predictive models, for example. What I find interesting though is the care involved in this approach to use samples that fit your model:

That said, picking your leagues is important. First, you obviously need correct inputs for whatever model you have constructed for each league you want to use. Beyond that, look at the league profiles you have data for and figure out which ones seem to fit a “normal” profile. Examining the data, it looks like EPL, Bundesliga, La Liga, and Serie A are comparatively similar. Ligue 1 in France is an outlier due to much lower totals and average goal expectations than the other leagues, while Eredivisie is an outlier in the other direction. Your model MIGHT work there, but it also might be that those leagues are so different that it might cause problems, so be aware of that.

This is empiricism ‘by accident.’ If you don’t want to mess up your gambling model, you’re going to want to know specifically why Ligue 1 and Eredivisie are outliers, as Knutson as done here. Again, this is all in the name of making money, but the empirical knowledge remains just as valuable.

The more distance between me and Boston, the more I think this aspect of analytics in soccer is the most valuable.