Everton v Tottenham Hotspur - Barclays Premier League

So a few days ago Dustin Parkes sent this article my way, Colin Wyers’ farewell speech of sorts on his leaving Baseball Prospectus for a dream job with the Houston Astros as a “mathematical modeler”. Wyers made a couple of remarks I think are prescient with regard to the state of football analytics at the moment:

A lot of sabermetrics these days likes to focus on how well a model can predict things. Now, predictive models are great and good, partly because there’s a lot of utility in predicting things, and there’s a lot of intrinsic value in using prediction to validate a model. But there’s also a lot of value in how a model can explain something.

Part of that is because explanatory models are going to be better predictive models in the long run. We can talk about overfitting, and how models based on a limited amount of data (so pretty much every model ever—some models are less limited than others, but there’s always less data than there is life) can come to some pretty odd and incorrect conclusions that break down when applied to additional data. There are statistical tools that you can use to avoid such problems, but creating a model that has explanatory power in addition to predictive power is another way to avoid such problems.

I kind of think it’s not necessary to make this distinction explicit, at least in the football world. That’s because despite the open reference to better betting models in a lot of quarters, the real work seems to consist of two crucial phases:

  • isolation of some solid predictive metrics (and that can include both counter-intuitive measurements that predict points or next season goal tallies or whatever AND numbers people think are meaningful but in fact regress to the mean)
  • working out a reasonable explanation when those metrics fail to explain an outcome above and beyond the needle swings of random variation.

I can’t imagine a way of approach advanced stats that doesn’t involve this second phase, even if it’s as simple as pointing to random variation. This is kind of what happened when Phil Birnbaum recently questioned the notion that shot percentages aren’t predictive based on a couple of seasons when shot volume had a negative correlation with win percentages in the NHL. Nick Emptage reasonably countered that even good predictive metrics are sometimes themselves subject to “natural variation,” which shouldn’t come as a big surprise.

It doesn’t even have to be as abstract as all that, either. This season in the Premier League, Spurs’ high shot ratio (good!) and low shot conversion rate (bad!) have led to some explanations—tactical, statistical, whatever—to fill the gap (it’s not clear to me even that we can discount random variation in Spurs’ in ability to convert necessarily, for what it’s worth). And if it appears in the end that the variation has to do with the predictive limits of the metric, well then, back to the drawing board for a better number! And so we get final third touches, or goal differentials or whatever. And then when those fail to predict some important results, we go back to looking for a reasonable explanation.

You see? It’s like a nerdy little dance. The point to be made here however is that predictive models provide the perfect entry point for analysts to approach raw data to find something meaningful. Perfecting said models a long and arduous process, but the better the models, the better the explanations, and eventually, the more useful for clubs, the media and supporters. You can’t simply jump the queue and led the raw data guide you on its own terms. Predictive models are like a wonderful filter for noise. And right now that filter is leading to major progress in soccer analytics.