Permit me yet another meta moment.

I’ve been reading a lot of Twitter angst from lately over the inaccessibility of football analytics to a wider football audience. While I generally think it’s a good idea to write on the subject in such a way as to make it understandable for as many football people as possible, I get a little apprehensive on the subject of “primers.”

It brings to mind an old That Mitchell and Webb Sound sketch, in which David Mitchell hosts a science radio show that is so dumbed down it’s insulting. I don’t want to insult anybody, but at the same time I think it would be of some value to at least talk a little about a way to appreciate analytics without becoming overwhelmed.

Part of the problem for me is that “analytics” is often too widely defined to include anything that involves either a number or a visualization. I think this may be a sneaky workaround—”Don’t worry, it’s not all math, analytics is infographics too!” And that leads to a host of problems, which I’ll get to in a moment.

So here’s the text book definition (by which I mean the wiki definition):

“Analytics is the discovery and communication of meaningful patterns in data.”

That’s it. That’s exactly what you should think of analytics when you hear the word. But I want to unpack this a little, particularly that seemingly innocuous but vitally important word: meaningful.

To give you an idea of what I’m getting at, think of any article you’ve ever read that makes use of the latest and coolest stats Opta/Infostrada/Bloomberg/Prozone has to offer. Perhaps it focuses on a single game. The author might note that one side had more possession of the ball, or that a single player completed X% of their passes. The author will then use this data to make a conclusion over the state of the team as a whole.

Is this analytics? To me at least, the answer is no.

This might seem counter-intuitive. “Hey you idiot, the author looked at the data, and noticed a few things that pointed to a poor performance. Doesn’t that fit the description? Why do you have to be so depressing?”

The data used in the article might say something about an individual game perhaps, but remember: our (semi) fictional author is using it to say something about the quality of the team or of a single player for the foreseeable future. That’s a very important distinction, one that is glossed over almost every day by someone writing about football, somewhere, much to my extreme annoyance interest. And it leads to repeated errors of judgment.

Those errors arise because we don’t actually know whether or not these awful, no good, very bad numbers in fact mean anything in the long term. For example, it’s possible that the low possession or pass completion rates within one or several games means a player or team had a bad day out, or even a pair of bad games, or a bad month, or played against superior opposition, etc. etc. but are otherwise in good shape.

The obvious problem is: how do we tell the difference between a team or player experiencing a run of “bad luck”, and a team or player that is in serious trouble?

This is where analytics comes in.

Careful statistical analysis of large data samples can help bridge the gap between raw information (match possession stats, individual player stats) and knowledge (is a team lucky or talented?). The metaphor used by many analysts for this, including Nate Silver, is finding the signal in the noise. This is a concept taken from radio and the signal-to-noise ratio—to what degree does a deliberate, meaningful signal cut through the meaningless background noise or static? Too often in football punditry what is cited as signal may in fact be just background noise. Some work has to be done in many cases before we can assume isolated numbers tell us anything meaningful.

Analytics has several mathematical and statistical tools at its disposal to do this. For example, a football analyst might use linear regression analysis to determine if there is a relationship between two independent variables over a big sample of data (sort of). For example: is possession positively linked to a higher points total over the long term? And if so, what is the number of games a team would have to post these numbers before we could reasonably say they’re good at keeping possession?

This is, for example, what James Grayson did with total shots ratio, a measurement of shot dominance—whether teams are consistently taking more shots than they’re conceding. We know through Grayson’s work that TSR becomes pretty consistent for individual teams over 4-6 matches, and that it correlates pretty well with end of season points totals. That’s why you see it around a lot these days.

As I’ve said countless times before however, this doesn’t mean teams should simply shoot more often to run up their end of seasons points total. Remember: this kind of statistical relationship is a correlation. The one doesn’t necessarily directly cause the other. It just tells us that one thing that is highly valued in football (league points) correlates well over the long term to a specific kind of in-game behaviour (shooting more than opponents). It indicates that teams which control the ball well shoot more than their opposition on a regular basis. In other words, TSR points to something far more important than shots: dominance.

That’s why these models are more like waypoints than final destinations in the search for meaning. It’s the same principle with PDO, which simply adds team shot conversion rates and save percentage rates. We know from statistical analysis that shot and save percentages regress quickly to the mean. This means that while they can spike up or down in the near term, they tend to move closer to the average over time. In other words, by themselves, they seem to be on the whole more a function of random variation (or, more crudely, luck) than intention (or talent, skill, whatever you want to call it).

Sometimes these models fail to predict certain outcomes in ways that are so far off they’re not likely to be a freak occurrence. This doesn’t mean they’re “wrong” or even “flawed,” but perhaps in need of further refinement. A good analyst will think about possible reasons for the failure, and then repeat the process but with a more narrow set of variables…so instead of checking to see if there is a correlation between total shots ratio and final points totals for example, you check to for a correlation between total shots ratio when teams are tied and final points totals. And maybe it correlates even better. This kind of thing happens all the time.

In other words, you can’t just cite numbers to make your case and call it a day. Analytics demands context, it demands interpretation, it requires evidence (but not necessarily absolute proof). It works on the basis of skepticism and constant revision. It rarely works in absolutes, and it never takes good as good enough.

And that’s it really. “The discovery and communication of meaningful patterns in data.” Simple. Now get reading…

Recommended Reading

If there’s any proof you can enjoy reading and learning about statistics without a postgraduate degree in math, I’m it. To get a grasp at the statistical concepts involved however, I heartily recommend Thinking, Fast and Slow by Daniel Kahneman, The Success Equation by Michael Mauboussin, and The Black Swan by Nassim Nicholas Taleb. They’re all very accessible and are great resources for understanding statistical concepts, and they’re so counter-intuitive.

For people to follow, the list is long and difficult to maintain. Colin Trainor has a great start-up resource though, so you can subscribe to his Twitter list, and expand from there. There are many there more besides, most of which are included in Simon Gleave’s followees. I might curate my own list or lists later on if I could be bothered…