## The State of Analytics: A Soccer Statistics Style Guide

Last week I wrote a primer of sorts on soccer analytics, but I think more is needed in order to stop stats abuse while at the same time not warding off writers from using stats completely. So here is a very basic statistics style guide in the form of questions you might ask yourself before citing any soccer statistic. This can hopefully be used by regular fans, bloggers and beat writers alike.

Is my statistic repeatable?

You see a player with an incredible conversion rate over several games. You get really excited! This guy must be amazing! Worth a bajillion pounds! Everything is finally going to be okay! Well it might be, however, if you’re going on conversion rates alone, we know they tend to regress to the mean. On their own, they’re not always the best marker of a talented player.

The point is there are many available statistics on footballers at the moment, but we have a very poor idea of which ones are repeatable, ie which ones are just random noise and which ones indicate future quality, and if so, whether they’re further inflated or deflated from luck (random variation).

Because of this, it is a bad idea to point to a single player statistic, be it pass accuracy or take-ons won or whatever, as “proof” of their future worth as a player. It might be better just to point out that the numbers alone are impressive/bad at the moment and rejoice/wail.

Is my statistic meaningful?

There is a famous story (which I kind of think may be apocryphal) in which Alex Ferguson sold Man United’s Dutch defender Jaap Stam because the data showed he didn’t tackle enough. On the surface this makes a lot of sense; what does a defender do if not tackle opposition players? Except that Stam was such a brilliant defender, his positional sense was such that he rarely needed to resort to what is generally a last gasp defensive measure.

This is a gross simplification; after all, certain teams might arguably play in a style that requires expert, last ditch defending from the centrebacks. But the general point remains: you should never cite a statistic unless you know fairly well it means what you think it means.

Considering analysts argue about this kind of thing on a daily basis, particularly when it comes to goals and assists, you should be very careful to assume a good number on a particular measurement equals ‘good player’ or a bad number on a particular measurement equals ‘bad player.’

One question you might ask yourself to find out is:

What is the in-game context of the statistic?

To give you an idea of what I mean, let’s consider the pass accuracy percentage for a single player. There is a tendency for example for some writers to scan down a team stats sheet from a single match, notice a high pass completion percentage, and then cite it as evidence of a good performance from that player.

Within a single match a 92% pass accuracy stat looks great, but it needs context. As in: how many passes did the player attempt? If they only tried eleven passes as a sub for ten minutes, that stat won’t very useful. Where on the pitch did the player attempt their passes? Perhaps the vast majority of passes were square back to the central defender or to the goal-keeper. That’s great for keeping possession perhaps, but not necessarily if the player’s job was to assist in attack. How was the opposition set up, and were they inferior? Superior? Equal? Perhaps a player’s pass completion rate was impressive because they weren’t properly marked, or because the opposition allowed them more space on a particular area of the pitch.

So before you cite a statistic to make your point, try and consider it in wider the context of the football match, the player role, the opposition formation etc. Just think about it from a common sense perspective. You might end up finding the particular statistic is in fact irrelevant to your argument, and that old fashioned observations might do the trick instead.

Is my sample size big enough to fit my argument?

You watch three Chelsea matches. You notice that shot volume of the forwards is woeful, the shot ratios are bad, and the key pass rate from the attacking midfielders is very low. So you pen a solid op-ed citing these numbers and declare that the team is in dire straits and in need of a major personnel change.

You may very well turn out to be correct, but the reader/listener needs more information with such a small sample of games. What were the opposition sides? Were they playing at home? Away from home? What was the game state (the score line for the match)? In a three game sample, these things can hugely affect the data one way or another.

As a rule, the smaller the sample, the more prone the numbers are to being skewed either through random variation, or from other factors listed above, or things as banal and unnoticed as the weather or the travel time or the gap between matches.

And, as we learned earlier, there are some stats which seem more luck-driven over the long term anyway. There are in fact very few relatively short term statistics that tell us anything about quality over the long term (the ones we do have are precious, truly precious). It’s for this reason I would generally avoid making any declarative, long-term judgements about a team based on data from a small sample.

Am I using statistics to provide context, or to provide a cause or explanation?

With all the available match data these days, it can be tempting to go immediately to the game statistics, pick out any obvious patterns, and then use them to reverse engineer an ‘explanation’ for the result. For example, you might be stumped as to why Manchester City steamrolled Man United, so you stare at an “average position” chart for both teams on a stats website, compare them, and then use the visual as the basis for some tactical theory as to why City dominated (if you think this is implausible, I read one author attempt to do this today).

If you do this, you’re going to end up in a heap of trouble. It’s like trying to give someone highway directions just by citing the number of rest stops there are along the way. In a game where luck counts for a lot in a single match and in which goals are relatively rare, it is a fool’s errand to rely on statistics alone to “explain” an isolated result. Remember: statistics–even as definitive as a score line–don’t often speak for themselves, and should never be used without reference to the lived match played in real-time.

It’s generally a far better idea to generally use match statistics sparingly and to provide context. For example, you might be convinced Aston Villa played the long ball against Chelsea, and so you check to see the most frequent pass combination is between Brad Guzan and Christian Benteke in that particular match. You should never cite it as absolute proof or without further thought and care (for example, were these cross field passes?), and you should never, EVER cherry pick only those stats that support your argument and not mention others that seem to meaningfully contradict it, but here the statistic in the context of a match might help your case.

Am I using a stat as a potential pathway for further discussion or as an argumentative cul-de-sac?

By now, you might be reticent to use any stats to make any claims at all. And yes–I often think stats are like salt: used in moderation, it brings out the flavour of an argument. However I think there is in fact another way to use statistics without running into the many traps above.

I asked Michael Cox once over his use of match data in his tactical analyses, and he said he just liked certain elements of match data not because they said anything with certainty, but because they were interesting.

In the analytics world, this is often how stats are treated. There is no perfect model, no perfect explanatory metric. Each measurement must be used in context, compared against one another. There is also allowance for discussion of potential trends however, and for the airing of theories that the data might refute in a few matches’ time.

Once you abandon the need for total certainty in your arguments, the inclination to use stats as defensive brick walls rather than possible pathways to open up further discussion diminishes. You learn over time which pitfalls to avoid and when the use of certain numbers is inappropriate, yes, but you also learn the power of using the data not as a period, but a question mark.