This past weekend in Germany, Hoffenheim defeated Schalke 3-2. An unremarkable result, perhaps, but Hoffenheim enjoyed only 27% possession, according to

You might wonder whether it’s possible Schalke didn’t manage to do anything with their possession. Perhaps they took a defensive posture and passed the ball near or in their own half.

Yet Schalke managed 23 shots to Hoffenheim’s 6. If we calculate Schalke’s total shots ratio (James Grayson’s beautifully simple predictive metric explained here), it comes out to .79, which if plotted out over the course of an entire season is commensurate with Champions League qualification (TSR, as Grayson details, is a good measure of ball control, and of all the available metrics it performs best in relation to table position over the long term). If we take Schalke’s shots on target ratio (SOTR), it’s an equally impressive .71.

You and I both know the answer as to why Schalke lost despite dominating in every meaningful way: this is one game out of 38 for the season. Schalke need not panic (and they won’t); if they perform as well over the course of the whole season, chances are they will finish closer to the top of the table.

But this isn’t entirely a matter of small sample size (although it basically mostly is). There is also the ‘goal problem’ detailed today in a post from Chris Andersen at Soccer by the Numbers. In keeping with simple, lovely metrics, Andersen carefully points out the importance of both sample size and distribution. He notes that while shots are generally evenly distributed over a large sample of games, shots on target are slightly less evenly-distributed, and goals are “rare and certainly not “normal” (in the statistical sense).” Andersen goes on from this example to write what could be the two most important paragraphs in soccer analytics this year:

When we take these individual match numbers of shots, accurate shots, and goals – of which there were 32,789, 10,396, and 2,954 across the three seasons we collected data for – and put them in relation to each other, it turns out that the odds of any one shot actually being on target was 32%, while the odds of an accurate shot finding the back of the net was similarly around 30 % (28% to be exact). Plenty of teams shoot enough to score, but very few of them consistently score.

Clearly, “normal” football isn’t always normally distributed. As a general rule, the more common an event on the pitch is, the more the distribution looks like a bell-shaped curve (graph the frequency of passes per match and you’ll see what I mean). This has important implications: using some of the most common statistical techniques to deal with these data may be problematic, standard (canned) versions of techniques like correlations and linear regression assume normally distributed data. The stuff we care about the most – goals – is the least “normal” of all the events above. But as importantly, think about what the picture above tells us: there is enormous slippage from one stage of the goal production process to the next. Understanding why and how this slippage occurs should be important questions for any budding analyst.

We know a bit of the answer to this question: a team in better control of the ball is likely to finish higher in the table, in part because ball control makes scoring more likely. It may not matter how this translates to a single, ninety minute game, particularly if high table finish broadly correlates with consistently high TSR. Obviously the ideal would be figuring out consistent in-game means of increasing shot-to-goal ratios to prevent a freak result, but that may be a pipe dream (although the EPL Index believes it may be onto something, although their sample is a mere 9 games).

In any case, Schalke’s loss to Hoffenheim is a true outlier. It was statistically unlikely, Schalke controlled the ball, used it well, and still lost. If you were to plot the likelihood of a Hoffenheim win based on the post-game possession and shots statistics alone, chances are it would be low, perhaps (pullin this out of my arse here) in the 20% range. But, as with any weather forecast, that doesn’t denote certainty, only an empirically-sound probability. The romantics can still shout, “Football, bloody hell!”, and the nerds can still feel confident in their methods.