sloan mit

As I mentioned on yesterday’s podcast, before I went to the Sloan MIT Sports Analytics Conference in Boston, I felt pretty good about the state of ‘amateur’ analytics. Sure, football isn’t a game of discrete events, but an endlessly complex combination of individual skill and team dynamics where goals follow a Poisson distribution curve. By its nature, the sport is not amenable to easy analysis.

But as Infostrada analyst Simon Gleave has wisely said time and time again, we need to learn to walk before we learn to run. And so in the last few months I’ve become a major fan of the work of analysts like 11tegen11 and James Grayson, who demonstrated the beautiful simplicity in a metric as plain as Total Shots Ratio. This kind of thing gets at the way football, from a statistical standpoint, works. It isn’t concerned with telling coaches to put out a certain formation or scouts to buy a certain player.

In other words amateur soccer analytics tends to be more empirical than deontological, which is a pretentious way of saying it’s about what is rather than what should be. The former approach is, for now at least, the only one open to those of us outside the proprietary data wall thrown up by firms and clubs to protect their competitive advantage.

The thing is though about the Sloan Sports Analytics conference is that it deals with people very much interested in analytics as a means to improve elements of a particular sport based on various unseen statistical probabilities. And in this realm, football lags well behind sports like baseball, basketball and hockey. Venture captitalist Mitch Lasky wrote this on his blog Bizpunk yesterday, which sums up the current state of soccer analytics well:

When Bill James and the SABR-metricians were making their contributions to baseball in the late 70′s and 80′s, they were a fringe group working with public data. They worked for decades without the taint of team sponsorship, before Billy Beane and Moneyball dragged them into public view. Now that sports analytics has gone mainstream, a lot of important work is being done by in-house number-crunchers, working in secret, in order to provide clubs with competitive advantage. The lack of community and lack of broad data access may be retarding analytics innovation in sports like basketball (as this Slate piece discusses) and almost certainly in soccer. I’ll discuss the soccer problem in another post, but it was remarkable how little progress there has been developing “game models” for soccer, from the public perspective.

Lasky’s stance was one I held when I started writing about soccer analytics, and have slowly moved away from since. After this weekend in Boston however, I’m beginning to see the limitations of working with simple data.

For example, I attended several paper presentations that demonstrated what could be done with a rich data set in other sports. In one, MLB consultant Adam Guttridge presented an ‘automated prospect model’ which used historical data to extrapolate various individual player metrics from A and AA baseball that tend to extrapolate well and improve over a certain age range, say from 18 to 21. The results cleaved closely to the 100 top prospects offered up by publications like Baseball Prospectus, and even predicted the jump in rankings for players like Oscar Taveras.

In another, Michael Schuckers and James Curro presented a Total Hockey Rating model. Their method is as simple as it is intriguing:

Our approach considers every event recorded by the NHL and assigns value to those events based upon the probability that they will lead to a goal. To evaluate players we determine which players were on the ice for which events and assess the impact of each player adjusting for their teammates and their opponents on the ice with them. Recent work has shown that where a player starts their shift (a shift in hockey is the continuous period when a player is on the ice) has an impact upon the events for which the player will be on the ice. This effect is known as Zone Starts and we explicitly model this effect as part of our ratings. Further, we include a home-ice effect. The result of all of this is the change in probability of a goal per event for each player.

The researches used a two-season sample using detailed on-ice positioning with NHL’s Real time Scoring System. Essentially the researchers allowed for a twenty second window following a particular action in a particular area before a goal is scored, and then determine the probability these isolated events precede the puck going in the net.

This approach works well in hockey of course because the rink is relatively small and the goal is under near-constant threat so the probabilities would be relatively higher, but one could easily see an X,Y positioning graph in the final third of a football pitch and a calculation of what kind of events are likely to precede a goal (Sam Green, who also attended SSAC, has done some work on this with Opta).

When considering these incredible possibilities, it’s easy to get in a funk about the relative paucity of publicly-available soccer data. I left Boston so interested in exploring these models in a football context that I left with the only half-flighty idea of taking some statistical modelling courses and pitching a data company or a club to let me try and do some of this myself.

There are some bright lights however for those of working outside the data-collection machine. Simple predictive work like Grayson’s for example is good in a betting context, which is less concerned with analytics that improve teams and more concerned with figuring out the probability one team will beat another on any given Saturday. And though football is still at an early stage in analytics development, there were one or two paper presentations that seemed to provide a little more progress, as in the Field Vision study for Premier League players.

Even so, the Soccer Analytics panel, which featured major bright lights in the field like Prozone’s Blake Wooster and Cornell university professor Chris Anderson, sounded some depressing notes about the lack of progress on the club side. The attendance at the panel itself though was impressive, and one wonders if some of those students in attendance might have sensed some major opportunity in what is a growth sport in North America. To that end, the number of MLS analysts in attendance was also very encouraging.

My faith in the progress of meaningful soccer analytics was shaken by Sloan, but not broken. I’ll be returning to some major themes I picked up there in the weeks and maybe even months ahead…