## Roto-Relevant Research: Accuracy, Simplicity and the Extremes

The statistician George Box once wrote that “Essentially, all models are wrong, but some are useful.” … Perfect models only occur when severe constraints can be imposed.

Since you can’t have a perfect model, the designer and user must decide what level of accuracy is acceptable for the purposes for which the model will be used. — Patriot, “All Models are Wrong,” at Walk Like a Sabermetrician

If you’ve ever worked in a science or even a social science, some of the practices of the sabermetric community might be surprising. There’s a lower threshold for significant findings, for one. And ‘experiments’ are almost never performed in the same way that you’ll find in a psychology lab. You can’t impose constraints, and there’s no way to test an assertion in a sterile environment. You can’t have an alternate MLB universe where you change one aspect of the game and play it out.

And so, as Patriot affirms, all sabermetric models are wrong. They’ll be wrong somewhere. He goes on to to set up a delicate balance between accuracy at the extremes and simplicity.

His example, the brouhaha over Aroldis Chapman having a negative FIP for the month of July, exists at the extremes. In that month, Chapman had 31 strikeouts in 14 1/3 innings, against only two walks and six hits. He didn’t give up a home run. He didn’t give up a run, and one of our most rudimentary models, ERA, just tells you that he kept teams scoreless in July. In a way, the negative FIP he had that month seems more appropriate.

Except that you can’t give up negative runs. So that means that FIP does, in fact, have some issues at the periphery. But it is nice and simple — the inputs are home runs, walks, hit-by-pitches, strikeouts and innings. The idea is simple — the stat attempts to describe only what a pitcher can control without the help of defense behind him. Even the calculation isn’t terrible.

One of the functions of FIP is to accurately model pitching performances (by removing defense-dependent statistics) in a way that can help better predict future pitching performances. Another function is to help spread the word about DIPS, or defense-independent pitching. The first function demands more accuracy, the second more simplicity. There’s an obvious tradeoff as you look at the different pitching estimators available at FanGraphs — xFIP is FIP with a normalized home run rate, SIERA includes skill interactivity with ground-ball rates and interactive expected BABIPs and the like. Less simple, possibly more accurate at the extremes.

What does this mean for your average baseball fan, or a fantasy baseball fanatic? You are faced with a choice.

Not to put too fine a point on it, but you can choose to emphasize the simplicity. If your personal checklist looks different from Patriot’s, that’s fine. You can put your stock in the models whose inputs you understand best, and who might model the majority of baseball best. Perhaps you believe that no model can be completely correct, so you’d rather have a guide for the big chunk in the middle, and use your own eyes to determine what’s missing in extreme cases. In this case, you don’t call FIP broken, you call Aroldis Chapman a unique case. You still believe FIP tells you something, but you’ll watch Ryan Vogelsong yourself to determine how well he’ll do in the future.

Or you can emphasize the accuracy at the extremes. According to Matt Swartz, Chapman’s July SIERA was -.02, which is ‘better’ perhaps than the more negative FIP. SIERA is a complicated skill-interactive estimator born of multivariate regression that has had some of the smoothest fits in the middle, and does okay in this extreme case. Or use a base-runs-based estimator — Patriot’s dRA came in at .68 for Chapman’s July. But you might have to take it on faith unless you’re willing to check all the math that goes into base runs estimators. And you might have some difficulty finding them on a leaderboard unless you use Patriot’s equations to run them yourself.

Most saber-savvy fantasy fans are familiar with this issue on some level. Read across Ryan Vogelsong’s 2011 stats, and it should crystallize quickly. He had a 2.71 ERA, 3.67 FIP, 3.85 xFIP, and a 3.97 SIERA. Vogelsong fans might be sad to hear that Patriot’s dRA had him at 4.18 last season. The same sort of split is happening so far this season — and in the case of some of these metrics, it’s been park-adjusted, even. DIPS feels that Vogelsong has regression coming, and it’s just a question of how much depending on each ERA estimator.

So you’re faced with a choice between models. What does your checklist look like? What do you value most? What runs allowed model do you use, and how do you use those numbers?