Revised January 5, 2011
J.C. Bradbury's recent blog postings (here and here) have included histograms showing the distribution of ERA across major league pitchers for the 2009 season. For his analysis, Bradbury omitted those pitchers with fewer than 100 batters faced -- in both his blog and book Hot Stove Economics, he justifies this due to the wide variance in ERA scores, much of which will be due to the small number of "samples" for each pitcher. (As we saw in my earlier post about Bo Hart, it's possible for an average player to do very well over the short term; the inverse applies too.)
But a few comments on Bradbury's blog from readers ask about the impact that those "missing cases", who account for nearly a third (28%) of all individuals who pitched in MLB in 2009, would have on the curve.
Here's the answer:
Figure 1: MLB Pitching, 2009 -- Number of Pitchers by ERA, by Number of Batters Faced
Incorporating the <100 BFP pitchers (the black chunks of each bar) adds pitchers across the whole range, although they are skewed to the right (i.e. higher ERAs). While there is a stack on the left with very low ERAs, there's a bigger group of players with an ERA greater than 10. (The highest ERA of this group was 135.00.)
NOTE 1: ERA is a poor measure to use for this type of evaluation -- for pitchers with a low number of batters faced or innings pitched, it's easy for huge numbers to appear. That 135.00 ERA is the equivalent of 15 earned runs with only a single recorded out. These exaggerated values then lead to an upward distortion of the mean for the group. A better measure would be wOBA, or other measure that resembles a probability between 0 and 1.
The table below shows the average ERA of this group and three other groups based on the number of batters faced. What we see is that the <100 BFP pitchers have a higher ERA than those who pitched more frequently. (This difference is statistically significant.) In spite of the variation in their ERAs, this group on average are less skilled than the other three groupings of pitchers.
NOTE 2: This is where I went wrong. The math is correct, but there is bias in the sample that I ignored. We can be fairly confident that pitchers who get off to a poor start won't get many opportunities to pitch -- and therefore won't get the opportunity to regress to the mean. Pitchers who do better at the start of their season will continue to pitch, and regress to the mean. This process may take them some time, which may push them over the arbitrary line of 100 batters faced. Thus the statistical significance is an artifact of the bias.
Figure 2: MLB Pitching, 2009 -- Average ERA, by Number of Batters Faced
In a thread on The Book blog that covered this same topic, I made a similar statement (reply #8): "What I’m trying to say is that our best estimate of the “true talent” of this group is an ERA of 8.11 [in the current case, 8.72], and that estimate is quite accurate". That statement got a response from Tango (reply #9) of "That is not accurate. If you look at how those pitchers who faced fewer than 100 batters did in the season preceding or the season following, THAT will give you a much better indicator of the true talent level."
So let me clarify. The average level of skill of the pitchers who faced fewer than 100 batters in 2009, is an average ERA of 8.72. Although Tango is correct in his assertion that the poorest performers would regress upwards, by the same token the best pitchers (some of whom managed a 0.00 ERA in their short stint) would get worse. But if we were to let all 188 of them continue to pitch, we can be 95% certain that the "true" ERA of the group would end up somewhere between 6.92 and 10.52.
Even the lower bound (i.e. the lowest score we would expect with our more rigorous testing) is higher than the highest range from the other groups.
NOTE 3: My statement above would be correct, if it were not for the bias in the sample. My belief had been that this group would regress not to the MLB average, but to the average of the <100 BFP pitchers. But because of the selection bias, this does not hold true.
Here's a simple example to demonstrate how this works. Think of the probability professor's favourite tool, the coin toss. If we have a penny and toss it repeatedly -- say, 10,000 times -- and recorded the result each time, the proportion of heads would very accurately reflect the true probability of the individual penny. And we'd need plenty of tosses to get an accurate measure of the single penny.
But what if instead of one penny we had 188 pennies, and we varied the number of tosses each penny got? Although the average number of tosses would be 50, some pennies might get only one toss, while others would get as many as 100 tosses. Some of those short sequences might come up all heads, while others would heavily favour the tails. On average, though, across the 188 pennies, we would find that the group average was a close reflection of "true average" of the group.
NOTE 4: The error in the initial assumption causes my coin flipping analogy to fall apart. If “success” is a head, then the coin that comes up heads >0.5 will keep being flipped, possibly with enough flips to no longer be part of the “low flip” group (over that arbitrary threshold). Meanwhile, a coin that runs tails more often will get pulled from the trials quickly, and end up <0.5 and with few flips. Thus, as a group, the coins with a smaller number of flips will end up looking worse than those that keep getting flipped. Selection bias causes an apparent difference, where none really exists.
And so it is with the pitchers in question. If they were like the other pitchers in MLB, we would expect that some of the <100 batters faced pitchers would have ERAs above the league average, while others would fall below. What we see, however, is that while there is a wide variation, the average is substantially higher than the other groups of pitchers.
NOTE 5: ...because of selection bias! The lesson: selection bias can crop up anywhere, even if you are not the one doing the selecting.