December 10, 2010

Slugging regression

Tango issued a multi-part challenge, of which the first part is:
1. Take the top 10 in SLG in each of the last 10 years, and tell me what the overall average SLG of these 100 players was in the following year.

The point of the challenge is to demonstrate that top performing players will regress toward the mean in subsequent seasons, and that the year under consideration accounts for, as a rule of thumb, 70% of the next season's performance, and the league average (to which their performance regresses) the other 30%.

In algebraic terms, X is predicted to be 70% when
SLG1 is season 1 slugging average,
SLG2 is season 2 slugging average,
LSLG is the average league slugging average (from season 1)

Leo quickly responded (comment #1 to Tango's post), with his calculation that for slugging, 73.3% was accounted for by the player's average in the first season. To my way of thinking, the challenge has been met -- job well done, Leo!

But as I started to think about it further, I began to wonder how far through the rankings this rule of thumb holds -- as we approach the league average, the player's SLG and the league SLG become one and the same number. And at the opposite end of the scale -- the non-sluggers -- do they regress upwards towards the mean?

So my first step was to simplify the challenge, and only look at two consecutive seasons, 2007 and 2008. Using only those players who had a minimum of 400 at bats each season, I pruned the list down to 129 players in both the NL and AL. Simplifying matters further is the fact that the 2007 SLG for the NL was the same as the AL -- .423. So for my "top sluggers" I then looked at the top 25 across both leagues.

The result: for these 25 players, on average, 66% of their 2008 SLG was accounted for through their 2007 score. A few percentage points from Tango's rule of thumb, but close enough.

Charting the results shows that all but two of the top 25 sluggers regressed downwards towards the mean. And of the two, only one improved dramatically: Albert Pujols (who inched up still further in 2009, before regressing ever-so-slightly in 2010). Were Pujols not in the mix, the 2007 SLG would account for only 62% of the 2008 scores.

Another interesting observation is that of these top performers, not one fell so far in 2008 to end up with a SLG below the league average. That's not to say that it wouldn't happen, but it suggests that at the extreme end of the performance curve, as determined over the course of a full season, top performers really are above average. (NOTE: further testing required!)

But what of the other end of the ranking? I looked at the lowest performing players that I had selected, and the rule of thumb does not work. From the bottom up, the percentage explained was 87%, 84%, -4.4%, -28%, ...

At this point, I started to wonder -- why minus values? A quick check of the numbers, and I saw that these players regressed up, and to a point above the league average.

So what's different about the bottom of the range? It's simple: survivorship bias. My "sample" of 139 players who had 400+ ABs in each of 2007 and 2008, while ensuring I found the top hitters, automatically excluded those weak-slugging players who don't get many plate appearances but who collectively drag down the league average. Thus the "worst" players of the 139 with lots of ABs were not (by and large) far from the league average. The bottom of the list was Jason Kendall, who slugged .309 in 2007 for the A's and the Cubs while catching. Perform much worse than that, and you'll end up playing Triple A. Or in Kendall's case, for the Royals.

On deck: regression toward the mean, SLG with 75+ ABs.


No comments:

Post a Comment