A recent post on Hawkonomics presented a regression analysis of Major League Baseball team performance as a function of payroll. This post has generated some chatter in the sabermetric blogs (Sabermetric Research and The Book). If I may be so bold, the original post wasn't very well articulated, which has led to some critiques. Herein I aim to repeat the original analysis, and provide some elaboration that will aid interpretation.
It is clear that the calibre of the players on the team influences the number of wins. What is less clear is the relationship between team calibre and the total amount the team pays in salaries. We have all heard it said that rich teams "buy a championship" by loading up on highly paid free agents, but how true is it?
This relationship has been analyzed in the past. One such analysis can be found in the book The Wages of Wins by Berri, Schmidt, & Brook, and there are plenty of other sources around the sabermetric blogs. (One interesting visualization tool can be found on Ben Fry’s site.)
One of the most common ways to test a relationship between two variables is through a regression analysis. This is the approach taken by Stacey Brook over at Hawkonomics. (Note: Brook is one of the co-authors of The Wages of Wins.)
I have re-run the regression using the data supplied on his blog. I changed two things to make the results more readily comprehensible. First, I changed the salary figures to be represented as millions; thus the Yankee’s salary is expressed not as $206,333,389 but $206.3. More dramatically, I used each team’s current winning percentage and projected it out over 162 games – essentially a forecast of where the teams will end up at the end of the 2010 season if they continue at the pace established over the first ~50 games of the season.
NOTE: These transformations alter neither the “goodness of fit” of the model nor the statistical significance.
Let’s look in detail at the model that results.
1. The Correlation: The strength of the relationship between team salaries and wins
The correlation coefficient (often identified as the Pearson correlation coefficient, and represented as "R") is a unitless measure that simply tells us how much the two variables vary together in a linear manner. If they both move up in lockstep, the correlation coefficient will be 1 (the temperature outside and the amount of electricity used to run air conditioners); if one moves up while the other moves down in lockstep, the R value will be -1 (the temperature outside and the amount of natural gas burned keeping your house warm). If there is no relationship at all, then R will equal zero (the temperature outside and the amount of energy used to heat the gallons of hot water used by teenagers in the shower).
For MLB salaries and wins so far this season, the R value is 0.224. This is interpreted as being a weak linear relationship. In plainer language, the data do not really follow a linear pattern.
But there is another value -- R2 or R-squared -- that gives us some language to work with. In this case, O.224 squared is 0.0503. From this, we can say that salaries improve our prediction of a team's winning success by 5.03% -- not a very big improvement at all.
This is easily seen in the X-Y chart below. Salary is plotted across the bottom, with the forecast wins up the side (I converted the team win percentages to a forecast season wins -- more on this later.) Each team is represented by one of the dark blue diamonds scattered about the chart. The predicted values derived from the regession model are shown in the form of the red dots joined by a nice straight line.
[click for a bigger version]
From this, it is easy to see that the model is not a very good predictor of actual wins. While there are some blue dots that fall close to the line, there are others that are well above or below the line. If there isn’t much difference between the actual and predicted values, we have a good model. That clearly is not the case here.
The model tells us little about what makes a winning team, because a lot of the difference in team success cannot be explained by salaries. In short, this model has no oomph.
(Back to the side note from earlier: the correlation coefficient will remain the same regardless of how we express our variables. We can convert the dollars to a percentage of the average for the season (thus the low-spending Pirates would be said to have a salary that is 39% of the average while the Yankees are spending 206% above), and the correlation coefficient remains at 0.224. Or we could convert the winning percentage to actual wins, with no change in the R value.
2a. The regression equation -- how much does a change in salaries influence wins?
The regression equation is expressed as
Where Y is the predicted number of wins, and X is the salary. The constant is the point on the Y axis where X is equal to zero, also known as the Y intercept. The beta value is the amount that a change in X will generate in increase of one in the predicted Y value. The constant and the beta value are calculated in the model.
In this model, it becomes
The interpretation: each extra $87,000 spent yields an increase in a single win, starting at a base of 73 wins. A team that spends an average amount on salaries ($90.6 million) will get an average number of wins (81).
When we start to think about this equation, it’s easy to see why the model isn’t very robust. There are some teams that are going to end the season with less than 73 wins if they keep on the way they have been. To end up below 73 wins, the model says the players should be paying the team!
2b. This year’s Moneyball teams
The model does give us a way to see which teams are getting the most production (i.e. wins) for every dollar spent – the gap between the team’s actual performance and the number of wins predicted in the model is the “residual”, and it ranges from a high of 29 wins above what the model predicts for Tampa Bay and 21 for San Diego, to a low of -34 for Baltimore and -23 for Houston.
3. Statistical significance
This model is NOT statistically significant.
So what? All this means is that if we were to use another group of 30 team wins-team salary pairs, we would likely get a different R value. We could improve the significance of the model with more team salary and wins data pairs.
But if the data points are still as dispersed as they are in this case, more data points might yield a “statistically significant” model that (and this is the important part…) has the same correlation coefficient – the model would still have no oomph. All we have then is a model that we can be confident tells us that team salary has a small relationship with being the number of wins earned.
Some parting thoughts
So we have arrived at the inescapable conclusion that this model does not tell us much about what influences wins, since there is little relationship between salaries and wins at this point in the 2010 season. The model is both weak and statistically insignificant.
The fact that the model is so weak runs counter to earlier research, which tended to find a stronger relationship. Is 2010 different, or is it just too early in the season to tell?
Comments and questions are always welcome.