May 29, 2011


XKCD on sports.  Is there a way to test (in a quantitative manner) the hypothesis that baseball is the worst offender?


May 6, 2011

When labour market research goes to the ballpark

In a recently issued paper called "Productivity, Wages, and Marriage: The Case of Major League Baseball", economists Francesca Cornaglia and Naomi E. Feldman examine the "marriage premium" -- the fact that controlling for other influencing factors, married men earn more than unmarried men. In most situations, a variety of confounding variables muddy the waters -- things like geographic location, differences across occupations, and poor productivity measures. Cornaglia and Feldman innovatively use information from MLB to control for those variables.

Derek Jeter, the exception that proves the rule.

The abstract:

Using a sample of professional baseball players from 1871 - 2007, this paper aims at analyzing a longstanding empirical observation that married men earn significantly more than their single counterparts holding all else equal. There are numerous conflicting explanations, some of which reflect subtle sample selection problems (that is, men who tend to be successful in the workplace or have high potential wage growth also tend to be successful in attracting a spouse) and some of which are causal (that is, marriage does indeed increase productivity for men). Baseball is a unique case study because it has a long history of statistics collection and numerous direct measurements of productivity. Our results show that the marriage premium also holds for baseball players, where married players earn up to 20% more than those who are not married, even after controlling for selection. The results are generally robust only for players in the top third of the ability distribution and post 1975 when changes in the rules that govern wage contracts allowed for players to be valued closer to their true market price. Nonetheless, there do not appear to be clear differences in productivity between married and nonmarried players. We discuss possible reasons why employers may discriminate in favor of married men.

You can hear Dr. Cornaglia discuss the research on the BBC programme More or Less (2011-04-29), starting at roughly 10'50".

My initial reaction regards neither the findings nor the methodology, but the fact that other than a mention of the Lahman database, the list of references does not include any of work of the sabermetric research community. At one point in the discussion of productivity measures the authors write "Most modern-day baseball enthusiasts and commentators consider the latter two statistics [OPS and EqA] to be the most accurate measures of a player’s productivity", but the authors neither refer to any authority to support that statement nor discuss fact that others have critiqued those measures.

This is not the first time that academics have utilized the contributions of the sabermetric community in supporting their research (in this case, it provides a vital element in the foundation of the productivy measure) but then failed to acknowledge that work. For a well-reasoned discussion of that topic, please read Phil Birnbaum's "Chopped liver II".


May 1, 2011

Early season standings and Bayes

Early season performance has been a hot topic this year (not that it isn't a topic of discussion every year).  I wrote about it, using a simple approach of assuming that every team is .500, and a more recent addition in the blogosphere is Rob Neyer's take.

Last week Kincaid over at 3-D baseball had great post that used Boston's 2-10 start to go down a detailed and more sophisticated Bayesian path to estimating the team's true talent.  Tango posted a link to Kincaid's blog, and added a few details that incorporate actual observations.  A key element in this is that the observed spread of talent is wider than the theoretical .500 level of all teams.  (If all teams were .500, the random component would result in a standard deviation of 0.039. In reality, the standard deviation is wider, at 0.071 -- the implication of this is that there are real talent differences between teams, with some teams having a true talent level above .500 and others below.)

My modest contribution to this thread is here:  a Google doc spreadsheet that show all of the MLB team's current record (as of 2011-04-30), and then takes two different Bayesian-based methods to predict each team's final season outcome.

The first set are the yellow columns, which replicate Kincaid's "shortcut" approach, with the implied regression of 69 games noted by Tango. The blue columns take a different approach that uses the standard deviation of both the observed performance to date and the long-term observations (every MLB team season outcome from 1961-2010) as the prior.

The difference in the result generated between these two approaches is relatively modest.  (It's worth noting that the relative position on the standings does not change.) What is apparent is that with roughly 25 games played this season, there are solid differences appearing in the team performances.  This method forecasts that Cleveland and Philadelphia will regress downward from .692 (a 112 win season) to .573 and end up with 93 wins. At the bottom of the table, it suggests that the Twins will improve from .346 (58 wins on the season) to .444 and a much more respectable 74 wins over the course of the season.