May 1, 2011

Early season standings and Bayes

Early season performance has been a hot topic this year (not that it isn't a topic of discussion every year).  I wrote about it, using a simple approach of assuming that every team is .500, and a more recent addition in the blogosphere is Rob Neyer's take.

Last week Kincaid over at 3-D baseball had great post that used Boston's 2-10 start to go down a detailed and more sophisticated Bayesian path to estimating the team's true talent.  Tango posted a link to Kincaid's blog, and added a few details that incorporate actual observations.  A key element in this is that the observed spread of talent is wider than the theoretical .500 level of all teams.  (If all teams were .500, the random component would result in a standard deviation of 0.039. In reality, the standard deviation is wider, at 0.071 -- the implication of this is that there are real talent differences between teams, with some teams having a true talent level above .500 and others below.)

My modest contribution to this thread is here:  a Google doc spreadsheet that show all of the MLB team's current record (as of 2011-04-30), and then takes two different Bayesian-based methods to predict each team's final season outcome.

The first set are the yellow columns, which replicate Kincaid's "shortcut" approach, with the implied regression of 69 games noted by Tango. The blue columns take a different approach that uses the standard deviation of both the observed performance to date and the long-term observations (every MLB team season outcome from 1961-2010) as the prior.

The difference in the result generated between these two approaches is relatively modest.  (It's worth noting that the relative position on the standings does not change.) What is apparent is that with roughly 25 games played this season, there are solid differences appearing in the team performances.  This method forecasts that Cleveland and Philadelphia will regress downward from .692 (a 112 win season) to .573 and end up with 93 wins. At the bottom of the table, it suggests that the Twins will improve from .346 (58 wins on the season) to .444 and a much more respectable 74 wins over the course of the season.



  1. Do those take into account the games already played? That is, when you say Philadelphia will end up with 93 wins, is that the 18-8 they already have, plus 75-61 for the rest of the season?

    Or are you saying that their TALENT is 93 wins, and they should end up with more because of their 18-8 start?

  2. Phil, this analysis suggests that the Phillies will end up 93-69 for the season. The 18 wins they already have are included in the 93 total.

    The math behind the shortcut approach (which put them at 90 wins) is:
    WINS: current wins + 69/2 = 18 + 34.5 = 52.5
    GAMES: current games + 69 = 26 + 69 = 95

    This gives a W/G percentage of .553, or 90-72 over the 162 game season. (This is slightly different from the more complex approach, but close enough for this purpose).

    In both methods, as the sample size (number of games played) increases, the impact of the regression is reduced.

    If the Phillies go to 36-16 (the same winning percentage but with twice as many games played), their predicted success on the season will jump to 94-68 games in the shortcut approach.

    And if their record goes to 54-24 (same % but three times the games), the predicted record ends up at 98-64.