After a bit of water-cooler chat today at work, I had planned to spend part of my evening working out the probability of the crazy Wimbledon match that saw John Isner win a marathon match against Nicolas Mahut, after the tie-breaker went 138 games and had Isner prevail 70-68.

I got home to find not one but two well-presented analyses that tackle the question, so my work would be redundant. Thus I present the following links for your reading pleasure:

1. Carl Bialik, "Isner Fitting Winner of Marathon Wimbledon Match", Wall Street Journal

2. Phil Birnbaum, "What were the odds of the 70-68 score at Wimbledon?", Sabermetric Research. (Birnbaum graciously acknowledges Bialik's article in an appended post-post foreward.)

The final summary: this was a one-in-a-million (give or take, depending on some of the assumptions) event.

## June 24, 2010

## June 6, 2010

### Perfectly random?

There has been much discussion about the recent run of perfect and almost-perfect games. A variety of hypotheses have been floated, including

pitching dominance (including a higher strike out ratio), improved defense, and the confluence of expansion, better player evaluation, and a drug-free world.

Perfect games are a rare event, so we run the risk of seeing a random cluster as a trend. There have now been 20 perfect games -- 18 in the "modern era" (since 1900), 14 since the expansion era began in 1961, and two so far in the 2010 season. How can we tell if this "streak" of two perfect games in a single season is simply a random fluctuation?

One approach is to calculate a theoretical probability based on on-base percentage (OBP). Tango has a blog entry "Perfect Game calculation" that presents one approach. His estimate was 1 perfect game per 15,000.

Another example of this appears in

A side note: after Mark Buehrle's perfect game in 2009, Sky Andrecheck took a similar approach for individual players. He worked out the individual chances for the 16 modern-era players who had tossed a perfect game, based on the sum of the on-base percentage and reached-on-error percentage they allowed over their careers.

A second approach to calculating the probability is to compare the observed number of perfect games and to the number of opportunities. I decided to use 1961 as year one. This was a natural point to begin -- this was the first year of baseball's expansion, and it falls mid-way between Don Larson's 1956 World Series perfecto (which had been the first in 22 years) and Jim Bunning's 90 pitch masterpiece in 1964. Between 1961 and 2009 inclusive, there were 12 perfect games -- and there were 201,506 regular season "team games". This gives us a probability of 0.00005955, or 1 perfect game every 16,790 team games played.

This method yields a result that is roughly the mid-point between Tango's and Winston's approaches.

While most statistical analysis makes the assumption that the distribution of the events is "normal", when we are dealing with rare discrete events the distribution does not resemble the normal distribution. The most common distribution used for this is the Poisson distribution.

At the probability of 1 in 16,790 across a season of 4,860 "team games" (the current number per season -- based on 2,430 games and therefore 4,860 perfect game opportunities) and 4,112 (the average number since 1961) that the probabilities, expected frequencies, and observed frequencies are as follows:

So over 50 seasons, we would predict that there would be between 1 and 2 seasons with 2 perfect games, and between 9 and 11 seasons with 1 perfect game.

So to answer the question posed in the title, the answer is "Yes -- two perfect games in one season is well within the expected distribution." The fact that 2010 has been the first season with 2 perfect games in the 50 years since 1961 fits perfectly with the expected distribution.

In future posts I will repeat the calculation of probabilities and frequencies, with modified probabilities (once the dust settles on the "correct" way to calculate the probabilities...)

pitching dominance (including a higher strike out ratio), improved defense, and the confluence of expansion, better player evaluation, and a drug-free world.

Perfect games are a rare event, so we run the risk of seeing a random cluster as a trend. There have now been 20 perfect games -- 18 in the "modern era" (since 1900), 14 since the expansion era began in 1961, and two so far in the 2010 season. How can we tell if this "streak" of two perfect games in a single season is simply a random fluctuation?

*Calculating the probability of a perfect game: allowing runners*One approach is to calculate a theoretical probability based on on-base percentage (OBP). Tango has a blog entry "Perfect Game calculation" that presents one approach. His estimate was 1 perfect game per 15,000.

Another example of this appears in

*Mathletics*by Wayne L. Winston, who calculated a probability of 0.0000489883, or 1 game in just over 20,400. Winston noted at the time the book went to press (before the 2009 season) there had been nearly 173,000 regular season games since 1900 and each game provides 2 opportunities for a perfect game (so we have 346,000 "team games"). Winston then goes on to note that we would therefore expect there to be 16.95 perfect games over that period -- almost perfectly matching the observed total of 17 to that point in time.A side note: after Mark Buehrle's perfect game in 2009, Sky Andrecheck took a similar approach for individual players. He worked out the individual chances for the 16 modern-era players who had tossed a perfect game, based on the sum of the on-base percentage and reached-on-error percentage they allowed over their careers.

*Calculating the probability of a perfect game: observed rate*A second approach to calculating the probability is to compare the observed number of perfect games and to the number of opportunities. I decided to use 1961 as year one. This was a natural point to begin -- this was the first year of baseball's expansion, and it falls mid-way between Don Larson's 1956 World Series perfecto (which had been the first in 22 years) and Jim Bunning's 90 pitch masterpiece in 1964. Between 1961 and 2009 inclusive, there were 12 perfect games -- and there were 201,506 regular season "team games". This gives us a probability of 0.00005955, or 1 perfect game every 16,790 team games played.

This method yields a result that is roughly the mid-point between Tango's and Winston's approaches.

*What are the odds of two perfect games in one season?*While most statistical analysis makes the assumption that the distribution of the events is "normal", when we are dealing with rare discrete events the distribution does not resemble the normal distribution. The most common distribution used for this is the Poisson distribution.

At the probability of 1 in 16,790 across a season of 4,860 "team games" (the current number per season -- based on 2,430 games and therefore 4,860 perfect game opportunities) and 4,112 (the average number since 1961) that the probabilities, expected frequencies, and observed frequencies are as follows:

So over 50 seasons, we would predict that there would be between 1 and 2 seasons with 2 perfect games, and between 9 and 11 seasons with 1 perfect game.

So to answer the question posed in the title, the answer is "Yes -- two perfect games in one season is well within the expected distribution." The fact that 2010 has been the first season with 2 perfect games in the 50 years since 1961 fits perfectly with the expected distribution.

In future posts I will repeat the calculation of probabilities and frequencies, with modified probabilities (once the dust settles on the "correct" way to calculate the probabilities...)

*Comments and questions are always welcome.*## June 4, 2010

### A closer look at payroll and performance

A recent post on Hawkonomics presented a regression analysis of Major League Baseball team performance as a function of payroll. This post has generated some chatter in the sabermetric blogs (Sabermetric Research and The Book). If I may be so bold, the original post wasn't very well articulated, which has led to some critiques. Herein I aim to repeat the original analysis, and provide some elaboration that will aid interpretation.

It is clear that the calibre of the players on the team influences the number of wins. What is less clear is the relationship between team calibre and the total amount the team pays in salaries. We have all heard it said that rich teams "buy a championship" by loading up on highly paid free agents, but how true is it?

This relationship has been analyzed in the past. One such analysis can be found in the book

One of the most common ways to test a relationship between two variables is through a regression analysis. This is the approach taken by Stacey Brook over at Hawkonomics. (Note: Brook is one of the co-authors of

I have re-run the regression using the data supplied on his blog. I changed two things to make the results more readily comprehensible. First, I changed the salary figures to be represented as millions; thus the Yankee’s salary is expressed not as $206,333,389 but $206.3. More dramatically, I used each team’s current winning percentage and projected it out over 162 games – essentially a forecast of where the teams will end up at the end of the 2010 season if they continue at the pace established over the first ~50 games of the season.

Let’s look in detail at the model that results.

The correlation coefficient (often identified as the Pearson correlation coefficient, and represented as "R") is a unitless measure that simply tells us how much the two variables vary together in a linear manner. If they both move up in lockstep, the correlation coefficient will be 1 (the temperature outside and the amount of electricity used to run air conditioners); if one moves up while the other moves down in lockstep, the R value will be -1 (the temperature outside and the amount of natural gas burned keeping your house warm). If there is no relationship at all, then R will equal zero (the temperature outside and the amount of energy used to heat the gallons of hot water used by teenagers in the shower).

For MLB salaries and wins so far this season, the R value is 0.224. This is interpreted as being a weak linear relationship. In plainer language, the data do not really follow a linear pattern.

But there is another value -- R

This is easily seen in the X-Y chart below. Salary is plotted across the bottom, with the forecast wins up the side (I converted the team win percentages to a forecast season wins -- more on this later.) Each team is represented by one of the dark blue diamonds scattered about the chart. The predicted values derived from the regession model are shown in the form of the red dots joined by a nice straight line.

[click for a bigger version]

From this, it is easy to see that the model is not a very good predictor of actual wins. While there are some blue dots that fall close to the line, there are others that are well above or below the line. If there isn’t much difference between the actual and predicted values, we have a good model. That clearly is not the case here.

The model tells us little about what makes a winning team, because a lot of the difference in team success cannot be explained by salaries. In short, this model has no

The regression equation is expressed as

Y=constant+(beta*X)

Where Y is the predicted number of wins, and X is the salary. The constant is the point on the Y axis where X is equal to zero, also known as the Y intercept. The beta value is the amount that a change in X will generate in increase of one in the predicted Y value. The constant and the beta value are calculated in the model.

In this model, it becomes

Y=73.0+(0.087*X)

The interpretation: each extra $87,000 spent yields an increase in a single win, starting at a base of 73 wins. A team that spends an average amount on salaries ($90.6 million) will get an average number of wins (81).

When we start to think about this equation, it’s easy to see why the model isn’t very robust. There are some teams that are going to end the season with less than 73 wins if they keep on the way they have been. To end up below 73 wins, the model says the players should be paying the team!

The model does give us a way to see which teams are getting the most production (i.e. wins) for every dollar spent – the gap between the team’s actual performance and the number of wins predicted in the model is the “residual”, and it ranges from a high of 29 wins above what the model predicts for Tampa Bay and 21 for San Diego, to a low of -34 for Baltimore and -23 for Houston.

This model is NOT statistically significant.

So what? All this means is that if we were to use another group of 30 team wins-team salary pairs, we would likely get a different R value. We could improve the significance of the model with more team salary and wins data pairs.

But if the data points are still as dispersed as they are in this case, more data points might yield a “statistically significant” model that (and this is the important part…) has the same correlation coefficient – the model would still have no oomph. All we have then is a model that we can be confident tells us that team salary has a small relationship with being the number of wins earned.

So we have arrived at the inescapable conclusion that this model does not tell us much about what influences wins, since there is little relationship between salaries and wins at this point in the 2010 season. The model is both weak and statistically insignificant.

The fact that the model is so weak runs counter to earlier research, which tended to find a stronger relationship. Is 2010 different, or is it just too early in the season to tell?

It is clear that the calibre of the players on the team influences the number of wins. What is less clear is the relationship between team calibre and the total amount the team pays in salaries. We have all heard it said that rich teams "buy a championship" by loading up on highly paid free agents, but how true is it?

This relationship has been analyzed in the past. One such analysis can be found in the book

*The Wages of Wins*by Berri, Schmidt, & Brook, and there are plenty of other sources around the sabermetric blogs. (One interesting visualization tool can be found on Ben Fry’s site.)One of the most common ways to test a relationship between two variables is through a regression analysis. This is the approach taken by Stacey Brook over at Hawkonomics. (Note: Brook is one of the co-authors of

*The Wages of Wins*.)I have re-run the regression using the data supplied on his blog. I changed two things to make the results more readily comprehensible. First, I changed the salary figures to be represented as millions; thus the Yankee’s salary is expressed not as $206,333,389 but $206.3. More dramatically, I used each team’s current winning percentage and projected it out over 162 games – essentially a forecast of where the teams will end up at the end of the 2010 season if they continue at the pace established over the first ~50 games of the season.

*NOTE: These transformations alter neither the “goodness of fit” of the model nor the statistical significance.*Let’s look in detail at the model that results.

**1. The Correlation: The strength of the relationship between team salaries and wins**The correlation coefficient (often identified as the Pearson correlation coefficient, and represented as "R") is a unitless measure that simply tells us how much the two variables vary together in a linear manner. If they both move up in lockstep, the correlation coefficient will be 1 (the temperature outside and the amount of electricity used to run air conditioners); if one moves up while the other moves down in lockstep, the R value will be -1 (the temperature outside and the amount of natural gas burned keeping your house warm). If there is no relationship at all, then R will equal zero (the temperature outside and the amount of energy used to heat the gallons of hot water used by teenagers in the shower).

For MLB salaries and wins so far this season, the R value is 0.224. This is interpreted as being a weak linear relationship. In plainer language, the data do not really follow a linear pattern.

But there is another value -- R

^{2}or R-squared -- that gives us some language to work with. In this case, O.224 squared is 0.0503. From this, we can say that salaries improve our prediction of a team's winning success by 5.03% -- not a very big improvement at all.This is easily seen in the X-Y chart below. Salary is plotted across the bottom, with the forecast wins up the side (I converted the team win percentages to a forecast season wins -- more on this later.) Each team is represented by one of the dark blue diamonds scattered about the chart. The predicted values derived from the regession model are shown in the form of the red dots joined by a nice straight line.

[click for a bigger version]

From this, it is easy to see that the model is not a very good predictor of actual wins. While there are some blue dots that fall close to the line, there are others that are well above or below the line. If there isn’t much difference between the actual and predicted values, we have a good model. That clearly is not the case here.

The model tells us little about what makes a winning team, because a lot of the difference in team success cannot be explained by salaries. In short, this model has no

*oomph*.*(Back to the side note from earlier: the correlation coefficient will remain the same regardless of how we express our variables. We can convert the dollars to a percentage of the average for the season (thus the low-spending Pirates would be said to have a salary that is 39% of the average while the Yankees are spending 206% above), and the correlation coefficient remains at 0.224. Or we could convert the winning percentage to actual wins, with no change in the R value.***2a. The regression equation -- how much does a change in salaries influence wins?**The regression equation is expressed as

Y=constant+(beta*X)

Where Y is the predicted number of wins, and X is the salary. The constant is the point on the Y axis where X is equal to zero, also known as the Y intercept. The beta value is the amount that a change in X will generate in increase of one in the predicted Y value. The constant and the beta value are calculated in the model.

In this model, it becomes

Y=73.0+(0.087*X)

The interpretation: each extra $87,000 spent yields an increase in a single win, starting at a base of 73 wins. A team that spends an average amount on salaries ($90.6 million) will get an average number of wins (81).

When we start to think about this equation, it’s easy to see why the model isn’t very robust. There are some teams that are going to end the season with less than 73 wins if they keep on the way they have been. To end up below 73 wins, the model says the players should be paying the team!

**2b. This year’s Moneyball teams**The model does give us a way to see which teams are getting the most production (i.e. wins) for every dollar spent – the gap between the team’s actual performance and the number of wins predicted in the model is the “residual”, and it ranges from a high of 29 wins above what the model predicts for Tampa Bay and 21 for San Diego, to a low of -34 for Baltimore and -23 for Houston.

**3. Statistical significance**This model is NOT statistically significant.

So what? All this means is that if we were to use another group of 30 team wins-team salary pairs, we would likely get a different R value. We could improve the significance of the model with more team salary and wins data pairs.

But if the data points are still as dispersed as they are in this case, more data points might yield a “statistically significant” model that (and this is the important part…) has the same correlation coefficient – the model would still have no oomph. All we have then is a model that we can be confident tells us that team salary has a small relationship with being the number of wins earned.

**Some parting thoughts**So we have arrived at the inescapable conclusion that this model does not tell us much about what influences wins, since there is little relationship between salaries and wins at this point in the 2010 season. The model is both weak and statistically insignificant.

The fact that the model is so weak runs counter to earlier research, which tended to find a stronger relationship. Is 2010 different, or is it just too early in the season to tell?

*Comments and questions are always welcome.*
Subscribe to:
Posts (Atom)