November 29, 2010

Good math, bad statistics

In the past few days, a pair of posts on other blogs caught my attention -- they seem to be coming at the same issue from different directions.
First, William R. Briggs posted "Statistics Is Not Math" (November 16, 2010). Then, Tango over at The Book posted "Detrending: when statisticians attack!" (November 24, 2010). I responded to the Tango post (comment #4), but I would like to here elaborate further.
One of the things that jumped out at me from Briggs' post was the statement that "Statistics rightly belongs to epistemology, the philosophy of how we know what we know. Probability and statistics can even be called quantitative epistemology." In other words, statistics is useful only if we have some understanding of the subject matter at hand. No amount of fancy math will help our understanding if we do not start our research with some knowledge of the topic.
In the "Detrending" post, Tango links to an unpublished (in the academic sense that it's not been published in a peer-reviewed journal) paper, by three physicists, Alexander M. Petersen , Orion Penner, and H. Eugene Stanley, entitled "Detrending career statistics in professional baseball:
Accounting for the steroids era and beyond". I may offer a longer critique of this paper at a later date, but the first thing that jumps out is an apparent ignorance of The Literature (i.e. what's been written earlier about the topic -- baseball -- from a statistical basis). This leads the authors to make conclusions that have been supported elsewhere (for example, pitcher wins are not a good measure of pitcher performance, or that standardizing allows for inter-season comparisons).
There's lots of fancy maths (some of which isn't as fancy or new-fangled as the authors seem to think) and plenty of Greek letters, but in the end it doesn't add a great deal to our understanding of baseball.
This article serves as a reminder that when we are assessing the quality of any sabermetric writing, we need to consider two factors:
1. Is the author using the appropriate statistical tools and interpreting the mathematical results correctly?
2. Does the author understand the game, including how baseball has evolved and the analytic literature that has been written over the past 50 years?


November 22, 2010

Bo knows probability

Cardinals' second baseman Bo Hart

Over on 3-D Baseball, Kincaid has a nice explanation of regression to the mean in a post titled "On Correlation, Regression, and Bo Hart". The blog entry starts with the story of Bo Hart, who got called up to the Cardinals in June 2003, and promptly hit .412 over his first 75 at-bats. Since Kincaid wrote a regression to the mean article, you can guess where Hart's season went -- he finished with 286 at-bats and a .277 average.

But Kincaid flirts with a few notions that I think are worth following in a bit more detail.

First up, what are the odds that a .277 hitter will break .400 across a string of 75 at-bats?

The answer is roughly 1 in 200.

This is calculated through the fact that the binomial distribution approximates the normal distribution -- in English, if you repeat a set of binomial trials, the histogram of the count of success rates for the trials will look like the normal curve. This leads us to the probability density function, which allows us to state the probability that a value (in this case, a batting average of .412) falls at a certain point given the mean value (.277).

Using Bo Hart's season batting average of .277 as his "true talent" (or "population mean") across 75 at-bats, we can calculate the standard deviation of the distribution (0.052). We then determine that .412 lies at 2.60 standard deviations from the mean (2.60=[.412-.277]/.052). As a probability, 2.60 standard deviations is 0.5% -- or 1 in 200.

What was unusual about Bo Hart is that his 1 in 200 string of successful at-bats occurred at the beginning of his Major League career. Calculating that probability is a task for another day.

In my next post I will explore Kincaid's statements about evaluating "true talent" based on a number of observations. Specifically, I'll delve into the following questions: "At what point can we be relatively certain about our inferences of true talent based on observed performance? 75 PAs is not enough, and one million is plenty, but what about 1000?"


November 14, 2010

Sealing the exits?

The Victoria Seals of the Golden Baseball League announced on Wednesday (November 10, 2010) that they were "ceasing operations". While acknowledging that the league has some serious challenges, the club was pointing more fingers at the City of Victoria, who owns and operates the Seals' home, Royal Athletic Park (RAP).

Like practically every other pro sports franchise that plays in a publicly owned facility, the Seals wanted a better deal. In the news release and press conference, the Seals stated they had asked for a larger share of the gate and concession revenue, and solutions to a variety of issues relating to the playing field, including the ability to leave the outfield fence up all summer. The City's position is that they are unwilling to have taxpayers subsidize the club, and since RAP is a multi-purpose facility available for many users their hands are tied on the field issues.

I don't know enough about the details on either side to offer an informed opinion. But I can say both sides seem to have entrenched positions, based on their specific operating requirements (for the City, that includes the political reality as well as the business side) that seem reasonable enough. In short, there may not be a middle ground that is satisfactory to both parties.

The local press has played up the distance between the Seals and the City (the news article is here, but the local daily also weighed in with this opinion piece and this more resigned article). But in so doing, the press has missed a key element that the Seals acknowledge: the Golden Baseball League is itself in a shambles. The team's press release describes the league as being in an "unstable state", but I suspect that understates the troubles.

To my way of thinking, the biggest problem is the distribution of the teams in the league. To expand the league off the mainland of North America to Victoria (on Vancouver Island) was one thing -- it guarantees a higher-per-mile travel cost (those ferries aren't cheap) and perhaps an extra hotel night. But what was the league thinking, given the evidence that the league is in a tenuous state to start with, adding clubs in Mexico (Tijuana -- not far from the mothballed San Diego club but across an international border) and in particular Hawaii (Maui)?

I have to wonder if, with their "cease operations" announcement, the management of the Seals is trying to press both the City of Victoria and the management of the Golden Baseball League. Perhaps it is just wishful thinking on my part, but if the Seals can get a more satisfactory arrangement with the City of Victoria (or another municipality in the greater Victoria area -- we have 13, after all), while pressuring the league to make some sensible choices about where the franchises are located, then perhaps we haven't seen the end of this round of professional baseball in Victoria.

Update (2010-11-15): The Globe & Mail also chimes in.

The Bayes Ball Bookshelf, #1

The Numbers Game: Baseball's Lifelong Fascination with Statistics, by Alan Schwarz. 2004, St. Martin's Press.

In The Numbers Game, Alan Schwarz presents a well-written and tidy history of the development and evolution of the statistics that record the history of the game. Or more accurately, it's a history of baseball, and its evolution over the past century and a half, from the perspective of the numerical record and analysis of the game.
Thus Schwarz begins in the mid-nineteenth century, with Henry Chadwick's influence on the information that got recorded. But more importantly, Schwarz points out (and this becomes a recurring theme) that how the game was played was an influence on what got recorded. In the early days, the ball was "pitched" to the batter in a way that facilitated batting it -- and because pitching was secondary to hitting and fielding, there was no record of pitching performance. And as the game evolved, so did the numbers that recorded the game and got used to evaluate the players.
A second recurring theme is the weaving of the technical aspects of the statistics with the personal characters of those who developed and promoted various measures. This is very much a character-driven book -- we hear not only about the "why" of the statistics that were recorded, but the people who developed them and the means of recording them. So we hear about Al Munro Elias, Allan Roth's career with the Dodgers, Hal Richman's development of Strat-O-Matic, and George Lindsey's articles that appeared in academic journals beginning in the late 1950s. We also get an entire chapter devoted to the publication in 1969 of The Baseball Encyclopedia, and another to Bill James.
One of the things that jumps out to me is the impact that computers-- particularly the personal computer -- has had on the volume of statistics available, and the precision of the analysis that is now available. (And what is perhaps a topic for another day, the proliferation of analysts of varying quality.)
In The Numbers Game, Schwarz has written what may well be the single best introduction to sabermetrics. But it's not a technical manual that will tell you how to calculate any one statistic, or how another measure should be interpreted. Instead it's a lively history of major league baseball, and the numerical record and analysis that accompanies it.

Assessment: home run.

2010 in retrospect

The 2010 MLB season has come to a close, and so begins a time of reflection and resolutions.

a) I started this blog, and rarely posted. I started a few posts, often in response to other blogs, but finished fewer still. I am confounded by the traffic on the blogosphere -- I thought I could respond thoughtfully and add something of value, but I find myself either repeating what gets said elsewhere, or sounding like a condescending pedant. Or both.

So I'll start off on a different tack, starting now.

b) I went to two MLB games in 2010. (There's nothing like living across an international border to the closest team, and 4,300 km from the "national" club). Two shutouts! Fangraphs has the results here and here.

c) The local pro team just announced they are closing up shop. More on that later.