February 2, 2013

Comparing individual team run production

Or, The 2010 Mariners: How Bad Were They?

In earlier posts, I used the statistical software R to plot the trends in league average run scoring since 1901. This was the first step to answering other questions I had on my mind:
  1. How poor was the offensive performance of the 2010 Seattle Mariners?
  2. Are they showing any signs of improvement?
  3. And how can I use R to tabulate the data to answer these questions?
So, to answer Question #1.  It is well-established that the 2010 Mariners were not very good, at least offensively. (For fans of the team the well-deserved Cy Young award won by Felix Hernandez is surely the highlight of the season.) But I wanted a form of relative measure that would be comparable across time, to accommodate the various fluctuations in run scoring that were the subject of that earlier post.

As I started into this, the first decision was to draw a line in the historical record. I opted to use the eras described in Bill James' "Dividing Baseball History into Eras" article (behind a pay wall – but chances are if you're reading my blog, you already a Bill James subscriber):
  • Era 1 (The Pioneer Era), 1871-1892 
  • Era 2 (The Spitball Era), 1893-1919 
  • Era 3 (The Landis Era), 1920-1946 
  • Era 4 (The Baby Boomers Era), 1947-1968 
  • Era 5 (The Artifical Turf Era), 1969-1992 
  • Era 6 (The Camden Yards Era), 1993-2012
Based on these groupings, I opted to use the range of seasons 1947-2012 inclusive. This yields 1,580 team seasons of National League and American League baseball.

The second step was to calculate a runs per game (RPG) for each team, by year. This corrects for the longer regular season in the post-expansion period, the strike-shortened seasons, and will give us a common denominator to compare the results so far in 2012.

To do this, I accessed the 2012 edition of the Lahman database. Once I had downloaded and extracted the comma-delimted version of the files, I read the "teams" file into R.

Sidebar: My original intention was to incorporate chunks of the R code into this post.  But Blogspot seems to be going out of its way to make formatting of that code a nightmare.  I located this post at Getting Genetics Done that pointed me to the Gist feature at Github.  I have to admit to being a total Github neophyte, but I have managed to create a public Gist that will allow you to access all of the R code I used here.)

With the 1947-2012 data frame constructed, we can calculate the average number of runs each team scored per game. And while we are at it, we can calculate the runs allowed, too (and save them for another day). The variables are R (runs), RA (runs allowed), and G (games).

The next step is to calculate the league averages. First, I used "aggregate", which summed the number of runs (variable R) by year (yearID) and league (lgID). Each of the lines of code below creates a stand-alone summary table with three variables: the year, league, and the sum of the variable.

Now I have three new data frames -- "RunsLG", "RunsALG", and "GamesLG". Each has only 3 variables, but they share "Teams.yearID" and "Teams.lgID". This code, using the self-explanatory "merge" command, builds a single table LG_RPG with all three of the newly created variables.

Now we've got the totals of runs, runs allowed, and games for each league's season, the next step is to use the values to calculate the averages of runs and runs allowed per game, and then a bit of variable name maintenance.

Although it's possible to calculate the values I'm looking for from the values in the two separate data frames, I decided to make a single table (ultimately, I will be writing this table as a flat file for later use). There's probably a more elegant way to accomplish this, but it works. [Note to self: this should perhaps be the motto for my coding.] A single line of code is all that's required, since the default for "merge" is to merge based on the columns with shared variable names.

The table "Teams.merge" is the truncated 1947-2012 version of the Lahman database table "Teams" that was first read into R, with the corresponding league averages for runs and runs allowed added for each team.

The next step in the process is to compare the individual team's runs scored with the league average for that year, by creating an index value where 100 is equal to the league average, and the individual team index is measured relative to this. Thus an index score of 110 indicates that a team scored runs at a rate 10% higher than the league average, or allowed runs at a rate 10% higher than the league average.

I also was curious to find out the distributions, so I calculated the minimum, maximum, and standard deviation, and then plotted the distribution.

(click to enlarge)
So let's take a look at the extremes of the distribution -- those offensive juggernauts that managed to score runs at 120% or more of the league average, and whatever the opposite of a juggernaut might be, with run production below 80% of the league rate. Here, I used two different tools -- the "rank" command, and a sorting function, "order".

First, the juggernauts.


Both the 1976 Reds and the 1950 Red Sox scored runs at a rate more than 30% higher than the league average of the time. These two clubs were clearly parts of on-going offensive powerhouses -- the '75 Reds and both the 1948 and '49 Red Sox also make the list of teams with an index score of greater than 120.

And now the equivalent for the low-scoring teams.


There at the top of this list stand the 2010 Seattle Mariners. The 2010 Seattle Mariners plated 513 runs (3.17 per game), which turns out to be more than the 463 (2.86 per game) that were scored by the White Sox in a 162 game season in 1968. But the Sox, and the other 19 teams that had lower runs per game values than the 2010 Mariners, were playing in seasons with very low run scoring.

But by the index measure, the 2010 Seattle Mariners were unprecedented in their inability to score runs. With an index score of 71.1, the 2010 Mariners produced lowest number of runs relative to the league average than the other 1,579 teams that played in the period 1947-2012. They scored nearly 30% fewer runs than the league average, and with a Z score of -2.96, it indicates that this is roughly 1 in a 1,000 event. (OK, for those of you with a bent for precision, it's 1 in 998.5.)

It's important to note that 2010 wasn't a one-off fluke of bad luck for the Mariners, it just happens to be the nadir of their run scoring performance. The 2011 Mariners were better than the team in 2010, but not a whole lot. They produced runs at 76.9% of the American League rate that season -- the 15th poorest in the 1947-2011 period.

For my next post, I'll look at the historic trend for the Mariners (you may have noticed other Mariner teams showing up in the above list, although not the 2012 edition of the team) and then move on to the pitching side of the equation -- runs allowed.



  1. Excellent post! The 1969 Padres weren't that far behind the 2010 Mariners in offensive ineptitude -- 71.13 vs 71.25. At least the Padres had an excuse -- they were an expansion team.

  2. On the R code,

    You asked for a more elegant way. Lines 25 - 41 of your code could be replaced with simply:

    LG_RPG <- aggregate(cbind(R, RA, G) ~ yearID + lgID, data = Teams, sum)

    And then you don't even have to clean up the variable names!

    1. Peter, thanks -- very elegant indeed. I'll edit the Gist to reflect this improvement.


Note: Only a member of this blog may post a comment.