February 24, 2013

MLB runs allowed by team

Or, How good were the Maddux/Glavine-era Braves?

In this on-going series of posts about run scoring in Major League Baseball, for this installment I'll turn the equation around and look at runs allowed.  In order to account for the changing run scoring environments, the runs allowed by individual teams is compared to the league average for that season, creating an index where 100 is the league average. In this formulation, a score below 100 is a good thing; a team with an index score of 95 allowed runs at a rate 5 percentage points below the league average.

Having written the original code in R, it's now a very simple process to change a few variable names and create the equivalent of the earlier runs scored analysis, but looking at runs allowed. This is one of the most important benefits of a code/syntax environment, an option that doesn't exist if  you are using a point-and-click GUI interface.

February 23, 2013

Sabermetrics primer

Phil Birnbaum, author of the Sabermetric Research and the editor of SABR's "By the Numbers", has written a primer on the topic with the title "A Guide to Sabermetric Research" that appears at the SABR site. This should be the first stop for anyone who wants to find out more about the field of sabermetrics, and a good read for those already active.


February 17, 2013

Run production, one team at a time

In a previous post, I used R to process data from the Lahman database to calculate index values that compare a team's run production to the league average for that year.  For the purpose of that exercise, I started the sequence at 1947, but for what follows I re-ran the code with the time period 1901-2012.

The R code I used can be found at this Github gist. Instead of boring you here with the ins and outs of what the code is doing, I've embedded that as documentation in the gist. The R code assumes that you've got a data frame called "Teams.merge" already in your workspace.  This can be achieved by running the previous code, or if you've done that before, you'll have created a csv file with the name "Teams.merge.csv", and now have the option to read that file as a data frame "Teams.merge".

The first step is to choose one of the current teams, and create a data frame that contains just that club's history.  Once this has been done, the code then creates trend lines (using the LOESS method, as I did with the leagues in previous posts), and then plot them.

February 16, 2013

Gist for previous posts

The more I use it, the more I understand the benefits and value of Github as a code-sharing resource. The gist found here is the R code for my posts on run scoring trends by league (found here, here, and here).  I will continue to use Github for the code used in future posts.


February 2, 2013

Comparing individual team run production

Or, The 2010 Mariners: How Bad Were They?

In earlier posts, I used the statistical software R to plot the trends in league average run scoring since 1901. This was the first step to answering other questions I had on my mind:
  1. How poor was the offensive performance of the 2010 Seattle Mariners?
  2. Are they showing any signs of improvement?
  3. And how can I use R to tabulate the data to answer these questions?
So, to answer Question #1.  It is well-established that the 2010 Mariners were not very good, at least offensively. (For fans of the team the well-deserved Cy Young award won by Felix Hernandez is surely the highlight of the season.) But I wanted a form of relative measure that would be comparable across time, to accommodate the various fluctuations in run scoring that were the subject of that earlier post.

As I started into this, the first decision was to draw a line in the historical record. I opted to use the eras described in Bill James' "Dividing Baseball History into Eras" article (behind a pay wall – but chances are if you're reading my blog, you already a Bill James subscriber):
  • Era 1 (The Pioneer Era), 1871-1892 
  • Era 2 (The Spitball Era), 1893-1919 
  • Era 3 (The Landis Era), 1920-1946 
  • Era 4 (The Baby Boomers Era), 1947-1968 
  • Era 5 (The Artifical Turf Era), 1969-1992 
  • Era 6 (The Camden Yards Era), 1993-2012
Based on these groupings, I opted to use the range of seasons 1947-2012 inclusive. This yields 1,580 team seasons of National League and American League baseball.

The second step was to calculate a runs per game (RPG) for each team, by year. This corrects for the longer regular season in the post-expansion period, the strike-shortened seasons, and will give us a common denominator to compare the results so far in 2012.

To do this, I accessed the 2012 edition of the Lahman database. Once I had downloaded and extracted the comma-delimted version of the files, I read the "teams" file into R.