February 24, 2013

MLB runs allowed by team

Or, How good were the Maddux/Glavine-era Braves?

In this on-going series of posts about run scoring in Major League Baseball, for this installment I'll turn the equation around and look at runs allowed.  In order to account for the changing run scoring environments, the runs allowed by individual teams is compared to the league average for that season, creating an index where 100 is the league average. In this formulation, a score below 100 is a good thing; a team with an index score of 95 allowed runs at a rate 5 percentage points below the league average.

Having written the original code in R, it's now a very simple process to change a few variable names and create the equivalent of the earlier runs scored analysis, but looking at runs allowed. This is one of the most important benefits of a code/syntax environment, an option that doesn't exist if  you are using a point-and-click GUI interface.

February 23, 2013

Sabermetrics primer

Phil Birnbaum, author of the Sabermetric Research and the editor of SABR's "By the Numbers", has written a primer on the topic with the title "A Guide to Sabermetric Research" that appears at the SABR site. This should be the first stop for anyone who wants to find out more about the field of sabermetrics, and a good read for those already active.

-30-

February 17, 2013

Run production, one team at a time


In a previous post, I used R to process data from the Lahman database to calculate index values that compare a team's run production to the league average for that year.  For the purpose of that exercise, I started the sequence at 1947, but for what follows I re-ran the code with the time period 1901-2012.

The R code I used can be found at this Github gist. Instead of boring you here with the ins and outs of what the code is doing, I've embedded that as documentation in the gist. The R code assumes that you've got a data frame called "Teams.merge" already in your workspace.  This can be achieved by running the previous code, or if you've done that before, you'll have created a csv file with the name "Teams.merge.csv", and now have the option to read that file as a data frame "Teams.merge".

The first step is to choose one of the current teams, and create a data frame that contains just that club's history.  Once this has been done, the code then creates trend lines (using the LOESS method, as I did with the leagues in previous posts), and then plot them.

February 16, 2013

Gist for previous posts

The more I use it, the more I understand the benefits and value of Github as a code-sharing resource. The gist found here is the R code for my posts on run scoring trends by league (found here, here, and here).  I will continue to use Github for the code used in future posts.

-30-

February 2, 2013

Comparing individual team run production

Or, The 2010 Mariners: How Bad Were They?

In earlier posts, I used the statistical software R to plot the trends in league average run scoring since 1901. This was the first step to answering other questions I had on my mind:
  1. How poor was the offensive performance of the 2010 Seattle Mariners?
  2. Are they showing any signs of improvement?
  3. And how can I use R to tabulate the data to answer these questions?
So, to answer Question #1.  It is well-established that the 2010 Mariners were not very good, at least offensively. (For fans of the team the well-deserved Cy Young award won by Felix Hernandez is surely the highlight of the season.) But I wanted a form of relative measure that would be comparable across time, to accommodate the various fluctuations in run scoring that were the subject of that earlier post.

As I started into this, the first decision was to draw a line in the historical record. I opted to use the eras described in Bill James' "Dividing Baseball History into Eras" article (behind a pay wall – but chances are if you're reading my blog, you already a Bill James subscriber):
  • Era 1 (The Pioneer Era), 1871-1892 
  • Era 2 (The Spitball Era), 1893-1919 
  • Era 3 (The Landis Era), 1920-1946 
  • Era 4 (The Baby Boomers Era), 1947-1968 
  • Era 5 (The Artifical Turf Era), 1969-1992 
  • Era 6 (The Camden Yards Era), 1993-2012
Based on these groupings, I opted to use the range of seasons 1947-2012 inclusive. This yields 1,580 team seasons of National League and American League baseball.

The second step was to calculate a runs per game (RPG) for each team, by year. This corrects for the longer regular season in the post-expansion period, the strike-shortened seasons, and will give us a common denominator to compare the results so far in 2012.

To do this, I accessed the 2012 edition of the Lahman database. Once I had downloaded and extracted the comma-delimted version of the files, I read the "teams" file into R.

January 12, 2013

Latest Lahman database now available

One of the most important resources for baseball researchers, the Lahman database, has been updated with the 2012 Major League Baseball data.  It's an amazing compendium, with complete statistics for teams and players from 1871 to 2012.

You can access the dataset (in a variety of formats) here.

And I encourage you to make a financial contribution through the donations page to help support this endeavour.

-30-

December 7, 2012

West Coast League baseball coming to Victoria

We're getting closer to the arrival of summer collegiate baseball in Victoria.  The Victoria HarbourCats are one of two new clubs in the West Coast League, along with the Medford Rogues.  The team has been generating some buzz already, seven months before playing a game. The club posted their 2013 schedule yesterday.  And to top things off, the HarbourCats will be hosting the league All Star Game in July, .

The best sources of information about the HarbourCats are the club's official homepage and the fan blog.

Given my own proclivities, I've created a Google map plotting the locations of all the teams in the league.  A zoomed out screenshot is below.



-30-



.