June 30, 2013

Victoria baseball history

The history of baseball in Victoria, British Columbia has many interesting stories (as well as a statistical legacy). As Tom Hawthorn notes in this great overview of professional baseball in Victoria, on the surface the city seems much more like a cricket and rugby town.  But baseball goes back a long way, with a lot of colourful characters.

Hawthorn penned a similar piece for the Globe & Mail ten years ago, which contains more details about the single appearance of the subject of "the greatest baseball card ever made".

Bill Murray, Victoria Mussels infielder


June 16, 2013

Annotating select points on an X-Y plot using ggplot2

or, Is the Seattle Mariners outfield a disaster?

The Backstory
Earlier this week (2013-06-10), a blog post by Dave Cameron appeared at USS Mariner under the title “Maybe It's Time For Dustin Ackley To Play Some Outfield”. In the first paragraph, Cameron describes to the Seattle Mariners outfield this season as “a complete disaster” and Raul Ibanez as “nothing short of a total disaster”.

To back up the Ibanez assertion, the article included a link to a Fangraphs table showing the defensive metrics for all MLB outfielders with a minimum of 200 innings played to date, sorted in ascending order of UZR.150 (UZR is generally recognized as the best defensive metric). And there, at the top (or bottom) of the list, Raul Ibanez.

But surely, I thought, Ibanez's offensive production – starting with the 11 home runs he had hit at the time, now up to 13 – off-sets to some degree the lack of defense. So I took a look at a variety of offensive measures, to see how Ibanez stacks up. It quickly struck me that wRAA (Weighted Runs Above Average), the offensive component of WAR (Wins Above Replacement, the best comprehensive measure of a player's overall contribution, which also includes a base running not examined here), would make an interesting scatterplot against UZR. And a great opportunity to use ggplot2.

Manipulating the data
Using this table from Fangraphs (advanced batting stats of all MLB players so far this season), I created a new table “outfield” that appended the advanced hitting stats to the defensive stats in the original table, and then set about creating the plot using the ggplot2 package in R.

Note: once I had downloaded the two Fangraphs tables as csv files (with results through 2013-06-15), I edited the file names slightly.

# load the ggplot2 and grid packages
# read data (note csv files are renamed)
tbl1 = read.csv("FanGraphs_Leaderboard_h.csv")
tbl2 = read.csv("FanGraphs_Leaderboard_d.csv")
# create new table with data from both tbl1 and tbl2 by link on variable
# 'playerid'
outfield = data.frame(merge(tbl1, tbl2, by = "playerid"))
# clean up the variable names of the two Name fields
names(outfield)[2] = paste("Name")
names(outfield)[21] = paste("Name.y")

A quick plot
With the two data sets now merged, we can start plotting the results. First of all, a quick plot using ggplot2's “qplot” needs only one line of code, and three specifications (X axis data, Y axis data, and the name of the source table):

  qplot(UZR.150, wRAA, data = outfield)

So that must be Raul Ibanez over there on the far left. It's clear from this plot that his hitting (represented on the Y axis) is just above the 0 line, and a long way below the outfielders who are hitting up a storm. It's worth keeping in mind that Ibanez's hitting contribution is helped to some degree by the fact that just over one-third of his plate appearances so far this year (126 of 187) have been as a designated hitter or pinch hitter.

In looking at this plot, you might ask the same thing I did: Where are the rest of the Mariners outfielders, and who are the stars of the X and Y axes?

Code to set up the tables for plotting

The next chunk of code takes three approaches to identifying groups and individuals on the chart. We don't want to plot the names of all 110 players, that would be utterly illegible. Instead, we'll focus on three groups: the Seattle Mariners, the top UZR.150 players, and the top wRAA players. The Mariners player points and names will be navy blue, and others in black. The code will label the Mariners players and the top performers on the wRAA axis automatically, and a manual approach will be adopted to create the code necessary to identify the top UZR players.

But before plotting the results, new variables in the “outfield” table are created that have the names of the Mariners players, the UZR stars, and the wRAA stars.

# create new MarinerNames field that contains only the name of Mariners
# players (plagarized from Winston Chang's R Graphics Cookbook Recipe 5.11)
outfield$MarinerNames = outfield$Name
idx = (outfield$Team.x == "Mariners")
outfield$MarinerNames[!idx] = NA
# create a new table, taking a subset that has only the Mariners players
Mariners = subset(outfield, Team.x == "Mariners")
# add the names of the UZR stars to outfield$Table2 sort the table by
# wRAA, then add the names of the top 4 wRAA stars
outfield$wRAAstars = outfield$Name
outfield = outfield[order(-outfield$wRAA), ]
outfield$wRAAstars[5:110] = NA
# sort the table by UZR.150, then copy the first 3 names
outfield$UZRstars = outfield$Name
outfield = outfield[order(-outfield$UZR.150), ]
outfield$UZRstars[4:110] = NA

The final plot code
# the full ggplot verion, creating an object called "WARcht"
WARcht = ggplot(outfield, aes(x=UZR.150, y=wRAA)) + #
   geom_point(colour="gray60", size=2.0) + # set the colour and size of the points
   theme_bw() + # and use the "background white" theme
   ggtitle("Everyday Outfielders, 2013 [to 2013-06-15]") # and put a title on the plot
# start with WARcht, add geom_text() [for auto labels] and annotate() [for manual labels and arrows]
WARcht + # print the chart object
   geom_text(aes(label=MarinerNames), size=4, fontface="bold", colour="navyblue",
      vjust=0, hjust=-0.1) + # add the names of the Mariners players
   geom_text(aes(label=wRAAstars), size=3, fontface="bold",
      vjust=0, hjust=-0.1) + # add the names of the top wRAA players
   annotate("text", label="Shane Victorino", x=40, y=3, size=3, 

      fontface="bold.italic") + # manually place the label for Shane Victorino
   annotate("segment", x=50, y=2, xend=51.7, yend=-0.4, size=0.5,
      arrow=arrow(length=unit(.2, "cm"))) + # manually place the Victorino arrow
   annotate("text", label="Craig Gentry", x=40, y=-7.0, size=3,

      fontface="bold.italic") +
   annotate("segment", x=42, y=-6.6, xend=40.9, yend=-4.0, size=0.5, 

      arrow=arrow(length=unit(.2, "cm"))) +
   annotate("text", label="A.J. Pollock", x=49, y=-2.5, size=3,    

      fontface="bold.italic") +
   geom_point(data=Mariners, aes(x=UZR.150, y=wRAA), colour="navyblue", size=4) # over-plot the points for the Mariners players

The final analysis

In addition to Raul Ibanez, there are four other Mariners outfielders who have logged more than 200 innings. The only one on the plus side of the UZR.150 ledger is Jason Bay, at 5.5. And along with Ibanez, only Michael Morse has a positive wRAA. Put it another way, all five are more or less in the lower right-hand quadrant of the chart. So yes, it's a fair assessment that the Mariners outfield is a disaster.

The Major League outfielders who are the top hitters (the Y axis on the chart) are led by the Rockies' Carlos Gonzalez (at 28.1), ahead of Shin-Soo Choo (21.2) and Mike Trout (19.8). And defensively (the X axis), Shane Victorino leads with 51.9, followed by Craig Gentry (40.9) and A.J. Pollock (39.1).

The only outfielder who shines on both dimensions is the Brewers' Carlos Gomez, who stands in fourth place on both UZR.150 and wRAA. As the chart shows, so far this season he's in a class by himself.

Note: the code above can be found in a gist at github.


June 4, 2013

Major League Baseball run scoring trends with R's Lahman package

The statistical software R has an ever-expanding array of packages that provide pre-programmed functions and datasets. One such package is named Lahman, bundling the contents of the Lahman database into a quick-and-easy resource for R users. In addition to the data tables, the package resources also contain a variety of analyses and graphics undertaken using the package, providing some examples of how the package can be used.

Full disclosure: I am now one of the Lahman package project members.

This is my first blog post using the Lahman package, and as a first step I will simply recreate the league run scoring trends graphs that I generated previously. Originally, I had used data from Baseball Reference, for the simple reason that the Lahman database does not, in its source form, contain any league-level aggregations.

The process for loading the Lahman package is as simple as any other R package; this simplicity is even greater if you are using an IDE such as RStudio. Once loaded, you have access to all the tables in the database, without any of the futzing that is sometimes required in tidying up a raw flat file (I find that variable names are sometimes lost or changed in translation).

The code (available as a gist here, downloadable as an R script file) creates a pair of tables, calculating each league's run scoring rates by year. Then, recycling my earlier code, it calculates a series of trend lines using the loess method, and graphs those trend lines. For simplicity's sake, only the final version of each graph is shown.

Step 1: install the package (if you haven't already), access the library, and load the data table “Teams”.

# load the package into R, and open the data table 'Teams' into the
# workspace

The second step is to use the individual team season results to calculate the aggregate of each league's year. We start with 1901, the year the American League was formed. Once those tables are created, the loess function is used to calculate trend lines for each league's run scoring environment.

# select a sub-set of teams from 1901 [the establishment of the American
# League] forward to 2012
Teams_sub <- as.data.frame(subset(Teams, yearID > 1900))
# calculate each team's average runs and runs allowed per game
Teams_sub$RPG <- Teams_sub$R/Teams_sub$G
Teams_sub$RAPG <- Teams_sub$RA/Teams_sub$G
# create new data frame with season totals for each league
LG_RPG <- aggregate(cbind(R, RA, G) ~ yearID + lgID, data = Teams_sub, sum)
# calculate league + season runs and runs allowed per game
# select a sub-set of teams from 1901 [the establishment of the American
# League] forward to 2012 read the data into separate league tables
ALseason <- (subset(LG_RPG, yearID > 1900 & lgID == "AL"))
NLseason <- (subset(LG_RPG, yearID > 1900 & lgID == "NL"))
# AMERICAN LEAGUE create new object ALRunScore.LO for loess model
ALRunScore.LO <- loess(ALseason$LG_RPG ~ ALseason$yearID)
ALRunScore.LO.predict <- predict(ALRunScore.LO)
# create new objects RunScore.Lo.XX for loess models with 'span' control
# span = 0.25
ALRunScore.LO.25 <- loess(ALseason$LG_RPG ~ ALseason$yearID, span = 0.25)
ALRunScore.LO.25.predict <- predict(ALRunScore.LO.25)
# span = 0.5
ALRunScore.LO.5 <- loess(ALseason$LG_RPG ~ ALseason$yearID, span = 0.5)
ALRunScore.LO.5.predict <- predict(ALRunScore.LO.5)
# NATIONAL LEAGUE create new object RunScore.LO for loess model
NLRunScore.LO <- loess(NLseason$LG_RPG ~ NLseason$yearID)
NLRunScore.LO.predict <- predict(NLRunScore.LO)
# loess models
NLRunScore.LO.25 <- loess(NLseason$LG_RPG ~ NLseason$yearID, span = 0.25)
NLRunScore.LO.25.predict <- predict(NLRunScore.LO.25)
NLRunScore.LO.5 <- loess(NLseason$LG_RPG ~ NLseason$yearID, span = 0.5)
NLRunScore.LO.5.predict <- predict(NLRunScore.LO.5)

Now that we have calculated the league averages and trend lines (using the loess method), we can start the plots. First, a simple plot of the actual values:

# MULTI-PLOT -- MERGING AL AND NL RESULTS plot individual years as lines
ylim <- c(3, 6)
# start with AL line
plot(ALseason$LG_RPG ~ ALseason$yearID, type = "l", lty = "solid", col = "red", 
    lwd = 2, main = "Runs per team per game, 1901-2012", ylim = ylim, xlab = "year", 
    ylab = "runs per game")
# add NL line
lines(NLseason$yearID, NLseason$LG_RPG, lty = "solid", col = "blue", lwd = 2)
# chart additions
legend(1900, 3.5, c("AL", "NL"), lty = c("solid", "solid"), col = c("red", "blue"), 
    lwd = c(2, 2))

Next, comparing the league trends.

# plot multiple loess curves (span=0.50 and 0.25)
ylim <- c(3, 6)
# start with AL line
plot(ALRunScore.LO.5.predict ~ ALseason$yearID, type = "l", lty = "solid", col = "red", 
    lwd = 2, main = "Runs per team per game, 1901-2012", ylim = ylim, xlab = "year", 
    ylab = "runs per game")
# add NL line
lines(NLseason$yearID, NLRunScore.LO.5.predict, lty = "solid", col = "blue", 
    lwd = 2)
# add 0.25 lines
lines(ALseason$yearID, ALRunScore.LO.25.predict, lty = "dashed", col = "red", 
    lwd = 2)
lines(NLseason$yearID, NLRunScore.LO.25.predict, lty = "dashed", col = "blue", 
    lwd = 2)
# chart additions
legend(1900, 3.5, c("AL (span=0.50)", "NL (span=0.50)", "AL (span=0.25)", "NL (span=0.25)"), 
    lty = c("solid", "solid", "dashed", "dashed"), col = c("red", "blue", "red", 
        "blue"), lwd = c(2, 2, 2, 2))

Next, calculate the difference between the two leagues – both the absolute difference and the difference in the loess trend lines.

# 1. absolute
RunDiff <- (ALseason$LG_RPG - NLseason$LG_RPG)
# 2. LOESS span=0.25
RunDiffLO <- (ALRunScore.LO.25.predict - NLRunScore.LO.25.predict)

And plot the differences.

# plot each year absolute difference as bar, difference in trend as line
ylim <- c(-1, 1.5)
plot(RunDiff ~ ALseason$yearID, type = "h", lty = "solid", col = "blue", lwd = 2, 
    main = "Run scoring trend: AL difference from NL, 1901-2012", ylim = ylim, 
    xlab = "year", ylab = "runs per game")
# add RunDiff line
lines(ALseason$yearID, RunDiffLO, lty = "solid", col = "black", lwd = 2)
# add line at zero
abline(h = 0, lty = "dotdash")
# chart additions
legend(1900, 1.5, c("AL difference from NL: absolute", "AL difference from NL, LOESS (span=0.25)"), 
    lty = c("solid", "solid"), col = c("blue", "black"), lwd = c(2, 2))

For the next “using R” post, I'll take a look at the ways to plot the residuals from the loess method.

The one after that: ggplot2 versions of the graphs.