June 16, 2013

Annotating select points on an X-Y plot using ggplot2

or, Is the Seattle Mariners outfield a disaster?

The Backstory
Earlier this week (2013-06-10), a blog post by Dave Cameron appeared at USS Mariner under the title “Maybe It's Time For Dustin Ackley To Play Some Outfield”. In the first paragraph, Cameron describes to the Seattle Mariners outfield this season as “a complete disaster” and Raul Ibanez as “nothing short of a total disaster”.

To back up the Ibanez assertion, the article included a link to a Fangraphs table showing the defensive metrics for all MLB outfielders with a minimum of 200 innings played to date, sorted in ascending order of UZR.150 (UZR is generally recognized as the best defensive metric). And there, at the top (or bottom) of the list, Raul Ibanez.

But surely, I thought, Ibanez's offensive production – starting with the 11 home runs he had hit at the time, now up to 13 – off-sets to some degree the lack of defense. So I took a look at a variety of offensive measures, to see how Ibanez stacks up. It quickly struck me that wRAA (Weighted Runs Above Average), the offensive component of WAR (Wins Above Replacement, the best comprehensive measure of a player's overall contribution, which also includes a base running not examined here), would make an interesting scatterplot against UZR. And a great opportunity to use ggplot2.

Manipulating the data
Using this table from Fangraphs (advanced batting stats of all MLB players so far this season), I created a new table “outfield” that appended the advanced hitting stats to the defensive stats in the original table, and then set about creating the plot using the ggplot2 package in R.

Note: once I had downloaded the two Fangraphs tables as csv files (with results through 2013-06-15), I edited the file names slightly.

# load the ggplot2 and grid packages
library(ggplot2)
library(grid)
# read data (note csv files are renamed)
tbl1 = read.csv("FanGraphs_Leaderboard_h.csv")
tbl2 = read.csv("FanGraphs_Leaderboard_d.csv")
# create new table with data from both tbl1 and tbl2 by link on variable
# 'playerid'
outfield = data.frame(merge(tbl1, tbl2, by = "playerid"))
# clean up the variable names of the two Name fields
names(outfield)[2] = paste("Name")
names(outfield)[21] = paste("Name.y")
#


A quick plot
With the two data sets now merged, we can start plotting the results. First of all, a quick plot using ggplot2's “qplot” needs only one line of code, and three specifications (X axis data, Y axis data, and the name of the source table):

  qplot(UZR.150, wRAA, data = outfield)



So that must be Raul Ibanez over there on the far left. It's clear from this plot that his hitting (represented on the Y axis) is just above the 0 line, and a long way below the outfielders who are hitting up a storm. It's worth keeping in mind that Ibanez's hitting contribution is helped to some degree by the fact that just over one-third of his plate appearances so far this year (126 of 187) have been as a designated hitter or pinch hitter.

In looking at this plot, you might ask the same thing I did: Where are the rest of the Mariners outfielders, and who are the stars of the X and Y axes?

Code to set up the tables for plotting

The next chunk of code takes three approaches to identifying groups and individuals on the chart. We don't want to plot the names of all 110 players, that would be utterly illegible. Instead, we'll focus on three groups: the Seattle Mariners, the top UZR.150 players, and the top wRAA players. The Mariners player points and names will be navy blue, and others in black. The code will label the Mariners players and the top performers on the wRAA axis automatically, and a manual approach will be adopted to create the code necessary to identify the top UZR players.

But before plotting the results, new variables in the “outfield” table are created that have the names of the Mariners players, the UZR stars, and the wRAA stars.

# create new MarinerNames field that contains only the name of Mariners
# players (plagarized from Winston Chang's R Graphics Cookbook Recipe 5.11)
outfield$MarinerNames = outfield$Name
idx = (outfield$Team.x == "Mariners")
outfield$MarinerNames[!idx] = NA
# create a new table, taking a subset that has only the Mariners players
Mariners = subset(outfield, Team.x == "Mariners")
# add the names of the UZR stars to outfield$Table2 sort the table by
# wRAA, then add the names of the top 4 wRAA stars
outfield$wRAAstars = outfield$Name
outfield = outfield[order(-outfield$wRAA), ]
outfield$wRAAstars[5:110] = NA
# sort the table by UZR.150, then copy the first 3 names
outfield$UZRstars = outfield$Name
outfield = outfield[order(-outfield$UZR.150), ]
outfield$UZRstars[4:110] = NA
#


The final plot code
# the full ggplot verion, creating an object called "WARcht"
WARcht = ggplot(outfield, aes(x=UZR.150, y=wRAA)) + #
   geom_point(colour="gray60", size=2.0) + # set the colour and size of the points
   theme_bw() + # and use the "background white" theme
   ggtitle("Everyday Outfielders, 2013 [to 2013-06-15]") # and put a title on the plot
#
# start with WARcht, add geom_text() [for auto labels] and annotate() [for manual labels and arrows]
#
#
WARcht + # print the chart object
   geom_text(aes(label=MarinerNames), size=4, fontface="bold", colour="navyblue",
      vjust=0, hjust=-0.1) + # add the names of the Mariners players
   geom_text(aes(label=wRAAstars), size=3, fontface="bold",
      vjust=0, hjust=-0.1) + # add the names of the top wRAA players
   annotate("text", label="Shane Victorino", x=40, y=3, size=3, 

      fontface="bold.italic") + # manually place the label for Shane Victorino
   annotate("segment", x=50, y=2, xend=51.7, yend=-0.4, size=0.5,
      arrow=arrow(length=unit(.2, "cm"))) + # manually place the Victorino arrow
   annotate("text", label="Craig Gentry", x=40, y=-7.0, size=3,

      fontface="bold.italic") +
   annotate("segment", x=42, y=-6.6, xend=40.9, yend=-4.0, size=0.5, 

      arrow=arrow(length=unit(.2, "cm"))) +
   annotate("text", label="A.J. Pollock", x=49, y=-2.5, size=3,    

      fontface="bold.italic") +
   geom_point(data=Mariners, aes(x=UZR.150, y=wRAA), colour="navyblue", size=4) # over-plot the points for the Mariners players





The final analysis

In addition to Raul Ibanez, there are four other Mariners outfielders who have logged more than 200 innings. The only one on the plus side of the UZR.150 ledger is Jason Bay, at 5.5. And along with Ibanez, only Michael Morse has a positive wRAA. Put it another way, all five are more or less in the lower right-hand quadrant of the chart. So yes, it's a fair assessment that the Mariners outfield is a disaster.

The Major League outfielders who are the top hitters (the Y axis on the chart) are led by the Rockies' Carlos Gonzalez (at 28.1), ahead of Shin-Soo Choo (21.2) and Mike Trout (19.8). And defensively (the X axis), Shane Victorino leads with 51.9, followed by Craig Gentry (40.9) and A.J. Pollock (39.1).

The only outfielder who shines on both dimensions is the Brewers' Carlos Gomez, who stands in fourth place on both UZR.150 and wRAA. As the chart shows, so far this season he's in a class by himself.

Note: the code above can be found in a gist at github.

-30-

No comments:

Post a Comment