June 16, 2013

Annotating select points on an X-Y plot using ggplot2

or, Is the Seattle Mariners outfield a disaster?

The Backstory
Earlier this week (2013-06-10), a blog post by Dave Cameron appeared at USS Mariner under the title “Maybe It's Time For Dustin Ackley To Play Some Outfield”. In the first paragraph, Cameron describes to the Seattle Mariners outfield this season as “a complete disaster” and Raul Ibanez as “nothing short of a total disaster”.

To back up the Ibanez assertion, the article included a link to a Fangraphs table showing the defensive metrics for all MLB outfielders with a minimum of 200 innings played to date, sorted in ascending order of UZR.150 (UZR is generally recognized as the best defensive metric). And there, at the top (or bottom) of the list, Raul Ibanez.

But surely, I thought, Ibanez's offensive production – starting with the 11 home runs he had hit at the time, now up to 13 – off-sets to some degree the lack of defense. So I took a look at a variety of offensive measures, to see how Ibanez stacks up. It quickly struck me that wRAA (Weighted Runs Above Average), the offensive component of WAR (Wins Above Replacement, the best comprehensive measure of a player's overall contribution, which also includes a base running not examined here), would make an interesting scatterplot against UZR. And a great opportunity to use ggplot2.

Manipulating the data
Using this table from Fangraphs (advanced batting stats of all MLB players so far this season), I created a new table “outfield” that appended the advanced hitting stats to the defensive stats in the original table, and then set about creating the plot using the ggplot2 package in R.

Note: once I had downloaded the two Fangraphs tables as csv files (with results through 2013-06-15), I edited the file names slightly.


# load the ggplot2 and grid packages
library(ggplot2)
library(grid)
# read data (note csv files are renamed)
tbl1 = read.csv("FanGraphs_Leaderboard_h.csv")
tbl2 = read.csv("FanGraphs_Leaderboard_d.csv")
# create new table with data from both tbl1 and tbl2 by link on variable
# 'playerid'
outfield = data.frame(merge(tbl1, tbl2, by = "playerid"))
# clean up the variable names of the two Name fields
names(outfield)[2] = paste("Name")
names(outfield)[21] = paste("Name.y")
#

A quick plot
With the two data sets now merged, we can start plotting the results. First of all, a quick plot using ggplot2's “qplot” needs only one line of code, and three specifications (X axis data, Y axis data, and the name of the source table):

qplot(UZR.150, wRAA, data = outfield)

#
So that must be Raul Ibanez over there on the far left. It's clear from this plot that his hitting (represented on the Y axis) is just above the 0 line, and a long way below the outfielders who are hitting up a storm. It's worth keeping in mind that Ibanez's hitting contribution is helped to some degree by the fact that just over one-third of his plate appearances so far this year (126 of 187) have been as a designated hitter or pinch hitter.

In looking at this plot, you might ask the same thing I did: Where are the rest of the Mariners outfielders, and who are the stars of the X and Y axes?

Code to set up the tables for plotting
The next chunk of code takes three approaches to identifying groups and individuals on the chart. We don't want to plot the names of all 110 players, that would be utterly illegible. Instead, we'll focus on three groups: the Seattle Mariners, the top UZR.150 players, and the top wRAA players. The Mariners player points and names will be navy blue, and others in black. The code will label the Mariners players and the top performers on the wRAA axis automatically, and a manual approach will be adopted to create the code necessary to identify the top UZR players.

But before plotting the results, new variables in the “outfield” table are created that have the names of the Mariners players, the UZR stars, and the wRAA stars.

# create new MarinerNames field that contains only the name of Mariners
# players (plagarized from Winston Chang's R Graphics Cookbook Recipe 5.11)
outfield$MarinerNames = outfield$Name
idx = (outfield$Team.x == "Mariners")
outfield$MarinerNames[!idx] = NA
# create a new table, taking a subset that has only the Mariners players
Mariners = subset(outfield, Team.x == "Mariners")
# add the names of the UZR stars to outfield$Table2 sort the table by
# wRAA, then add the names of the top 4 wRAA stars
outfield$wRAAstars = outfield$Name
outfield = outfield[order(-outfield$wRAA), ]
outfield$wRAAstars[5:110] = NA
# sort the table by UZR.150, then copy the first 3 names
outfield$UZRstars = outfield$Name
outfield = outfield[order(-outfield$UZR.150), ]
outfield$UZRstars[4:110] = NA
#

The final plot code

# the full ggplot verion, creating an object called "WARcht"
WARcht = ggplot(outfield, aes(x=UZR.150, y=wRAA)) + #
  geom_point(colour="gray60", size=2.0) +  # set the colour and size of the points
  theme_bw() +  # and use the "background white" theme
  ggtitle("Everyday Outfielders, 2013 [to 2013-06-15]")  # and put a title on the plot
#
# start with WARcht, add geom_text() [for auto labels] and annotate() [for manual labels and arrows]
#
#
WARcht +  # print the chart object
  geom_text(aes(label=MarinerNames), size=4, fontface="bold", colour="navyblue", 
            vjust=0, hjust=-0.1) +  # add the names of the Mariners players
  geom_text(aes(label=wRAAstars), size=3, fontface="bold", 
            vjust=0, hjust=-0.1) +  # add the names of the top wRAA players
  annotate("text", label="Shane Victorino", x=40, y=3, size=3, fontface="bold.italic") +  # manually place the label for Shane Victorino
  annotate("segment", x=50, y=2, xend=51.7, yend=-0.4, size=0.5, 
           arrow=arrow(length=unit(.2, "cm"))) +  # manually place the Victorino arrow
  annotate("text", label="Craig Gentry", x=40, y=-7.0, size=3, fontface="bold.italic") +
  annotate("segment", x=42, y=-6.6, xend=40.9, yend=-4.0, size=0.5, 
           arrow=arrow(length=unit(.2, "cm"))) +
  annotate("text", label="A.J. Pollock", x=49, y=-2.5, size=3, fontface="bold.italic") +
  geom_point(data=Mariners, aes(x=UZR.150, y=wRAA), colour="navyblue", size=4)  # over-plot the points for the Mariners players
## Warning: Removed 105 rows containing missing values (geom_text).
## Warning: Removed 106 rows containing missing values (geom_text).

The final analysis
In addition to Raul Ibanez, there are four other Mariners outfielders who have logged more than 200 innings. The only one on the plus side of the UZR.150 ledger is Jason Bay, at 5.5. And along with Ibanez, only Michael Morse has a positive wRAA. Put it another way, all five are more or less in the lower right-hand quadrant of the chart. So yes, it's a fair assessment that the Mariners outfield is a disaster.

The Major League outfielders who are the top hitters (the Y axis on the chart) are led by the Rockies' Carlos Gonzalez (at 28.1), ahead of Shin-Soo Choo (21.2) and Mike Trout (19.8). And defensively (the X axis), Shane Victorino leads with 51.9, followed by Craig Gentry (40.9) and A.J. Pollock (39.1).

The only outfielder who shines on both dimensions is the Brewers' Carlos Gomez, who stands in fourth place on both UZR.150 and wRAA. As the chart shows, so far this season he's in a class by himself.

Note: the code above can be found in a gist at github.

-30-

June 4, 2013

Major League Baseball run scoring trends with R's Lahman package


The statistical software R has an ever-expanding array of packages that provide pre-programmed functions and datasets. One such package is named Lahman, bundling the contents of the Lahman database into a quick-and-easy resource for R users. In addition to the data tables, the package resources also contain a variety of analyses and graphics undertaken using the package, providing some examples of how the package can be used.

Full disclosure: I am now one of the Lahman package project members.

This is my first blog post using the Lahman package, and as a first step I will simply recreate the league run scoring trends graphs that I generated previously. Originally, I had used data from Baseball Reference, for the simple reason that the Lahman database does not, in its source form, contain any league-level aggregations.

The process for loading the Lahman package is as simple as any other R package; this simplicity is even greater if you are using an IDE such as RStudio. Once loaded, you have access to all the tables in the database, without any of the futzing that is sometimes required in tidying up a raw flat file (I find that variable names are sometimes lost or changed in translation).

The code (available as a gist here, downloadable as an R script file) creates a pair of tables, calculating each league's run scoring rates by year. Then, recycling my earlier code, it calculates a series of trend lines using the loess method, and graphs those trend lines. For simplicity's sake, only the final version of each graph is shown.

Step 1: install the package (if you haven't already), access the library, and load the data table “Teams”.

# load the package into R, and open the data table 'Teams' into the
# workspace
library("Lahman")
data(Teams)
#

The second step is to use the individual team season results to calculate the aggregate of each league's year. We start with 1901, the year the American League was formed. Once those tables are created, the loess function is used to calculate trend lines for each league's run scoring environment.

# ===== CREATE LEAGUE SUMMARY TABLES
# 
# select a sub-set of teams from 1901 [the establishment of the American
# League] forward to 2012
Teams_sub <- as.data.frame(subset(Teams, yearID > 1900))
# calculate each team's average runs and runs allowed per game
Teams_sub$RPG <- Teams_sub$R/Teams_sub$G
Teams_sub$RAPG <- Teams_sub$RA/Teams_sub$G
# create new data frame with season totals for each league
LG_RPG <- aggregate(cbind(R, RA, G) ~ yearID + lgID, data = Teams_sub, sum)
# calculate league + season runs and runs allowed per game
LG_RPG$LG_RPG <- LG_RPG$R/LG_RPG$G
LG_RPG$LG_RAPG <- LG_RPG$RA/LG_RPG$G
# select a sub-set of teams from 1901 [the establishment of the American
# League] forward to 2012 read the data into separate league tables
ALseason <- (subset(LG_RPG, yearID > 1900 & lgID == "AL"))
NLseason <- (subset(LG_RPG, yearID > 1900 & lgID == "NL"))
#
# ===== TRENDS: RUNS SCORED PER GAME
# 
# AMERICAN LEAGUE create new object ALRunScore.LO for loess model
ALRunScore.LO <- loess(ALseason$LG_RPG ~ ALseason$yearID)
ALRunScore.LO.predict <- predict(ALRunScore.LO)
# create new objects RunScore.Lo.XX for loess models with 'span' control
# span = 0.25
ALRunScore.LO.25 <- loess(ALseason$LG_RPG ~ ALseason$yearID, span = 0.25)
ALRunScore.LO.25.predict <- predict(ALRunScore.LO.25)
# span = 0.5
ALRunScore.LO.5 <- loess(ALseason$LG_RPG ~ ALseason$yearID, span = 0.5)
ALRunScore.LO.5.predict <- predict(ALRunScore.LO.5)
# NATIONAL LEAGUE create new object RunScore.LO for loess model
NLRunScore.LO <- loess(NLseason$LG_RPG ~ NLseason$yearID)
NLRunScore.LO.predict <- predict(NLRunScore.LO)
# loess models
NLRunScore.LO.25 <- loess(NLseason$LG_RPG ~ NLseason$yearID, span = 0.25)
NLRunScore.LO.25.predict <- predict(NLRunScore.LO.25)
NLRunScore.LO.5 <- loess(NLseason$LG_RPG ~ NLseason$yearID, span = 0.5)
NLRunScore.LO.5.predict <- predict(NLRunScore.LO.5)
#
Now that we have calculated the league averages and trend lines (using the loess method), we can start the plots. First, a simple plot of the actual values:

# MULTI-PLOT -- MERGING AL AND NL RESULTS plot individual years as lines
ylim <- c(3, 6)
# start with AL line
plot(ALseason$LG_RPG ~ ALseason$yearID, type = "l", lty = "solid", col = "red", 
    lwd = 2, main = "Runs per team per game, 1901-2012", ylim = ylim, xlab = "year", 
    ylab = "runs per game")
# add NL line
lines(NLseason$yearID, NLseason$LG_RPG, lty = "solid", col = "blue", lwd = 2)
# chart additions
grid()
legend(1900, 3.5, c("AL", "NL"), lty = c("solid", "solid"), col = c("red", "blue"), 
    lwd = c(2, 2))
plot of chunk unnamed-chunk-3
#
Next, comparing the league trends.
# plot multiple loess curves (span=0.50 and 0.25)
ylim <- c(3, 6)
# start with AL line
plot(ALRunScore.LO.5.predict ~ ALseason$yearID, type = "l", lty = "solid", col = "red", 
    lwd = 2, main = "Runs per team per game, 1901-2012", ylim = ylim, xlab = "year", 
    ylab = "runs per game")
# add NL line
lines(NLseason$yearID, NLRunScore.LO.5.predict, lty = "solid", col = "blue", 
    lwd = 2)
# add 0.25 lines
lines(ALseason$yearID, ALRunScore.LO.25.predict, lty = "dashed", col = "red", 
    lwd = 2)
lines(NLseason$yearID, NLRunScore.LO.25.predict, lty = "dashed", col = "blue", 
    lwd = 2)
# chart additions
legend(1900, 3.5, c("AL (span=0.50)", "NL (span=0.50)", "AL (span=0.25)", "NL (span=0.25)"), 
    lty = c("solid", "solid", "dashed", "dashed"), col = c("red", "blue", "red", 
        "blue"), lwd = c(2, 2, 2, 2))
grid()
plot of chunk unnamed-chunk-4
#
Next, calculate the difference between the two leagues – both the absolute difference and the difference in the loess trend lines.

# 1. absolute
RunDiff <- (ALseason$LG_RPG - NLseason$LG_RPG)
# 2. LOESS span=0.25
RunDiffLO <- (ALRunScore.LO.25.predict - NLRunScore.LO.25.predict)
#
And plot the differences.

# plot each year absolute difference as bar, difference in trend as line
ylim <- c(-1, 1.5)
plot(RunDiff ~ ALseason$yearID, type = "h", lty = "solid", col = "blue", lwd = 2, 
    main = "Run scoring trend: AL difference from NL, 1901-2012", ylim = ylim, 
    xlab = "year", ylab = "runs per game")
# add RunDiff line
lines(ALseason$yearID, RunDiffLO, lty = "solid", col = "black", lwd = 2)
# add line at zero
abline(h = 0, lty = "dotdash")
# chart additions
grid()
legend(1900, 1.5, c("AL difference from NL: absolute", "AL difference from NL, LOESS (span=0.25)"), 
    lty = c("solid", "solid"), col = c("blue", "black"), lwd = c(2, 2))
plot of chunk unnamed-chunk-6
#
For the next “using R” post, I'll take a look at the ways to plot the residuals from the loess method.
The one after that: ggplot2 versions of the graphs.
-30-
--> -->

April 5, 2013

Strikeout rates - update

Just a quick follow-up from my earlier post:  James Gentile at Beyond the Boxscore has written another great analysis on strikeouts, this one with the title "How can strikeouts be great for pitchers, but not that bad for hitters?" The analysis delves deeper in the question of increased strikeout rates by looking at the asymmetry between pitcher and hitter outcomes.

It boils down to this sentence: "Over time, hitters, managers, and front offices have slowly recognized more and more that they can trade additional strikeouts for an increase in production at the plate with very little repercussions."

-30-

March 29, 2013

On strikeout rates

A couple of recent articles have looked into the increased rate of strikeouts per game.

In the New York Times, Tyler Kepner has an article titled "Swing and a Mystery: Strikeout Rates Are Soaring".  This one has a sidebar article "Strikeouts on the rise", which includes an interactive chart displaying the changes over time (including an optional overlay for a selected team).

Some explanations offered--none conclusively--include increased use of relief pitchers, batters are more likely to swing aggressively with two strikes, better information available to pitchers, and pitchers throwing more strikes (walk rates are the lowest they have been in 20 years).


At Beyond The Boxscore (part of SB Nation), James Gentile takes a look at the rise of the called strike. Gentile notes that pitches per plate appearance (PA) have risen, and it's been called strikes and foul strikes per PA, not swinging strikes, that have risen. Gentile suggests (in common with some of Kepner's ideas) that batters are being less aggressive, and more patient.

So far, the research is only scratching the surface.  I'm sure we will see more in the future.

-30-

February 24, 2013

MLB runs allowed by team

Or, How good were the Maddux/Glavine-era Braves?

In this on-going series of posts about run scoring in Major League Baseball, for this installment I'll turn the equation around and look at runs allowed.  In order to account for the changing run scoring environments, the runs allowed by individual teams is compared to the league average for that season, creating an index where 100 is the league average. In this formulation, a score below 100 is a good thing; a team with an index score of 95 allowed runs at a rate 5 percentage points below the league average.

Having written the original code in R, it's now a very simple process to change a few variable names and create the equivalent of the earlier runs scored analysis, but looking at runs allowed. This is one of the most important benefits of a code/syntax environment, an option that doesn't exist if  you are using a point-and-click GUI interface.

February 23, 2013

Sabermetrics primer

Phil Birnbaum, author of the Sabermetric Research and the editor of SABR's "By the Numbers", has written a primer on the topic with the title "A Guide to Sabermetric Research" that appears at the SABR site. This should be the first stop for anyone who wants to find out more about the field of sabermetrics, and a good read for those already active.

-30-

February 17, 2013

Run production, one team at a time


In a previous post, I used R to process data from the Lahman database to calculate index values that compare a team's run production to the league average for that year.  For the purpose of that exercise, I started the sequence at 1947, but for what follows I re-ran the code with the time period 1901-2012.

The R code I used can be found at this Github gist. Instead of boring you here with the ins and outs of what the code is doing, I've embedded that as documentation in the gist. The R code assumes that you've got a data frame called "Teams.merge" already in your workspace.  This can be achieved by running the previous code, or if you've done that before, you'll have created a csv file with the name "Teams.merge.csv", and now have the option to read that file as a data frame "Teams.merge".

The first step is to choose one of the current teams, and create a data frame that contains just that club's history.  Once this has been done, the code then creates trend lines (using the LOESS method, as I did with the leagues in previous posts), and then plot them.