November 15, 2016

Subtitles and captions with ggplot2 v.2.2.0

Back in March 2016, I wrote about an extension to the R package ggplot2 that allowed subtitles to be added to charts. The process took a bit of fiddling and futzing, but now, with the release of ggplot2 version 2.2.0, it’s easy.

Let’s retrace the steps, and create a chart with a subtitle and a caption, the other nifty feature that has been added.

First, let’s read the packages we’ll be using, ggplot2 and the data carpentry package dplyr:

# package load 

Read and summarize the data

For this example, we’ll use the baseball data package Lahman (bundling the Lahman database for R users), and the data table ‘Teams’ in it.

Once it’s loaded, the data are filtered and summarized using dplyr.
  • filter from 1901 [the establishment of the American League] to the most recent year,
  • filter out the Federal League
  • summarise the total number of runs scored, runs allowed, and games played
  • using `mutate`, calculate the league runs (leagueRPG) and runs allowed (leagueRAPG) per game

MLB_RPG <- Teams %>%
  filter(yearID > 1900, lgID != "FL") %>%
  group_by(yearID) %>%
  summarise(R=sum(R), RA=sum(RA), G=sum(G)) %>%
  mutate(leagueRPG=R/G, leagueRAPG=RA/G)

A basic plot

You may have heard that run scoring in Major League Baseball has been down in recent years…but what better way to see if that’s true than by plotting the data?

For the first version of the plot, we’ll make a basic X-Y plot, where the X axis has the years and the Y axis has the average number of runs scored. With ggplot2, it’s easy to add a trend line (the geom_smooth option).

The scale_x_continuous options set the limits and breaks of the axes.

MLBRPGplot <- ggplot(MLB_RPG, aes(x=yearID, y=leagueRPG)) +
  geom_point() +
  geom_smooth(span = 0.25) +
  scale_x_continuous(breaks = seq(1900, 2015, by = 20)) +
  scale_y_continuous(limits = c(3, 6), breaks = seq(3, 6, by = 1))


So now we have a nice looking dot plot showing the average number of runs scored per game for the years 1901-2015. (The data for the 2016 season, recently concluded, has not yet been added to the Lahman database.)

With the basic plot object now created, we can make the changes in the format.  In the past, the way we would set the title, along with X and Y axis labels, would be something like this.

MLBRPGplot +
  ggtitle("MLB run scoring, 1901-2014") +
  theme(plot.title = element_text(hjust=0, size=16)) +
  xlab("year") +
  ylab("team runs per game")

Adding a subtitle and a caption: the function

A popular feature of charts–particularly in magazines–is a subtitle that has a summary of what the chart shows and/or what the author wants to emphasize.

In this case, we could legitimately say something like any of the following:
  • The peak of run scoring in the 2000 season has been followed by a steady drop
  • Teams scored 20% fewer runs in 2015 than in 2000
  • Team run scoring has fallen to just over 4 runs per game from the 2000 peak of 5 runs
  • Run scoring has been falling for 15 years, reversing a 30 year upward trend
I like this last one, drawing attention not only to the recent decline but also the longer trend that started with the low-scoring environment of 1968.

How can we add a subtitle to our chart that does that, as well as a caption that acknowledges the source of the data? The new labs function, available in ggplot2 version 2.2.0, lets us do that.

Note that labs contains the title, subtitle, caption, as well as the X and Y axis labels.

MLBRPGplot +
  labs(title = "MLB run scoring, 1901-2015",
       subtitle = "Run scoring has been falling for 15 years, reversing a 30 year upward trend",
       caption = "Source: the Lahman baseball database", 
       x = "year", y = "team runs per game") 


Thanks to everyone involved with ggplot2 who made this possible.

The code for this post (as an R markdown file) can be found in my Bayesball github repo.


March 14, 2016

Adding a subtitle to ggplot2

A couple of days ago (2016-03-12) a short blog post by Bob Rudis appeared on, "Subtitles in ggplot2". I was intrigued by the idea and what this could mean for my own plotting efforts, and it turned out to be very simple to apply. (Note that Bob's post originally appeared on his own blog, as "Subtitles in ggplot2".)
In order to see if I could create a plot with a subtitle, I went back to some of my own code drawing on the Lahman database package. The code below summarizes the data using dplyr, and creates a ggplot2 plot showing the annual average number of runs scored by each team in every season from 1901 through 2014, including a trend line using the loess smoothing method.
This is an update to my series of blog posts, most recently 2015-01-06, visualizing run scoring trends in Major League Baseball.

# load the package into R, and open the data table 'Teams' into the
# workspace
# package load 
# ====================
# create a new dataframe that
# - filters from 1901 [the establishment of the American League] to the most recent year,
# - filters out the Federal League
# - summarizes the total number of runs scored, runs allowed, and games played
# - calculates the league runs and runs allowed per game 

MLB_RPG <- Teams %>%
  filter(yearID > 1900, lgID != "FL") %>%
  group_by(yearID) %>%
  summarise(R=sum(R), RA=sum(RA), G=sum(G)) %>%
  mutate(leagueRPG=R/G, leagueRAPG=RA/G)

Plot the MLB runs per game trend

Below is the code to create the plot, including the formatting. Note the hjust=0 (for horizontal justification = left) in the plot.title line. This is because the default for the title is to be centred, while the subtitle is to be justified to the left.

MLBRPGplot <- ggplot(MLB_RPG, aes(x=yearID, y=leagueRPG)) +
  geom_point() +
  theme_bw() +
  theme(panel.grid.minor = element_line(colour="gray95")) +
  scale_x_continuous(breaks = seq(1900, 2015, by = 20)) +
  scale_y_continuous(limits = c(3, 6), breaks = seq(3, 6, by = 1)) +
  xlab("year") +
  ylab("team runs per game") +
  geom_smooth(span = 0.25) +
  ggtitle("MLB run scoring, 1901-2014") +
  theme(plot.title = element_text(hjust=0, size=16))


MLB run scoring, 1901-2014

Adding a subtitle: the function

So now we have a nice looking dot plot showing the average number of runs scored per game for the years 1901-2014.

But a popular feature of charts--particularly in magazines--is a subtitle that has a summary of what the chart shows and/or what the author wants to emphasize.

In this case, we could legitimately say something like any of the following:
  • The peak of run scoring in the 2000 season has been followed by a steady drop
  • Teams scored 20% fewer runs in 2015 than in 2000
  • Team run scoring has fallen to just over 4 runs per game from the 2000 peak of 5 runs
  • Run scoring has been falling for 15 years, reversing a 30 year upward trend
I like this last one, drawing attention not only to the recent decline but also the longer trend that started with the low-scoring environment of 1968.

How can we add a subtitle to our chart that does that?

The function Bob Rudis has created quickly and easily allows us to add a subtitle. The following code is taken from his blog post. Note that the code for this function relies on two additional packages, grid and gtable. Other than the package loads, this is a straight copy/paste from Bob's blog post.


ggplot_with_subtitle <- function(gg, 
                                 hjust=0, vjust=0, 
                                 ...) {
  if (is.null(fontfamily)) {
    gpr <- gpar(fontsize=fontsize, ...)
  } else {
    gpr <- gpar(fontfamily=fontfamily, fontsize=fontsize, ...)
  subtitle <- textGrob(label, x=unit(hjust, "npc"), y=unit(hjust, "npc"), 
                       hjust=hjust, vjust=vjust,
  data <- ggplot_build(gg)
  gt <- ggplot_gtable(data)
  gt <- gtable_add_rows(gt, grobHeight(subtitle), 2)
  gt <- gtable_add_grob(gt, subtitle, 3, 4, 3, 4, 8, "off", "subtitle")
  gt <- gtable_add_rows(gt, grid::unit(bottom_margin, "pt"), 3)
  if (newpage) grid.newpage()
  if (is.null(vp)) {
  } else {
    if (is.character(vp)) seekViewport(vp) else pushViewport(vp)

Adding a subtitle

Now we've got the function loaded into our R workspace, the steps are easy:
  • Rename the active plot object gg (simply because that's what Bob's code uses)
  • Define the text that we want to be in the subtitle
  • Call the function

# set the name of the current plot object to `gg`
gg <- MLBRPGplot

# define the subtitle text
subtitle <- 
  "Run scoring has been falling for 15 years, reversing a 30 year upward trend"
ggplot_with_subtitle(gg, subtitle,
                     bottom_margin=20, lineheight=0.9)

MLB run scoring, 1901-2014 with a subtitle

Wasn't that easy? Thanks, Bob!

And it's going to get easier; in the few days since his blog post, Bob has taken this into the ggplot2 development environment, working on the code necessary to add this as a simple extension to the package's already extensive functionality. And Jan Schulz has chimed in, adding the ability to add a text annotation (e.g. the data source) under the plot. It's early days, but it's looking great. (See ggplot2 Pull request #1582.) Thanks, Bob and Jan!

And thanks also to the rest of the ggplot2 developers, for making those of us who use the package create good-looking and effective data visualization. Ain't open development great?

The code for this post (as an R markdown file) can be found in my Bayesball github repo.


March 6, 2016

Book review: Storytelling With Data

by Cole Nussbaumer Knaflic (2015, Wiley)

The Sabermetric bookshelf, #4

One of the great strengths of R is that there are some robust (and always improving) packages that facilitate great data visualization and tabular summaries. Beyond the capabilities built into the base version of R, packages such as ggplot2 (my favourite), lattice, and vcd and vcdExtra extend the possibilities for rendering charts and graphs, and a similar variety exist for reproducing tables. And accompanying these packages have been a variety of fine instruction manuals that delineate the code necessary to produce high-quality and reproducible outputs. (You can’t go wrong by starting with Winston Chang’s R Graphics Cookbook, and the R Graph Catalog based on Naomi Robbins’s Creating More Effective Graphs, created and maintained by Joanna Zhao and Jennifer Bryan at the University of British Columbia.)

Let’s call these the “how” resources; once you’ve determined you want a Cleveland plot (which are sometimes called “lollipop plots”—please, just stop it), these sources provide the code for that style of chart, including the myriad options available to you.

Elsewhere, there has been a similar explosion in the number of books that build on research and examples as to what makes a good graphic. These are the “what” books; the authors include the aforementioned William Cleveland and Naomi Robbins, and also include Stephen Few and Edward R. Tufte. Also making an appearance are books that codify the “why”, written by the likes of Alberto Cairo and Nathan Yau.

The recently published Storytelling with Data: A Data Visualization Guide for Business Professionals by Cole Nussbaumer Knaflic falls into the latter category, and it’s one of the best I’ve seen to date. Although the subtitle indicates the intended audience, I believe that anyone involved in creating data-driven visualizations would benefit from reading and learning from it.

The book is relatively software agnostic, although Nussbaumer Knaflic recognizes the ubiquity of Excel and has used it to produce the charts in the book. She also provides some sidebar commentary and tools via her website specifically using Excel. For R users, this shouldn’t pose a particular challenge or barrier; the worst-case scenario is that it provides an opportunity to learn how to use R to replicate the book’s examples.

One of the strengths of the book is that Nussbaumer Knaflic takes the approach of starting with a chart (often a real-life example published elsewhere), and then iterates through one or more options before arriving at the finished product. One instance is the step-by-step decluttering of a line graph, which becomes substantially improved through a six step process. This example re-appears later, first in the chapter on the use of preattentive attributes and then again in the chapter titled “Think like a designer”. This approach reinforces the second of Nussbaumer Knaflic’s tips that close the book, “iterate and seek feedback”.

Nussbaumer Knaflic also introduces the Gestalt Principles of Visual Perception, and provides vivid examples of how these principles play out in data visualizations.

All of the discussion of graphics is wrapped in the context of storytelling. That is to say, the data visualization is always in the service of making a point about what the data tell us. In the context of business, this then translates into influencing decisions. The chapter “Lessons in storytelling” falls almost exactly in the middle of the book; after we’ve been introduced to the principles of making good data visualizations, Nussbaumer Knaflic gives us a way to think about the purpose of the visualization. With all of the pieces in place, the remainder of the book is focussed on the applications of storytelling with data.

The book is supported with Nussbaumer Knaflic’s site, which includes her blog. Check out her blog entry/discussion with Steven Few “Is there a single right answer?”), and some makeovers (in the Gallery) where she redraws some problematic charts that have appeared in the public domain.

All in all, Cole Nussbaumer Knaflic’s Storytelling with Data is a succinct and focussed book, one that clearly adopts and demonstrates the enormous value of the principles that it espouses. Highly recommended to anyone, of any skill level, who is interested in making effective data visualizations, and the effective use of those visualizations.

Cole Nussbaumer Knaflic’s Storytelling with Data: A Data Visualization Guide for Business Professionals was published in 2015 by Wiley.

Cross-posted at


Note: the authors of the following books have all published additional books on these topics; I’ve simply selected the ones that most closely fit with the context of this review. All are recommended.

  • Alberto Cairo, The Functional Art: An Introduction to Information Graphics and Visualization (2013) New Riders.
  • Winston Chang, R Graphics Cookbook (2012) O’Reilly.
  • William S. Cleveland, Visualizing Data (1993) Hobart Press.
  • Stephen Few, Signal (2015) Analytics Press.
  • Michael Friendly and David Meyer, Discrete Data Analysis with R: Visualization and Modeling Techniques for Categorical and Count Data (2016) CRC Press.
  • Naomi B. Robbins, Creating More Effective Graphs (2004) Reprinted 2013 by Chart House.
  • Edward R. Tufte, The Visual Display of Quantitative Information (2nd edition, 2001) Graphics Press.
  • Hadley Wickham, ggplot2: Elegant Graphics for Data Analysis (2nd edition, 2016) Springer.
  • Nathan Yau, Visualize This: The FlowingData Guide to Design, Visualization, and Statistics (2011) Wiley.

January 6, 2015

Run scoring trends: using Shiny to create dynamic charts and tables in R

Or, Retracing my steps

As I’ve been learning the functionality of Shiny, the web app for R, I have used the helpful tutorials available from the developers at RStudio. At some point, though, one needs to break out and develop one’s own application.  My Shiny app “MLB run scoring trends” can be found at (

Note: this app is a work in progress! If you have any thoughts on how it might be improved, please leave a comment.
All of the files associated with this app, including the code, can be found on, at MonkmanMH/MLBrunscoring_shiny.

This Shiny app is a return to my earlier analysis on run scoring trends in Major League Baseball, last seen in my blog post “Major League Baseball run scoring trends with R’s Lahman package”; see the “References” tab in the Shiny app for more). This project gave me the opportunity to update the underlying data, as well as to introduce some of the coding improvements I’ve learned along the way (notably the packages ggplot2 and dplyr.)

Some notable changes in the code:
  • In the original version (starting here), I treated each league separately, starting with subsetting (now, with dplr, filtering) the Lahman “Teams” table on the lgID variable into two separate data frames which were then used to separately generate the two charts. Now, with ggplot2, I have used the faceting to plot the two leagues, and given the reader the option of making that split or not. This is both more flexible from the reader’s point of view, and more efficient code.
  • In my original approach, the trend lines were generated using the loess function, embedded in a discrete object, and then added to the plot as a separately plotted line. By using ggplot2, a LOESS trendline can be quickly added to the plot call with the stat_smooth() option, a much more efficient approach.
  • The stat_smooth() makes it possible to adjust the degree of smoothing of the tend line through changes to the span specification. Originally this was hard-coded, but is now dynamic, controlled in the Shiny app through a slider widget.
  • The stat_smooth() also includes the option of showing a confidence interval. This is achieved through the level specification. For this, I used a set of radio buttons in the Shiny user interface. (I had initially tried a slider, but was not able to specify a set of pre-defined points for the confidence intervals.)
  • The start and end dates of the league plots are also user-controlled through a slider widget. You will notice that the date in the chart title changes along with the range of the plot.
Other things I learned:
  • Radio buttons return factors, even if they look numeric in the ui.r code. In order to get the values that are input by the user to work in the stat_smooth(), I wrapped them in as.numeric().
  • Tables made with dplyr don't render properly in the Shiny environment; the numbers are all there but the sort function generates an error. My solution to this was to wrap the table created by renderDataTable in the server file with
  • I already knew that I was struggling to keep up with the changes in the R coding environment, but this exercise opened my eyes to even more potential opportunities. The latest version of Shiny ( as of 2015-01-06) has added a lot of new functionality, but I hadn’t realized the degree of integration with other visualization tools. This recent blog entry, “Goodbye static graphs, hello shiny, ggvis, rmarkdown” by Simon Jackman, gives some hints as to where an integrated analytic & reporting environment might go. Exciting stuff, indeed.


July 26, 2014

Roger Angell and the Baseball Hall of Fame

The great baseball writer Roger Angell is the recipient of the 2014 J.G. Taylor Spink Award, the first time a non-BBWAA member has been given the award.  Angell will be presented the award at the Baseball Hall of Fame at Cooperstown, during the induction weekend July 25 - 28, 2014.

Angell is, for my money, the best writer about baseball.  His accounts of the game are from a fan's perspective, rather than the typical listing of the game's dramatic moments. Indeed, many of his greatest observations are about the fans and their experiences.  One such essay is "The Interior Stadium"; required reading for anyone interested in sports and people's responses to the games.

Much of Angell's writing was published by his employer, the New Yorker, who have recently compiled two summaries of his writing.  The first was offered by David Remmick, whose piece "Roger Angell Heads to Cooperstown" was published when the Spink award was announced in December 2013 and has links to a variety of Angell's best. More recently, "Hall of Fame Weekend: Roger Angell's Baseball Writing" (by Sky Dylan-Robbins) provides a different list of great essays.

One of the best things the New Yorker has made available is Angell's scorecard for Game 6 of the 2011 World Series, when the Cardinals were down to their final strike twice, but managed to come back and win the game (and then, in an anti-climactic game 7, the Series).


July 23, 2014

Left-handed catchers

Benny Distefano – 1985 Donruss #166
We are approaching the twenty-fifth anniversary of the last time a left-handed throwing catcher appeared behind the plate in a Major League Baseball game; on August 18, 1989 Benny Distefano made his third and final appearance as a catcher for the Pirates. Distefano’s accomplishment was celebrated five years ago, in Alan Schwarz’s “Left-Handed and Left Out” (New York Times, 2009-08-15).

Jack Moore, writing on the site Sports on Earth in 2013 (“Why no left-handed catchers?”), points out that lack of left-handed catchers goes back a long way. One interesting piece of evidence is a 1948 Ripley’s “Believe It Or Not” item with a left-handed catcher Dick Bernard (you can read more about Bernard’s signing in the July 1, 1948 edition of the Tuscaloosa News). Bernard didn’t make the majors, and doesn’t appear in any of the minor league records that are available on-line either.

Dick Bernard in Ripley’s “Believe It or Not”, 1948-12-30

There are a variety of hypotheses why there are no left-handed catchers, all of which are summarized in John Walsh’s “Top 10 Left-Handed Catchers for 2006” (a tongue-in-cheek title if ever there were) at The Hardball Times. A compelling explanation, and one supported by both Bill James and J.C. Bradbury (in his book The Baseball Economist) is natural selection; a left-handed little league player who can throw well will be groomed as a pitcher.

Throwing hand by fielding position as an example of a categorical variable

I was looking for some examples of categorical variables to display visually, and the lack of left-handed throwing catchers, compared to other positions, came to mind. The following uses R, and the Lahman database package.

The analysis requires merging the Master and Fielding tables in the Lahman database – the Master table gives the player's name and his throwing hand, and Fielding tells us how many games at each position they played. For the purpose of this analysis, we’ll look at the seasons 1954 (the first year in the Lahman database that has the outfield positions split into left, centre, and right) through 2012.

You may note that for the merging of the two tables, I used the new dplyr package. I tested the system.time of the basic version of “merge” to combine the two tables, and the “inner_join” in dplyr. The latter is substantially faster: my aging computer ran “merge” in about 5.5 seconds, compared to 0.17 seconds with dplyr.

# load the required packages

The first step is to create a new data table that merges the Fielding and Master tables, based on the common variable “playerID”. This new table has one row for each player, by position and season; we use the dim function to show the dimensions of the table.

Then, select only those seasons since 1954 and omit the records that are Designated Hitter (DH) and the summary of outfield positions (OF) (i.e. leave the RF, CF, and LF).

MasterFielding <- inner_join(Fielding, Master, by="playerID")
## [1] 164903     52
MasterFielding <- filter(MasterFielding, POS != "OF" & POS != "DH" & yearID > "1953")
## [1] 91214    52

This table needs to be summarized one step further – a single row for each player, counting how many games played at each position.

Player_games <- MasterFielding %.%
  group_by(playerID, nameFirst, nameLast, POS, throws) %.%
  summarise(gamecount = sum(G)) %.%
## [1] 19501     6
## Source: local data frame [6 x 6]
## Groups: playerID, nameFirst, nameLast, POS
##    playerID nameFirst nameLast POS throws gamecount
## 1 robinbr01    Brooks Robinson  3B      R      2870
## 2 bondsba01     Barry    Bonds  LF      L      2715
## 3 vizquom01      Omar  Vizquel  SS      R      2709
## 4  mayswi01    Willie     Mays  CF      R      2677
## 5 aparilu01      Luis Aparicio  SS      R      2583
## 6 jeterde01     Derek    Jeter  SS      R      2531

This table shows the career records for the most games played at the positions (for 1954-2012). We see that Brooks Robinson leads the way with 2,870 games played at third base, and the fact that Derek Jeter, at the end of the 2012 season, was closing in on Omar Vizquel’s career record for games played as a shortstop.

Cross-tab Tables

The next step is to prepare a simple cross-tab table (also known as contingency or pivot tables) showing the number of players cross-tabulated by position (POS) and throwing hand (throws).

Here, I’ll demonstrate two ways to do this: first with dplyr’s “group_by” and “summarise” (with a bit of help from reshape2), and then the “table” function in gmodels.

# first method - dplyr
Player_POS <- Player_games %.%
  group_by(POS, throws) %.%
  summarise(playercount = length(gamecount))
## Source: local data frame [17 x 3]
## Groups: POS
##    POS throws playercount
## 1   1B      L         411
## 2   1B      R        1515
## 3   2B      L           4
## 4   2B      R        1560
## 5   3B      L           4
## 6   3B      R        1889
## 7    C      L           4
## 8    C      R         980
## 9   CF      L         393
## 10  CF      R        1252
## 11  LF      L         544
## 12  LF      R        2161
## 13   P      L        1452
## 14   P      R        3623
## 15  RF      L         520
## 16  RF      R        1893
## 17  SS      R        1296

To transform this long-form table into a traditional cross-tab shape we can use the “dcast” function in reshape2.

## Loading required package: reshape2
dcast(Player_POS, POS ~ throws, value.var = "playercount")
##   POS    L    R
## 1  1B  411 1515
## 2  2B    4 1560
## 3  3B    4 1889
## 4   C    4  980
## 5  CF  393 1252
## 6  LF  544 2161
## 7   P 1452 3623
## 8  RF  520 1893
## 9  SS   NA 1296

A second method to get the same result is to use the “table” function in the gmodels package.

## Loading required package: gmodels
throwPOS <- with(Player_games, table(POS, throws))
##     throws
## POS     L    R
##   1B  411 1515
##   2B    4 1560
##   3B    4 1889
##   C     4  980
##   CF  393 1252
##   LF  544 2161
##   P  1452 3623
##   RF  520 1893
##   SS    0 1296

A more elaborate table can be created using gmodels package. In this case, we’ll use the CrossTable function to generate a table with row percentages. You’ll note that the format is set to SPSS, so the table output resembles that software’s display style.

CrossTable(Player_games$POS, Player_games$throws, 
           digits=2, format="SPSS",
           prop.r=TRUE, prop.c=FALSE, prop.t=FALSE, prop.chisq=FALSE,  # keeping the row proportions
           chisq=TRUE)                                                 # adding the ChiSquare statistic
##    Cell Contents
## |-------------------------|
## |                   Count |
## |             Row Percent |
## |-------------------------|
## Total Observations in Table:  19501 
##                  | Player_games$throws 
## Player_games$POS |        L  |        R  | Row Total | 
## -----------------|-----------|-----------|-----------|
##               1B |      411  |     1515  |     1926  | 
##                  |    21.34% |    78.66% |     9.88% | 
## -----------------|-----------|-----------|-----------|
##               2B |        4  |     1560  |     1564  | 
##                  |     0.26% |    99.74% |     8.02% | 
## -----------------|-----------|-----------|-----------|
##               3B |        4  |     1889  |     1893  | 
##                  |     0.21% |    99.79% |     9.71% | 
## -----------------|-----------|-----------|-----------|
##                C |        4  |      980  |      984  | 
##                  |     0.41% |    99.59% |     5.05% | 
## -----------------|-----------|-----------|-----------|
##               CF |      393  |     1252  |     1645  | 
##                  |    23.89% |    76.11% |     8.44% | 
## -----------------|-----------|-----------|-----------|
##               LF |      544  |     2161  |     2705  | 
##                  |    20.11% |    79.89% |    13.87% | 
## -----------------|-----------|-----------|-----------|
##                P |     1452  |     3623  |     5075  | 
##                  |    28.61% |    71.39% |    26.02% | 
## -----------------|-----------|-----------|-----------|
##               RF |      520  |     1893  |     2413  | 
##                  |    21.55% |    78.45% |    12.37% | 
## -----------------|-----------|-----------|-----------|
##               SS |        0  |     1296  |     1296  | 
##                  |     0.00% |   100.00% |     6.65% | 
## -----------------|-----------|-----------|-----------|
##     Column Total |     3332  |    16169  |    19501  | 
## -----------------|-----------|-----------|-----------|
## Statistics for All Table Factors
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  1759     d.f. =  8     p =  0 
##        Minimum expected frequency: 168.1

Mosaic Plot

A mosaic plot is an effective way to graphically represent the contents of the summary tables. Note that the length (left to right) dimension of each bar is constant, comparing proportions, while the height of the bar (top to bottom) varies depending on the absolute number of cases. The mosaic plot function is in the vcd package.

## Loading required package: vcd
## Loading required package: grid
mosaic(throwPOS, highlighting = "throws", highlighting_fill=c("darkgrey", "white"))


The clear result is that it’s not just catchers that are overwhelmingly right-handed throwers, it’s also infielders (except first base). There have been very few southpaws playing second and third base – and there have been absolutely no left-handed throwing shortstops in this period.

As J.G. Preston puts it in the blog post “Left-handed throwing second basemen, shortstops and third basemen”,
While right-handed throwers can be found at any of the nine positions on a baseball field, left-handers are, in practice, restricted to five of them.

So who are these left-handed oddities? Using the filter function, it’s easy to find out:

# catchers
filter(Player_games, POS == "C", throws == "L")
## Source: local data frame [4 x 6]
## Groups: playerID, nameFirst, nameLast, POS
##    playerID nameFirst  nameLast POS throws gamecount
## 1 distebe01     Benny Distefano   C      L         3
## 2  longda02      Dale      Long   C      L         2
## 3 squirmi01      Mike   Squires   C      L         2
## 4 shortch02     Chris     Short   C      L         1

# second base
filter(Player_games, POS == "2B", throws == "L")
## Source: local data frame [4 x 6]
## Groups: playerID, nameFirst, nameLast, POS
##    playerID nameFirst  nameLast POS throws gamecount
## 1 marqugo01   Gonzalo   Marquez  2B      L         2
## 2 crowege01    George     Crowe  2B      L         1
## 3 mattido01       Don Mattingly  2B      L         1
## 4 mcdowsa01       Sam  McDowell  2B      L         1

# third base
filter(Player_games, POS == "3B", throws == "L")
## Source: local data frame [4 x 6]
## Groups: playerID, nameFirst, nameLast, POS
##    playerID nameFirst  nameLast POS throws gamecount
## 1 squirmi01      Mike   Squires  3B      L        14
## 2 mattido01       Don Mattingly  3B      L         3
## 3 francte01     Terry  Francona  3B      L         1
## 4 valdema02     Mario    Valdez  3B      L         1

My github file for this entry in Markdown is here: []


December 17, 2013

Book Review: Analyzing Baseball Data with R

by Max Marchi and Jim Albert (2014, CRC Press)

The Sabermetric bookshelf, #3

Here we have the perfect book for anyone who stumbles across this blog--the intersection of R and baseball data. The open source statistical programming environment of R is a great tool for anyone analyzing baseball data, from the robust analytic functions to the great visualization packages. The particular readership niche might be small, but as both R and interest in sabermetrics expand, it's a natural fit.

And one would be hard pressed to find better qualified authors, writers who have feet firmly planted in both worlds.  Max Marchi is a writer for Baseball Prospectus, and it's clear from the ggplot2 charts in his blog entries (such as this entry on left-handed catchers) that he's an avid R user.

Jim Albert is a Professor in the Department of Mathematics and Statistics at Bowling Green State University; three of his previous books sit on my bookshelf. Curve Ball, written with Jay Bennett, is pure sabermetrics, and one of the best books ever written on the topic (and winner of SABR's Baseball Research Award in 2002).  Albert's two R-focussed  books, the introductory R by Example (co-authored with Maria Rizzo) and the more advanced Bayesian Computation with R, are intended as supplementary texts for students learning statistical methods. Both employ plenty of baseball examples in their explanations of statistical analysis using R.

In Analyzing Baseball Data with R Marchi and Albert consolidate this joint expertise, and have produced a book that is simultaneously interesting and useful.

The authors takes a very logical approach to the subject at hand. The first chapter concerns the three sources of baseball data that are referenced throughout the book:
- the annual summaries contained with the Lahman database,
- the play-by-play data at Retrosheet, and
- the pitch-by-pitch PITCHf/x data.
The chapter doesn't delve into R, but summarizes the contents of the three data sets, and takes a quick look at the types of questions that can be answered with each.

The reader first encounters R in the second and third chapters, titled "Introduction to R" and "Traditional Graphics". These two chapters cover many of the basic topics that a new R user needs to know, starting with installing R and RStudio, then moving on to data structures like vectors and data frames, objects, functions, and data plots. Some of the key R packages are also covered in these chapters, both functional packages like plyr and data packages, notably Lahman, the data package containing the Lahman database.

The material covered in these early chapters are things I learned early on in my own R experience, but whereas I had relied on multiple sources and an unstructured ad hoc approach, in Analyzing Baseball Data with R a newcomer to R will find the basics laid out in a straight-forward and logical progression. These chapters will most certainly help them climb the steep learning curve faced by every neophyte R user.  (It is worth noting that the "Introduction to R" chapter relies heavily on a fourth source of baseball data -- the 1965 Warren Spahn Topps card, the last season of his storied career. Not all valuable data are big data.)

From that point on, the book tackles some of the core concepts of sabermetrics. This includes the relationship between runs and wins, run expectancy, career trajectories, and streaky performances.  As the authors work through these and other topics, they weave in information about additional R functions and packages, along with statistical and analytic concepts.  As one example, one chapter introduces Markov Chains in the context of using R to simulate half inning, season, and post-season outcomes.

The chapter "Exploring Streaky Performances" provides the opportunity to take a closer look at how Analyzing Baseball Data with R compares to Albert's earlier work.  In this case, the chapter uses moving average and simulation methodologies, providing the data code to examine recent examples (Ichiro and Raul Ibanez).  This is methodologically similar to what is described in Curve Ball, but with the addition of "here's the data and the code so you can replicate the analysis yourself".  This approach differs substantially from the much more mathematical content in Albert's text Bayesian Computation with R, where the example of streaky hitters is used to explore beta functions and the laplace R function.

Woven among these later chapters are also ones that put R first, and use baseball data as the examples. A chapter devoted to the advanced graphics capabilities of the R splits time between the packages lattice and ggplot2. The examples used in this chapter include  visualizations that are used to analyze variations in Justin Verlander's pitch speed.

Each chapter of the book also includes "Further Reading" and "Exercises", which provide readers with the chance to dig deeper into the topic just covered and to apply their new-found skills. The exercises are consistently interesting and often draw on previous sabermetric research.  Here's a couple of examples:
  • "By drawing a contour plot, compare the umpire's strike zone for left-handed and right-handed batters. Use only the rows of the data frame where the pitch type is a four-seam fastball." (Chapter 7)
  • "Using [Bill] James' similarity score measure ..., find the five hitters with hitting statistics most similar to Willie Mays." (Chapter 8)
The closing pages of the book are devoted to technical arcana regarding the data sources, and how-to instructions on obtaining those data.

The authors have established a companion blog (, which has an expansion of the analytics presented in the book.  For example, the entry from December 12, 2013 goes deeper into ggplot2 capabilities to enhance and refine charts that were described in the book.

Analyzing Baseball Data with R provides readers with an excellent introduction to both R and sabermetrics, using examples that provide nuggets of insight into baseball player and team performance. The examples are clear, the R code is well explained and easy to follow, and I found the examples consistently interesting. All told, Analyzing Baseball Data with R will be an extremely valuable addition to the practicing sabermetrician's library, and is most highly recommended.

Additional Resources

Jim Albert and Jay Bennett (2003), Curve Ball: Baseball, Statistics, and the Role of Chance in the Game (revised edition), Copernicus Books.

Jim Albert and Maria Rizzo (2011), R by Example, Springer.

Jim Albert (2009), Bayesian Computation with R (2nd edition), Springer.

An interview with Max Marchi, originally posted at MilanoRnet and also available through R-bloggers