November 15, 2016

Subtitles and captions with ggplot2 v.2.2.0

Back in March 2016, I wrote about an extension to the R package ggplot2 that allowed subtitles to be added to charts. The process took a bit of fiddling and futzing, but now, with the release of ggplot2 version 2.2.0, it’s easy.

Let’s retrace the steps, and create a chart with a subtitle and a caption, the other nifty feature that has been added.

First, let’s read the packages we’ll be using, ggplot2 and the data carpentry package dplyr:

# package load 

Read and summarize the data

For this example, we’ll use the baseball data package Lahman (bundling the Lahman database for R users), and the data table ‘Teams’ in it.

Once it’s loaded, the data are filtered and summarized using dplyr.
  • filter from 1901 [the establishment of the American League] to the most recent year,
  • filter out the Federal League
  • summarise the total number of runs scored, runs allowed, and games played
  • using `mutate`, calculate the league runs (leagueRPG) and runs allowed (leagueRAPG) per game

MLB_RPG <- Teams %>%
  filter(yearID > 1900, lgID != "FL") %>%
  group_by(yearID) %>%
  summarise(R=sum(R), RA=sum(RA), G=sum(G)) %>%
  mutate(leagueRPG=R/G, leagueRAPG=RA/G)

A basic plot

You may have heard that run scoring in Major League Baseball has been down in recent years…but what better way to see if that’s true than by plotting the data?

For the first version of the plot, we’ll make a basic X-Y plot, where the X axis has the years and the Y axis has the average number of runs scored. With ggplot2, it’s easy to add a trend line (the geom_smooth option).

The scale_x_continuous options set the limits and breaks of the axes.

MLBRPGplot <- ggplot(MLB_RPG, aes(x=yearID, y=leagueRPG)) +
  geom_point() +
  geom_smooth(span = 0.25) +
  scale_x_continuous(breaks = seq(1900, 2015, by = 20)) +
  scale_y_continuous(limits = c(3, 6), breaks = seq(3, 6, by = 1))


So now we have a nice looking dot plot showing the average number of runs scored per game for the years 1901-2015. (The data for the 2016 season, recently concluded, has not yet been added to the Lahman database.)

With the basic plot object now created, we can make the changes in the format.  In the past, the way we would set the title, along with X and Y axis labels, would be something like this.

MLBRPGplot +
  ggtitle("MLB run scoring, 1901-2014") +
  theme(plot.title = element_text(hjust=0, size=16)) +
  xlab("year") +
  ylab("team runs per game")

Adding a subtitle and a caption: the function

A popular feature of charts–particularly in magazines–is a subtitle that has a summary of what the chart shows and/or what the author wants to emphasize.

In this case, we could legitimately say something like any of the following:
  • The peak of run scoring in the 2000 season has been followed by a steady drop
  • Teams scored 20% fewer runs in 2015 than in 2000
  • Team run scoring has fallen to just over 4 runs per game from the 2000 peak of 5 runs
  • Run scoring has been falling for 15 years, reversing a 30 year upward trend
I like this last one, drawing attention not only to the recent decline but also the longer trend that started with the low-scoring environment of 1968.

How can we add a subtitle to our chart that does that, as well as a caption that acknowledges the source of the data? The new labs function, available in ggplot2 version 2.2.0, lets us do that.

Note that labs contains the title, subtitle, caption, as well as the X and Y axis labels.

MLBRPGplot +
  labs(title = "MLB run scoring, 1901-2015",
       subtitle = "Run scoring has been falling for 15 years, reversing a 30 year upward trend",
       caption = "Source: the Lahman baseball database", 
       x = "year", y = "team runs per game") 


Thanks to everyone involved with ggplot2 who made this possible.

The code for this post (as an R markdown file) can be found in my Bayesball github repo.