Let’s retrace the steps, and create a chart with a subtitle and a caption, the other nifty feature that has been added.
First, let’s read the packages we’ll be using, ggplot2 and the data carpentry package dplyr:
# package load
library(ggplot2)
library(dplyr)
Read and summarize the data
For this example, we’ll use the baseball data package Lahman (bundling the Lahman database for R users), and the data table ‘Teams’ in it.Once it’s loaded, the data are filtered and summarized using dplyr.
- filter from 1901 [the establishment of the American League] to the most recent year,
- filter out the Federal League
- summarise the total number of runs scored, runs allowed, and games played
- using `mutate`, calculate the league runs (
leagueRPG
) and runs allowed (leagueRAPG
) per game
library(Lahman)
data(Teams)
MLB_RPG <- Teams %>%
filter(yearID > 1900, lgID != "FL") %>%
group_by(yearID) %>%
summarise(R=sum(R), RA=sum(RA), G=sum(G)) %>%
mutate(leagueRPG=R/G, leagueRAPG=RA/G)
A basic plot
You may have heard that run scoring in Major League Baseball has been down in recent years…but what better way to see if that’s true than by plotting the data?For the first version of the plot, we’ll make a basic X-Y plot, where the X axis has the years and the Y axis has the average number of runs scored. With ggplot2, it’s easy to add a trend line (the
geom_smooth
option).The
scale_x_continuous
options set the limits and breaks of the axes.MLBRPGplot <- ggplot(MLB_RPG, aes(x=yearID, y=leagueRPG)) +
geom_point() +
geom_smooth(span = 0.25) +
scale_x_continuous(breaks = seq(1900, 2015, by = 20)) +
scale_y_continuous(limits = c(3, 6), breaks = seq(3, 6, by = 1))
MLBRPGplot
So now we have a nice looking dot plot showing the average number of runs scored per game for the years 1901-2015. (The data for the 2016 season, recently concluded, has not yet been added to the Lahman database.)
With the basic plot object now created, we can make the changes in the format. In the past, the way we would set the title, along with X and Y axis labels, would be something like this.
MLBRPGplot +
ggtitle("MLB run scoring, 1901-2014") +
theme(plot.title = element_text(hjust=0, size=16)) +
xlab("year") +
ylab("team runs per game")
Adding a subtitle and a caption: the function
A popular feature of charts–particularly in magazines–is a subtitle that has a summary of what the chart shows and/or what the author wants to emphasize.In this case, we could legitimately say something like any of the following:
- The peak of run scoring in the 2000 season has been followed by a steady drop
- Teams scored 20% fewer runs in 2015 than in 2000
- Team run scoring has fallen to just over 4 runs per game from the 2000 peak of 5 runs
- Run scoring has been falling for 15 years, reversing a 30 year upward trend
How can we add a subtitle to our chart that does that, as well as a caption that acknowledges the source of the data? The new
labs
function, available in ggplot2 version 2.2.0, lets us do that.Note that
labs
contains the title, subtitle, caption, as well as the X and Y axis labels.MLBRPGplot +
labs(title = "MLB run scoring, 1901-2015",
subtitle = "Run scoring has been falling for 15 years, reversing a 30 year upward trend",
caption = "Source: the Lahman baseball database",
x = "year", y = "team runs per game")
Easy.
Thanks to everyone involved with ggplot2 who made this possible.
The code for this post (as an R markdown file) can be found in my Bayesball github repo.
-30-