Full disclosure: I am now one of the Lahman package project members.
This is my first blog post using the Lahman package, and as a first step I will simply recreate the league run scoring trends graphs that I generated previously. Originally, I had used data from Baseball Reference, for the simple reason that the Lahman database does not, in its source form, contain any league-level aggregations.
The process for loading the Lahman package is as simple as any other R package; this simplicity is even greater if you are using an IDE such as RStudio. Once loaded, you have access to all the tables in the database, without any of the futzing that is sometimes required in tidying up a raw flat file (I find that variable names are sometimes lost or changed in translation).
The code (available as a gist here, downloadable as an R script file) creates a pair of tables, calculating each league's run scoring rates by year. Then, recycling my earlier code, it calculates a series of trend lines using the loess method, and graphs those trend lines. For simplicity's sake, only the final version of each graph is shown.
Step 1: install the package (if you haven't already), access the library, and load the data table “Teams”.
# load the package into R, and open the data table 'Teams' into the
# workspace
library("Lahman")
data(Teams)
#
The second step is to use the individual team season results to calculate the aggregate of each league's year. We start with 1901, the year the American League was formed. Once those tables are created, the loess function is used to calculate trend lines for each league's run scoring environment.
# ===== CREATE LEAGUE SUMMARY TABLES
#
# select a sub-set of teams from 1901 [the establishment of the American
# League] forward to 2012
Teams_sub <- as.data.frame(subset(Teams, yearID > 1900))
# calculate each team's average runs and runs allowed per game
Teams_sub$RPG <- Teams_sub$R/Teams_sub$G
Teams_sub$RAPG <- Teams_sub$RA/Teams_sub$G
# create new data frame with season totals for each league
LG_RPG <- aggregate(cbind(R, RA, G) ~ yearID + lgID, data = Teams_sub, sum)
# calculate league + season runs and runs allowed per game
LG_RPG$LG_RPG <- LG_RPG$R/LG_RPG$G
LG_RPG$LG_RAPG <- LG_RPG$RA/LG_RPG$G
# select a sub-set of teams from 1901 [the establishment of the American
# League] forward to 2012 read the data into separate league tables
ALseason <- (subset(LG_RPG, yearID > 1900 & lgID == "AL"))
NLseason <- (subset(LG_RPG, yearID > 1900 & lgID == "NL"))
#
# ===== TRENDS: RUNS SCORED PER GAME
#
# AMERICAN LEAGUE create new object ALRunScore.LO for loess model
ALRunScore.LO <- loess(ALseason$LG_RPG ~ ALseason$yearID)
ALRunScore.LO.predict <- predict(ALRunScore.LO)
# create new objects RunScore.Lo.XX for loess models with 'span' control
# span = 0.25
ALRunScore.LO.25 <- loess(ALseason$LG_RPG ~ ALseason$yearID, span = 0.25)
ALRunScore.LO.25.predict <- predict(ALRunScore.LO.25)
# span = 0.5
ALRunScore.LO.5 <- loess(ALseason$LG_RPG ~ ALseason$yearID, span = 0.5)
ALRunScore.LO.5.predict <- predict(ALRunScore.LO.5)
# NATIONAL LEAGUE create new object RunScore.LO for loess model
NLRunScore.LO <- loess(NLseason$LG_RPG ~ NLseason$yearID)
NLRunScore.LO.predict <- predict(NLRunScore.LO)
# loess models
NLRunScore.LO.25 <- loess(NLseason$LG_RPG ~ NLseason$yearID, span = 0.25)
NLRunScore.LO.25.predict <- predict(NLRunScore.LO.25)
NLRunScore.LO.5 <- loess(NLseason$LG_RPG ~ NLseason$yearID, span = 0.5)
NLRunScore.LO.5.predict <- predict(NLRunScore.LO.5)
#
Now that we have calculated the league averages and trend lines (using the loess method), we can start the plots. First, a simple plot of the actual values:
# MULTI-PLOT -- MERGING AL AND NL RESULTS plot individual years as lines
ylim <- c(3, 6)
# start with AL line
plot(ALseason$LG_RPG ~ ALseason$yearID, type = "l", lty = "solid", col = "red",
lwd = 2, main = "Runs per team per game, 1901-2012", ylim = ylim, xlab = "year",
ylab = "runs per game")
# add NL line
lines(NLseason$yearID, NLseason$LG_RPG, lty = "solid", col = "blue", lwd = 2)
# chart additions
grid()
legend(1900, 3.5, c("AL", "NL"), lty = c("solid", "solid"), col = c("red", "blue"),
lwd = c(2, 2))
# plot multiple loess curves (span=0.50 and 0.25)
ylim <- c(3, 6)
# start with AL line
plot(ALRunScore.LO.5.predict ~ ALseason$yearID, type = "l", lty = "solid", col = "red",
lwd = 2, main = "Runs per team per game, 1901-2012", ylim = ylim, xlab = "year",
ylab = "runs per game")
# add NL line
lines(NLseason$yearID, NLRunScore.LO.5.predict, lty = "solid", col = "blue",
lwd = 2)
# add 0.25 lines
lines(ALseason$yearID, ALRunScore.LO.25.predict, lty = "dashed", col = "red",
lwd = 2)
lines(NLseason$yearID, NLRunScore.LO.25.predict, lty = "dashed", col = "blue",
lwd = 2)
# chart additions
legend(1900, 3.5, c("AL (span=0.50)", "NL (span=0.50)", "AL (span=0.25)", "NL (span=0.25)"),
lty = c("solid", "solid", "dashed", "dashed"), col = c("red", "blue", "red",
"blue"), lwd = c(2, 2, 2, 2))
grid()
# 1. absolute
RunDiff <- (ALseason$LG_RPG - NLseason$LG_RPG)
# 2. LOESS span=0.25
RunDiffLO <- (ALRunScore.LO.25.predict - NLRunScore.LO.25.predict)
#
And plot the differences.
# plot each year absolute difference as bar, difference in trend as line
ylim <- c(-1, 1.5)
plot(RunDiff ~ ALseason$yearID, type = "h", lty = "solid", col = "blue", lwd = 2,
main = "Run scoring trend: AL difference from NL, 1901-2012", ylim = ylim,
xlab = "year", ylab = "runs per game")
# add RunDiff line
lines(ALseason$yearID, RunDiffLO, lty = "solid", col = "black", lwd = 2)
# add line at zero
abline(h = 0, lty = "dotdash")
# chart additions
grid()
legend(1900, 1.5, c("AL difference from NL: absolute", "AL difference from NL, LOESS (span=0.25)"),
lty = c("solid", "solid"), col = c("blue", "black"), lwd = c(2, 2))
#
The one after that: ggplot2 versions of the graphs.
-30-
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.