August 10, 2012

Trends in run scoring - comparing the leagues


My previous two posts have looked a using R to create trend lines for the run scoring environments in the American and National leagues.  This time around, I'll plot the two against each other to allow for some comparisons.

(The code below assumes that you've read the data into your workspace and calculated the LOESS trend lines, as I did in the previous two posts.)

One of the things I quickly appreciated about the R environment is the option to quickly compare and manipulate (for example, multiply) data from two different source files without having to cut-and-paste the data together.  For everything in this post, we've got two data tables (one for each league) and they remain separate.



First off, here's a scatter plot with the two leagues given different symbols to tell them apart.  Note that the AL data are referenced as coming from the table ALseason, while the NL points come from the table NLseason.

# plot individual years as points
ylim <- c(3,6)
# start with AL
plot(ALseason$R ~ ALseason$Year,
  type="p", pch=1, col="black",
  main = "Runs per team per game, 1901-2012", 
  ylim = ylim, 
  xlab = "year", ylab = "runs per game")
# add NL line
  points(NLseason$Year, NLseason$R, pch=2, col="blue")
# chart additions
  grid()
  legend(1900, 6, c("AL", "NL"), pch=c(1, 2), col=c("black", "blue"))
#




The result is an indistinguishable soup -- the only clear-cut things we can see is that between about 1950 and 1985 there was a run scoring valley between the peaks in the 1930s and 1990s/2000s.

So let's change the code slightly, and plot the leagues as lines instead of points.  The R code uses type="l" (for line) instead of type="p" (for point).


# plot individual years as lines
ylim <- c(3,6)
# start with AL line
plot(ALseason$R ~ ALseason$Year,
  type="l", lty="solid", col="red", lwd=2,
  main = "Runs per team per game, 1901-2012", 
  ylim = ylim, 
  xlab = "year", ylab = "runs per game")
# add NL line
  lines(NLseason$Year, NLseason$R, lty="solid", col="blue", lwd=2)
# chart additions
  grid()
  legend(1900, 3.5, c("AL", "NL"), lty=c("solid", "solid"), col=c("red", "blue"), lwd=c(2, 2))
#



Now we are getting somewhere.  It's a lot easier to see that for most seasons that the AL has had a higher number of runs scored than the NL.  But how much of this is random noise, and how much is a real trend?  Here's where comparing the LOESS functions come in.  Below is the code to create a chart that compares the two LOESS with span=0.25 lines (the two most "sensitive" lines that follow the points more closely than the default setting).


# plot loess curves (span=0.25)
ylim <- c(3,6)
# start with AL line
plot(ALRunScore.LO.25.predict ~ ALseason$Year,
  type="l", lty="solid", col="red", lwd=2,
  main = "Runs per team per game, 1901-2012", 
  ylim = ylim, 
  xlab = "year", ylab = "runs per game")
# add NL line
  lines(NLseason$Year, NLRunScore.LO.25.predict, lty="dashed", col="blue", lwd=2)
# chart additions
   legend(1900, 3.5, 
    c("AL (span=0.25)", "NL (span=0.25)"), 
    lty=c("solid", "dashed"), 
    col=c("red", "blue"), 
    lwd=c(2, 2))
  grid()
#



And we can go a step further, and rather than plot the lines, instead calculate the difference between the lines.  The first way is to compare the absolute differences between the two trend lines.  In this plot, it will show the absolute difference in run scoring as histogram bars, and then overlay the difference in the trend lines as a line.


# calculate the difference between the two leagues 
# 1. absolute difference
RunDiff <- (ALseason$R - NLseason$R)
# 2. LOESS span=0.25
RunDiffLO <- (ALRunScore.LO.25.predict - NLRunScore.LO.25.predict)
#
# plot each year absolute difference as bar, trend as line
ylim <- c(-1,1.5)
plot(RunDiff ~ ALseason$Year,
  type="h", lty="solid", col="blue", lwd=2,
  main = "Run scoring trend: AL difference from NL, 1901-2012", 
  ylim = ylim, 
  xlab = "year", ylab = "runs per game")
# add RunDiff line
  lines(ALseason$Year, RunDiffLO, lty="solid", col="black", lwd=2)
# add line at zero
  abline(h = 0, lty="dotdash") 
# chart additions
  grid()
#



One thing that jumps out is the huge spike in the American League before WWII, when AL teams were scoring half a run per game more (on average about 15% more runs) than the teams on the senior circuit. Steve Treder, in an essay titled "A Tale of Two Leagues" in the 2004 Hardball Times, offers up the only explanation I've ever seen: a marked difference in how the baseballs were made for the two leagues. The construction of the ball was standardized because of the limited availability of materials during the war, and AL scoring fell back to the same level of the NL.

After WWII, there was a period of roughly 20 years where there was parity between the leagues until the late 1960s when AL run production fell below NL. This changed dramatically and consistently with the introduction of the Designated Hitter rule to the American League in the 1973 season. Only once since then -- 1974 -- has the AL scored fewer runs per game than the NL.  After the introduction of the DH, the American League added about 10% more run production than the National League. I don't recall seeing any analysis on the impact of the DH, but I suspect that 10% is roughly the difference we would expect when we swap a pitcher for a "real hitter" in the batting order. (Note to self: this is a future research/analysis project.)

The difference between the two leagues remained at about that level until 2000 (suggesting some level of talent equilibrium, if we accept that the difference between the leagues is simply a matter of not having pitchers [try to] hit).  And in the last decade, the difference has been reduced. The easiest explanation is the impact of interleague play, which now account for 11% of the games.  In these games, a National League team uses a DH when playing in an American League park, and AL pitchers are in the batting order when they play in an AL park.

The fundamental changes in how many times pitchers make plate appearances in Major League Baseball, introduced with the Designated Hitter and interleague play, are easy explanations for the recent differences in run scoring.  But are there any differences that could have more subtly caused the differences between play in the American and National Leagues?  If you've got any ideas, I would love to hear them.

-30-



2 comments:

  1. Hello, I was wondering if you had a spreadsheet by league of average number of runs per year.
    thanks
    John

    ReplyDelete
  2. Hi John, I've put copies of the National and American league spreadsheets (as .csv files) on Google Docs.
    National League: https://drive.google.com/file/d/0B7t4wpcrwqkBdEt2UG4xNDJTdms/edit?usp=sharing
    American League: https://drive.google.com/file/d/0B7t4wpcrwqkBdEt2UG4xNDJTdms/edit?usp=sharing

    To create the files used the Lahman package in R, which is only current up to the 2012 season. I've updated the files manually with the 2013 data points.

    ReplyDelete