Bayes Ball: Comparing individual team run production

February 2, 2013

Comparing individual team run production

Or, The 2010 Mariners: How Bad Were They?

In earlier posts, I used the statistical software R to plot the trends in league average run scoring since 1901. This was the first step to answering other questions I had on my mind:

How poor was the offensive performance of the 2010 Seattle Mariners?
Are they showing any signs of improvement?
And how can I use R to tabulate the data to answer these questions?

So, to answer Question #1. It is well-established that the 2010 Mariners were not very good, at least offensively. (For fans of the team the well-deserved Cy Young award won by Felix Hernandez is surely the highlight of the season.) But I wanted a form of relative measure that would be comparable across time, to accommodate the various fluctuations in run scoring that were the subject of that earlier post.

As I started into this, the first decision was to draw a line in the historical record. I opted to use the eras described in Bill James' "Dividing Baseball History into Eras" article (behind a pay wall – but chances are if you're reading my blog, you already a Bill James subscriber):

Era 1 (The Pioneer Era), 1871-1892
Era 2 (The Spitball Era), 1893-1919
Era 3 (The Landis Era), 1920-1946
Era 4 (The Baby Boomers Era), 1947-1968
Era 5 (The Artifical Turf Era), 1969-1992
Era 6 (The Camden Yards Era), 1993-2012

Based on these groupings, I opted to use the range of seasons 1947-2012 inclusive. This yields 1,580 team seasons of National League and American League baseball.

The second step was to calculate a runs per game (RPG) for each team, by year. This corrects for the longer regular season in the post-expansion period, the strike-shortened seasons, and will give us a common denominator to compare the results so far in 2012.

To do this, I accessed the 2012 edition of the Lahman database. Once I had downloaded and extracted the comma-delimted version of the files, I read the "teams" file into R.

Sidebar: My original intention was to incorporate chunks of the R code into this post. But Blogspot seems to be going out of its way to make formatting of that code a nightmare. I located this post at Getting Genetics Done that pointed me to the Gist feature at Github. I have to admit to being a total Github neophyte, but I have managed to create a public Gist that will allow you to access all of the R code I used here.)

With the 1947-2012 data frame constructed, we can calculate the average number of runs each team scored per game. And while we are at it, we can calculate the runs allowed, too (and save them for another day). The variables are R (runs), RA (runs allowed), and G (games).

The next step is to calculate the league averages. First, I used "aggregate", which summed the number of runs (variable R) by year (yearID) and league (lgID). Each of the lines of code below creates a stand-alone summary table with three variables: the year, league, and the sum of the variable.

Now I have three new data frames -- "RunsLG", "RunsALG", and "GamesLG". Each has only 3 variables, but they share "Teams.yearID" and "Teams.lgID". This code, using the self-explanatory "merge" command, builds a single table LG_RPG with all three of the newly created variables.

Now we've got the totals of runs, runs allowed, and games for each league's season, the next step is to use the values to calculate the averages of runs and runs allowed per game, and then a bit of variable name maintenance.

Although it's possible to calculate the values I'm looking for from the values in the two separate data frames, I decided to make a single table (ultimately, I will be writing this table as a flat file for later use). There's probably a more elegant way to accomplish this, but it works. [Note to self: this should perhaps be the motto for my coding.] A single line of code is all that's required, since the default for "merge" is to merge based on the columns with shared variable names.

The table "Teams.merge" is the truncated 1947-2012 version of the Lahman database table "Teams" that was first read into R, with the corresponding league averages for runs and runs allowed added for each team.

The next step in the process is to compare the individual team's runs scored with the league average for that year, by creating an index value where 100 is equal to the league average, and the individual team index is measured relative to this. Thus an index score of 110 indicates that a team scored runs at a rate 10% higher than the league average, or allowed runs at a rate 10% higher than the league average.

I also was curious to find out the distributions, so I calculated the minimum, maximum, and standard deviation, and then plotted the distribution.

(click to enlarge)

So let's take a look at the extremes of the distribution -- those offensive juggernauts that managed to score runs at 120% or more of the league average, and whatever the opposite of a juggernaut might be, with run production below 80% of the league rate. Here, I used two different tools -- the "rank" command, and a sorting function, "order".

First, the juggernauts.

ROW	yearID	lgID	franchID	R_index	R_index_rank
565	1976	NL	CIN	132.8853857	1
49	1950	AL	BOS	132.2461322	2
105	1953	NL	LAD	129.6017105	3
1092	1996	NL	COL	126.6497223	4
314	1965	NL	CIN	126.2664769	5
541	1975	NL	CIN	125.650482	6
451	1971	NL	PIT	124.4046836	7
41	1949	NL	LAD	124.0612662	8
118	1954	AL	NYY	123.974382	9
17	1948	AL	BOS	123.8245771	10
1120	1997	NL	COL	123.7739464	11
33	1949	AL	BOS	123.6994013	12
5	1947	AL	NYY	123.6724566	13
297	1964	NL	ATL	123.5204413	14
375	1968	NL	CIN	123.4188181	15
137	1955	NL	LAD	122.9114378	16
713	1982	AL	MIL	122.0939182	17
1283	2003	AL	BOS	122.0507949	18
1409	2007	AL	NYY	121.9362966	19
431	1971	AL	BAL	121.4275066	20
1296	2003	NL	ATL	121.3964208	21
73	1951	NL	LAD	121.2494984	22
323	1966	AL	BAL	121.2018005	23
1239	2001	NL	COL	121.1882488	24
310	1965	AL	MIN	121.1646838	25
1522	2011	AL	BOS	121.0833251	26
282	1963	NL	STL	121.0034335	27
397	1969	NL	CIN	120.7483263	28
1014	1993	NL	PHI	120.5969299	29
344	1967	AL	BOS	120.493992	30
1165	1999	AL	CLE	120.31825	31
368	1968	AL	DET	120.1109289	32

Both the 1976 Reds and the 1950 Red Sox scored runs at a rate more than 30% higher than the league average of the time. These two clubs were clearly parts of on-going offensive powerhouses -- the '75 Reds and both the 1948 and '49 Red Sox also make the list of teams with an index score of greater than 120.

And now the equivalent for the low-scoring teams.

ROW	yearID	lgID	franchID	R_index	R_index_rank
1501	2010	AL	SEA	71.13003863	1
404	1969	NL	SDP	71.25193635	2
1256	2002	AL	DET	74.23534967	3
318	1965	NL	NYM	74.83598509	4
113	1954	AL	BAL	74.86764629	5
1286	2003	AL	DET	75.05933378	6
275	1963	NL	HOU	75.16143658	7
637	1979	AL	OAK	75.80085072	8
295	1964	NL	HOU	76.1427378	9
692	1981	AL	TOR	76.17245382	10
1142	1998	AL	TBD	76.37483502	11
18	1948	AL	CHW	76.81081117	12
1302	2003	NL	LAD	76.82640084	13
742	1983	AL	SEA	76.82901532	14
1531	2011	AL	SEA	76.93980429	15
452	1971	NL	SDP	77.20331012	16
8	1947	AL	MIN	77.75801025	17
129	1955	AL	BAL	77.9177115	18
611	1978	AL	OAK	78.11858551	19
861	1988	AL	BAL	78.3863785	20
385	1969	AL	ANA	79.19104726	21
127	1954	NL	PIT	79.23186344	22
943	1991	AL	CLE	79.27644514	23
24	1948	AL	MIN	79.42155431	24
50	1950	AL	CHW	79.44904395	25
95	1952	NL	PIT	79.61825664	26
145	1956	AL	BAL	79.61833163	27
1009	1993	NL	FLA	79.8937472	28

There at the top of this list stand the 2010 Seattle Mariners. The 2010 Seattle Mariners plated 513 runs (3.17 per game), which turns out to be more than the 463 (2.86 per game) that were scored by the White Sox in a 162 game season in 1968. But the Sox, and the other 19 teams that had lower runs per game values than the 2010 Mariners, were playing in seasons with very low run scoring.

But by the index measure, the 2010 Seattle Mariners were unprecedented in their inability to score runs. With an index score of 71.1, the 2010 Mariners produced lowest number of runs relative to the league average than the other 1,579 teams that played in the period 1947-2012. They scored nearly 30% fewer runs than the league average, and with a Z score of -2.96, it indicates that this is roughly 1 in a 1,000 event. (OK, for those of you with a bent for precision, it's 1 in 998.5.)

It's important to note that 2010 wasn't a one-off fluke of bad luck for the Mariners, it just happens to be the nadir of their run scoring performance. The 2011 Mariners were better than the team in 2010, but not a whole lot. They produced runs at 76.9% of the American League rate that season -- the 15th poorest in the 1947-2011 period.

For my next post, I'll look at the historic trend for the Mariners (you may have noticed other Mariner teams showing up in the above list, although not the 2012 edition of the team) and then move on to the pitching side of the equation -- runs allowed.

-30-

3 comments:

GusFebruary 3, 2013 at 9:42 PM
Excellent post! The 1969 Padres weren't that far behind the 2010 Mariners in offensive ineptitude -- 71.13 vs 71.25. At least the Padres had an excuse -- they were an expansion team.
ReplyDelete
Replies
PeterFebruary 11, 2013 at 4:30 PM
On the R code,

You asked for a more elegant way. Lines 25 - 41 of your code could be replaced with simply:

LG_RPG <- aggregate(cbind(R, RA, G) ~ yearID + lgID, data = Teams, sum)

And then you don't even have to clean up the variable names!
ReplyDelete
Replies

Note: Only a member of this blog may post a comment.

Bayes Ball

February 2, 2013

Comparing individual team run production

3 comments:

Blog Archive

Labels

Mariners bloggers

Sabermetric blogs

other baseball blogs

Statistics blogs

Baseball data

Other baseball sites

Statistics sites

Followers

About Me

Bayes Ball

February 2, 2013

Comparing individual team run production

3 comments:

Blog Archive

Labels

Mariners bloggers

Sabermetric blogs

other baseball blogs

Statistics blogs

Baseball data

Other baseball sites

Statistics sites

Followers

About Me

Subscribe to Bayes Ball