Showing posts with label sabermetrics. Show all posts
Showing posts with label sabermetrics. Show all posts

March 26, 2017

Updated Shiny app

A short post to alert the world that my modest Shiny application, showing Major League Baseball run scoring trends since 1901, has been updated to include the 2016 season. The application can be found here:
https://monkmanmh.shinyapps.io/MLBrunscoring_shiny/.

In addition to the underlying data, the update removed some of the processing that was happening inside the application, and put it into the pre-processing stage. This processing needs to happen only the once, and is not related to the reactivity of the application. This will improve the speed of the application; in addition to reducing the processing, it also shrinks the size of the data table loaded into the application.

The third set of changes were a consequence of the updates to the Shiny and ggplot2 packages in the two years that have passed since I built the app. In Shiny, there was a deprecation for "format" in the sliderInput widget. And in ggplot2, it was a change in the quotes around the "method" specification in stat_smooth(). A little thing that took a few minutes to debug! Next up will be some formatting changes, and a different approach to one of the visualizations.

 -30-

December 17, 2013

Book Review: Analyzing Baseball Data with R

by Max Marchi and Jim Albert (2014, CRC Press)


The Sabermetric bookshelf, #3


Here we have the perfect book for anyone who stumbles across this blog--the intersection of R and baseball data. The open source statistical programming environment of R is a great tool for anyone analyzing baseball data, from the robust analytic functions to the great visualization packages. The particular readership niche might be small, but as both R and interest in sabermetrics expand, it's a natural fit.


And one would be hard pressed to find better qualified authors, writers who have feet firmly planted in both worlds.  Max Marchi is a writer for Baseball Prospectus, and it's clear from the ggplot2 charts in his blog entries (such as this entry on left-handed catchers) that he's an avid R user.

Jim Albert is a Professor in the Department of Mathematics and Statistics at Bowling Green State University; three of his previous books sit on my bookshelf. Curve Ball, written with Jay Bennett, is pure sabermetrics, and one of the best books ever written on the topic (and winner of SABR's Baseball Research Award in 2002).  Albert's two R-focussed  books, the introductory R by Example (co-authored with Maria Rizzo) and the more advanced Bayesian Computation with R, are intended as supplementary texts for students learning statistical methods. Both employ plenty of baseball examples in their explanations of statistical analysis using R.

In Analyzing Baseball Data with R Marchi and Albert consolidate this joint expertise, and have produced a book that is simultaneously interesting and useful.

The authors takes a very logical approach to the subject at hand. The first chapter concerns the three sources of baseball data that are referenced throughout the book:
- the annual summaries contained with the Lahman database,
- the play-by-play data at Retrosheet, and
- the pitch-by-pitch PITCHf/x data.
The chapter doesn't delve into R, but summarizes the contents of the three data sets, and takes a quick look at the types of questions that can be answered with each.

The reader first encounters R in the second and third chapters, titled "Introduction to R" and "Traditional Graphics". These two chapters cover many of the basic topics that a new R user needs to know, starting with installing R and RStudio, then moving on to data structures like vectors and data frames, objects, functions, and data plots. Some of the key R packages are also covered in these chapters, both functional packages like plyr and data packages, notably Lahman, the data package containing the Lahman database.

The material covered in these early chapters are things I learned early on in my own R experience, but whereas I had relied on multiple sources and an unstructured ad hoc approach, in Analyzing Baseball Data with R a newcomer to R will find the basics laid out in a straight-forward and logical progression. These chapters will most certainly help them climb the steep learning curve faced by every neophyte R user.  (It is worth noting that the "Introduction to R" chapter relies heavily on a fourth source of baseball data -- the 1965 Warren Spahn Topps card, the last season of his storied career. Not all valuable data are big data.)

From that point on, the book tackles some of the core concepts of sabermetrics. This includes the relationship between runs and wins, run expectancy, career trajectories, and streaky performances.  As the authors work through these and other topics, they weave in information about additional R functions and packages, along with statistical and analytic concepts.  As one example, one chapter introduces Markov Chains in the context of using R to simulate half inning, season, and post-season outcomes.

The chapter "Exploring Streaky Performances" provides the opportunity to take a closer look at how Analyzing Baseball Data with R compares to Albert's earlier work.  In this case, the chapter uses moving average and simulation methodologies, providing the data code to examine recent examples (Ichiro and Raul Ibanez).  This is methodologically similar to what is described in Curve Ball, but with the addition of "here's the data and the code so you can replicate the analysis yourself".  This approach differs substantially from the much more mathematical content in Albert's text Bayesian Computation with R, where the example of streaky hitters is used to explore beta functions and the laplace R function.

Woven among these later chapters are also ones that put R first, and use baseball data as the examples. A chapter devoted to the advanced graphics capabilities of the R splits time between the packages lattice and ggplot2. The examples used in this chapter include  visualizations that are used to analyze variations in Justin Verlander's pitch speed.

Each chapter of the book also includes "Further Reading" and "Exercises", which provide readers with the chance to dig deeper into the topic just covered and to apply their new-found skills. The exercises are consistently interesting and often draw on previous sabermetric research.  Here's a couple of examples:
  • "By drawing a contour plot, compare the umpire's strike zone for left-handed and right-handed batters. Use only the rows of the data frame where the pitch type is a four-seam fastball." (Chapter 7)
  • "Using [Bill] James' similarity score measure ..., find the five hitters with hitting statistics most similar to Willie Mays." (Chapter 8)
The closing pages of the book are devoted to technical arcana regarding the data sources, and how-to instructions on obtaining those data.

The authors have established a companion blog (http://baseballwithr.wordpress.com/), which has an expansion of the analytics presented in the book.  For example, the entry from December 12, 2013 goes deeper into ggplot2 capabilities to enhance and refine charts that were described in the book.

Analyzing Baseball Data with R provides readers with an excellent introduction to both R and sabermetrics, using examples that provide nuggets of insight into baseball player and team performance. The examples are clear, the R code is well explained and easy to follow, and I found the examples consistently interesting. All told, Analyzing Baseball Data with R will be an extremely valuable addition to the practicing sabermetrician's library, and is most highly recommended.

Additional Resources


Jim Albert and Jay Bennett (2003), Curve Ball: Baseball, Statistics, and the Role of Chance in the Game (revised edition), Copernicus Books.

Jim Albert and Maria Rizzo (2011), R by Example, Springer.

Jim Albert (2009), Bayesian Computation with R (2nd edition), Springer.

An interview with Max Marchi, originally posted at MilanoRnet and also available through R-bloggers

-30-


June 16, 2013

Annotating select points on an X-Y plot using ggplot2

or, Is the Seattle Mariners outfield a disaster?

The Backstory
Earlier this week (2013-06-10), a blog post by Dave Cameron appeared at USS Mariner under the title “Maybe It's Time For Dustin Ackley To Play Some Outfield”. In the first paragraph, Cameron describes to the Seattle Mariners outfield this season as “a complete disaster” and Raul Ibanez as “nothing short of a total disaster”.

To back up the Ibanez assertion, the article included a link to a Fangraphs table showing the defensive metrics for all MLB outfielders with a minimum of 200 innings played to date, sorted in ascending order of UZR.150 (UZR is generally recognized as the best defensive metric). And there, at the top (or bottom) of the list, Raul Ibanez.

But surely, I thought, Ibanez's offensive production – starting with the 11 home runs he had hit at the time, now up to 13 – off-sets to some degree the lack of defense. So I took a look at a variety of offensive measures, to see how Ibanez stacks up. It quickly struck me that wRAA (Weighted Runs Above Average), the offensive component of WAR (Wins Above Replacement, the best comprehensive measure of a player's overall contribution, which also includes a base running not examined here), would make an interesting scatterplot against UZR. And a great opportunity to use ggplot2.

Manipulating the data
Using this table from Fangraphs (advanced batting stats of all MLB players so far this season), I created a new table “outfield” that appended the advanced hitting stats to the defensive stats in the original table, and then set about creating the plot using the ggplot2 package in R.

Note: once I had downloaded the two Fangraphs tables as csv files (with results through 2013-06-15), I edited the file names slightly.

# load the ggplot2 and grid packages
library(ggplot2)
library(grid)
# read data (note csv files are renamed)
tbl1 = read.csv("FanGraphs_Leaderboard_h.csv")
tbl2 = read.csv("FanGraphs_Leaderboard_d.csv")
# create new table with data from both tbl1 and tbl2 by link on variable
# 'playerid'
outfield = data.frame(merge(tbl1, tbl2, by = "playerid"))
# clean up the variable names of the two Name fields
names(outfield)[2] = paste("Name")
names(outfield)[21] = paste("Name.y")
#


A quick plot
With the two data sets now merged, we can start plotting the results. First of all, a quick plot using ggplot2's “qplot” needs only one line of code, and three specifications (X axis data, Y axis data, and the name of the source table):

  qplot(UZR.150, wRAA, data = outfield)



So that must be Raul Ibanez over there on the far left. It's clear from this plot that his hitting (represented on the Y axis) is just above the 0 line, and a long way below the outfielders who are hitting up a storm. It's worth keeping in mind that Ibanez's hitting contribution is helped to some degree by the fact that just over one-third of his plate appearances so far this year (126 of 187) have been as a designated hitter or pinch hitter.

In looking at this plot, you might ask the same thing I did: Where are the rest of the Mariners outfielders, and who are the stars of the X and Y axes?

Code to set up the tables for plotting

The next chunk of code takes three approaches to identifying groups and individuals on the chart. We don't want to plot the names of all 110 players, that would be utterly illegible. Instead, we'll focus on three groups: the Seattle Mariners, the top UZR.150 players, and the top wRAA players. The Mariners player points and names will be navy blue, and others in black. The code will label the Mariners players and the top performers on the wRAA axis automatically, and a manual approach will be adopted to create the code necessary to identify the top UZR players.

But before plotting the results, new variables in the “outfield” table are created that have the names of the Mariners players, the UZR stars, and the wRAA stars.

# create new MarinerNames field that contains only the name of Mariners
# players (plagarized from Winston Chang's R Graphics Cookbook Recipe 5.11)
outfield$MarinerNames = outfield$Name
idx = (outfield$Team.x == "Mariners")
outfield$MarinerNames[!idx] = NA
# create a new table, taking a subset that has only the Mariners players
Mariners = subset(outfield, Team.x == "Mariners")
# add the names of the UZR stars to outfield$Table2 sort the table by
# wRAA, then add the names of the top 4 wRAA stars
outfield$wRAAstars = outfield$Name
outfield = outfield[order(-outfield$wRAA), ]
outfield$wRAAstars[5:110] = NA
# sort the table by UZR.150, then copy the first 3 names
outfield$UZRstars = outfield$Name
outfield = outfield[order(-outfield$UZR.150), ]
outfield$UZRstars[4:110] = NA
#


The final plot code
# the full ggplot verion, creating an object called "WARcht"
WARcht = ggplot(outfield, aes(x=UZR.150, y=wRAA)) + #
   geom_point(colour="gray60", size=2.0) + # set the colour and size of the points
   theme_bw() + # and use the "background white" theme
   ggtitle("Everyday Outfielders, 2013 [to 2013-06-15]") # and put a title on the plot
#
# start with WARcht, add geom_text() [for auto labels] and annotate() [for manual labels and arrows]
#
#
WARcht + # print the chart object
   geom_text(aes(label=MarinerNames), size=4, fontface="bold", colour="navyblue",
      vjust=0, hjust=-0.1) + # add the names of the Mariners players
   geom_text(aes(label=wRAAstars), size=3, fontface="bold",
      vjust=0, hjust=-0.1) + # add the names of the top wRAA players
   annotate("text", label="Shane Victorino", x=40, y=3, size=3, 

      fontface="bold.italic") + # manually place the label for Shane Victorino
   annotate("segment", x=50, y=2, xend=51.7, yend=-0.4, size=0.5,
      arrow=arrow(length=unit(.2, "cm"))) + # manually place the Victorino arrow
   annotate("text", label="Craig Gentry", x=40, y=-7.0, size=3,

      fontface="bold.italic") +
   annotate("segment", x=42, y=-6.6, xend=40.9, yend=-4.0, size=0.5, 

      arrow=arrow(length=unit(.2, "cm"))) +
   annotate("text", label="A.J. Pollock", x=49, y=-2.5, size=3,    

      fontface="bold.italic") +
   geom_point(data=Mariners, aes(x=UZR.150, y=wRAA), colour="navyblue", size=4) # over-plot the points for the Mariners players





The final analysis

In addition to Raul Ibanez, there are four other Mariners outfielders who have logged more than 200 innings. The only one on the plus side of the UZR.150 ledger is Jason Bay, at 5.5. And along with Ibanez, only Michael Morse has a positive wRAA. Put it another way, all five are more or less in the lower right-hand quadrant of the chart. So yes, it's a fair assessment that the Mariners outfield is a disaster.

The Major League outfielders who are the top hitters (the Y axis on the chart) are led by the Rockies' Carlos Gonzalez (at 28.1), ahead of Shin-Soo Choo (21.2) and Mike Trout (19.8). And defensively (the X axis), Shane Victorino leads with 51.9, followed by Craig Gentry (40.9) and A.J. Pollock (39.1).

The only outfielder who shines on both dimensions is the Brewers' Carlos Gomez, who stands in fourth place on both UZR.150 and wRAA. As the chart shows, so far this season he's in a class by himself.

Note: the code above can be found in a gist at github.

-30-

April 5, 2013

Strikeout rates - update

Just a quick follow-up from my earlier post:  James Gentile at Beyond the Boxscore has written another great analysis on strikeouts, this one with the title "How can strikeouts be great for pitchers, but not that bad for hitters?" The analysis delves deeper in the question of increased strikeout rates by looking at the asymmetry between pitcher and hitter outcomes.

It boils down to this sentence: "Over time, hitters, managers, and front offices have slowly recognized more and more that they can trade additional strikeouts for an increase in production at the plate with very little repercussions."

-30-

March 29, 2013

On strikeout rates

A couple of recent articles have looked into the increased rate of strikeouts per game.

In the New York Times, Tyler Kepner has an article titled "Swing and a Mystery: Strikeout Rates Are Soaring".  This one has a sidebar article "Strikeouts on the rise", which includes an interactive chart displaying the changes over time (including an optional overlay for a selected team).

Some explanations offered--none conclusively--include increased use of relief pitchers, batters are more likely to swing aggressively with two strikes, better information available to pitchers, and pitchers throwing more strikes (walk rates are the lowest they have been in 20 years).


At Beyond The Boxscore (part of SB Nation), James Gentile takes a look at the rise of the called strike. Gentile notes that pitches per plate appearance (PA) have risen, and it's been called strikes and foul strikes per PA, not swinging strikes, that have risen. Gentile suggests (in common with some of Kepner's ideas) that batters are being less aggressive, and more patient.

So far, the research is only scratching the surface.  I'm sure we will see more in the future.

-30-

February 23, 2013

Sabermetrics primer

Phil Birnbaum, author of the Sabermetric Research and the editor of SABR's "By the Numbers", has written a primer on the topic with the title "A Guide to Sabermetric Research" that appears at the SABR site. This should be the first stop for anyone who wants to find out more about the field of sabermetrics, and a good read for those already active.

-30-

August 10, 2012

Trends in run scoring - comparing the leagues


My previous two posts have looked a using R to create trend lines for the run scoring environments in the American and National leagues.  This time around, I'll plot the two against each other to allow for some comparisons.

(The code below assumes that you've read the data into your workspace and calculated the LOESS trend lines, as I did in the previous two posts.)

One of the things I quickly appreciated about the R environment is the option to quickly compare and manipulate (for example, multiply) data from two different source files without having to cut-and-paste the data together.  For everything in this post, we've got two data tables (one for each league) and they remain separate.

July 17, 2012

Trends in run scoring, NL edition (more R)

Last time around I used R to plot the average runs per game for the American League, starting in 1901. Now I’ll do the same for the National League.  I'll save a comparison of the two leagues for my next post.

A fundamental principal of programming is that code can be repurposed for different sets of datas. So much of what I’m going to describe recycles the R code I used for the AL exercise.

So starting with the preliminary step, I went back to Baseball Reference for the data, followed up by the same sort of finessing described for the AL. Once the data was read into the R workspace, I simply copies the AL code, and changed the variable names to create new objects and variables.  (I could have simply rerun the same code, but I wanted to have both the AL and NL data and trend lines available for comparison.)  This included creating new LOESS trend lines.


July 14, 2012

Trends in AL run scoring (using R)


I have started to explore the functionality of R, the statistical and graphics programming language. And with what better data to play than that of Major League Baseball?

There have already been some good examples of using R to analyze baseball data. The most comprehensive is the on-going series at The Prince of Slides (Brian Mills, aka Millsy), cross-posted at the R-bloggers site. I am nowhere near that level, but explaining what I've done is a valuable exercise for me -- as Joseph Joubert said (no doubt in French) "To teach is to learn twice over." 

So after some reading (I have found Paul Teetor's R Cookbook particularly helpful) and working through some examples I found on the web, I decided to plot some time series data, calculate a trend line, and then plot the points and trend line. I started with the American League data, from its origins in 1901 through to the All Star break of 2012.  For this, I relied on this handy table at Baseball Reference.

Step 1: load the data into the R workspace.  This required a bit of finessing in software outside R. Any text editor such as Notepad or TextPad would do the trick.  What I did was paste it into the text editor, tidied up the things listed below, and then saved the file with a .csv extension.

January 13, 2012

SABR Analytics Conference

The Society for American Baseball Research (SABR) has just announced that they will be hosting the first-ever conference dedicated to baseball analytics from March 15-17, 2012.

All of the "featured speakers" shown are team general managers and executives -- but I anticipate that as more speakers are announced there will be some number crunchers added to the list.

Hopefully there will be a conference procedings document produced for those of us who won't be able to attend.

-30-

January 7, 2012

The music of my mind

Jason Branon at Baseball Nation offers up "ProGS - The Pronunciation Guide For Sabermetricians", which is just what it says.

At first I thought, why bother? I tended to side with Jason's leader, Rob Neyer -- the only people who read this stuff never leave the house. And then I realized that isn't entirely true -- at the annual SABR convention (in Minneapolis this year), baseball nerds get together, and the quantitative specialists need to be able to argue about the "new stats" and the interpretation thereof, rather than wasting precious time and energy arguing about pronunciation. (Which is a whole other dimension of nerdiness.)

Perhaps someone can create a sabermetric rewrite of the Gershwin brothers' "Let's call the whole thing off" -- the tuh-MAY-toe / ta-MAH-toe song.

Fred shows Ginger how to execute an effective throw from second base on a double play pivot.

-30-

November 7, 2011

The Bayes Ball Bookshelf, #2

Baseball Analyst, 1982-1989 (Bill James, publisher and editor)

SABR is now hosting -- the the blessing of Bill James, and through the work of Phil Birnbaum -- the complete Baseball Analyst.  Between 1982 and 1989, Bill James published 40 issues of Baseball Analyst, which in retrospect is now recognized as the launch pad for some fundamental thinking about using quantitative approaches to understand baseball.

The initial issue got off to a great start, with an article about fielding by Paul Schwarzenbart. In his introduction to the issue, James writes that the article "demonstrates that fielding statistics, like batting and pitching but apparently even more so, are the products in part of circumstances as well as men." This is a topic that, 30 years later, continues to provide plenty of fodder for analysis (e.g. this blog post from a month ago by Tangotiger, "Not all fielding opportunities are created the same").

In later issues, there are articles covering the usual parade of topics: clutch hitting, ballpark effects, how much young pitchers should work, ageing of ball players, and of course movie reviews.

There's also familiar names: Pete Palmer, Phil Birnbaum, and Bill James himself.

All in all, Baseball Analyst is an interesting time capsule. The tools the sabermetric community use to communicate have shifted -- when was the last time you subscribed to a magazine produced on a typewriter and mimeograph? But more importantly, it demonstrates how thinking about these topics has shifted. This shift is both because of further research (we know more than we used to) and because of the proliferation of data and cheap computing power

But it also shows that in spite of 30 years of analysis, there are still many questions unresolved.

-30-

May 6, 2011

When labour market research goes to the ballpark

In a recently issued paper called "Productivity, Wages, and Marriage: The Case of Major League Baseball", economists Francesca Cornaglia and Naomi E. Feldman examine the "marriage premium" -- the fact that controlling for other influencing factors, married men earn more than unmarried men. In most situations, a variety of confounding variables muddy the waters -- things like geographic location, differences across occupations, and poor productivity measures. Cornaglia and Feldman innovatively use information from MLB to control for those variables.

Derek Jeter, the exception that proves the rule.

The abstract:

Using a sample of professional baseball players from 1871 - 2007, this paper aims at analyzing a longstanding empirical observation that married men earn significantly more than their single counterparts holding all else equal. There are numerous conflicting explanations, some of which reflect subtle sample selection problems (that is, men who tend to be successful in the workplace or have high potential wage growth also tend to be successful in attracting a spouse) and some of which are causal (that is, marriage does indeed increase productivity for men). Baseball is a unique case study because it has a long history of statistics collection and numerous direct measurements of productivity. Our results show that the marriage premium also holds for baseball players, where married players earn up to 20% more than those who are not married, even after controlling for selection. The results are generally robust only for players in the top third of the ability distribution and post 1975 when changes in the rules that govern wage contracts allowed for players to be valued closer to their true market price. Nonetheless, there do not appear to be clear differences in productivity between married and nonmarried players. We discuss possible reasons why employers may discriminate in favor of married men.

You can hear Dr. Cornaglia discuss the research on the BBC programme More or Less (2011-04-29), starting at roughly 10'50".

My initial reaction regards neither the findings nor the methodology, but the fact that other than a mention of the Lahman database, the list of references does not include any of work of the sabermetric research community. At one point in the discussion of productivity measures the authors write "Most modern-day baseball enthusiasts and commentators consider the latter two statistics [OPS and EqA] to be the most accurate measures of a player’s productivity", but the authors neither refer to any authority to support that statement nor discuss fact that others have critiqued those measures.

This is not the first time that academics have utilized the contributions of the sabermetric community in supporting their research (in this case, it provides a vital element in the foundation of the productivy measure) but then failed to acknowledge that work. For a well-reasoned discussion of that topic, please read Phil Birnbaum's "Chopped liver II".

-30-

December 10, 2010

Slugging regression

Tango issued a multi-part challenge, of which the first part is:
1. Take the top 10 in SLG in each of the last 10 years, and tell me what the overall average SLG of these 100 players was in the following year.


The point of the challenge is to demonstrate that top performing players will regress toward the mean in subsequent seasons, and that the year under consideration accounts for, as a rule of thumb, 70% of the next season's performance, and the league average (to which their performance regresses) the other 30%.


In algebraic terms, X is predicted to be 70% when
X = (SLG2 - LSLG)/(SLG1-LSLG)
Where:
SLG1 is season 1 slugging average,
SLG2 is season 2 slugging average,
LSLG is the average league slugging average (from season 1)

Leo quickly responded (comment #1 to Tango's post), with his calculation that for slugging, 73.3% was accounted for by the player's average in the first season. To my way of thinking, the challenge has been met -- job well done, Leo!

But as I started to think about it further, I began to wonder how far through the rankings this rule of thumb holds -- as we approach the league average, the player's SLG and the league SLG become one and the same number. And at the opposite end of the scale -- the non-sluggers -- do they regress upwards towards the mean?

So my first step was to simplify the challenge, and only look at two consecutive seasons, 2007 and 2008. Using only those players who had a minimum of 400 at bats each season, I pruned the list down to 129 players in both the NL and AL. Simplifying matters further is the fact that the 2007 SLG for the NL was the same as the AL -- .423. So for my "top sluggers" I then looked at the top 25 across both leagues.

The result: for these 25 players, on average, 66% of their 2008 SLG was accounted for through their 2007 score. A few percentage points from Tango's rule of thumb, but close enough.

Charting the results shows that all but two of the top 25 sluggers regressed downwards towards the mean. And of the two, only one improved dramatically: Albert Pujols (who inched up still further in 2009, before regressing ever-so-slightly in 2010). Were Pujols not in the mix, the 2007 SLG would account for only 62% of the 2008 scores.






Another interesting observation is that of these top performers, not one fell so far in 2008 to end up with a SLG below the league average. That's not to say that it wouldn't happen, but it suggests that at the extreme end of the performance curve, as determined over the course of a full season, top performers really are above average. (NOTE: further testing required!)

But what of the other end of the ranking? I looked at the lowest performing players that I had selected, and the rule of thumb does not work. From the bottom up, the percentage explained was 87%, 84%, -4.4%, -28%, ...

At this point, I started to wonder -- why minus values? A quick check of the numbers, and I saw that these players regressed up, and to a point above the league average.

So what's different about the bottom of the range? It's simple: survivorship bias. My "sample" of 139 players who had 400+ ABs in each of 2007 and 2008, while ensuring I found the top hitters, automatically excluded those weak-slugging players who don't get many plate appearances but who collectively drag down the league average. Thus the "worst" players of the 139 with lots of ABs were not (by and large) far from the league average. The bottom of the list was Jason Kendall, who slugged .309 in 2007 for the A's and the Cubs while catching. Perform much worse than that, and you'll end up playing Triple A. Or in Kendall's case, for the Royals.


On deck: regression toward the mean, SLG with 75+ ABs.


-30-

November 14, 2010

The Bayes Ball Bookshelf, #1

The Numbers Game: Baseball's Lifelong Fascination with Statistics, by Alan Schwarz. 2004, St. Martin's Press.




In The Numbers Game, Alan Schwarz presents a well-written and tidy history of the development and evolution of the statistics that record the history of the game. Or more accurately, it's a history of baseball, and its evolution over the past century and a half, from the perspective of the numerical record and analysis of the game.
Thus Schwarz begins in the mid-nineteenth century, with Henry Chadwick's influence on the information that got recorded. But more importantly, Schwarz points out (and this becomes a recurring theme) that how the game was played was an influence on what got recorded. In the early days, the ball was "pitched" to the batter in a way that facilitated batting it -- and because pitching was secondary to hitting and fielding, there was no record of pitching performance. And as the game evolved, so did the numbers that recorded the game and got used to evaluate the players.
A second recurring theme is the weaving of the technical aspects of the statistics with the personal characters of those who developed and promoted various measures. This is very much a character-driven book -- we hear not only about the "why" of the statistics that were recorded, but the people who developed them and the means of recording them. So we hear about Al Munro Elias, Allan Roth's career with the Dodgers, Hal Richman's development of Strat-O-Matic, and George Lindsey's articles that appeared in academic journals beginning in the late 1950s. We also get an entire chapter devoted to the publication in 1969 of The Baseball Encyclopedia, and another to Bill James.
One of the things that jumps out to me is the impact that computers-- particularly the personal computer -- has had on the volume of statistics available, and the precision of the analysis that is now available. (And what is perhaps a topic for another day, the proliferation of analysts of varying quality.)
In The Numbers Game, Schwarz has written what may well be the single best introduction to sabermetrics. But it's not a technical manual that will tell you how to calculate any one statistic, or how another measure should be interpreted. Instead it's a lively history of major league baseball, and the numerical record and analysis that accompanies it.

Assessment: home run.

July 27, 2010

Skill, luck, and more than a little style

Pictured: "Mr. May", Dave Winfield, comes through in the clutch in the 1992 World Series with, in his words, "One stinkin' little hit." The 11th-inning double drove in two runs and sealed the World Series win for the Blue Jays.


The BBC has posted an article and calculator ("Can chance make you a killer?") that is used to demonstrate the challenges in differentiating luck from skill. In this case, a simple scenario with fixed parameters is linked to a calculator that generates the range of possibilities.

While I'm not sure how this could be used in a baseball setting, it is a very good tool for demonstrating that it can be difficult -- particularly if you just look at "the numbers" in a selective way -- to make definitive statements about a player's ability. Such as, say, clutch hitting.

(Acknowledgement: The Book.)

July 26, 2010

Baseball imitates real life

How understanding luck in baseball can help understanding real life, or at least your investment portfolio: "Untangling skill and luck" by Michael J. Mauboussin.

Mauboussin uses a variety of sabermetric analysis, including Jim Albert's 2004 paper “A Batting Average: Does It Represent Ability or Luck?” and Tango's True Talent Level analysis.