December 17, 2013

Book Review: Analyzing Baseball Data with R

by Max Marchi and Jim Albert (2014, CRC Press)


The Sabermetric bookshelf, #3


Here we have the perfect book for anyone who stumbles across this blog--the intersection of R and baseball data. The open source statistical programming environment of R is a great tool for anyone analyzing baseball data, from the robust analytic functions to the great visualization packages. The particular readership niche might be small, but as both R and interest in sabermetrics expand, it's a natural fit.


And one would be hard pressed to find better qualified authors, writers who have feet firmly planted in both worlds.  Max Marchi is a writer for Baseball Prospectus, and it's clear from the ggplot2 charts in his blog entries (such as this entry on left-handed catchers) that he's an avid R user.

Jim Albert is a Professor in the Department of Mathematics and Statistics at Bowling Green State University; three of his previous books sit on my bookshelf. Curve Ball, written with Jay Bennett, is pure sabermetrics, and one of the best books ever written on the topic (and winner of SABR's Baseball Research Award in 2002).  Albert's two R-focussed  books, the introductory R by Example (co-authored with Maria Rizzo) and the more advanced Bayesian Computation with R, are intended as supplementary texts for students learning statistical methods. Both employ plenty of baseball examples in their explanations of statistical analysis using R.

In Analyzing Baseball Data with R Marchi and Albert consolidate this joint expertise, and have produced a book that is simultaneously interesting and useful.

The authors takes a very logical approach to the subject at hand. The first chapter concerns the three sources of baseball data that are referenced throughout the book:
- the annual summaries contained with the Lahman database,
- the play-by-play data at Retrosheet, and
- the pitch-by-pitch PITCHf/x data.
The chapter doesn't delve into R, but summarizes the contents of the three data sets, and takes a quick look at the types of questions that can be answered with each.

The reader first encounters R in the second and third chapters, titled "Introduction to R" and "Traditional Graphics". These two chapters cover many of the basic topics that a new R user needs to know, starting with installing R and RStudio, then moving on to data structures like vectors and data frames, objects, functions, and data plots. Some of the key R packages are also covered in these chapters, both functional packages like plyr and data packages, notably Lahman, the data package containing the Lahman database.

The material covered in these early chapters are things I learned early on in my own R experience, but whereas I had relied on multiple sources and an unstructured ad hoc approach, in Analyzing Baseball Data with R a newcomer to R will find the basics laid out in a straight-forward and logical progression. These chapters will most certainly help them climb the steep learning curve faced by every neophyte R user.  (It is worth noting that the "Introduction to R" chapter relies heavily on a fourth source of baseball data -- the 1965 Warren Spahn Topps card, the last season of his storied career. Not all valuable data are big data.)

From that point on, the book tackles some of the core concepts of sabermetrics. This includes the relationship between runs and wins, run expectancy, career trajectories, and streaky performances.  As the authors work through these and other topics, they weave in information about additional R functions and packages, along with statistical and analytic concepts.  As one example, one chapter introduces Markov Chains in the context of using R to simulate half inning, season, and post-season outcomes.

The chapter "Exploring Streaky Performances" provides the opportunity to take a closer look at how Analyzing Baseball Data with R compares to Albert's earlier work.  In this case, the chapter uses moving average and simulation methodologies, providing the data code to examine recent examples (Ichiro and Raul Ibanez).  This is methodologically similar to what is described in Curve Ball, but with the addition of "here's the data and the code so you can replicate the analysis yourself".  This approach differs substantially from the much more mathematical content in Albert's text Bayesian Computation with R, where the example of streaky hitters is used to explore beta functions and the laplace R function.

Woven among these later chapters are also ones that put R first, and use baseball data as the examples. A chapter devoted to the advanced graphics capabilities of the R splits time between the packages lattice and ggplot2. The examples used in this chapter include  visualizations that are used to analyze variations in Justin Verlander's pitch speed.

Each chapter of the book also includes "Further Reading" and "Exercises", which provide readers with the chance to dig deeper into the topic just covered and to apply their new-found skills. The exercises are consistently interesting and often draw on previous sabermetric research.  Here's a couple of examples:
  • "By drawing a contour plot, compare the umpire's strike zone for left-handed and right-handed batters. Use only the rows of the data frame where the pitch type is a four-seam fastball." (Chapter 7)
  • "Using [Bill] James' similarity score measure ..., find the five hitters with hitting statistics most similar to Willie Mays." (Chapter 8)
The closing pages of the book are devoted to technical arcana regarding the data sources, and how-to instructions on obtaining those data.

The authors have established a companion blog (http://baseballwithr.wordpress.com/), which has an expansion of the analytics presented in the book.  For example, the entry from December 12, 2013 goes deeper into ggplot2 capabilities to enhance and refine charts that were described in the book.

Analyzing Baseball Data with R provides readers with an excellent introduction to both R and sabermetrics, using examples that provide nuggets of insight into baseball player and team performance. The examples are clear, the R code is well explained and easy to follow, and I found the examples consistently interesting. All told, Analyzing Baseball Data with R will be an extremely valuable addition to the practicing sabermetrician's library, and is most highly recommended.

Additional Resources


Jim Albert and Jay Bennett (2003), Curve Ball: Baseball, Statistics, and the Role of Chance in the Game (revised edition), Copernicus Books.

Jim Albert and Maria Rizzo (2011), R by Example, Springer.

Jim Albert (2009), Bayesian Computation with R (2nd edition), Springer.

An interview with Max Marchi, originally posted at MilanoRnet and also available through R-bloggers

-30-


December 1, 2013

A few random things

The creation of random numbers, or the random selection of elements in a set (or population), is an important part of statistics and data science. From simulating coin tosses to selecting potential respondents for a survey, we have a heavy reliance on random number generation.
R offers us a variety of solutions for random number generation; here's a quick overview of some of the options.

runif, rbinom, rnorm

One simple solution is to use the runif function, which generates a stated number of values between two end points (but not the end points themselves!) The function uses the continuous uniform distribution, meaning that every value between the two end points has an equal probability of being sampled.
Here's the code to produce 100 values between 1 and 100, and then print them.
RandomNumbers <- runif(100, 1, 100)
RandomNumbers
##   [1] 22.290 33.655 89.835 38.535 24.601 11.431  7.047 94.958 83.703 76.847
##  [11] 58.429 20.667 25.796 91.821  8.741 65.696 24.262  8.077 51.399 19.652
##  [21] 64.883 33.258 55.488  6.828 14.925 11.480 72.783  2.549 78.706 49.563
##  [31] 10.829 27.653 70.304 96.759 12.614 66.610 82.467  8.506 71.719 86.586
##  [41] 69.519 11.538 72.321 63.126 42.754 60.139 44.854 71.088 15.165 67.818
##  [51] 83.342  9.894 64.497 96.620 64.286 20.162 16.343 53.800 31.380 24.418
##  [61] 13.740 47.458 80.037 13.189 45.496 20.697 28.240 60.003 84.350 14.888
##  [71] 20.084  3.003  1.191 28.748  4.528 40.568 90.963 82.640 15.885 95.029
##  [81] 54.166 17.315 43.355  9.762 74.012 64.537 74.131 24.758 41.922 65.458
##  [91] 11.423 41.084 22.514 77.329 76.879 43.954 78.471 24.727 69.357 60.118
R helpfully has random generators from a plethora of distributions (see http://cran.r-project.org/web/views/Distributions.html under the heading “Random Number Generators”). For example, the equivalent function to pull random numbers from the binomial distribution is rbinom. In the following example, the code generates 100 iterations of a single trial where there's a 0.5 (50/50) probabilty – as you would get with one hundred coin tosses. So let's call the object OneHundredCoinTosses. The table function then gives us a count of the zeros and ones in the object.
OneHundredCoinTosses <- rbinom(100, 1, 0.5)
OneHundredCoinTosses
##   [1] 1 0 0 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 1 1 0 1 1 0 1 0 0 1 0 0 1 1 1 0 0
##  [36] 1 0 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 1 0 1 0 0 0 1 1 1 0 1 0 0 1
##  [71] 1 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 1 1 1 0 0 1 1 0 1 0 1
table(OneHundredCoinTosses)
## OneHundredCoinTosses
##  0  1 
## 48 52
In this variant, we'll toss our coin again, but this time it will be 100 iterations of 10 trials. R will generate the number of successes per trial. A plot of the histogram would show how with enough iterations, we'd get something that looks very much like a normal distribution curve.
OneHundredCoinTrials <- rbinom(100, 10, 0.5)
OneHundredCoinTrials
##   [1]  5  4  7  7  4  2  3  3  6  6  6  5  4  6  5  6  4  7  3  3  6  5  5
##  [24]  7  4  5  6  3  7  4 10  5  3  6  6  5  7  6  3  6  4  7  7  4  3  6
##  [47]  6  2  2  7  5  6  5  8  4  3  5  5  2  7  5  4  4  1  4  5  7  5  6
##  [70]  9  6  3  7  4  4  2  3  6  3  5  6  6  5  8  5  5  7  5  7  4  2  3
##  [93]  5  5  7  5  4  6  0  6
table(OneHundredCoinTrials)
## OneHundredCoinTrials
##  0  1  2  3  4  5  6  7  8  9 10 
##  1  1  6 13 16 23 21 15  2  1  1
And there's rnorm for the normal distribution. In this case, the second number in the function is the mean and the third is the standard deviation. With this example, the code generates 100 values from a normal distribution with a mean of 50 and a standard deviation of 12.5.
RandomNormal <- rnorm(100, 50, 12.5)
RandomNormal
##   [1] 52.67 36.55 57.48 63.00 53.65 52.05 65.39 36.16 53.03 53.22 53.61
##  [12] 65.66 47.08 41.87 80.60 67.34 49.56 54.09 50.51 48.35 33.72 47.31
##  [23] 51.22 51.01 56.76 51.01 79.84 37.27 33.67 41.73 49.34 62.04 61.48
##  [34] 37.49 56.54 63.87 49.13 36.11 47.14 34.67 57.88 34.45 50.46 48.68
##  [45] 66.36 45.32 51.72 36.64 41.35 48.09 35.50 55.30 42.13 26.29 30.68
##  [56] 53.08 60.53 40.96 35.04 60.64 74.57 49.00 62.41 37.19 52.64 39.80
##  [67] 31.66 49.13 36.05 49.98 55.00 72.42 56.64 53.44 50.50 54.02 60.74
##  [78] 39.32 53.72 75.50 46.87 12.10 45.09 70.40 53.11 37.36 58.97 67.09
##  [89] 45.63 55.44 46.66 31.69 36.68 59.28 55.09 53.25 49.52 59.87 59.16
## [100] 80.30

sample

Another approach to randomization is the sample function, which pulls elements from an object (such as a vector) of defined values or, alternatively, can be specified to select cases from a string of integers. The function also has the option of specifying whether replacement will be used or not. (See http://cran.r-project.org/web/packages/sampling/index.html)
In the first example of sample, we'll generate 100 values (the second value specified in the function) from the integers between 1 and 99 (the first value specified), with replacement – so there's a possibility of duplicates. The code adds the sort function so that we can easily spot the duplicates.
RandomSample <- sort(sample(99, 100, replace = TRUE))
RandomSample
##   [1]  1  1  2  2  2  5  5  5  7  8 11 11 12 12 12 13 15 15 16 16 19 19 21
##  [24] 23 23 23 23 23 25 26 27 30 30 31 32 33 34 35 35 35 35 37 38 40 41 42
##  [47] 42 42 42 45 46 46 47 48 48 50 52 52 54 54 54 54 54 57 58 61 62 63 63
##  [70] 64 66 66 67 67 69 70 71 71 71 73 73 74 78 78 80 80 82 83 84 84 85 86
##  [93] 86 91 93 94 96 96 98 98
In a second example, we'll generate 5 values (the second value specified) from a list of 13 names that we predefine, without replacement. Note that the default setting in sample is “without replacement”, so there should be no duplicates.
# the list of party-goers
dwarves <- c("Fí­li", "Kíli", "Balin", "Dwalin", "Óin", "Glóin", "Bifur", "Bofur", 
    "Bombur", "Ori", "Nori", "Dori", "Thorin")  # draw a sorted sample of 50 with replacement
Party <- sort(sample(dwarves, 5))
# print the names
Party
## [1] "Dwalin" "Fí­li"   "Nori"   "Ori"    "Thorin"
There is also the sample.int variant which insists on integers for the values. Here's the code to randomly select 6 numbers between 1 and 49, without replacement.
six49numbers <- sort(sample.int(49, 6, replace = FALSE))
six49numbers
## [1]  3 15 16 24 35 44

Controlling random number generation: set.seed and RNGkind

It sounds like an oxymoron--how can you control something that is random? The answer is that in many computer programs and programming languages, R included, many of the functions that are dubbed random number generation really aren't. I won't get into the arcana, but runif (and its ilk) and sample all rely on pseudo-random approaches, methods that are close enough to being truly random for most purposes. (If you want to investigate this further in the context of R, I suggest starting with John Ramey's post at http://www.r-bloggers.com/pseudo-random-vs-random-numbers-in-r-2/ )
With the set.seed command, an integer is used to start a random number generation, allowing the same sequence of “random” numbers to be selected repeatedly. In this example, we'll use the code written earlier to sample 6 numbers between 1 and 49, and repeat it three times.
The first time through, set.seed will define the starting seed as 1, then for the second time through, the seed will be set to 13, leading to a different set of 6 numbers. The third iteration will reset the starting seed to 1, and the third sample set of 6 numbers will be the same as the first sample.
set.seed(1)
six49numbers <- sort(sample.int(49, 6))
six49numbers
## [1] 10 14 18 27 40 42
set.seed(13)
six49numbers <- sort(sample.int(49, 6))
six49numbers
## [1]  1  5 12 19 35 44
set.seed(1)
six49numbers <- sort(sample.int(49, 6))
six49numbers
## [1] 10 14 18 27 40 42
The first and third draws contain the same 6 integers.
Another control of the random number generation is RNGkind. This command defines the random number generation method, from an extensive list of methodologies. The default is Mersene Twister (http://en.wikipedia.org/wiki/Mersenne_twister), and a variety of others are available.
The R documentation page on Random{}, with both set.seed and RNGkind, can be found here: http://stat.ethz.ch/R-manual/R-devel/library/base/html/Random.html

random

While the methods above are pseudo-random, there are methods available that generate truly random numbers. One is the service provided by random.org (http://www.random.org/).
The R package random (documentation here: http://cran.r-project.org/web/packages/random/) uses the random.org service to generate random numbers and return them into an R object. While the functions in the package can return random integers, randomized sequences, and random strings, and has the flexibility to define the shape of the matrix (i.e. the number of columns).
It's worth nothing that free users or random.org are confronted by daily limits to the volume of calls you can make to random.org (paying customers don't have these limits).
Here's an example to generate 20 random numbers from random.org, defined as being between 100 and 999 (that is to say, three digit numbers) and present them in two columns.
# load random
if (!require(random)) install.packages("random")
## Loading required package: random
library(random)
# 
twentytruerandom <- randomNumbers(n = 20, min = 100, max = 999, col = 2, check = TRUE)
# note: the 'check=' sets whether quota at server should be checked first
twentytruerandom
##        V1  V2
##  [1,] 531 402
##  [2,] 559 367
##  [3,] 616 789
##  [4,] 830 853
##  [5,] 382 436
##  [6,] 336 737
##  [7,] 769 548
##  [8,] 293 818
##  [9,] 746 609
## [10,] 108 331

References

Paul Teetor's R Cookbook (O'Reilly, 2011) has a chapter on probability (Chapter 8) that includes good examples of various random number generation in R.
Jim Albert & Maria Rizzo, R by Example (Springer, 2012), Chapter 11 “Simulation Experiments” and Chapter 13 “Monte Carlo Methods”, contain a variety of applications of random number generation using sample and rbinom to approximate and understand probability experiments.
For an in-depth look at random sampling in the context of survey design, see Thomas Lumley Complex Surveys: A Guide to Analysis Using R (Wiley, 2010).
If you're interested in testing a random number generator, check out http://www.johndcook.com/blog/2010/12/06/how-to-test-a-random-number-generator-2/
Joseph Rickert's blog entry at blog.revolutionanalytics.com gives a good rundown of the applications and approach for parallel random number generation http://blog.revolutionanalytics.com/2013/06/intro-to-parallel-random-number-generation-with-revoscaler.html
-30-