July 26, 2014

Roger Angell and the Baseball Hall of Fame

The great baseball writer Roger Angell is the recipient of the 2014 J.G. Taylor Spink Award, the first time a non-BBWAA member has been given the award.  Angell will be presented the award at the Baseball Hall of Fame at Cooperstown, during the induction weekend July 25 - 28, 2014.

Angell is, for my money, the best writer about baseball.  His accounts of the game are from a fan's perspective, rather than the typical listing of the game's dramatic moments. Indeed, many of his greatest observations are about the fans and their experiences.  One such essay is "The Interior Stadium"; required reading for anyone interested in sports and people's responses to the games.

Much of Angell's writing was published by his employer, the New Yorker, who have recently compiled two summaries of his writing.  The first was offered by David Remmick, whose piece "Roger Angell Heads to Cooperstown" was published when the Spink award was announced in December 2013 and has links to a variety of Angell's best. More recently, "Hall of Fame Weekend: Roger Angell's Baseball Writing" (by Sky Dylan-Robbins) provides a different list of great essays.

One of the best things the New Yorker has made available is Angell's scorecard for Game 6 of the 2011 World Series, when the Cardinals were down to their final strike twice, but managed to come back and win the game (and then, in an anti-climactic game 7, the Series).

-30-

July 23, 2014

Left-handed catchers


Benny Distefano – 1985 Donruss #166
(source: baseball-almanac.com)
We are approaching the twenty-fifth anniversary of the last time a left-handed throwing catcher appeared behind the plate in a Major League Baseball game; on August 18, 1989 Benny Distefano made his third and final appearance as a catcher for the Pirates. Distefano’s accomplishment was celebrated five years ago, in Alan Schwarz’s “Left-Handed and Left Out” (New York Times, 2009-08-15).

Jack Moore, writing on the site Sports on Earth in 2013 (“Why no left-handed catchers?”), points out that lack of left-handed catchers goes back a long way. One interesting piece of evidence is a 1948 Ripley’s “Believe It Or Not” item with a left-handed catcher Dick Bernard (you can read more about Bernard’s signing in the July 1, 1948 edition of the Tuscaloosa News). Bernard didn’t make the majors, and doesn’t appear in any of the minor league records that are available on-line either.


Dick Bernard in Ripley’s “Believe It or Not”, 1948-12-30
(source: 
sportsonearth.com)


There are a variety of hypotheses why there are no left-handed catchers, all of which are summarized in John Walsh’s “Top 10 Left-Handed Catchers for 2006” (a tongue-in-cheek title if ever there were) at The Hardball Times. A compelling explanation, and one supported by both Bill James and J.C. Bradbury (in his book The Baseball Economist) is natural selection; a left-handed little league player who can throw well will be groomed as a pitcher.

Throwing hand by fielding position as an example of a categorical variable



I was looking for some examples of categorical variables to display visually, and the lack of left-handed throwing catchers, compared to other positions, came to mind. The following uses R, and the Lahman database package.

The analysis requires merging the Master and Fielding tables in the Lahman database – the Master table gives the player's name and his throwing hand, and Fielding tells us how many games at each position they played. For the purpose of this analysis, we’ll look at the seasons 1954 (the first year in the Lahman database that has the outfield positions split into left, centre, and right) through 2012.

You may note that for the merging of the two tables, I used the new dplyr package. I tested the system.time of the basic version of “merge” to combine the two tables, and the “inner_join” in dplyr. The latter is substantially faster: my aging computer ran “merge” in about 5.5 seconds, compared to 0.17 seconds with dplyr.


# load the required packages
library(Lahman)
library(dplyr)
#

The first step is to create a new data table that merges the Fielding and Master tables, based on the common variable “playerID”. This new table has one row for each player, by position and season; we use the dim function to show the dimensions of the table.

Then, select only those seasons since 1954 and omit the records that are Designated Hitter (DH) and the summary of outfield positions (OF) (i.e. leave the RF, CF, and LF).

MasterFielding <- inner_join(Fielding, Master, by="playerID")
dim(MasterFielding)
## [1] 164903     52
#
MasterFielding <- filter(MasterFielding, POS != "OF" & POS != "DH" & yearID > "1953")
dim(MasterFielding)
## [1] 91214    52

This table needs to be summarized one step further – a single row for each player, counting how many games played at each position.

Player_games <- MasterFielding %.%
  group_by(playerID, nameFirst, nameLast, POS, throws) %.%
  summarise(gamecount = sum(G)) %.%
  arrange(desc(gamecount)) 
dim(Player_games)
## [1] 19501     6
head(Player_games)
## Source: local data frame [6 x 6]
## Groups: playerID, nameFirst, nameLast, POS
## 
##    playerID nameFirst nameLast POS throws gamecount
## 1 robinbr01    Brooks Robinson  3B      R      2870
## 2 bondsba01     Barry    Bonds  LF      L      2715
## 3 vizquom01      Omar  Vizquel  SS      R      2709
## 4  mayswi01    Willie     Mays  CF      R      2677
## 5 aparilu01      Luis Aparicio  SS      R      2583
## 6 jeterde01     Derek    Jeter  SS      R      2531

This table shows the career records for the most games played at the positions (for 1954-2012). We see that Brooks Robinson leads the way with 2,870 games played at third base, and the fact that Derek Jeter, at the end of the 2012 season, was closing in on Omar Vizquel’s career record for games played as a shortstop.

Cross-tab Tables


The next step is to prepare a simple cross-tab table (also known as contingency or pivot tables) showing the number of players cross-tabulated by position (POS) and throwing hand (throws).

Here, I’ll demonstrate two ways to do this: first with dplyr’s “group_by” and “summarise” (with a bit of help from reshape2), and then the “table” function in gmodels.

# first method - dplyr
Player_POS <- Player_games %.%
  group_by(POS, throws) %.%
  summarise(playercount = length(gamecount))
Player_POS
## Source: local data frame [17 x 3]
## Groups: POS
## 
##    POS throws playercount
## 1   1B      L         411
## 2   1B      R        1515
## 3   2B      L           4
## 4   2B      R        1560
## 5   3B      L           4
## 6   3B      R        1889
## 7    C      L           4
## 8    C      R         980
## 9   CF      L         393
## 10  CF      R        1252
## 11  LF      L         544
## 12  LF      R        2161
## 13   P      L        1452
## 14   P      R        3623
## 15  RF      L         520
## 16  RF      R        1893
## 17  SS      R        1296

To transform this long-form table into a traditional cross-tab shape we can use the “dcast” function in reshape2.

library(reshape2)
## Loading required package: reshape2
dcast(Player_POS, POS ~ throws, value.var = "playercount")
##   POS    L    R
## 1  1B  411 1515
## 2  2B    4 1560
## 3  3B    4 1889
## 4   C    4  980
## 5  CF  393 1252
## 6  LF  544 2161
## 7   P 1452 3623
## 8  RF  520 1893
## 9  SS   NA 1296

A second method to get the same result is to use the “table” function in the gmodels package.

library(gmodels)
## Loading required package: gmodels
throwPOS <- with(Player_games, table(POS, throws))
throwPOS
##     throws
## POS     L    R
##   1B  411 1515
##   2B    4 1560
##   3B    4 1889
##   C     4  980
##   CF  393 1252
##   LF  544 2161
##   P  1452 3623
##   RF  520 1893
##   SS    0 1296

A more elaborate table can be created using gmodels package. In this case, we’ll use the CrossTable function to generate a table with row percentages. You’ll note that the format is set to SPSS, so the table output resembles that software’s display style.

CrossTable(Player_games$POS, Player_games$throws, 
           digits=2, format="SPSS",
           prop.r=TRUE, prop.c=FALSE, prop.t=FALSE, prop.chisq=FALSE,  # keeping the row proportions
           chisq=TRUE)                                                 # adding the ChiSquare statistic
## 
##    Cell Contents
## |-------------------------|
## |                   Count |
## |             Row Percent |
## |-------------------------|
## 
## Total Observations in Table:  19501 
## 
##                  | Player_games$throws 
## Player_games$POS |        L  |        R  | Row Total | 
## -----------------|-----------|-----------|-----------|
##               1B |      411  |     1515  |     1926  | 
##                  |    21.34% |    78.66% |     9.88% | 
## -----------------|-----------|-----------|-----------|
##               2B |        4  |     1560  |     1564  | 
##                  |     0.26% |    99.74% |     8.02% | 
## -----------------|-----------|-----------|-----------|
##               3B |        4  |     1889  |     1893  | 
##                  |     0.21% |    99.79% |     9.71% | 
## -----------------|-----------|-----------|-----------|
##                C |        4  |      980  |      984  | 
##                  |     0.41% |    99.59% |     5.05% | 
## -----------------|-----------|-----------|-----------|
##               CF |      393  |     1252  |     1645  | 
##                  |    23.89% |    76.11% |     8.44% | 
## -----------------|-----------|-----------|-----------|
##               LF |      544  |     2161  |     2705  | 
##                  |    20.11% |    79.89% |    13.87% | 
## -----------------|-----------|-----------|-----------|
##                P |     1452  |     3623  |     5075  | 
##                  |    28.61% |    71.39% |    26.02% | 
## -----------------|-----------|-----------|-----------|
##               RF |      520  |     1893  |     2413  | 
##                  |    21.55% |    78.45% |    12.37% | 
## -----------------|-----------|-----------|-----------|
##               SS |        0  |     1296  |     1296  | 
##                  |     0.00% |   100.00% |     6.65% | 
## -----------------|-----------|-----------|-----------|
##     Column Total |     3332  |    16169  |    19501  | 
## -----------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  1759     d.f. =  8     p =  0 
## 
## 
##  
##        Minimum expected frequency: 168.1

Mosaic Plot


A mosaic plot is an effective way to graphically represent the contents of the summary tables. Note that the length (left to right) dimension of each bar is constant, comparing proportions, while the height of the bar (top to bottom) varies depending on the absolute number of cases. The mosaic plot function is in the vcd package.

library(vcd)
## Loading required package: vcd
## Loading required package: grid
mosaic(throwPOS, highlighting = "throws", highlighting_fill=c("darkgrey", "white"))


Conclusion

The clear result is that it’s not just catchers that are overwhelmingly right-handed throwers, it’s also infielders (except first base). There have been very few southpaws playing second and third base – and there have been absolutely no left-handed throwing shortstops in this period.

As J.G. Preston puts it in the blog post “Left-handed throwing second basemen, shortstops and third basemen”,
While right-handed throwers can be found at any of the nine positions on a baseball field, left-handers are, in practice, restricted to five of them.

So who are these left-handed oddities? Using the filter function, it’s easy to find out:

# catchers
filter(Player_games, POS == "C", throws == "L")
## Source: local data frame [4 x 6]
## Groups: playerID, nameFirst, nameLast, POS
## 
##    playerID nameFirst  nameLast POS throws gamecount
## 1 distebe01     Benny Distefano   C      L         3
## 2  longda02      Dale      Long   C      L         2
## 3 squirmi01      Mike   Squires   C      L         2
## 4 shortch02     Chris     Short   C      L         1

# second base
filter(Player_games, POS == "2B", throws == "L")
## Source: local data frame [4 x 6]
## Groups: playerID, nameFirst, nameLast, POS
## 
##    playerID nameFirst  nameLast POS throws gamecount
## 1 marqugo01   Gonzalo   Marquez  2B      L         2
## 2 crowege01    George     Crowe  2B      L         1
## 3 mattido01       Don Mattingly  2B      L         1
## 4 mcdowsa01       Sam  McDowell  2B      L         1

# third base
filter(Player_games, POS == "3B", throws == "L")
## Source: local data frame [4 x 6]
## Groups: playerID, nameFirst, nameLast, POS
## 
##    playerID nameFirst  nameLast POS throws gamecount
## 1 squirmi01      Mike   Squires  3B      L        14
## 2 mattido01       Don Mattingly  3B      L         3
## 3 francte01     Terry  Francona  3B      L         1
## 4 valdema02     Mario    Valdez  3B      L         1

My github file for this entry in Markdown is here: [https://github.com/MonkmanMH/Bayesball/blob/master/LeftHandedCatchers.md]

-30-

December 17, 2013

Book Review: Analyzing Baseball Data with R

by Max Marchi and Jim Albert (2014, CRC Press)


The Sabermetric bookshelf, #3


Here we have the perfect book for anyone who stumbles across this blog--the intersection of R and baseball data. The open source statistical programming environment of R is a great tool for anyone analyzing baseball data, from the robust analytic functions to the great visualization packages. The particular readership niche might be small, but as both R and interest in sabermetrics expand, it's a natural fit.


And one would be hard pressed to find better qualified authors, writers who have feet firmly planted in both worlds.  Max Marchi is a writer for Baseball Prospectus, and it's clear from the ggplot2 charts in his blog entries (such as this entry on left-handed catchers) that he's an avid R user.

Jim Albert is a Professor in the Department of Mathematics and Statistics at Bowling Green State University; three of his previous books sit on my bookshelf. Curve Ball, written with Jay Bennett, is pure sabermetrics, and one of the best books ever written on the topic (and winner of SABR's Baseball Research Award in 2002).  Albert's two R-focussed  books, the introductory R by Example (co-authored with Maria Rizzo) and the more advanced Bayesian Computation with R, are intended as supplementary texts for students learning statistical methods. Both employ plenty of baseball examples in their explanations of statistical analysis using R.

In Analyzing Baseball Data with R Marchi and Albert consolidate this joint expertise, and have produced a book that is simultaneously interesting and useful.

The authors takes a very logical approach to the subject at hand. The first chapter concerns the three sources of baseball data that are referenced throughout the book:
- the annual summaries contained with the Lahman database,
- the play-by-play data at Retrosheet, and
- the pitch-by-pitch PITCHf/x data.
The chapter doesn't delve into R, but summarizes the contents of the three data sets, and takes a quick look at the types of questions that can be answered with each.

The reader first encounters R in the second and third chapters, titled "Introduction to R" and "Traditional Graphics". These two chapters cover many of the basic topics that a new R user needs to know, starting with installing R and RStudio, then moving on to data structures like vectors and data frames, objects, functions, and data plots. Some of the key R packages are also covered in these chapters, both functional packages like plyr and data packages, notably Lahman, the data package containing the Lahman database.

The material covered in these early chapters are things I learned early on in my own R experience, but whereas I had relied on multiple sources and an unstructured ad hoc approach, in Analyzing Baseball Data with R a newcomer to R will find the basics laid out in a straight-forward and logical progression. These chapters will most certainly help them climb the steep learning curve faced by every neophyte R user.  (It is worth noting that the "Introduction to R" chapter relies heavily on a fourth source of baseball data -- the 1965 Warren Spahn Topps card, the last season of his storied career. Not all valuable data are big data.)

From that point on, the book tackles some of the core concepts of sabermetrics. This includes the relationship between runs and wins, run expectancy, career trajectories, and streaky performances.  As the authors work through these and other topics, they weave in information about additional R functions and packages, along with statistical and analytic concepts.  As one example, one chapter introduces Markov Chains in the context of using R to simulate half inning, season, and post-season outcomes.

The chapter "Exploring Streaky Performances" provides the opportunity to take a closer look at how Analyzing Baseball Data with R compares to Albert's earlier work.  In this case, the chapter uses moving average and simulation methodologies, providing the data code to examine recent examples (Ichiro and Raul Ibanez).  This is methodologically similar to what is described in Curve Ball, but with the addition of "here's the data and the code so you can replicate the analysis yourself".  This approach differs substantially from the much more mathematical content in Albert's text Bayesian Computation with R, where the example of streaky hitters is used to explore beta functions and the laplace R function.

Woven among these later chapters are also ones that put R first, and use baseball data as the examples. A chapter devoted to the advanced graphics capabilities of the R splits time between the packages lattice and ggplot2. The examples used in this chapter include  visualizations that are used to analyze variations in Justin Verlander's pitch speed.

Each chapter of the book also includes "Further Reading" and "Exercises", which provide readers with the chance to dig deeper into the topic just covered and to apply their new-found skills. The exercises are consistently interesting and often draw on previous sabermetric research.  Here's a couple of examples:
  • "By drawing a contour plot, compare the umpire's strike zone for left-handed and right-handed batters. Use only the rows of the data frame where the pitch type is a four-seam fastball." (Chapter 7)
  • "Using [Bill] James' similarity score measure ..., find the five hitters with hitting statistics most similar to Willie Mays." (Chapter 8)
The closing pages of the book are devoted to technical arcana regarding the data sources, and how-to instructions on obtaining those data.

The authors have established a companion blog (http://baseballwithr.wordpress.com/), which has an expansion of the analytics presented in the book.  For example, the entry from December 12, 2013 goes deeper into ggplot2 capabilities to enhance and refine charts that were described in the book.

Analyzing Baseball Data with R provides readers with an excellent introduction to both R and sabermetrics, using examples that provide nuggets of insight into baseball player and team performance. The examples are clear, the R code is well explained and easy to follow, and I found the examples consistently interesting. All told, Analyzing Baseball Data with R will be an extremely valuable addition to the practicing sabermetrician's library, and is most highly recommended.

Additional Resources


Jim Albert and Jay Bennett (2003), Curve Ball: Baseball, Statistics, and the Role of Chance in the Game (revised edition), Copernicus Books.

Jim Albert and Maria Rizzo (2011), R by Example, Springer.

Jim Albert (2009), Bayesian Computation with R (2nd edition), Springer.

An interview with Max Marchi, originally posted at MilanoRnet and also available through R-bloggers

-30-


December 1, 2013

A few random things

The creation of random numbers, or the random selection of elements in a set (or population), is an important part of statistics and data science. From simulating coin tosses to selecting potential respondents for a survey, we have a heavy reliance on random number generation.

R offers us a variety of solutions for random number generation; here's a quick overview of some of the options.

runif, rbinom, rnorm

One simple solution is to use the runif function, which generates a stated number of values between two end points (but not the end points themselves!) The function uses the continuous uniform distribution, meaning that every value between the two end points has an equal probability of being sampled.

Here's the code to produce 100 values between 1 and 100, and then print them.

RandomNumbers <- runif(100, 1, 100)
RandomNumbers
##   [1] 22.290 33.655 89.835 38.535 24.601 11.431  7.047 94.958 83.703 76.847
##  [11] 58.429 20.667 25.796 91.821  8.741 65.696 24.262  8.077 51.399 19.652
##  [21] 64.883 33.258 55.488  6.828 14.925 11.480 72.783  2.549 78.706 49.563
##  [31] 10.829 27.653 70.304 96.759 12.614 66.610 82.467  8.506 71.719 86.586
##  [41] 69.519 11.538 72.321 63.126 42.754 60.139 44.854 71.088 15.165 67.818
##  [51] 83.342  9.894 64.497 96.620 64.286 20.162 16.343 53.800 31.380 24.418
##  [61] 13.740 47.458 80.037 13.189 45.496 20.697 28.240 60.003 84.350 14.888
##  [71] 20.084  3.003  1.191 28.748  4.528 40.568 90.963 82.640 15.885 95.029
##  [81] 54.166 17.315 43.355  9.762 74.012 64.537 74.131 24.758 41.922 65.458
##  [91] 11.423 41.084 22.514 77.329 76.879 43.954 78.471 24.727 69.357 60.118

R helpfully has random generators from a plethora of distributions (see http://cran.r-project.org/web/views/Distributions.html under the heading “Random Number Generators”). For example, the equivalent function to pull random numbers from the binomial distribution is rbinom. In the following example, the code generates 100 iterations of a single trial where there's a 0.5 (50/50) probabilty – as you would get with one hundred coin tosses. So let's call the object OneHundredCoinTosses. The table function then gives us a count of the zeros and ones in the object.

OneHundredCoinTosses <- rbinom(100, 1, 0.5)
OneHundredCoinTosses
##   [1] 1 0 0 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 1 1 0 1 1 0 1 0 0 1 0 0 1 1 1 0 0
##  [36] 1 0 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 1 0 1 0 0 0 1 1 1 0 1 0 0 1
##  [71] 1 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 1 1 1 0 0 1 1 0 1 0 1
table(OneHundredCoinTosses)
## OneHundredCoinTosses
##  0  1 
## 48 52

In this variant, we'll toss our coin again, but this time it will be 100 iterations of 10 trials. R will generate the number of successes per trial. A plot of the histogram would show how with enough iterations, we'd get something that looks very much like a normal distribution curve.

OneHundredCoinTrials <- rbinom(100, 10, 0.5)
OneHundredCoinTrials
##   [1]  5  4  7  7  4  2  3  3  6  6  6  5  4  6  5  6  4  7  3  3  6  5  5
##  [24]  7  4  5  6  3  7  4 10  5  3  6  6  5  7  6  3  6  4  7  7  4  3  6
##  [47]  6  2  2  7  5  6  5  8  4  3  5  5  2  7  5  4  4  1  4  5  7  5  6
##  [70]  9  6  3  7  4  4  2  3  6  3  5  6  6  5  8  5  5  7  5  7  4  2  3
##  [93]  5  5  7  5  4  6  0  6
table(OneHundredCoinTrials)
## OneHundredCoinTrials
##  0  1  2  3  4  5  6  7  8  9 10 
##  1  1  6 13 16 23 21 15  2  1  1

And there's rnorm for the normal distribution. In this case, the second number in the function is the mean and the third is the standard deviation. With this example, the code generates 100 values from a normal distribution with a mean of 50 and a standard deviation of 12.5.

RandomNormal <- rnorm(100, 50, 12.5)
RandomNormal
##   [1] 52.67 36.55 57.48 63.00 53.65 52.05 65.39 36.16 53.03 53.22 53.61
##  [12] 65.66 47.08 41.87 80.60 67.34 49.56 54.09 50.51 48.35 33.72 47.31
##  [23] 51.22 51.01 56.76 51.01 79.84 37.27 33.67 41.73 49.34 62.04 61.48
##  [34] 37.49 56.54 63.87 49.13 36.11 47.14 34.67 57.88 34.45 50.46 48.68
##  [45] 66.36 45.32 51.72 36.64 41.35 48.09 35.50 55.30 42.13 26.29 30.68
##  [56] 53.08 60.53 40.96 35.04 60.64 74.57 49.00 62.41 37.19 52.64 39.80
##  [67] 31.66 49.13 36.05 49.98 55.00 72.42 56.64 53.44 50.50 54.02 60.74
##  [78] 39.32 53.72 75.50 46.87 12.10 45.09 70.40 53.11 37.36 58.97 67.09
##  [89] 45.63 55.44 46.66 31.69 36.68 59.28 55.09 53.25 49.52 59.87 59.16
## [100] 80.30

sample

Another approach to randomization is the sample function, which pulls elements from an object (such as a vector) of defined values or, alternatively, can be specified to select cases from a string of integers. The function also has the option of specifying whether replacement will be used or not. (See http://cran.r-project.org/web/packages/sampling/index.html)

In the first example of sample, we'll generate 100 values (the second value specified in the function) from the integers between 1 and 99 (the first value specified), with replacement – so there's a possibility of duplicates. The code adds the sort function so that we can easily spot the duplicates.

RandomSample <- sort(sample(99, 100, replace = TRUE))
RandomSample
##   [1]  1  1  2  2  2  5  5  5  7  8 11 11 12 12 12 13 15 15 16 16 19 19 21
##  [24] 23 23 23 23 23 25 26 27 30 30 31 32 33 34 35 35 35 35 37 38 40 41 42
##  [47] 42 42 42 45 46 46 47 48 48 50 52 52 54 54 54 54 54 57 58 61 62 63 63
##  [70] 64 66 66 67 67 69 70 71 71 71 73 73 74 78 78 80 80 82 83 84 84 85 86
##  [93] 86 91 93 94 96 96 98 98

In a second example, we'll generate 5 values (the second value specified) from a list of 13 names that we predefine, without replacement. Note that the default setting in sample is “without replacement”, so there should be no duplicates.

# the list of party-goers
dwarves <- c("Fíli", "Kíli", "Balin", "Dwalin", "Óin", "Glóin", "Bifur", "Bofur", 
    "Bombur", "Ori", "Nori", "Dori", "Thorin")  # draw a sorted sample of 50 with replacement
Party <- sort(sample(dwarves, 5))
# print the names
Party
## [1] "Dwalin" "Fíli"   "Nori"   "Ori"    "Thorin"

There is also the sample.int variant which insists on integers for the values. Here's the code to randomly select 6 numbers between 1 and 49, without replacement.

six49numbers <- sort(sample.int(49, 6, replace = FALSE))
six49numbers
## [1]  3 15 16 24 35 44

Controlling random number generation: set.seed and RNGkind

It sounds like an oxymoron – how can you control something that is random? The answer is that in many computer programs and programming languages, R included, many of the functions that are dubbed random number generation really aren't. I won't get into the arcana, but runif (and it's ilk) and sample all rely on pseudo-random approaches, methods that are close enough to being truly random for most purposes. (If you want to investigate this further in the context of R, I suggest starting with John Ramey's post at http://www.r-bloggers.com/pseudo-random-vs-random-numbers-in-r-2/ )

With the set.seed command, an integer is used to start a random number generation, allowing the same sequence of “random” numbers to be selected repeatedly. In this example, we'll use the code written earlier to sample 6 numbers between 1 and 49, and repeat it three times.

The first time through, set.seed will define the starting seed as 1, then for the second time through, the seed will be set to 13, leading to a different set of 6 numbers. The third iteration will reset the starting seed to 1, and the third sample set of 6 numbers will be the same as the first sample.

set.seed(1)
six49numbers <- sort(sample.int(49, 6))
six49numbers
## [1] 10 14 18 27 40 42
set.seed(13)
six49numbers <- sort(sample.int(49, 6))
six49numbers
## [1]  1  5 12 19 35 44
set.seed(1)
six49numbers <- sort(sample.int(49, 6))
six49numbers
## [1] 10 14 18 27 40 42

The first and third draws contain the same 6 integers.

Another control of the random number generation is RNGkind. This command defines the random number generation method, from an extensive list of methodologies. The default is Mersene Twister (http://en.wikipedia.org/wiki/Mersenne_twister), and a variety of others are available.

The R documentation page on Random{}, with both set.seed and RNGkind, can be found here: http://stat.ethz.ch/R-manual/R-devel/library/base/html/Random.html

random

While the methods above are pseudo-random, there are methods available that generate truly random numbers. One is the service provided by random.org (http://www.random.org/).

The R package random (documentation here: http://cran.r-project.org/web/packages/random/) uses the random.org service to generate random numbers and return them into an R object. While the functions in the package can return random integers, randomized sequences, and random strings, and has the flexibility to define the shape of the matrix (i.e. the number of columns).

It's worth nothing that free users or random.org are confronted by daily limits to the volume of calls you can make to random.org (paying customers don't have these limits).

Here's an example to generate 20 random numbers from random.org, defined as being between 100 and 999 (that is to say, three digit numbers) and present them in two columns.

# load random
if (!require(random)) install.packages("random")
## Loading required package: random
library(random)
# 
twentytruerandom <- randomNumbers(n = 20, min = 100, max = 999, col = 2, check = TRUE)
# note: the 'check=' sets whether quota at server should be checked first
twentytruerandom
##        V1  V2
##  [1,] 531 402
##  [2,] 559 367
##  [3,] 616 789
##  [4,] 830 853
##  [5,] 382 436
##  [6,] 336 737
##  [7,] 769 548
##  [8,] 293 818
##  [9,] 746 609
## [10,] 108 331

References

Paul Teetor's R Cookbook (O'Reilly, 2011) has a chapter on probability (Chapter 8) that includes good examples of various random number generation in R.

Jim Albert & Maria Rizzo, R by Example (Springer, 2012), Chapter 11 “Simulation Experiments” and Chapter 13 “Monte Carlo Methods”, contain a variety of applications of random number generation using sample and rbinom to approximate and understand probability experiments.

For an in-depth look at random sampling in the context of survey design, see Thomas Lumley Complex Surveys: A Guide to Analysis Using R (Wiley, 2010).

If you're interested in testing a random number generator, check out http://www.johndcook.com/blog/2010/12/06/how-to-test-a-random-number-generator-2/

Joseph Rickert's blog entry at blog.revolutionanalytics.com gives a good rundown of the applications and approach for parallel random number generation http://blog.revolutionanalytics.com/2013/06/intro-to-parallel-random-number-generation-with-revoscaler.html

-30-

September 1, 2013

Fair weather fans, redux

Fair weather fans, redux Or, A little larger small sample

On August 11 the Victoria HarbourCats closed out their 2013 West Coast League season with a 4-3 win over the Bellingham Bells.

In an earlier post, written mid-way through the season after the 'Cats had played 15 home games, I created a scatter plot matrix to look for correlations between the HarbourCats home attendance and possible influencing factors such as the day of the week and temperature. Now that the season has concluded, it's time to return to the data for all 27 home games, to see if temperature remained the strongest correlate with attendance.

I also took this opportunity to move the source data from Google Documents to GitHub, where it can be accessed directly by the R code – no manual downloads required. The code necessary to make this work is from Christopher Gandrud, who wrote the function source_GitHubData. (Gandrud has also written code to pull data directly from DropBox.)

Read the data

First up, the code installs the packages ggplot2 (for the plots) and devtools (for accessing the data from Github) and opens them into the library. Then the “source_GitHubData” function reads the data.

# load ggplot2
if (!require(ggplot2)) install.packages("ggplot2")
## Loading required package: ggplot2
library(ggplot2)
# Use the function source_GitHubData, which requires the package devtools
if (!require(devtools)) install.packages("devtools")
## Loading required package: devtools
library(devtools)
# The functions' gist ID is 4466237
source_gist("4466237")
## [1] "https://raw.github.com/gist/4466237"
## SHA-1 hash of file is fcb5fe0b4dd7d99d6e747fb8968176c229506ce4
# Download data, which is stored as a csv file at github
HarbourCat.attend <- source_GitHubData("https://raw.github.com/MonkmanMH/HarbourCats/master/HC_attendance_2013.csv")
## Loading required package: httr

Looking at the attendance data

Now the data has been read into our R workspace, the first order of business is a simple plot of the raw attendance data.

# #####
# 
# simple line plot of data series
ggplot(HarbourCat.attend, aes(x = num, y = attend)) + geom_point() + geom_line() + 
    ggtitle("HarbourCats attendance \n2013 season") + annotate("text", label = "Opening Night", 
    x = 3.5, y = 3050, size = 3, fontface = "bold.italic")

From this plot, it's easy to see the spike on the opening game, and the end-of-season surge for the final two games.

When exploring data, it's valuable to get a sense of the distribution. R provides a “summary()” function as well as “sd()” for the standard deviation.

# summarize the distribution of 'attend'
summary(HarbourCat.attend$attend)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     885    1090    1250    1440    1580    3030
sd(HarbourCat.attend$attend)
## [1] 507.8

When looking at these summary stats, a couple of things jump out at me. First of all, the standard deviation is large compared to the total range, suggesting a very dispersed data set. The second thing I notice is that the mean is almost half a standard deviation larger than median, indicating a skew in the data to the large end.

While these numerical representations of the distribution are valuable, a plot of the data can help us understand the data still further. A great graphic tool for looking at a distribution and to identify outliers is the box plot (also known as the box-and-whisker plot).

#
boxplot(HarbourCat.attend$attend, ylab = "attendance", main = "Box plot of HarbourCat attendance")

The box is drawn with the first quartile as the lower edge, and the third quartile as the top edge. The median of the distribution is shown with the thick line that runs across the box. The whiskers show the range of the data, excluding the outliers. And the three dots (in this case, at the top of the chart) are the outliers, defined as being more than 1.5 times the interquartile range (i.e. Q3 - Q1) beyond Q3 or Q1.

Since something special was happening, let's omit those three values as the extreme outliers that were influenced by something other than the weather or the day of the week. Once we've done that, we'll use the “summary()” function again to describe the distribution of the values.

# #####
# 
# prune the extreme outliers and structure the data so that attendance is
# last and will appear as the Y axis on plots
HarbourCat.attend.data <- (subset(HarbourCat.attend, num > 1 & num < 26, select = c(num, 
    day2, sun, temp.f, attend)))
# print the data table to screen
HarbourCat.attend.data
##    num day2 sun temp.f attend
## 2    2    1   4     64   1082
## 3    3    3   4     66   1542
## 4    4    1   2     63   1014
## 5    5    1   2     60   1003
## 6    6    1   3     66   1015
## 7    7    3   5     64   1248
## 8    8    3   5     70   1640
## 9    9    2   1     64   1246
## 10  10    3   5     70   1591
## 11  11    3   5     73   1620
## 12  12    2   5     70   1402
## 13  13    1   5     72   1426
## 14  14    1   5     73   1187
## 15  15    1   5     72   1574
## 16  16    3   5     73   1515
## 17  17    3   5     70   1052
## 18  18    2   5     72   1208
## 19  19    3   5     69   1292
## 20  20    2   5     71   1218
## 21  21    1   5     64   1013
## 22  22    1   5     62   1104
## 23  23    1   2     63    885
## 24  24    1   3     63   1179
## 25  25    3   4     73   1731
# summarize the distribution of the pruned version of 'attend'
summary(HarbourCat.attend.data$attend)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     885    1070    1230    1280    1520    1730
sd(HarbourCat.attend.data$attend)
## [1] 245.7

From these summary statistics, we see that the nature of the data set has changed significantly. The median and mean are almost identical, and the standard deviation is half the magnitude without the outliers.

The scatterplot matrix

With the outliers removed, we can move on to the scatter plot matrix. This time we'll just run the all-in version that includes a smoothing line on the scatter plot, as well as a histogram of the variable and the correlation coefficients.


# ################### scatter plot matrix ###################
# 
# scatter plot matrix - with correlation coefficients define panels
# (copy-paste from the 'pairs' help page)
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...) {
    usr <- par("usr")
    on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if (missing(cex.cor)) 
        cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}
#
panel.hist <- function(x, ...) {
    usr <- par("usr")
    on.exit(par(usr))
    par(usr = c(usr[1:2], 0, 1.5))
    h <- hist(x, plot = FALSE)
    breaks <- h$breaks
    nB <- length(breaks)
    y <- h$counts
    y <- y/max(y)
    rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)
}
# run pairs plot
pairs(HarbourCat.attend.data[, 1:5], upper.panel = panel.cor, diag.panel = panel.hist, 
    lower.panel = panel.smooth)

Conclusion

A few more data points hasn't fundamentally changed the analysis. Temperature remains the best predictor of attendance, with a correlation coefficient of 0.68. The day of the week was also a fairly strong predictor, with bigger crowds on Friday and Saturday nights than the Sunday day games and the weekday evening games. (No surprise there, really.)

I was surprised to see that attendance seemed to flag as the season went on – you can see the drop shown in the smoothing line in the plot in the lower left corner (num by attend, where num is the number of the game from 2-25). But this drop can be explained by both the day of the week and the temperature. From Monday July 29 to Thursday August 1, the temperature was 17 or 18 Celsius (62 to 63 Farenheit). On the Wednesday of this stretch (July 31), under mainly cloudy skies and the temperature at 17 Celsius (63 Farenheit), only 885 people turned up to watch the game – the only time all season the HarbourCats drew fewer than 1,000 fans to a game.

The code and data for this analysis can be found at GitHub: https://github.com/MonkmanMH/HarbourCats -30-

August 12, 2013

Ichiro: not just hits

As we count down to Ichiro's 4,000th hit (combined Nippon Professional Baseball and Major League Baseball), it's worth remembering his outstanding abilities in the field.  The best representation for my dollar is this award-winning photo captured by Scott Eklund:

Ichiro makes a leaping catch (photo: Scott Eklund; click to link)

-30-

July 18, 2013

Fair weather fans? (An R scatter plot matrix)

The Victoria HarbourCats are roughly half way through their inaugural season in the West Coast League, and currently lead the league in average attendance.  In a recent conversation with one of the team's staff, he mentioned that after the first game in early June, the fans started to come out when the sun appeared and the weather got warmer.

In spite of the very small sample size, this question presented an opportunity to teach myself how to create a scatter plot matrix in R.

The first step was to get the necessary data fields from the HarbourCats website, pulling attendance, temperature, and other information from the box score for each home game.  (Here's the box score for the team's first game, played on June 5,  2013.)

I entered the day of the week, and then coded them into 1 = Monday - Thursday, 2 = Sunday, and 3 = Friday & Saturday.  I also thought whether it was a day game or played in the evening might matter, so I coded them as 0 =  afternoon and 1 = evening (but so far, the only day games have been the two Sunday games).  The amount of sun in the sky was taken from the description in the box score, and coded as 1 = cloudy, 2 = mostly cloudy, 3 = partly cloudy, 4 = mostly sunny, and 5 = sunny.  I entered the temperature as both Celsius and Fahrenheit, and made additional notes about the game such as opening night.  And finally, there's a variable "num" which is the number of the game (e.g. 15 is the 15th home game of the season.)  The data table is here (Google drive).

First up was to run a quick line plot, to look at the variation in attendance as the season has progressed.

# ######################
# HARBOURCATS ATTENDANCE
# ######################
#
# data: pulled from www.harbourcats.com
# saved on Google Drive:
# https://docs.google.com/spreadsheet/ccc?key=0Art4wpcrwqkBdHZvTUFzOUo5U3BzMHFveXdYOTdTWUE&usp=sharing
# File / Download as >> Comma Separated Values (CSV)
#
# read the csv file
HarbourCat.attend <- read.csv("HarbourCat_attendance - Sheet1.csv")
#
# load ggplot2
if (!require(ggplot2)) install.packages("ggplot2")
library(ggplot2)
#
#
# #####
#
# simple line plot of data series
#
ggplot(HarbourCat.attend, aes(x=num, y=attend)) + 
    geom_point() + 
    geom_line()+
    ggtitle("HarbourCats attendance \n2013 season") +
    annotate("text", label="Opening Night", x=3, y=3050, size=3,    
      fontface="bold.italic")
#




As the plot shows, opening night was an exceptional success, with just over 3,000 baseball fans turning up to see the HarbourCats take the field for the first time.  For the purpose of this analysis, though, it's an extreme outlier and (particularly because of the extremely small sample size) it will play havoc with the correlation results; for the analysis that follows it will be removed.

The next thing that the plot shows is that the four of the five games that followed the home opener remain the lowest attendance games. The lowest attendance was June 12, when 1,003 fans turned out on a 16 degree Wednesday evening. After that, attendance has been variable, but has been generally stronger as the games moved into the later part of June and into early July.

A way to look at correlations between a variety of variables is a scatter plot matrix. For this, I turned to Recipe 5.13 of the R Graphics Cookbook by Winston Chang (who maintains the Cookbook for R site and is one of the software engineers behind the IDE RStudio, which you should use.)

Note: this analysis is over-the-top for the number of data points available!  I didn't bother to test for statistical significance.

# #####
# prune the extreme outlier that is opening night
# and structure the data so that attendance is last and will appear as the Y axis on plots
HarbourCat.attend.data <- (subset(HarbourCat.attend, num > 1, select=c(num, day2, sun, temp.f, attend)))
# print the data table to screen
HarbourCat.attend.data
#
   num day2 sun temp.f attend
2    2    1   4     64   1082
3    3    3   4     66   1542
4    4    1   2     63   1014
5    5    1   2     60   1003
6    6    1   3     66   1015
7    7    3   5     64   1248
8    8    3   5     70   1640
9    9    2   1     64   1246
10  10    3   5     70   1591
11  11    3   5     73   1620
12  12    2   5     70   1402
13  13    1   5     72   1426
14  14    1   5     73   1187
15  15    1   5     72   1574
#
# ###################
# scatter plot matrix
# ###################
#
# scatter plot matrix - simple
# (see Winston Chang, "R Graphics Cookbook", recipe 5.13)
pairs(HarbourCat.attend.data[,1:5])

Simple scatter plot matrix


The "pairs" function used above creates a simple scatter plot matrix, with multiple X-Y plots comparing all of the variables that have been defined.  While this is a quick way to visually check to see if there are any relationships, it is also possible to create a more complex scatter plot matrix, with the correlation coefficients and a histogram showing the distribution of each variable.  (And to the histogram, I'll also add a loess trend line.) For exploratory data analysis, this is much more valuable.


# #####
#
# scatter plot matrix - with correlation coefficients
# define panels (copy-paste from the "pairs" help page)
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}
#
panel.hist <- function(x, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(usr[1:2], 0, 1.5) )
    h <- hist(x, plot = FALSE)
    breaks <- h$breaks; nB <- length(breaks)
    y <- h$counts; y <- y/max(y)
    rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)
}
#
pairs(HarbourCat.attend.data[,1:5], upper.panel = panel.cor,
                                    diag.panel = panel.hist,
                                    lower.panel = panel.smooth)



Fully loaded scatter plot matrix

What we see from this plot is that attendance is most strongly correlated with temperature -- a correlation coefficient of 0.70.  (Before you jump to any conclusions:  did I mention the extremely small sample size?!?) There are also moderately strong correlations with the sunniness of the day, and whether the game was played mid-week, Sunday, or Friday/Saturday.  In summary, attendance has been highest on warm days played on a Friday or Saturday evening, and on cool mid-week days have seen the lowest attendance.  (Are we surprised by this?)

With more sunny warm weather in the forecast we can only hope that the correlation with attendance continues through the remainder of the season.  The schedule looks favourable for high attendance, too -- they've got four mid-week games, three Sunday matinees (including the last game of the season), and six Friday-Saturday games.

When the season is concluded and more data points are added to the data set, I will revisit this analysis.

-30-