December 1, 2013

A few random things

The creation of random numbers, or the random selection of elements in a set (or population), is an important part of statistics and data science. From simulating coin tosses to selecting potential respondents for a survey, we have a heavy reliance on random number generation.

R offers us a variety of solutions for random number generation; here's a quick overview of some of the options.

runif, rbinom, rnorm

One simple solution is to use the runif function, which generates a stated number of values between two end points (but not the end points themselves!) The function uses the continuous uniform distribution, meaning that every value between the two end points has an equal probability of being sampled.

Here's the code to produce 100 values between 1 and 100, and then print them.

RandomNumbers <- runif(100, 1, 100)
RandomNumbers
##   [1] 22.290 33.655 89.835 38.535 24.601 11.431  7.047 94.958 83.703 76.847
##  [11] 58.429 20.667 25.796 91.821  8.741 65.696 24.262  8.077 51.399 19.652
##  [21] 64.883 33.258 55.488  6.828 14.925 11.480 72.783  2.549 78.706 49.563
##  [31] 10.829 27.653 70.304 96.759 12.614 66.610 82.467  8.506 71.719 86.586
##  [41] 69.519 11.538 72.321 63.126 42.754 60.139 44.854 71.088 15.165 67.818
##  [51] 83.342  9.894 64.497 96.620 64.286 20.162 16.343 53.800 31.380 24.418
##  [61] 13.740 47.458 80.037 13.189 45.496 20.697 28.240 60.003 84.350 14.888
##  [71] 20.084  3.003  1.191 28.748  4.528 40.568 90.963 82.640 15.885 95.029
##  [81] 54.166 17.315 43.355  9.762 74.012 64.537 74.131 24.758 41.922 65.458
##  [91] 11.423 41.084 22.514 77.329 76.879 43.954 78.471 24.727 69.357 60.118

R helpfully has random generators from a plethora of distributions (see http://cran.r-project.org/web/views/Distributions.html under the heading “Random Number Generators”). For example, the equivalent function to pull random numbers from the binomial distribution is rbinom. In the following example, the code generates 100 iterations of a single trial where there's a 0.5 (50/50) probabilty – as you would get with one hundred coin tosses. So let's call the object OneHundredCoinTosses. The table function then gives us a count of the zeros and ones in the object.

OneHundredCoinTosses <- rbinom(100, 1, 0.5)
OneHundredCoinTosses
##   [1] 1 0 0 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 1 1 0 1 1 0 1 0 0 1 0 0 1 1 1 0 0
##  [36] 1 0 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 1 0 1 0 0 0 1 1 1 0 1 0 0 1
##  [71] 1 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 1 1 1 0 0 1 1 0 1 0 1
table(OneHundredCoinTosses)
## OneHundredCoinTosses
##  0  1 
## 48 52

In this variant, we'll toss our coin again, but this time it will be 100 iterations of 10 trials. R will generate the number of successes per trial. A plot of the histogram would show how with enough iterations, we'd get something that looks very much like a normal distribution curve.

OneHundredCoinTrials <- rbinom(100, 10, 0.5)
OneHundredCoinTrials
##   [1]  5  4  7  7  4  2  3  3  6  6  6  5  4  6  5  6  4  7  3  3  6  5  5
##  [24]  7  4  5  6  3  7  4 10  5  3  6  6  5  7  6  3  6  4  7  7  4  3  6
##  [47]  6  2  2  7  5  6  5  8  4  3  5  5  2  7  5  4  4  1  4  5  7  5  6
##  [70]  9  6  3  7  4  4  2  3  6  3  5  6  6  5  8  5  5  7  5  7  4  2  3
##  [93]  5  5  7  5  4  6  0  6
table(OneHundredCoinTrials)
## OneHundredCoinTrials
##  0  1  2  3  4  5  6  7  8  9 10 
##  1  1  6 13 16 23 21 15  2  1  1

And there's rnorm for the normal distribution. In this case, the second number in the function is the mean and the third is the standard deviation. With this example, the code generates 100 values from a normal distribution with a mean of 50 and a standard deviation of 12.5.

RandomNormal <- rnorm(100, 50, 12.5)
RandomNormal
##   [1] 52.67 36.55 57.48 63.00 53.65 52.05 65.39 36.16 53.03 53.22 53.61
##  [12] 65.66 47.08 41.87 80.60 67.34 49.56 54.09 50.51 48.35 33.72 47.31
##  [23] 51.22 51.01 56.76 51.01 79.84 37.27 33.67 41.73 49.34 62.04 61.48
##  [34] 37.49 56.54 63.87 49.13 36.11 47.14 34.67 57.88 34.45 50.46 48.68
##  [45] 66.36 45.32 51.72 36.64 41.35 48.09 35.50 55.30 42.13 26.29 30.68
##  [56] 53.08 60.53 40.96 35.04 60.64 74.57 49.00 62.41 37.19 52.64 39.80
##  [67] 31.66 49.13 36.05 49.98 55.00 72.42 56.64 53.44 50.50 54.02 60.74
##  [78] 39.32 53.72 75.50 46.87 12.10 45.09 70.40 53.11 37.36 58.97 67.09
##  [89] 45.63 55.44 46.66 31.69 36.68 59.28 55.09 53.25 49.52 59.87 59.16
## [100] 80.30

sample

Another approach to randomization is the sample function, which pulls elements from an object (such as a vector) of defined values or, alternatively, can be specified to select cases from a string of integers. The function also has the option of specifying whether replacement will be used or not. (See http://cran.r-project.org/web/packages/sampling/index.html)

In the first example of sample, we'll generate 100 values (the second value specified in the function) from the integers between 1 and 99 (the first value specified), with replacement – so there's a possibility of duplicates. The code adds the sort function so that we can easily spot the duplicates.

RandomSample <- sort(sample(99, 100, replace = TRUE))
RandomSample
##   [1]  1  1  2  2  2  5  5  5  7  8 11 11 12 12 12 13 15 15 16 16 19 19 21
##  [24] 23 23 23 23 23 25 26 27 30 30 31 32 33 34 35 35 35 35 37 38 40 41 42
##  [47] 42 42 42 45 46 46 47 48 48 50 52 52 54 54 54 54 54 57 58 61 62 63 63
##  [70] 64 66 66 67 67 69 70 71 71 71 73 73 74 78 78 80 80 82 83 84 84 85 86
##  [93] 86 91 93 94 96 96 98 98

In a second example, we'll generate 5 values (the second value specified) from a list of 13 names that we predefine, without replacement. Note that the default setting in sample is “without replacement”, so there should be no duplicates.

# the list of party-goers
dwarves <- c("Fíli", "Kíli", "Balin", "Dwalin", "Óin", "Glóin", "Bifur", "Bofur", 
    "Bombur", "Ori", "Nori", "Dori", "Thorin")  # draw a sorted sample of 50 with replacement
Party <- sort(sample(dwarves, 5))
# print the names
Party
## [1] "Dwalin" "Fíli"   "Nori"   "Ori"    "Thorin"

There is also the sample.int variant which insists on integers for the values. Here's the code to randomly select 6 numbers between 1 and 49, without replacement.

six49numbers <- sort(sample.int(49, 6, replace = FALSE))
six49numbers
## [1]  3 15 16 24 35 44

Controlling random number generation: set.seed and RNGkind

It sounds like an oxymoron – how can you control something that is random? The answer is that in many computer programs and programming languages, R included, many of the functions that are dubbed random number generation really aren't. I won't get into the arcana, but runif (and it's ilk) and sample all rely on pseudo-random approaches, methods that are close enough to being truly random for most purposes. (If you want to investigate this further in the context of R, I suggest starting with John Ramey's post at http://www.r-bloggers.com/pseudo-random-vs-random-numbers-in-r-2/ )

With the set.seed command, an integer is used to start a random number generation, allowing the same sequence of “random” numbers to be selected repeatedly. In this example, we'll use the code written earlier to sample 6 numbers between 1 and 49, and repeat it three times.

The first time through, set.seed will define the starting seed as 1, then for the second time through, the seed will be set to 13, leading to a different set of 6 numbers. The third iteration will reset the starting seed to 1, and the third sample set of 6 numbers will be the same as the first sample.

set.seed(1)
six49numbers <- sort(sample.int(49, 6))
six49numbers
## [1] 10 14 18 27 40 42
set.seed(13)
six49numbers <- sort(sample.int(49, 6))
six49numbers
## [1]  1  5 12 19 35 44
set.seed(1)
six49numbers <- sort(sample.int(49, 6))
six49numbers
## [1] 10 14 18 27 40 42

The first and third draws contain the same 6 integers.

Another control of the random number generation is RNGkind. This command defines the random number generation method, from an extensive list of methodologies. The default is Mersene Twister (http://en.wikipedia.org/wiki/Mersenne_twister), and a variety of others are available.

The R documentation page on Random{}, with both set.seed and RNGkind, can be found here: http://stat.ethz.ch/R-manual/R-devel/library/base/html/Random.html

random

While the methods above are pseudo-random, there are methods available that generate truly random numbers. One is the service provided by random.org (http://www.random.org/).

The R package random (documentation here: http://cran.r-project.org/web/packages/random/) uses the random.org service to generate random numbers and return them into an R object. While the functions in the package can return random integers, randomized sequences, and random strings, and has the flexibility to define the shape of the matrix (i.e. the number of columns).

It's worth nothing that free users or random.org are confronted by daily limits to the volume of calls you can make to random.org (paying customers don't have these limits).

Here's an example to generate 20 random numbers from random.org, defined as being between 100 and 999 (that is to say, three digit numbers) and present them in two columns.

# load random
if (!require(random)) install.packages("random")
## Loading required package: random
library(random)
# 
twentytruerandom <- randomNumbers(n = 20, min = 100, max = 999, col = 2, check = TRUE)
# note: the 'check=' sets whether quota at server should be checked first
twentytruerandom
##        V1  V2
##  [1,] 531 402
##  [2,] 559 367
##  [3,] 616 789
##  [4,] 830 853
##  [5,] 382 436
##  [6,] 336 737
##  [7,] 769 548
##  [8,] 293 818
##  [9,] 746 609
## [10,] 108 331

References

Paul Teetor's R Cookbook (O'Reilly, 2011) has a chapter on probability (Chapter 8) that includes good examples of various random number generation in R.

Jim Albert & Maria Rizzo, R by Example (Springer, 2012), Chapter 11 “Simulation Experiments” and Chapter 13 “Monte Carlo Methods”, contain a variety of applications of random number generation using sample and rbinom to approximate and understand probability experiments.

For an in-depth look at random sampling in the context of survey design, see Thomas Lumley Complex Surveys: A Guide to Analysis Using R (Wiley, 2010).

If you're interested in testing a random number generator, check out http://www.johndcook.com/blog/2010/12/06/how-to-test-a-random-number-generator-2/

Joseph Rickert's blog entry at blog.revolutionanalytics.com gives a good rundown of the applications and approach for parallel random number generation http://blog.revolutionanalytics.com/2013/06/intro-to-parallel-random-number-generation-with-revoscaler.html

-30-

No comments:

Post a Comment