Berton Gunter
2004-May-19 01:20 UTC
[R] R Optimization 101 for R Newcomers: An extended example
To new R-Programmers: This long-winded note gives an example of optimizing in R that should only be of interest to newcomers to the language. Others should ignore. My hope is that it might help illuminate some basic notions of code improvement, looping, and vectorization in R. However, I welcome comments from anyone that enlarge, correct, or improve it. I remain a novice R programmer. For the curious, the hardware/opsys details are a 1.5 ghz Windows 2000 machine, R1.9.0. The task is this: I have a data set, mydat, consisting of values >=0, and a fixed constant, K, that is "much" larger than any of the values. I want to simulate the distribution of the number of bootstrap draws (random sampling with replacement) from mydat that is needed to make the sum of the values drawn just exceed K. That is, repeatedly bootstrap sample from mydat until the sum of the results exceeds K. Suppose this requires 7 draws. Then 7 is a random sample of size 1 from the distribution I'm interested in. I want to repeat this procedure a "large" number, nsim, of times to estimate the distribution. Before reading on, I invite R novices to consider how -- or better yet write code -- to do this. Approach 1 (Naive/ straightforward): Perform the simulation exactly as described by using a nested loop. The outer loop goes from 1 to nsim to record the number of draws needed. The inner loop repeatedly draws a sample of size 1 from mydat and accumulates and checks the sum until it exceeds K, returning the number of times it took to do this to the outer loop. This is easy to program up using a "while" loop for the inner loop and takes about a short look's time out my window (a few seconds) to run for modest nsim (5-10K). But surely one can do better... Approach 2: Was essentially the same, except all the sampling is done "at once." That is, each execution of the inner loop in (1) required R to call the "sample()" function to get a bootstrap sample of size 1. This is very inefficient, because there is a high overhead of making that function call at each iteration. It can be avoided by calling sample() only once -- or at most a small number of times -- to get a large sample and then using indexing to work one's way through the large sample. This turns out to be a little more complicated than approach 1 because you first have to figure how large a "large" sample should be and then have to be a little careful about setting up your indexing. But it's not very difficult (especially for any C programmer who can manipulate pointers), and random sampling is so fast that it's no problem to generate a sample WAYYY bigger than you need (5 or 10 times as large, even) just to be safe. Or generate a not so big version and just check at each outer loop iteration whether you need to generate some more. Doing it this way -- now using indexing, not function calls in the inner loop -- made a considerable improvement (5 - 10 fold). It now took about one quick glance out the window time to generate the distribution (less than a second). This was more than adequate for my needs under any conceivable situation. But still, self-respecting R programmers are supposed to eschew looping, and here was this ugly(?) loop within a loop! So let us persevere. Approach 3: So the question is -- how to remove that inner loop? Again, I invite novice R programmers to think about that before continuing on... The key here is that the inner loop is doing a very simple calculation: just accumulating a sum. Forsooth! -- the answer fairly leaps out: R already has a cumsum() function that does this (looping in the underlying C code), so one only has to figure out how to make use of it. The essential idea to do this is to get a small sample from the large mydat sample that is guaranteed to be bigger than one needs to total up to more than K. Again, one can be profligate: if I need about 10 mydat values on average to sum up to K, if I sample 30 or 40 or some such thing, I'm sure to have enough (and can always add a check or use "try()" to make sure if I am faint of heart). If smallsamp is this sample of the next 30 or 40 values from my large bootstrap sample, then the following code vectorizes the inner loops: count <- sum(cumsum(smallsamp)<K)+1 Note a trick here: The <K produces a vector of logical TRUE's for all cumsum values <K. This is silently treated as numeric 1's when added by sum(). A moment's thought will make clear why 1 must be added at the end.. Sure enough, this vectorization reduced the time about another twofold -- to an eyeblink -- for my modest nsim sizes. Approach 4: The last stage, of course, is to get rid of the outer loop, which fills the vector for the distribution one element at a time, again another R no-no. One way to do this -- the ONLY way I could think of (so a challenge to smarter R programmers) -- is to make use of apply(), which basically always can remove loops. Unfortunately, this is an illusion, because what the apply() functions actually do is hide the loops in clever R code, not remove them. Their virtue, as will be seen here, is making the code more transparent by transferring the bookkeeping of indexing to the R functions rather than having to explicitly handle it yourself. To use apply, the simple idea is just to make up a matrix with nsim columns with enough rows so that each column contains "enough" random draws from mydat to sum to more than K. Let's say that we decide M rows will do it. Then the code for the whole simulation is: drawmat<-matrix(sample(mydat,M*nsim,rep=TRUE),ncol=nsim) drawmat <-apply(drawmat,2,cumsum)<K result<- colSums(drawmat)+1 The first statement generates the matrix of bootstrap samples, while the second and third use the same trick as previously to get the number of rows needed for each column to total more than K. Note that I used the handy (and fast) internal colSums function to add up the 1's in the columns. This code is much shorter, and I would say more transparent, then the code produced by the previous explicit looping strategies. However, not surprisingly, it does not run any faster (nor slower) than the Approach 3 code.The looping is still there hidden in the apply(). I hope that this long-winded example will help R newcomers in their entry into what may seem like a daunting programming language. As others have, I would urge all to consult R's quite remarkable documentation and Help facility (BEFORE they post to this list), as well as Venables's and Ripley's two indispensable books on S language programming and data analysis. Cheers, Bert Gunter Non-Clinical Biostatistics Genentech "The business of the statistician is to catalyze the scientific learning process." -- George E.P. Box