Paul Johnson
2010-Sep-04 18:37 UTC
[R] Please explain "do.call" in this context, or critique to "stack this list faster"
I've been doing some consulting with students who seem to come to R from SAS. They are usually pre-occupied with do loops and it is tough to persuade them to trust R lists rather than keeping 100s of named matrices floating around. Often it happens that there is a list with lots of matrices or data frames in it and we need to "stack those together". I thought it would be a simple thing, but it turns out there are several ways to get it done, and in this case, the most "elegant" way using do.call is not the fastest, but it does appear to be the least prone to programmer error. I have been staring at ?do.call for quite a while and I have to admit that I just need some more explanations in order to interpret it. I can't really get why this does work do.call( "rbind", mylist) but it does not work to do sapply ( mylist, rbind). Anyway, here's the self contained working example that compares the speed of various approaches. If you send yet more ways to do this, I will add them on and then post the result to my Working Example collection. ## stackMerge.R ## Paul Johnson <pauljohn at ku.edu> ## 2010-09-02 ## rbind is neat,but how to do it to a lot of ## data frames? ## Here is a test case df1 <- data.frame(x=rnorm(100),y=rnorm(100)) df2 <- data.frame(x=rnorm(100),y=rnorm(100)) df3 <- data.frame(x=rnorm(100),y=rnorm(100)) df4 <- data.frame(x=rnorm(100),y=rnorm(100)) mylist <- list(df1, df2, df3, df4) ## Usually we have done a stupid ## loop to get this done resultDF <- mylist[[1]] for (i in 2:4) resultDF <- rbind(resultDF, mylist[[i]]) ## My intuition was that this should work: ## lapply( mylist, rbind ) ## but no! It just makes a new list ## This obliterates the columns ## unlist( mylist ) ## I got this idea from code in the ## "complete" function in the "mice" package ## It uses brute force to allocate a big matrix of 0's and ## then it places the individual data frames into that matrix. m <- 4 nr <- nrow(df1) nc <- ncol(df1) dataComplete <- as.data.frame(matrix(0, nrow = nr*m, ncol = nc)) for (j in 1:m) dataComplete[(((j-1)*nr) + 1):(j*nr), ] <- mylist[[j]] ## I searched a long time for an answer that looked better. ## This website is helpful: ## http://stackoverflow.com/questions/tagged/r ## I started to type in the question and 3 plausible answers ## popped up before I could finish. ## The terse answer is: shortAnswer <- do.call("rbind",mylist) ## That's the right answer, see: shortAnswer == dataComplete ## But I don't understand why it works. ## More importantly, I don't know if it is fastest, or best. ## It is certainly less error prone than "dataComplete" ## First, make a bigger test case and use system.time to evaluate phony <- function(i){ data.frame(w=rnorm(1000), x=rnorm(1000),y=rnorm(1000),z=rnorm(1000)) } mylist <- lapply(1:1000, phony) ### First, try the terse way system.time( shortAnswer <- do.call("rbind", mylist) ) ### Second, try the complete way: m <- 1000 nr <- nrow(df1) nc <- ncol(df1) system.time( dataComplete <- as.data.frame(matrix(0, nrow = nr*m, ncol = nc)) ) system.time( for (j in 1:m) dataComplete[(((j-1)*nr) + 1):(j*nr), ] <- mylist[[j]] ) ## On my Thinkpad T62 dual core, the "shortAnswer" approach takes about ## three times as long: ## > system.time( bestAnswer <- do.call("rbind",mylist) ) ## user system elapsed ## 14.270 1.170 15.433 ## > system.time( ## + dataComplete <- as.data.frame(matrix(0, nrow = nr*m, ncol = nc)) ## + ) ## user system elapsed ## 0.000 0.000 0.006 ## > system.time( ## + for (j in 1:m) dataComplete[(((j-1)*nr) + 1):(j*nr), ] <- mylist[[j]] ## + ) ## user system elapsed ## 4.940 0.050 4.989 ## That makes the do.call way look slow, and I said "hey, ## our stupid for loop at the beginning may not be so bad. ## Wrong. It is a disaster. Check this out: ## > resultDF <- phony(1) ## > system.time( ## + for (i in 2:1000) resultDF <- rbind(resultDF, mylist[[i]]) ## + ) ## user system elapsed ## 159.740 4.150 163.996 -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas
Erik Iverson
2010-Sep-04 18:49 UTC
[R] Please explain "do.call" in this context, or critique to "stack this list faster"
On 09/04/2010 01:37 PM, Paul Johnson wrote:> I've been doing some consulting with students who seem to come to R > from SAS. They are usually pre-occupied with do loops and it is tough > to persuade them to trust R lists rather than keeping 100s of named > matrices floating around. > > Often it happens that there is a list with lots of matrices or data > frames in it and we need to "stack those together". I thought it > would be a simple thing, but it turns out there are several ways to > get it done, and in this case, the most "elegant" way using do.call is > not the fastest, but it does appear to be the least prone to > programmer error. > > I have been staring at ?do.call for quite a while and I have to admit > that I just need some more explanations in order to interpret it. I > can't really get why this does work > > do.call( "rbind", mylist)do.call is *constructing* a function call from the list of arguments, my.list. It is shorthand for rbind(mylist[[1]], mylist[[2]], mylist[[3]]) assuming mylist has 3 elements.> > but it does not work to do > > sapply ( mylist, rbind).That's because sapply is calling rbind once for each item in mylist, not what you want to do to accomplish your goal. It might help to use a debugging technique to watch when rbind gets called, and see how many times it gets called and with what arguments using those two approaches.> > Anyway, here's the self contained working example that compares the > speed of various approaches. If you send yet more ways to do this, I > will add them on and then post the result to my Working Example > collection. > > ## stackMerge.R > ## Paul Johnson<pauljohn at ku.edu> > ## 2010-09-02 > > > ## rbind is neat,but how to do it to a lot of > ## data frames? > > ## Here is a test case > > df1<- data.frame(x=rnorm(100),y=rnorm(100)) > df2<- data.frame(x=rnorm(100),y=rnorm(100)) > df3<- data.frame(x=rnorm(100),y=rnorm(100)) > df4<- data.frame(x=rnorm(100),y=rnorm(100)) > > mylist<- list(df1, df2, df3, df4) > > ## Usually we have done a stupid > ## loop to get this done > > resultDF<- mylist[[1]] > for (i in 2:4) resultDF<- rbind(resultDF, mylist[[i]]) > > ## My intuition was that this should work: > ## lapply( mylist, rbind ) > ## but no! It just makes a new list > > ## This obliterates the columns > ## unlist( mylist ) > > ## I got this idea from code in the > ## "complete" function in the "mice" package > ## It uses brute force to allocate a big matrix of 0's and > ## then it places the individual data frames into that matrix. > > m<- 4 > nr<- nrow(df1) > nc<- ncol(df1) > dataComplete<- as.data.frame(matrix(0, nrow = nr*m, ncol = nc)) > for (j in 1:m) dataComplete[(((j-1)*nr) + 1):(j*nr), ]<- mylist[[j]] > > > > ## I searched a long time for an answer that looked better. > ## This website is helpful: > ## http://stackoverflow.com/questions/tagged/r > ## I started to type in the question and 3 plausible answers > ## popped up before I could finish. > > ## The terse answer is: > shortAnswer<- do.call("rbind",mylist) > > ## That's the right answer, see: > > shortAnswer == dataComplete > ## But I don't understand why it works. > > ## More importantly, I don't know if it is fastest, or best. > ## It is certainly less error prone than "dataComplete" > > ## First, make a bigger test case and use system.time to evaluate > > phony<- function(i){ > data.frame(w=rnorm(1000), x=rnorm(1000),y=rnorm(1000),z=rnorm(1000)) > } > mylist<- lapply(1:1000, phony) > > > ### First, try the terse way > system.time( shortAnswer<- do.call("rbind", mylist) ) > > > ### Second, try the complete way: > m<- 1000 > nr<- nrow(df1) > nc<- ncol(df1) > > system.time( > dataComplete<- as.data.frame(matrix(0, nrow = nr*m, ncol = nc)) > ) > > system.time( > for (j in 1:m) dataComplete[(((j-1)*nr) + 1):(j*nr), ]<- mylist[[j]] > ) > > > ## On my Thinkpad T62 dual core, the "shortAnswer" approach takes about > ## three times as long: > > > ##> system.time( bestAnswer<- do.call("rbind",mylist) ) > ## user system elapsed > ## 14.270 1.170 15.433 > > ##> system.time( > ## + dataComplete<- as.data.frame(matrix(0, nrow = nr*m, ncol = nc)) > ## + ) > ## user system elapsed > ## 0.000 0.000 0.006 > > ##> system.time( > ## + for (j in 1:m) dataComplete[(((j-1)*nr) + 1):(j*nr), ]<- mylist[[j]] > ## + ) > ## user system elapsed > ## 4.940 0.050 4.989 > > > ## That makes the do.call way look slow, and I said "hey, > ## our stupid for loop at the beginning may not be so bad. > ## Wrong. It is a disaster. Check this out: > > > ##> resultDF<- phony(1) > ##> system.time( > ## + for (i in 2:1000) resultDF<- rbind(resultDF, mylist[[i]]) > ## + ) > ## user system elapsed > ## 159.740 4.150 163.996 > >
Joshua Wiley
2010-Sep-04 20:41 UTC
[R] Please explain "do.call" in this context, or critique to "stack this list faster"
To echo what Erik said, the second argument of do.call(), arg, takes a list of arguments that it passes to the specified function. Since rbind() can bind any number of data frames, each dataframe in mylist is rbind()ed at once. These two calls should take about the same time (except for time saved typing): rbind(mylist[[1]], mylist[[2]], mylist[[3]], mylist[[4]]) # 1 do.call("rbind", mylist) # 2 On my system using: set.seed(1) dat <- rnorm(10^6) df1 <- data.frame(x=dat, y=dat) mylist <- list(df1, df1, df1, df1) They do take about the same time (I started two instances of R and ran both calls but swithed the order because R has a way of being faster the second time you do the same thing). [1] "Order: 1, 2" user system elapsed 0.60 0.14 0.75 user system elapsed 0.41 0.14 0.54 [1] "Order: 2, 1" user system elapsed 0.56 0.21 0.76 user system elapsed 0.41 0.14 0.55 Using the for loop is much slower in your later example because rbind() is getting called over and over, plus you are incrementally increasing the size of the object containing your results.> Often it happens that there is a list with lots of matrices or data > frames in it and we need to "stack those together"For my own curiosity, are you reading in a bunch of separate data files or are these the results of various operations that you eventually want to combine? Cheers, Josh On Sat, Sep 4, 2010 at 11:37 AM, Paul Johnson <pauljohn32 at gmail.com> wrote:> I've been doing some consulting with students who seem to come to R > from SAS. ?They are usually pre-occupied with do loops and it is tough > to persuade them to trust R lists rather than keeping 100s of named > matrices floating around. > > Often it happens that there is a list with lots of matrices or data > frames in it and we need to "stack those together". ?I thought it > would be a simple thing, but it turns out there are several ways to > get it done, and in this case, the most "elegant" way using do.call is > not the fastest, but it does appear to be the least prone to > programmer error. > > I have been staring at ?do.call for quite a while and I have to admit > that I just need some more explanations in order to interpret it. ?I > can't really get why this does work > > do.call( "rbind", mylist) > > but it does not work to do > > sapply ( mylist, rbind). > > Anyway, here's the self contained working example that compares the > speed of various approaches. ?If you send yet more ways to do this, I > will add them on and then post the result to my Working Example > collection. > > ## stackMerge.R > ## Paul Johnson <pauljohn at ku.edu> > ## 2010-09-02 > > > ## rbind is neat,but how to do it to a lot of > ## data frames? > > ## Here is a test case > > df1 <- data.frame(x=rnorm(100),y=rnorm(100)) > df2 <- data.frame(x=rnorm(100),y=rnorm(100)) > df3 <- data.frame(x=rnorm(100),y=rnorm(100)) > df4 <- data.frame(x=rnorm(100),y=rnorm(100)) > > mylist <- ?list(df1, df2, df3, df4) > > ## Usually we have done a stupid > ## loop ?to get this done > > resultDF <- mylist[[1]] > for (i in 2:4) resultDF <- rbind(resultDF, mylist[[i]]) > > ## My intuition was that this should work: > ## lapply( mylist, rbind ) > ## but no! It just makes a new list > > ## This obliterates the columns > ## unlist( mylist ) > > ## I got this idea from code in the > ## "complete" function in the "mice" package > ## It uses brute force to allocate a big matrix of 0's and > ## then it places the individual data frames into that matrix. > > m <- 4 > nr <- nrow(df1) > nc <- ncol(df1) > dataComplete <- as.data.frame(matrix(0, nrow = nr*m, ncol = nc)) > for (j in ?1:m) dataComplete[(((j-1)*nr) + 1):(j*nr), ] <- mylist[[j]] > > > > ## I searched a long time for an answer that looked better. > ## This website is helpful: > ## http://stackoverflow.com/questions/tagged/r > ## I started to type in the question and 3 plausible answers > ## popped up before I could finish. > > ## The terse answer is: > shortAnswer <- do.call("rbind",mylist) > > ## That's the right answer, see: > > shortAnswer == dataComplete > ## But I don't understand why it works. > > ## More importantly, I don't know if it is fastest, or best. > ## It is certainly less error prone than "dataComplete" > > ## First, make a bigger test case and use system.time to evaluate > > phony <- function(i){ > ?data.frame(w=rnorm(1000), x=rnorm(1000),y=rnorm(1000),z=rnorm(1000)) > } > mylist <- lapply(1:1000, phony) > > > ### First, try the terse way > system.time( shortAnswer <- do.call("rbind", mylist) ) > > > ### Second, try the complete way: > m <- 1000 > nr <- nrow(df1) > nc <- ncol(df1) > > system.time( > ? dataComplete <- as.data.frame(matrix(0, nrow = nr*m, ncol = nc)) > ?) > > system.time( > ? for (j in ?1:m) dataComplete[(((j-1)*nr) + 1):(j*nr), ] <- mylist[[j]] > ) > > > ## On my Thinkpad T62 dual core, the "shortAnswer" approach takes about > ## three times as long: > > > ## > system.time( bestAnswer <- do.call("rbind",mylist) ) > ## ? ?user ?system elapsed > ## ?14.270 ? 1.170 ?15.433 > > ## > system.time( > ## + ? ?dataComplete <- as.data.frame(matrix(0, nrow = nr*m, ncol = nc)) > ## + ?) > ## ? ?user ?system elapsed > ## ? 0.000 ? 0.000 ? 0.006 > > ## > system.time( > ## + for (j in ?1:m) dataComplete[(((j-1)*nr) + 1):(j*nr), ] <- mylist[[j]] > ## + ) > ## ? ?user ?system elapsed > ## ? 4.940 ? 0.050 ? 4.989 > > > ## That makes the do.call way look slow, and I said "hey, > ## our stupid for loop at the beginning may not be so bad. > ## Wrong. It is a disaster. ?Check this out: > > > ## > resultDF <- phony(1) > ## > system.time( > ## + for (i in 2:1000) resultDF <- rbind(resultDF, mylist[[i]]) > ## + ? ?) > ## ? ?user ?system elapsed > ## 159.740 ? 4.150 163.996 > > > -- > Paul E. Johnson > Professor, Political Science > 1541 Lilac Lane, Room 504 > University of Kansas > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Joshua Wiley Ph.D. Student, Health Psychology University of California, Los Angeles http://www.joshuawiley.com/
Gabor Grothendieck
2010-Sep-04 21:36 UTC
[R] Please explain "do.call" in this context, or critique to "stack this list faster"
On Sat, Sep 4, 2010 at 2:37 PM, Paul Johnson <pauljohn32 at gmail.com> wrote:> I've been doing some consulting with students who seem to come to R > from SAS. ?They are usually pre-occupied with do loops and it is tough > to persuade them to trust R lists rather than keeping 100s of named > matrices floating around. > > Often it happens that there is a list with lots of matrices or data > frames in it and we need to "stack those together". ?I thought itThis has nothing specifically to do with do.call but note that R is faster at handling matrices than data frames. Below we see that rbind-ing 4 data frames takes over 100 times as long as rbind-ing matrices with the same data:> mylist <- list(iris[-5], iris[-5], iris[-5], iris[-5]) > L <- lapply(mylist, as.matrix) > > library(rbenchmark) > benchmark(+ df = do.call("rbind", mylist), + mat = do.call("rbind", L), + order = "relative", replications = 250 + ) test replications elapsed relative user.self sys.self user.child sys.child 2 mat 250 0.01 1 0.02 0.00 NA NA 1 df 250 1.06 106 1.03 0.01 NA NA -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
Dennis Murphy
2010-Sep-04 21:38 UTC
[R] Please explain "do.call" in this context, or critique to "stack this list faster"
Hi: Here's my test: l <- vector('list', 1000) for(i in seq_along(l)) l[[i]] <- data.frame(x=rnorm(100),y=rnorm(100)) system.time(u1 <- do.call(rbind, l)) user system elapsed 0.49 0.06 0.60 resultDF <- data.frame() system.time(for (i in 1:1000) resultDF <- rbind(resultDF, l[[i]])) user system elapsed 10.34 0.06 10.53 identical(u1, resultDF) [1] TRUE The problem with the second approach, which is really kind of an FAQ by now, is that repeated application of rbind as a standalone function results in 'Spaceballs: the search for more memory!' The base object gets bigger as the iterations proceed, something new is being added, so more memory is needed to hold both the old and new objects. This is an inefficient time killer because as the loop proceeds, increasingly more time is invested in finding new memory. Interestingly, this doesn't scale linearly: if we make a list of 10000 100 x 2 data frames, I get the following:> l <- vector('list', 10000) > for(i in seq_along(l)) l[[i]] <- data.frame(x=rnorm(100),y=rnorm(100)) > system.time(u1 <- do.call(rbind, l))user system elapsed 55.56 30.62 88.11> dim(u1)[1] 1000000 2> str(u1)'data.frame': 1000000 obs. of 2 variables: $ x: num -0.9516 -0.6948 0.0523 2.5798 -0.0862 ... $ y: num 1.466 0.165 1.375 0.571 -1.099 ...> rm(u1) > rm(resultDF) > resultDF <- data.frame()# go take a shower and come back....> system.time(for (i in 1:100000) resultDF <- rbind(resultDF, l[[i]]))user system elapsed 977.33 121.41 1130.26> dim(resultDF)[1] 1000000 2 This time, neither do.call nor iterative rbind did very well. One common way around this is to pre-allocate memory and then to populate the object using a loop, but a somewhat easier solution here turns out to be ldply() in the plyr package. The following is the same idea as do.call(rbind, l), only faster:> system.time(u3 <- ldply(l, rbind))user system elapsed 6.07 0.01 6.09> dim(u3)[1] 1000000 2> str(u3)'data.frame': 1000000 obs. of 2 variables: $ x: num -0.9516 -0.6948 0.0523 2.5798 -0.0862 ... $ y: num 1.466 0.165 1.375 0.571 -1.099 ... HTH, Dennis On Sat, Sep 4, 2010 at 11:37 AM, Paul Johnson <pauljohn32@gmail.com> wrote:> I've been doing some consulting with students who seem to come to R > from SAS. They are usually pre-occupied with do loops and it is tough > to persuade them to trust R lists rather than keeping 100s of named > matrices floating around. > > Often it happens that there is a list with lots of matrices or data > frames in it and we need to "stack those together". I thought it > would be a simple thing, but it turns out there are several ways to > get it done, and in this case, the most "elegant" way using do.call is > not the fastest, but it does appear to be the least prone to > programmer error. > > I have been staring at ?do.call for quite a while and I have to admit > that I just need some more explanations in order to interpret it. I > can't really get why this does work > > do.call( "rbind", mylist) > > but it does not work to do > > sapply ( mylist, rbind). > > Anyway, here's the self contained working example that compares the > speed of various approaches. If you send yet more ways to do this, I > will add them on and then post the result to my Working Example > collection. > > ## stackMerge.R > ## Paul Johnson <pauljohn at ku.edu> > ## 2010-09-02 > > > ## rbind is neat,but how to do it to a lot of > ## data frames? > > ## Here is a test case > > df1 <- data.frame(x=rnorm(100),y=rnorm(100)) > df2 <- data.frame(x=rnorm(100),y=rnorm(100)) > df3 <- data.frame(x=rnorm(100),y=rnorm(100)) > df4 <- data.frame(x=rnorm(100),y=rnorm(100)) > > mylist <- list(df1, df2, df3, df4) > > ## Usually we have done a stupid > ## loop to get this done > > resultDF <- mylist[[1]] > for (i in 2:4) resultDF <- rbind(resultDF, mylist[[i]]) > > ## My intuition was that this should work: > ## lapply( mylist, rbind ) > ## but no! It just makes a new list > > ## This obliterates the columns > ## unlist( mylist ) > > ## I got this idea from code in the > ## "complete" function in the "mice" package > ## It uses brute force to allocate a big matrix of 0's and > ## then it places the individual data frames into that matrix. > > m <- 4 > nr <- nrow(df1) > nc <- ncol(df1) > dataComplete <- as.data.frame(matrix(0, nrow = nr*m, ncol = nc)) > for (j in 1:m) dataComplete[(((j-1)*nr) + 1):(j*nr), ] <- mylist[[j]] > > > > ## I searched a long time for an answer that looked better. > ## This website is helpful: > ## http://stackoverflow.com/questions/tagged/r > ## I started to type in the question and 3 plausible answers > ## popped up before I could finish. > > ## The terse answer is: > shortAnswer <- do.call("rbind",mylist) > > ## That's the right answer, see: > > shortAnswer == dataComplete > ## But I don't understand why it works. > > ## More importantly, I don't know if it is fastest, or best. > ## It is certainly less error prone than "dataComplete" > > ## First, make a bigger test case and use system.time to evaluate > > phony <- function(i){ > data.frame(w=rnorm(1000), x=rnorm(1000),y=rnorm(1000),z=rnorm(1000)) > } > mylist <- lapply(1:1000, phony) > > > ### First, try the terse way > system.time( shortAnswer <- do.call("rbind", mylist) ) > > > ### Second, try the complete way: > m <- 1000 > nr <- nrow(df1) > nc <- ncol(df1) > > system.time( > dataComplete <- as.data.frame(matrix(0, nrow = nr*m, ncol = nc)) > ) > > system.time( > for (j in 1:m) dataComplete[(((j-1)*nr) + 1):(j*nr), ] <- mylist[[j]] > ) > > > ## On my Thinkpad T62 dual core, the "shortAnswer" approach takes about > ## three times as long: > > > ## > system.time( bestAnswer <- do.call("rbind",mylist) ) > ## user system elapsed > ## 14.270 1.170 15.433 > > ## > system.time( > ## + dataComplete <- as.data.frame(matrix(0, nrow = nr*m, ncol = nc)) > ## + ) > ## user system elapsed > ## 0.000 0.000 0.006 > > ## > system.time( > ## + for (j in 1:m) dataComplete[(((j-1)*nr) + 1):(j*nr), ] <- mylist[[j]] > ## + ) > ## user system elapsed > ## 4.940 0.050 4.989 > > > ## That makes the do.call way look slow, and I said "hey, > ## our stupid for loop at the beginning may not be so bad. > ## Wrong. It is a disaster. Check this out: > > > ## > resultDF <- phony(1) > ## > system.time( > ## + for (i in 2:1000) resultDF <- rbind(resultDF, mylist[[i]]) > ## + ) > ## user system elapsed > ## 159.740 4.150 163.996 > > > -- > Paul E. Johnson > Professor, Political Science > 1541 Lilac Lane, Room 504 > University of Kansas > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]