Paul Johnson
2010-Sep-09 01:53 UTC
[R] Fast / dependable way to "stack together" data frames from a list
Hi, everybody: I asked about this in r-help last week and promised a summary of answers. Special thanks to the folks that helped me understand do.call and pointed me toward plyr. We face this problem all the time. A procedure generates a list of data frames. How to stack them together? The short answer is that the plyr package's rbind.fill method is probably the fastest method that is not prone to trouble and does not require much user caution. result <- rbind.fill(mylist) A slower alternative that also works is result <- do.call("rbind", mylist) That is always available in R and it works well enough, even though it is not quite as fast. Both of these are much faster than a loop that repeatedly applies "rbind". Truly blazing speed can be found if we convert this into matrices, but that is not possible if the list actually contains data frames. I've run this quite a few times, and the relative speed of the different approaches has never differed much. If you run this, I hope you will feel smarter, as I do! :) ## stackListItems.R ## Paul Johnson <pauljohn at ku.edu> ## 2010-09-07 ## Here is a test case df1 <- data.frame(x=rnorm(100),y=rnorm(100)) df2 <- data.frame(x=rnorm(100),y=rnorm(100)) df3 <- data.frame(x=rnorm(100),y=rnorm(100)) df4 <- data.frame(x=rnorm(100),y=rnorm(100)) mylist <- list(df1, df2, df3, df4) ## Here's the way we have done it. We understand this, ## we believe the result, it is easy to remember. It is ## also horribly slow for a long list. resultDF <- mylist[[1]] for (i in 2:4) resultDF <- rbind(resultDF, mylist[[i]]) ## It works better to just call rbind once, as in: resultDF2 <- rbind( mylist[[1]],mylist[[2]],mylist[[3]],mylist[[4]]) ## That is faster because it calls rbind only once. ## But who wants to do all of that typing? How tiresome. ## Thanks to Erik Iverson in r-help, I understand that resultDF3 <- do.call("rbind", mylist) ## is doing the EXACT same thing. ## Erik explained that "do.call( "rbind", mylist)" ## is *constructing* a function call from the list of arguments. ## It is shorthand for "rbind(mylist[[1]], mylist[[2]], mylist[[3]])" ## assuming mylist has 3 elements. ## Check the result: all.equal( resultDF2, resultDF3) ## You often see people claim it is fast to allocate all ## of the required space in one shot and then fill it in. ## I got this algorithm from code in the ## "complete" function in the "mice" package. ## It allocates a big matrix of 0's and ## then it places the individual data frames into that matrix. m <- 4 nr <- nrow(df1) nc <- ncol(df1) resultDF4 <- as.data.frame(matrix(0, nrow = nr*m, ncol = nc)) for (j in 1:m) resultDF4[(((j-1)*nr) + 1):(j*nr), ] <- mylist[[j]] ## This is a bit error prone for my taste. If the data frames have ## different numbers of rows, some major code surgery will be needed. ## ## Dennis Murphy pointed out the plyr package, by Hadley Wickham. ## Dennis said " ldply() in the plyr package. The following is the same ## idea as do.call(rbind, l), only faster." library("plyr") resultDF5 <- ldply(mylist, rbind) all.equal(resultDF, resultDF5) ## Plyr author Hadley Wickham followed up with "I think all you want here is rbind.fill:" resultDF6 <- rbind.fill(mylist) all.equal(resultDF, resultDF6) ## Gabor Grothendieck noted that if the elements in mylist were matrices, this would all work faster. mylist2 <- lapply(mylist, as.matrix) matrixDoCall <- do.call("rbind", mylist2) all.equal(as.data.frame(matrixDoCall), resultDF) ## Gabor also showed a better way than 'system.time' to find out how ## long this takes on average using the rbenchmark package. Awesome! #> library(rbenchmark) #> benchmark( #+ df = do.call("rbind", mylist), #+ mat = do.call("rbind", L), #+ order = "relative", replications = 250 #+ ) ## To see the potentially HUGE impact of these changes, we need to ## make a bigger test case. I just used system.time to evaluate, but ## if this involved a close call, I'd use rbenchmark. phony <- function(i){ data.frame(w=rnorm(1000), x=rnorm(1000),y=rnorm(1000),z=rnorm(1000)) } mylist <- lapply(1:1000, phony) ### First, try my usual way resultDF <- mylist[[1]] system.time( for (i in 2:1000) resultDF <- rbind(resultDF, mylist[[i]]) ) ## wow, that's slow: ## user system elapsed ## 168.040 4.770 173.028 ### Now do.call method: system.time( resultDF3 <- do.call("rbind", mylist) ) all.equal(resultDF, resultDF3) ## Faster! Takes one-twelfth as long ## user system elapsed ## 14.64 0.85 15.49 ### Third, my adaptation of the complete function in the mice ### package: m <- length(mylist) nr <- nrow(mylist[[1]]) nc <- ncol(mylist[[1]]) system.time( resultDF4 <- as.data.frame(matrix(0, nrow = nr*m, ncol = nc)) ) colnames(resultDF4) <- colnames(mylist[[1]]) system.time( for (j in 1:m) resultDF4[(((j-1)*nr) + 1):(j*nr), ] <- mylist[[j]] ) all.equal(resultDF, resultDF4) ##Disappointingly slow on the big case: # user system elapsed # 80.400 3.970 84.573 ### That took much longer than I expected, Gabor's ### hint about the difference between matrix and data.frame ### turns out to be important. Do it again, but don't ### make the intermediate storage thing a data.frame: mylist2 <- lapply(mylist, as.matrix) m <- length(mylist2) nr <- nrow(mylist2[[1]]) nc <- ncol(mylist2[[1]]) system.time( resultDF4B <- matrix(0, nrow = nr*m, ncol = nc) ) colnames(resultDF4B) <- colnames(mylist[[1]]) system.time( for (j in 1:m) resultDF4B[(((j-1)*nr) + 1):(j*nr), ] <- mylist2[[j]] ) ### That's FAST! ### user system elapsed ### 0.07 0.00 0.07 all.equal(resultDF, as.data.frame(resultDF4B)) ### Now the two moethods from plyr. system.time( resultDF5 <- ldply(mylist, rbind)) ## Just about as fast, much less error prone ## user system elapsed ## 1.290 0.000 1.306 all.equal(resultDF, resultDF5) system.time(resultDF6 <- rbind.fill(mylist)) ## user system elapsed ## 0.450 0.000 0.459 all.equal(resultDF, resultDF6) ## Gabor was right. If we have matrices, do.call is ## just about as good as anything. system.time(matrixDoCall <- do.call("rbind", mylist2) ) ## user system elapsed ## 0.030 0.000 0.032 all.equal(as.data.frame(matrixDoCall), resultDF) -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas