Hi Bert, You are most likely right. I just thought that do.call("rbind", is somehow more clever and allocates the memory up front. My error. After more searching I did find rbind.fill from plyr which seems to do the job (it computes the size of the result data.frame and allocates it first). best On 27 June 2016 at 18:49, Bert Gunter <bgunter.4567 at gmail.com> wrote:> The following might be nonsense, as I have no understanding of R > internals; but .... > > "Growing" structures in R by iteratively adding new pieces is often > warned to be inefficient when the number of iterations is large, and > your rbind() invocation might fall under this rubric. If so, you might > try issuing the call say, 20 times, over 10k disjoint subsets of the > list, and then rbinding up the 20 large frames. > > Again, caveat emptor. > > Cheers, > Bert > > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along > and sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Mon, Jun 27, 2016 at 8:51 AM, Witold E Wolski <wewolski at gmail.com> wrote: >> I have a list (variable name data.list) with approx 200k data.frames >> with dim(data.frame) approx 100x3. >> >> a call >> >> data <-do.call("rbind", data.list) >> >> does not complete - run time is prohibitive (I killed the rsession >> after 5 minutes). >> >> I would think that merging data.frame's is a common operation. Is >> there a better function (more performant) that I could use? >> >> Thank you. >> Witold >> >> >> >> >> -- >> Witold Eryk Wolski >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code.-- Witold Eryk Wolski
Hi, Just to add my tuppence, which might not even be worth that these days... I found the following blog post from 2013, which is likely dated to some extent, but provided some benchmarks for a few methods: http://rcrastinate.blogspot.com/2013/05/the-rbinding-race-for-vs-docall-vs.html There is also a comment with a reference there to using the data.table package, which I don't use, but may be something to evaluate. As Bert and Sarah hinted at, there is overhead in taking the repetitive piecemeal approach. If all of your data frames are of the exact same column structure (column order, column types), it may be prudent to do your own pre-allocation of a data frame that is the target row total size and then "insert" each "sub" data frame by using row indexing into the target structure. Regards, Marc Schwartz> On Jun 27, 2016, at 11:54 AM, Witold E Wolski <wewolski at gmail.com> wrote: > > Hi Bert, > > You are most likely right. I just thought that do.call("rbind", is > somehow more clever and allocates the memory up front. My error. After > more searching I did find rbind.fill from plyr which seems to do the > job (it computes the size of the result data.frame and allocates it > first). > > best > > On 27 June 2016 at 18:49, Bert Gunter <bgunter.4567 at gmail.com> wrote: >> The following might be nonsense, as I have no understanding of R >> internals; but .... >> >> "Growing" structures in R by iteratively adding new pieces is often >> warned to be inefficient when the number of iterations is large, and >> your rbind() invocation might fall under this rubric. If so, you might >> try issuing the call say, 20 times, over 10k disjoint subsets of the >> list, and then rbinding up the 20 large frames. >> >> Again, caveat emptor. >> >> Cheers, >> Bert >> >> >> Bert Gunter >> >> "The trouble with having an open mind is that people keep coming along >> and sticking things into it." >> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) >> >> >> On Mon, Jun 27, 2016 at 8:51 AM, Witold E Wolski <wewolski at gmail.com> wrote: >>> I have a list (variable name data.list) with approx 200k data.frames >>> with dim(data.frame) approx 100x3. >>> >>> a call >>> >>> data <-do.call("rbind", data.list) >>> >>> does not complete - run time is prohibitive (I killed the rsession >>> after 5 minutes). >>> >>> I would think that merging data.frame's is a common operation. Is >>> there a better function (more performant) that I could use? >>> >>> Thank you. >>> Witold >>> >>> >>> >>> >>> -- >>> Witold Eryk Wolski >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. > > > > -- > Witold Eryk Wolski > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
That's not what I said, though, and it's not necessarily true. Growing an object within a loop _is_ a slow process, but that's not the problem here. The problem is using data frames instead of matrices. The need to manage column classes is very costly. Converting to matrices will almost always be enormously faster. Here's an expansion of the previous example I posted, in four parts: 1. do.call with data frame - very slow - 34.317 s elapsed time for 2000 data frames 2. do.call with matrix - very fast - 0.311 s elapsed 3. pre-allocated loop with data frame - even slower (!) - 82.162 s 4. pre-allocated loop with matrix - very fast - 68.009 s It matters whether the columns are converted to numeric or character, and the time doesn't scale linearly with list length. For a particular problem, the best solution may vary greatly (and I didn't even include packages beyond the base functionality). In general, though, using matrices is faster than using data frames, and using do.call is faster than using a pre-allocated loop, which is much faster than growing an object. Sarah> testsize <- 5000 > > set.seed(1234) > testdf <- data.frame(matrix(runif(300), nrow=100, ncol=3)) > testdf.list <- lapply(seq_len(testsize), function(x)testdf) > > system.time(r.df <- do.call("rbind", testdf.list))user system elapsed 34.280 0.009 34.317> > system.time({+ testm.list <- lapply(testdf.list, as.matrix) + r.m <- do.call("rbind", testm.list) + }) user system elapsed 0.310 0.000 0.311> > system.time({+ l.df <- data.frame(matrix(NA, nrow=100 * testsize, ncol=3)) + for(i in seq_len(testsize)) { + start <- (i-1)*100 + 1 + end <- i*100 + l.df[start:end, ] <- testdf.list[[i]] + } + }) user system elapsed 81.890 0.069 82.162> > system.time({+ l.m <- data.frame(matrix(NA, nrow=100 * testsize, ncol=3)) + testm.list <- lapply(testdf.list, as.matrix) + for(i in seq_len(testsize)) { + start <- (i-1)*100 + 1 + end <- i*100 + l.m[start:end, ] <- testm.list[[i]] + } + }) user system elapsed 67.664 0.047 68.009 On Mon, Jun 27, 2016 at 1:05 PM, Marc Schwartz <marc_schwartz at me.com> wrote:> Hi, > > Just to add my tuppence, which might not even be worth that these days... > > I found the following blog post from 2013, which is likely dated to some extent, but provided some benchmarks for a few methods: > > http://rcrastinate.blogspot.com/2013/05/the-rbinding-race-for-vs-docall-vs.html > > There is also a comment with a reference there to using the data.table package, which I don't use, but may be something to evaluate. > > As Bert and Sarah hinted at, there is overhead in taking the repetitive piecemeal approach. > > If all of your data frames are of the exact same column structure (column order, column types), it may be prudent to do your own pre-allocation of a data frame that is the target row total size and then "insert" each "sub" data frame by using row indexing into the target structure. > > Regards, > > Marc Schwartz > > >> On Jun 27, 2016, at 11:54 AM, Witold E Wolski <wewolski at gmail.com> wrote: >> >> Hi Bert, >> >> You are most likely right. I just thought that do.call("rbind", is >> somehow more clever and allocates the memory up front. My error. After >> more searching I did find rbind.fill from plyr which seems to do the >> job (it computes the size of the result data.frame and allocates it >> first). >> >> best >> >> On 27 June 2016 at 18:49, Bert Gunter <bgunter.4567 at gmail.com> wrote: >>> The following might be nonsense, as I have no understanding of R >>> internals; but .... >>> >>> "Growing" structures in R by iteratively adding new pieces is often >>> warned to be inefficient when the number of iterations is large, and >>> your rbind() invocation might fall under this rubric. If so, you might >>> try issuing the call say, 20 times, over 10k disjoint subsets of the >>> list, and then rbinding up the 20 large frames. >>> >>> Again, caveat emptor. >>> >>> Cheers, >>> Bert >>> >>> >>> Bert Gunter >>> >>> "The trouble with having an open mind is that people keep coming along >>> and sticking things into it." >>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) >>> >>> >>> On Mon, Jun 27, 2016 at 8:51 AM, Witold E Wolski <wewolski at gmail.com> wrote: >>>> I have a list (variable name data.list) with approx 200k data.frames >>>> with dim(data.frame) approx 100x3. >>>> >>>> a call >>>> >>>> data <-do.call("rbind", data.list) >>>> >>>> does not complete - run time is prohibitive (I killed the rsession >>>> after 5 minutes). >>>> >>>> I would think that merging data.frame's is a common operation. Is >>>> there a better function (more performant) that I could use? >>>> >>>> Thank you. >>>> Witold >>>>