I have a list (variable name data.list) with approx 200k data.frames with dim(data.frame) approx 100x3. a call data <-do.call("rbind", data.list) does not complete - run time is prohibitive (I killed the rsession after 5 minutes). I would think that merging data.frame's is a common operation. Is there a better function (more performant) that I could use? Thank you. Witold -- Witold Eryk Wolski
The following might be nonsense, as I have no understanding of R internals; but .... "Growing" structures in R by iteratively adding new pieces is often warned to be inefficient when the number of iterations is large, and your rbind() invocation might fall under this rubric. If so, you might try issuing the call say, 20 times, over 10k disjoint subsets of the list, and then rbinding up the 20 large frames. Again, caveat emptor. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Mon, Jun 27, 2016 at 8:51 AM, Witold E Wolski <wewolski at gmail.com> wrote:> I have a list (variable name data.list) with approx 200k data.frames > with dim(data.frame) approx 100x3. > > a call > > data <-do.call("rbind", data.list) > > does not complete - run time is prohibitive (I killed the rsession > after 5 minutes). > > I would think that merging data.frame's is a common operation. Is > there a better function (more performant) that I could use? > > Thank you. > Witold > > > > > -- > Witold Eryk Wolski > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
There is a substantial overhead in rbind.dataframe() because of the need to check the column types. Converting to matrix makes a huge difference in speed, but be careful of type coercion. testdf <- data.frame(matrix(runif(300), nrow=100, ncol=3)) testdf.list <- lapply(1:10000, function(x)testdf) system.time(r.df <- do.call("rbind", testdf.list)) system.time({ testm.list <- lapply(testdf.list, as.matrix) r.m <- do.call("rbind", testm.list) })> testdf <- data.frame(matrix(runif(300), nrow=100, ncol=3)) > testdf.list <- lapply(1:10000, function(x)testdf) > > system.time(r.df <- do.call("rbind", testdf.list))user system elapsed 195.105 36.419 231.930> > system.time({+ testm.list <- lapply(testdf.list, as.matrix) + r.m <- do.call("rbind", testm.list) + }) user system elapsed 0.603 0.009 0.612 Sarah On Mon, Jun 27, 2016 at 11:51 AM, Witold E Wolski <wewolski at gmail.com> wrote:> I have a list (variable name data.list) with approx 200k data.frames > with dim(data.frame) approx 100x3. > > a call > > data <-do.call("rbind", data.list) > > does not complete - run time is prohibitive (I killed the rsession > after 5 minutes). > > I would think that merging data.frame's is a common operation. Is > there a better function (more performant) that I could use? > > Thank you. > Witold > > >
Hi Bert, You are most likely right. I just thought that do.call("rbind", is somehow more clever and allocates the memory up front. My error. After more searching I did find rbind.fill from plyr which seems to do the job (it computes the size of the result data.frame and allocates it first). best On 27 June 2016 at 18:49, Bert Gunter <bgunter.4567 at gmail.com> wrote:> The following might be nonsense, as I have no understanding of R > internals; but .... > > "Growing" structures in R by iteratively adding new pieces is often > warned to be inefficient when the number of iterations is large, and > your rbind() invocation might fall under this rubric. If so, you might > try issuing the call say, 20 times, over 10k disjoint subsets of the > list, and then rbinding up the 20 large frames. > > Again, caveat emptor. > > Cheers, > Bert > > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along > and sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Mon, Jun 27, 2016 at 8:51 AM, Witold E Wolski <wewolski at gmail.com> wrote: >> I have a list (variable name data.list) with approx 200k data.frames >> with dim(data.frame) approx 100x3. >> >> a call >> >> data <-do.call("rbind", data.list) >> >> does not complete - run time is prohibitive (I killed the rsession >> after 5 minutes). >> >> I would think that merging data.frame's is a common operation. Is >> there a better function (more performant) that I could use? >> >> Thank you. >> Witold >> >> >> >> >> -- >> Witold Eryk Wolski >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code.-- Witold Eryk Wolski
Your description of the data frames as "approx" puts the solution to considerable difficulty and speed penalty. If you want better performance you need a better handle on the data you are working with. For example, if you knew that every data frame had exactly three columns named identically and exactly 100 rows, then you could preallocate the result data frame and loop through the input data copying values directly to the appropriate destination locations in the result. To the extent that you can figure out things like the union of all column names or the total number of rows prior to starting copying data, you can adapt the above approach even if the input data frames are not identical. The key is not having to restructure/reallocate your result data frame as you go. The bind_rows function in the dplyr package can do a lot of this for you... but being a general-purpose function it may not be as optimized as you could do yourself with better knowledge of your data. -- Sent from my phone. Please excuse my brevity. On June 27, 2016 8:51:17 AM PDT, Witold E Wolski <wewolski at gmail.com> wrote:>I have a list (variable name data.list) with approx 200k data.frames >with dim(data.frame) approx 100x3. > >a call > >data <-do.call("rbind", data.list) > >does not complete - run time is prohibitive (I killed the rsession >after 5 minutes). > >I would think that merging data.frame's is a common operation. Is >there a better function (more performant) that I could use? > >Thank you. >Witold
Hi, Note that if your list of 200k data frames is the result of splitting a big data frame, then trying to rbind the result of the split is equivalent to reordering the orginal big data frame. More precisely, do.call(rbind, unname(split(df, f))) is equivalent to df[order(f), , drop=FALSE] (except for the rownames), but the latter is *much* faster! Cheers, H. On 06/27/2016 08:51 AM, Witold E Wolski wrote:> I have a list (variable name data.list) with approx 200k data.frames > with dim(data.frame) approx 100x3. > > a call > > data <-do.call("rbind", data.list) > > does not complete - run time is prohibitive (I killed the rsession > after 5 minutes). > > I would think that merging data.frame's is a common operation. Is > there a better function (more performant) that I could use? > > Thank you. > Witold > > > >-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319