thr3ads.net - R help - [R] performance of do.call("rbind") [Jun 2016]

If this information is useful, please help other people find it:
Share via:

Witold E Wolski

2016-Jun-27 15:51 UTC

[R] performance of do.call("rbind")

I have a list (variable name data.list) with approx 200k data.frames
with dim(data.frame) approx 100x3.

a call

data <-do.call("rbind", data.list)

does not complete - run time is prohibitive (I killed the rsession
after 5 minutes).

I would think that merging data.frame's is a common operation. Is
there a better function (more performant) that I could use?

Thank you.
Witold




-- 
Witold Eryk Wolski

Bert Gunter

2016-Jun-27 16:49 UTC

head link

[R] performance of do.call("rbind")

The following might be nonsense, as I have no understanding of R
internals; but ....

"Growing" structures in R by iteratively adding new pieces is often
warned to be inefficient when the number of iterations is large, and
your rbind() invocation might fall under this rubric. If so, you might
try  issuing the call say, 20 times, over 10k disjoint subsets of the
list, and then rbinding up the 20 large frames.

Again, caveat emptor.

Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Mon, Jun 27, 2016 at 8:51 AM, Witold E Wolski <wewolski at gmail.com>
wrote:> I have a list (variable name data.list) with approx 200k data.frames
> with dim(data.frame) approx 100x3.
>
> a call
>
> data <-do.call("rbind", data.list)
>
> does not complete - run time is prohibitive (I killed the rsession
> after 5 minutes).
>
> I would think that merging data.frame's is a common operation. Is
> there a better function (more performant) that I could use?
>
> Thank you.
> Witold
>
>
>
>
> --
> Witold Eryk Wolski
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Sarah Goslee

2016-Jun-27 16:51 UTC

head link

[R] performance of do.call("rbind")

There is a substantial overhead in rbind.dataframe() because of the
need to check the column types. Converting to matrix makes a huge
difference in speed, but be careful of type coercion.

testdf <- data.frame(matrix(runif(300), nrow=100, ncol=3))
testdf.list <- lapply(1:10000, function(x)testdf)

system.time(r.df <- do.call("rbind", testdf.list))

system.time({
testm.list <- lapply(testdf.list, as.matrix)
r.m <- do.call("rbind", testm.list)
})

> testdf <- data.frame(matrix(runif(300), nrow=100, ncol=3))
> testdf.list <- lapply(1:10000, function(x)testdf)
>
> system.time(r.df <- do.call("rbind", testdf.list))   user  system elapsed
195.105  36.419 231.930>
> system.time({+ testm.list <- lapply(testdf.list, as.matrix)
+ r.m <- do.call("rbind", testm.list)
+ })
   user  system elapsed
  0.603   0.009   0.612

Sarah

On Mon, Jun 27, 2016 at 11:51 AM, Witold E Wolski <wewolski at gmail.com>
wrote:> I have a list (variable name data.list) with approx 200k data.frames
> with dim(data.frame) approx 100x3.
>
> a call
>
> data <-do.call("rbind", data.list)
>
> does not complete - run time is prohibitive (I killed the rsession
> after 5 minutes).
>
> I would think that merging data.frame's is a common operation. Is
> there a better function (more performant) that I could use?
>
> Thank you.
> Witold
>
>
>

Witold E Wolski

2016-Jun-27 16:54 UTC

head link

[R] performance of do.call("rbind")

Hi Bert,

You are most likely right. I just thought that do.call("rbind", is
somehow more clever and allocates the memory up front. My error. After
more searching I did find rbind.fill from plyr which seems to do the
job (it computes the size of the result data.frame and allocates it
first).

best

On 27 June 2016 at 18:49, Bert Gunter <bgunter.4567 at gmail.com>
wrote:> The following might be nonsense, as I have no understanding of R
> internals; but ....
>
> "Growing" structures in R by iteratively adding new pieces is
often
> warned to be inefficient when the number of iterations is large, and
> your rbind() invocation might fall under this rubric. If so, you might
> try  issuing the call say, 20 times, over 10k disjoint subsets of the
> list, and then rbinding up the 20 large frames.
>
> Again, caveat emptor.
>
> Cheers,
> Bert
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip
)
>
>
> On Mon, Jun 27, 2016 at 8:51 AM, Witold E Wolski <wewolski at
gmail.com> wrote:
>> I have a list (variable name data.list) with approx 200k data.frames
>> with dim(data.frame) approx 100x3.
>>
>> a call
>>
>> data <-do.call("rbind", data.list)
>>
>> does not complete - run time is prohibitive (I killed the rsession
>> after 5 minutes).
>>
>> I would think that merging data.frame's is a common operation. Is
>> there a better function (more performant) that I could use?
>>
>> Thank you.
>> Witold
>>
>>
>>
>>
>> --
>> Witold Eryk Wolski
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.


-- 
Witold Eryk Wolski

Jeff Newmiller

2016-Jun-27 17:00 UTC

head link

[R] performance of do.call("rbind")

Your description of the data frames as "approx" puts the solution to
considerable difficulty and speed penalty. If you want better performance you
need a better handle on the data you are working with.

For example, if you knew that every data frame had exactly three columns named
identically and exactly 100 rows, then you could preallocate the result data
frame and loop through the input data copying values directly to the appropriate
destination locations in the result.

To the extent that you can figure out things like the union of all column names
or the total number of rows prior to starting copying data, you can adapt the
above approach even if the input data frames are not identical. The key is not
having to restructure/reallocate your result data frame as you go.

The bind_rows function in the dplyr package can do a lot of this for you... but
being a general-purpose function it may not be as optimized as you could do
yourself with better knowledge of your data.
-- 
Sent from my phone. Please excuse my brevity.

On June 27, 2016 8:51:17 AM PDT, Witold E Wolski <wewolski at gmail.com>
wrote:>I have a list (variable name data.list) with approx 200k data.frames
>with dim(data.frame) approx 100x3.
>
>a call
>
>data <-do.call("rbind", data.list)
>
>does not complete - run time is prohibitive (I killed the rsession
>after 5 minutes).
>
>I would think that merging data.frame's is a common operation. Is
>there a better function (more performant) that I could use?
>
>Thank you.
>Witold

Hervé Pagès

2016-Jun-27 19:58 UTC

head link

[R] performance of do.call("rbind")

Hi,

Note that if your list of 200k data frames is the result of splitting
a big data frame, then trying to rbind the result of the split is
equivalent to reordering the orginal big data frame. More precisely,

   do.call(rbind, unname(split(df, f)))

is equivalent to

   df[order(f), , drop=FALSE]

(except for the rownames), but the latter is *much* faster!

Cheers,
H.

On 06/27/2016 08:51 AM, Witold E Wolski wrote:> I have a list (variable name data.list) with approx 200k data.frames
> with dim(data.frame) approx 100x3.
>
> a call
>
> data <-do.call("rbind", data.list)
>
> does not complete - run time is prohibitive (I killed the rsession
> after 5 minutes).
>
> I would think that merging data.frame's is a common operation. Is
> there a better function (more performant) that I could use?
>
> Thank you.
> Witold
>
>
>
>
-- 
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

R help - Jun 2016 - performance of do.call("rbind")

[R] performance of do.call("rbind")

[R] performance of do.call("rbind")

[R] performance of do.call("rbind")

[R] performance of do.call("rbind")

[R] performance of do.call("rbind")

[R] performance of do.call("rbind")