thr3ads.net - R devel - [Rd] combining large list of data.frames [Apr 2012]

If this information is useful, please help other people find it:
Share via:

Cole Beck

2012-Apr-19 22:34 UTC

[Rd] combining large list of data.frames

It's normal for me to create a list of data.frames and then use 
do.call('rbind', list(...)) to create a single data.frame.  However, 
I've noticed as the size of the list grows large, it is perhaps better 
to do this in chunks.  As an example here's a list of 20,000 similar 
data.frames.

# create list of data.frames
dat <- vector("list", 20000)
for(i in seq_along(dat)) {
   size <- sample(1:30, 1)
   dat[[i]] <- data.frame(id=rep(i, size), value=rnorm(size), 
letter=sample(LETTERS, size, replace=TRUE), ind=sample(c(TRUE,FALSE), 
size, replace=TRUE))
}
# combine into one data.frame, normal usage
# system.time(do.call('rbind', dat)) # takes 2-3 minutes
combine <- function(x, steps=NA, verbose=FALSE) {
   nr <- length(x)
   if(is.na(steps)) steps <- nr
   while(nr %% steps != 0) steps <- steps+1
   if(verbose) cat(sprintf("step size: %s\r\n", steps))
   dl <- vector("list", steps)
   for(i in seq(steps)) {
     ix <- seq(from=(i-1)*nr/steps+1, length.out=nr/steps)
     dl[[i]] <- do.call("rbind", x[ix])
   }
   do.call("rbind", dl)
}
# combine into one data.frame
system.time(combine(dat, 100)) # takes 5-10 seconds

I'm very surprised by this result.  Does this improvement seem 
reasonable?  I would think "do.call" could utilize something similar
by
default when the length of "args" is too high.  Is using
"do.call" not
recommended in this scenario?

Regards,
Cole Beck

Patrick Aboyoun

2012-Apr-21 00:55 UTC

head link

[Rd] combining large list of data.frames

Cole,
Bioconductor's high throughput sequencing infrastructure package IRanges 
contains code that may be useful for speeding up base::rbind.data.frame. 
I've extracted the salient bits from that rbind method, but left the 
corner case handling code out. IRanges's rbind method took the approach 
of treating a data set as a list of equal length columns, and so it 
contains a number of lapplys and vector concatenation calls. Given that 
base::rbind.data.frame sits at the core of many operations, I'm not sure 
if patches would be accepted for it, but I could take a crack at it.

biocRBind <- function(..., deparse.level=1)
{
   ## Simplified version of IRanges's rbind method for DataFrame
   ## Removed all data checks, ignored row names
   args <- list(...)
   df <- args[[1L]]
   cn <- colnames(df)
   cl <- unlist(lapply(as.list(df, use.names = FALSE), class))
   factors <- unlist(lapply(as.list(df, use.names = FALSE), is.factor))
   cols <- lapply(seq_len(length(df)), function(i) {
     cols <- lapply(args, `[[`, cn[i])
     if (factors[i]) { # combine factor levels, coerce to character
       levs <- unique(unlist(lapply(cols, levels), use.names=FALSE))
       cols <- lapply(cols, as.character)
     }
     combined <- do.call(c, unname(cols))
     if (factors[i])
       combined <- factor(combined, levs)
     as(combined, cl[i])
   })
   names(cols) <- colnames(df)
   do.call(data.frame, cols)
}

# create list of data.frames
set.seed(123)
dat <- vector("list", 20000)
for(i in seq_along(dat)) {
   size <- sample(1:30, 1)
   dat[[i]] <- data.frame(id=rep(i, size), value=rnorm(size), 
letter=sample(LETTERS, size, replace=TRUE), ind=sample(c(TRUE,FALSE), 
size, replace=TRUE))
}

# sample runs
 > system.time(do.call(biocRBind, dat))
    user  system elapsed
   2.120   0.000   2.125
 > system.time(do.call(biocRBind, dat))
    user  system elapsed
   2.092   0.000   2.091
 > system.time(do.call(biocRBind, dat))
    user  system elapsed
   2.080   0.000   2.077
 > sessionInfo()
R Under development (unstable) (2012-04-19 r59111)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=C                 LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] tools_2.16.0


Cheers,
Patrick


On 4/19/2012 3:34 PM, Cole Beck wrote:> It's normal for me to create a list of data.frames and then use 
> do.call('rbind', list(...)) to create a single data.frame. 
However,
> I've noticed as the size of the list grows large, it is perhaps better 
> to do this in chunks.  As an example here's a list of 20,000 similar 
> data.frames.
>
> # create list of data.frames
> dat <- vector("list", 20000)
> for(i in seq_along(dat)) {
>   size <- sample(1:30, 1)
>   dat[[i]] <- data.frame(id=rep(i, size), value=rnorm(size), 
> letter=sample(LETTERS, size, replace=TRUE), ind=sample(c(TRUE,FALSE), 
> size, replace=TRUE))
> }
> # combine into one data.frame, normal usage
> # system.time(do.call('rbind', dat)) # takes 2-3 minutes
> combine <- function(x, steps=NA, verbose=FALSE) {
>   nr <- length(x)
>   if(is.na(steps)) steps <- nr
>   while(nr %% steps != 0) steps <- steps+1
>   if(verbose) cat(sprintf("step size: %s\r\n", steps))
>   dl <- vector("list", steps)
>   for(i in seq(steps)) {
>     ix <- seq(from=(i-1)*nr/steps+1, length.out=nr/steps)
>     dl[[i]] <- do.call("rbind", x[ix])
>   }
>   do.call("rbind", dl)
> }
> # combine into one data.frame
> system.time(combine(dat, 100)) # takes 5-10 seconds
>
> I'm very surprised by this result.  Does this improvement seem 
> reasonable?  I would think "do.call" could utilize something
similar
> by default when the length of "args" is too high.  Is using
"do.call"
> not recommended in this scenario?
>
> Regards,
> Cole Beck
>
> ______________________________________________
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel 
	[[alternative HTML version deleted]]

Maybe Matching Threads

Search for more maybe matching threads

R devel - Apr 2012 - combining large list of data.frames

[Rd] combining large list of data.frames

[Rd] combining large list of data.frames

Maybe Matching Threads