thr3ads.net - R help - [R] SLOW split() function [Oct 2011]

If this information is useful, please help other people find it:
Share via:

ivo welch

2011-Oct-11 01:01 UTC

[R] SLOW split() function

dear R experts: ?apologies for all my speed and memory questions. ?I
have a bet with my coauthors that I can make R reasonably efficient
through R-appropriate programming techniques.  this is not just for
kicks, but for work.  for benchmarking, my [3 year old] Mac Pro has
2.8GHz Xeons, 16GB of RAM, and R 2.13.1.

right now, it seems that 'split()' is why I am losing my bet. ?(split
is an integral component of *apply() and by(), so I need split() to be
fast.  its resulting list can then be fed, e.g., to mclapply().)  I
made up an example to illustrate my ills:

    library(data.table)
    N <- 1000
    T <- N*10
    d <- data.table(data.frame( key= rep(1:T, rep(N,T)), val=rnorm(N*T) ))
    setkey(d, "key"); gc() ## force a garbage collection
    cat("N=", N, ". ?Size of d=", object.size(d)/1024/1024,
"MB\n")
    print(system.time( s<-split(d, d$key) ))

My ordered input data table (or data frame; doesn't make a difference)
is 114MB in size.  it takes about a second to create.  split() only
needs to reshape it.  this simple operation takes almost 5 minutes on
my computer.

with a data set that is larger, this explodes further.

am I doing something wrong?  is there an alternative to split()?

sincerely,

/iaw

----
Ivo Welch (ivo.welch at gmail.com)

Jim Holtman

2011-Oct-11 02:28 UTC

head link

[R] SLOW split() function

instead of spliting the entire dataframe, split the indices and then use these
to access your data: try

system.time(s <- split(seq(nrow(d)), d$key))

this should be faster and less memory intensive.  you can then use the indices
to access the subset:

result <- lapply(s, function(.indx){
    doSomething <- sum(d$someCol[.indx])
})

Sent from my iPad

On Oct 10, 2011, at 21:01, ivo welch <ivo.welch at gmail.com> wrote:
> dear R experts:  apologies for all my speed and memory questions.  I
> have a bet with my coauthors that I can make R reasonably efficient
> through R-appropriate programming techniques.  this is not just for
> kicks, but for work.  for benchmarking, my [3 year old] Mac Pro has
> 2.8GHz Xeons, 16GB of RAM, and R 2.13.1.
> 
> right now, it seems that 'split()' is why I am losing my bet. 
(split
> is an integral component of *apply() and by(), so I need split() to be
> fast.  its resulting list can then be fed, e.g., to mclapply().)  I
> made up an example to illustrate my ills:
> 
>    library(data.table)
>    N <- 1000
>    T <- N*10
>    d <- data.table(data.frame( key= rep(1:T, rep(N,T)), val=rnorm(N*T)
))
>    setkey(d, "key"); gc() ## force a garbage collection
>    cat("N=", N, ".  Size of d=",
object.size(d)/1024/1024, "MB\n")
>    print(system.time( s<-split(d, d$key) ))
> 
> My ordered input data table (or data frame; doesn't make a difference)
> is 114MB in size.  it takes about a second to create.  split() only
> needs to reshape it.  this simple operation takes almost 5 minutes on
> my computer.
> 
> with a data set that is larger, this explodes further.
> 
> am I doing something wrong?  is there an alternative to split()?
> 
> sincerely,
> 
> /iaw
> 
> ----
> Ivo Welch (ivo.welch at gmail.com)
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Dennis Murphy

2011-Oct-11 03:36 UTC

head link

[R] SLOW split() function

I tried this:

library(data.table)
    N <- 1000
    T <- N*10
d <- data.table(gp= rep(1:T, rep(N,T)), val=rnorm(N*T), key = 'gp')
dim(d)
[1] 10000000        2

# On my humble 8Gb system,> system.time(l <- d[, split(val, gp)])   user  system elapsed
   4.15    0.09    4.27

I wouldn't be surprised if there were a much faster way to do this
operation in data.table since split() is a data frame operation. This
is about as fast as Jim Holtman's suggestion:

system.time(s <- split(seq_len(nrow(d)), d$gp))
   user  system elapsed
   4.15    0.09    4.29

HTH,
Dennis

On Mon, Oct 10, 2011 at 6:01 PM, ivo welch <ivo.welch at gmail.com>
wrote:> dear R experts: ?apologies for all my speed and memory questions. ?I
> have a bet with my coauthors that I can make R reasonably efficient
> through R-appropriate programming techniques. ?this is not just for
> kicks, but for work. ?for benchmarking, my [3 year old] Mac Pro has
> 2.8GHz Xeons, 16GB of RAM, and R 2.13.1.
>
> right now, it seems that 'split()' is why I am losing my bet.
?(split
> is an integral component of *apply() and by(), so I need split() to be
> fast. ?its resulting list can then be fed, e.g., to mclapply().) ?I
> made up an example to illustrate my ills:
>
> ? ?library(data.table)
> ? ?N <- 1000
> ? ?T <- N*10
> ? ?d <- data.table(data.frame( key= rep(1:T, rep(N,T)), val=rnorm(N*T)
))
> ? ?setkey(d, "key"); gc() ## force a garbage collection
> ? ?cat("N=", N, ". ?Size of d=",
object.size(d)/1024/1024, "MB\n")
> ? ?print(system.time( s<-split(d, d$key) ))
>
> My ordered input data table (or data frame; doesn't make a difference)
> is 114MB in size. ?it takes about a second to create. ?split() only
> needs to reshape it. ?this simple operation takes almost 5 minutes on
> my computer.
>
> with a data set that is larger, this explodes further.
>
> am I doing something wrong? ?is there an alternative to split()?
>
> sincerely,
>
> /iaw
>
> ----
> Ivo Welch (ivo.welch at gmail.com)
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Maybe Matching Threads

Search for more apparently analagous threads

R help - Oct 2011 - SLOW split() function

[R] SLOW split() function

[R] SLOW split() function

[R] SLOW split() function

Maybe Matching Threads