dear R experts: ?apologies for all my speed and memory questions. ?I have a bet with my coauthors that I can make R reasonably efficient through R-appropriate programming techniques. this is not just for kicks, but for work. for benchmarking, my [3 year old] Mac Pro has 2.8GHz Xeons, 16GB of RAM, and R 2.13.1. right now, it seems that 'split()' is why I am losing my bet. ?(split is an integral component of *apply() and by(), so I need split() to be fast. its resulting list can then be fed, e.g., to mclapply().) I made up an example to illustrate my ills: library(data.table) N <- 1000 T <- N*10 d <- data.table(data.frame( key= rep(1:T, rep(N,T)), val=rnorm(N*T) )) setkey(d, "key"); gc() ## force a garbage collection cat("N=", N, ". ?Size of d=", object.size(d)/1024/1024, "MB\n") print(system.time( s<-split(d, d$key) )) My ordered input data table (or data frame; doesn't make a difference) is 114MB in size. it takes about a second to create. split() only needs to reshape it. this simple operation takes almost 5 minutes on my computer. with a data set that is larger, this explodes further. am I doing something wrong? is there an alternative to split()? sincerely, /iaw ---- Ivo Welch (ivo.welch at gmail.com)
instead of spliting the entire dataframe, split the indices and then use these to access your data: try system.time(s <- split(seq(nrow(d)), d$key)) this should be faster and less memory intensive. you can then use the indices to access the subset: result <- lapply(s, function(.indx){ doSomething <- sum(d$someCol[.indx]) }) Sent from my iPad On Oct 10, 2011, at 21:01, ivo welch <ivo.welch at gmail.com> wrote:> dear R experts: apologies for all my speed and memory questions. I > have a bet with my coauthors that I can make R reasonably efficient > through R-appropriate programming techniques. this is not just for > kicks, but for work. for benchmarking, my [3 year old] Mac Pro has > 2.8GHz Xeons, 16GB of RAM, and R 2.13.1. > > right now, it seems that 'split()' is why I am losing my bet. (split > is an integral component of *apply() and by(), so I need split() to be > fast. its resulting list can then be fed, e.g., to mclapply().) I > made up an example to illustrate my ills: > > library(data.table) > N <- 1000 > T <- N*10 > d <- data.table(data.frame( key= rep(1:T, rep(N,T)), val=rnorm(N*T) )) > setkey(d, "key"); gc() ## force a garbage collection > cat("N=", N, ". Size of d=", object.size(d)/1024/1024, "MB\n") > print(system.time( s<-split(d, d$key) )) > > My ordered input data table (or data frame; doesn't make a difference) > is 114MB in size. it takes about a second to create. split() only > needs to reshape it. this simple operation takes almost 5 minutes on > my computer. > > with a data set that is larger, this explodes further. > > am I doing something wrong? is there an alternative to split()? > > sincerely, > > /iaw > > ---- > Ivo Welch (ivo.welch at gmail.com) > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
I tried this: library(data.table) N <- 1000 T <- N*10 d <- data.table(gp= rep(1:T, rep(N,T)), val=rnorm(N*T), key = 'gp') dim(d) [1] 10000000 2 # On my humble 8Gb system,> system.time(l <- d[, split(val, gp)])user system elapsed 4.15 0.09 4.27 I wouldn't be surprised if there were a much faster way to do this operation in data.table since split() is a data frame operation. This is about as fast as Jim Holtman's suggestion: system.time(s <- split(seq_len(nrow(d)), d$gp)) user system elapsed 4.15 0.09 4.29 HTH, Dennis On Mon, Oct 10, 2011 at 6:01 PM, ivo welch <ivo.welch at gmail.com> wrote:> dear R experts: ?apologies for all my speed and memory questions. ?I > have a bet with my coauthors that I can make R reasonably efficient > through R-appropriate programming techniques. ?this is not just for > kicks, but for work. ?for benchmarking, my [3 year old] Mac Pro has > 2.8GHz Xeons, 16GB of RAM, and R 2.13.1. > > right now, it seems that 'split()' is why I am losing my bet. ?(split > is an integral component of *apply() and by(), so I need split() to be > fast. ?its resulting list can then be fed, e.g., to mclapply().) ?I > made up an example to illustrate my ills: > > ? ?library(data.table) > ? ?N <- 1000 > ? ?T <- N*10 > ? ?d <- data.table(data.frame( key= rep(1:T, rep(N,T)), val=rnorm(N*T) )) > ? ?setkey(d, "key"); gc() ## force a garbage collection > ? ?cat("N=", N, ". ?Size of d=", object.size(d)/1024/1024, "MB\n") > ? ?print(system.time( s<-split(d, d$key) )) > > My ordered input data table (or data frame; doesn't make a difference) > is 114MB in size. ?it takes about a second to create. ?split() only > needs to reshape it. ?this simple operation takes almost 5 minutes on > my computer. > > with a data set that is larger, this explodes further. > > am I doing something wrong? ?is there an alternative to split()? > > sincerely, > > /iaw > > ---- > Ivo Welch (ivo.welch at gmail.com) > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >