Witold E Wolski
2016-Jun-29 09:16 UTC
[R] Splitting data.frame into a list of small data.frames given indices
It's the inverse problem to merging a list of data.frames into a large data.frame just discussed in the "performance of do.call("rbind")" thread I would like to split a data.frame into a list of data.frames according to first column. This SEEMS to be easily possible with the function base::by. However, as soon as the data.frame has a few million rows this function CAN NOT BE USED (except you have A PLENTY OF TIME). for 'by' runtime ~ nrow^2, or formally O(n^2) (see benchmark below). So basically I am looking for a similar function with better complexity. > nrows <- c(1e5,1e6,2e6,3e6,5e6)> timing <- list() > for(i in nrows){+ dum <- peaks[1:i,] + timing[[length(timing)+1]] <- system.time(x<- by(dum[,2:3], INDICES=list(dum[,1]), FUN=function(x){x}, simplify = FALSE)) + }> names(timing)<- nrows > timing$`1e+05` user system elapsed 0.05 0.00 0.05 $`1e+06` user system elapsed 1.48 2.98 4.46 $`2e+06` user system elapsed 7.25 11.39 18.65 $`3e+06` user system elapsed 16.15 25.81 41.99 $`5e+06` user system elapsed 43.22 74.72 118.09 -- Witold Eryk Wolski
Rolf Turner
2016-Jun-29 10:00 UTC
[R] [FORGED] Splitting data.frame into a list of small data.frames given indices
On 29/06/16 21:16, Witold E Wolski wrote:> It's the inverse problem to merging a list of data.frames into a large > data.frame just discussed in the "performance of do.call("rbind")" > thread > > I would like to split a data.frame into a list of data.frames > according to first column. > This SEEMS to be easily possible with the function base::by. However, > as soon as the data.frame has a few million rows this function CAN NOT > BE USED (except you have A PLENTY OF TIME). > > for 'by' runtime ~ nrow^2, or formally O(n^2) (see benchmark below). > > So basically I am looking for a similar function with better complexity. > > > > nrows <- c(1e5,1e6,2e6,3e6,5e6) >> timing <- list() >> for(i in nrows){ > + dum <- peaks[1:i,] > + timing[[length(timing)+1]] <- system.time(x<- by(dum[,2:3], > INDICES=list(dum[,1]), FUN=function(x){x}, simplify = FALSE)) > + } >> names(timing)<- nrows >> timing > $`1e+05` > user system elapsed > 0.05 0.00 0.05 > > $`1e+06` > user system elapsed > 1.48 2.98 4.46 > > $`2e+06` > user system elapsed > 7.25 11.39 18.65 > > $`3e+06` > user system elapsed > 16.15 25.81 41.99 > > $`5e+06` > user system elapsed > 43.22 74.72 118.09I'm not sure that I follow what you're doing, and your example is not reproducible, since we have no idea what "peaks" is, but on a toy example with 5e6 rows in the data frame I got a timing result of user system elapsed 0.379 0.025 0.406 when I applied split(). Is this adequately fast? Seems to me that if you want to split something, split() would be a good place to start. cheers, Rolf Turner -- Technical Editor ANZJS Department of Statistics University of Auckland Phone: +64-9-373-7599 ext. 88276
Witold E Wolski
2016-Jun-29 13:21 UTC
[R] [FORGED] Splitting data.frame into a list of small data.frames given indices
Hi, Here is an complete example which shows the the complexity of split or by is O(n^2) nrows <- c(1e3,5e3, 1e4 ,5e4, 1e5 ,2e5) res<-list() for(i in nrows){ dum <- data.frame(x = runif(i,1,1000), y=runif(i,1,1000)) res[[length(res)+1]]<-(system.time(x<- split(dum, 1:nrow(dum)))) } res <- do.call("rbind",res) plot(nrows^2, res[,"elapsed"]) And I can't see a reason why this has to be so slow. cheers On 29 June 2016 at 12:00, Rolf Turner <r.turner at auckland.ac.nz> wrote:> On 29/06/16 21:16, Witold E Wolski wrote: >> >> It's the inverse problem to merging a list of data.frames into a large >> data.frame just discussed in the "performance of do.call("rbind")" >> thread >> >> I would like to split a data.frame into a list of data.frames >> according to first column. >> This SEEMS to be easily possible with the function base::by. However, >> as soon as the data.frame has a few million rows this function CAN NOT >> BE USED (except you have A PLENTY OF TIME). >> >> for 'by' runtime ~ nrow^2, or formally O(n^2) (see benchmark below). >> >> So basically I am looking for a similar function with better complexity. >> >> >> > nrows <- c(1e5,1e6,2e6,3e6,5e6) >>> >>> timing <- list() >>> for(i in nrows){ >> >> + dum <- peaks[1:i,] >> + timing[[length(timing)+1]] <- system.time(x<- by(dum[,2:3], >> INDICES=list(dum[,1]), FUN=function(x){x}, simplify = FALSE)) >> + } >>> >>> names(timing)<- nrows >>> timing >> >> $`1e+05` >> user system elapsed >> 0.05 0.00 0.05 >> >> $`1e+06` >> user system elapsed >> 1.48 2.98 4.46 >> >> $`2e+06` >> user system elapsed >> 7.25 11.39 18.65 >> >> $`3e+06` >> user system elapsed >> 16.15 25.81 41.99 >> >> $`5e+06` >> user system elapsed >> 43.22 74.72 118.09 > > > I'm not sure that I follow what you're doing, and your example is not > reproducible, since we have no idea what "peaks" is, but on a toy example > with 5e6 rows in the data frame I got a timing result of > > user system elapsed > 0.379 0.025 0.406 > > when I applied split(). Is this adequately fast? Seems to me that if you > want to split something, split() would be a good place to start. > > cheers, > > Rolf Turner > > -- > Technical Editor ANZJS > Department of Statistics > University of Auckland > Phone: +64-9-373-7599 ext. 88276-- Witold Eryk Wolski