thr3ads.net - R help - [R] Splitting data.frame into a list of small data.frames given indices [Jun 2016]

If this information is useful, please help other people find it:
Share via:

Witold E Wolski

2016-Jun-29 09:16 UTC

[R] Splitting data.frame into a list of small data.frames given indices

It's the inverse problem to merging a list of data.frames into a large
data.frame just discussed in the "performance of
do.call("rbind")"
thread

I would like to split a data.frame into a list of data.frames
according to first column.
This SEEMS to be easily possible with the function base::by. However,
as soon as the data.frame has a few million rows this function CAN NOT
BE USED (except you have A PLENTY OF TIME).

for 'by' runtime ~ nrow^2, or formally O(n^2)  (see benchmark below).

So basically I am looking for a similar function with better complexity.


 > nrows <- c(1e5,1e6,2e6,3e6,5e6)> timing <- list()
> for(i in nrows){+ dum <- peaks[1:i,]
+ timing[[length(timing)+1]] <- system.time(x<- by(dum[,2:3],
INDICES=list(dum[,1]), FUN=function(x){x}, simplify = FALSE))
+ }> names(timing)<- nrows
> timing$`1e+05`
   user  system elapsed
   0.05    0.00    0.05

$`1e+06`
   user  system elapsed
   1.48    2.98    4.46

$`2e+06`
   user  system elapsed
   7.25   11.39   18.65

$`3e+06`
   user  system elapsed
  16.15   25.81   41.99

$`5e+06`
   user  system elapsed
  43.22   74.72  118.09





-- 
Witold Eryk Wolski

Rolf Turner

2016-Jun-29 10:00 UTC

head link

[R] [FORGED] Splitting data.frame into a list of small data.frames given indices

On 29/06/16 21:16, Witold E Wolski wrote:> It's the inverse problem to merging a list of data.frames into a large
> data.frame just discussed in the "performance of
do.call("rbind")"
> thread
>
> I would like to split a data.frame into a list of data.frames
> according to first column.
> This SEEMS to be easily possible with the function base::by. However,
> as soon as the data.frame has a few million rows this function CAN NOT
> BE USED (except you have A PLENTY OF TIME).
>
> for 'by' runtime ~ nrow^2, or formally O(n^2)  (see benchmark
below).
>
> So basically I am looking for a similar function with better complexity.
>
>
>  > nrows <- c(1e5,1e6,2e6,3e6,5e6)
>> timing <- list()
>> for(i in nrows){
> + dum <- peaks[1:i,]
> + timing[[length(timing)+1]] <- system.time(x<- by(dum[,2:3],
> INDICES=list(dum[,1]), FUN=function(x){x}, simplify = FALSE))
> + }
>> names(timing)<- nrows
>> timing
> $`1e+05`
>    user  system elapsed
>    0.05    0.00    0.05
>
> $`1e+06`
>    user  system elapsed
>    1.48    2.98    4.46
>
> $`2e+06`
>    user  system elapsed
>    7.25   11.39   18.65
>
> $`3e+06`
>    user  system elapsed
>   16.15   25.81   41.99
>
> $`5e+06`
>    user  system elapsed
>   43.22   74.72  118.09
I'm not sure that I follow what you're doing, and your example is not 
reproducible, since we have no idea what "peaks" is, but on a toy 
example with 5e6 rows in the data frame I got a timing result of

    user  system elapsed
   0.379   0.025   0.406

when I applied split().  Is this adequately fast? Seems to me that if 
you want to split something, split() would be a good place to start.

cheers,

Rolf Turner

-- 
Technical Editor ANZJS
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276

Witold E Wolski

2016-Jun-29 13:21 UTC

head link

[R] [FORGED] Splitting data.frame into a list of small data.frames given indices

Hi,

Here is an complete example which shows the the complexity of split or
by is O(n^2)

nrows <- c(1e3,5e3, 1e4 ,5e4, 1e5 ,2e5)
res<-list()

for(i in nrows){
  dum <- data.frame(x = runif(i,1,1000), y=runif(i,1,1000))
  res[[length(res)+1]]<-(system.time(x<- split(dum, 1:nrow(dum))))
}
res <- do.call("rbind",res)
plot(nrows^2, res[,"elapsed"])

And I can't see a reason why this has to be so slow.


cheers







On 29 June 2016 at 12:00, Rolf Turner <r.turner at auckland.ac.nz>
wrote:> On 29/06/16 21:16, Witold E Wolski wrote:
>>
>> It's the inverse problem to merging a list of data.frames into a
large
>> data.frame just discussed in the "performance of
do.call("rbind")"
>> thread
>>
>> I would like to split a data.frame into a list of data.frames
>> according to first column.
>> This SEEMS to be easily possible with the function base::by. However,
>> as soon as the data.frame has a few million rows this function CAN NOT
>> BE USED (except you have A PLENTY OF TIME).
>>
>> for 'by' runtime ~ nrow^2, or formally O(n^2)  (see benchmark
below).
>>
>> So basically I am looking for a similar function with better
complexity.
>>
>>
>>  > nrows <- c(1e5,1e6,2e6,3e6,5e6)
>>>
>>> timing <- list()
>>> for(i in nrows){
>>
>> + dum <- peaks[1:i,]
>> + timing[[length(timing)+1]] <- system.time(x<- by(dum[,2:3],
>> INDICES=list(dum[,1]), FUN=function(x){x}, simplify = FALSE))
>> + }
>>>
>>> names(timing)<- nrows
>>> timing
>>
>> $`1e+05`
>>    user  system elapsed
>>    0.05    0.00    0.05
>>
>> $`1e+06`
>>    user  system elapsed
>>    1.48    2.98    4.46
>>
>> $`2e+06`
>>    user  system elapsed
>>    7.25   11.39   18.65
>>
>> $`3e+06`
>>    user  system elapsed
>>   16.15   25.81   41.99
>>
>> $`5e+06`
>>    user  system elapsed
>>   43.22   74.72  118.09
>
>
> I'm not sure that I follow what you're doing, and your example is
not
> reproducible, since we have no idea what "peaks" is, but on a toy
example
> with 5e6 rows in the data frame I got a timing result of
>
>    user  system elapsed
>   0.379 0.025 0.406
>
> when I applied split().  Is this adequately fast? Seems to me that if you
> want to split something, split() would be a good place to start.
>
> cheers,
>
> Rolf Turner
>
> --
> Technical Editor ANZJS
> Department of Statistics
> University of Auckland
> Phone: +64-9-373-7599 ext. 88276


-- 
Witold Eryk Wolski

R help - Jun 2016 - Splitting data.frame into a list of small data.frames given indices

[R] Splitting data.frame into a list of small data.frames given indices

[R] [FORGED] Splitting data.frame into a list of small data.frames given indices

[R] [FORGED] Splitting data.frame into a list of small data.frames given indices