thr3ads.net - R help - [R] [FORGED] Splitting data.frame into a list of small data.frames given indices [Jun 2016]

If this information is useful, please help other people find it:
Share via:

Witold E Wolski

2016-Jun-29 13:21 UTC

[R] [FORGED] Splitting data.frame into a list of small data.frames given indices

Hi,

Here is an complete example which shows the the complexity of split or
by is O(n^2)

nrows <- c(1e3,5e3, 1e4 ,5e4, 1e5 ,2e5)
res<-list()

for(i in nrows){
  dum <- data.frame(x = runif(i,1,1000), y=runif(i,1,1000))
  res[[length(res)+1]]<-(system.time(x<- split(dum, 1:nrow(dum))))
}
res <- do.call("rbind",res)
plot(nrows^2, res[,"elapsed"])

And I can't see a reason why this has to be so slow.


cheers







On 29 June 2016 at 12:00, Rolf Turner <r.turner at auckland.ac.nz>
wrote:> On 29/06/16 21:16, Witold E Wolski wrote:
>>
>> It's the inverse problem to merging a list of data.frames into a
large
>> data.frame just discussed in the "performance of
do.call("rbind")"
>> thread
>>
>> I would like to split a data.frame into a list of data.frames
>> according to first column.
>> This SEEMS to be easily possible with the function base::by. However,
>> as soon as the data.frame has a few million rows this function CAN NOT
>> BE USED (except you have A PLENTY OF TIME).
>>
>> for 'by' runtime ~ nrow^2, or formally O(n^2)  (see benchmark
below).
>>
>> So basically I am looking for a similar function with better
complexity.
>>
>>
>>  > nrows <- c(1e5,1e6,2e6,3e6,5e6)
>>>
>>> timing <- list()
>>> for(i in nrows){
>>
>> + dum <- peaks[1:i,]
>> + timing[[length(timing)+1]] <- system.time(x<- by(dum[,2:3],
>> INDICES=list(dum[,1]), FUN=function(x){x}, simplify = FALSE))
>> + }
>>>
>>> names(timing)<- nrows
>>> timing
>>
>> $`1e+05`
>>    user  system elapsed
>>    0.05    0.00    0.05
>>
>> $`1e+06`
>>    user  system elapsed
>>    1.48    2.98    4.46
>>
>> $`2e+06`
>>    user  system elapsed
>>    7.25   11.39   18.65
>>
>> $`3e+06`
>>    user  system elapsed
>>   16.15   25.81   41.99
>>
>> $`5e+06`
>>    user  system elapsed
>>   43.22   74.72  118.09
>
>
> I'm not sure that I follow what you're doing, and your example is
not
> reproducible, since we have no idea what "peaks" is, but on a toy
example
> with 5e6 rows in the data frame I got a timing result of
>
>    user  system elapsed
>   0.379 0.025 0.406
>
> when I applied split().  Is this adequately fast? Seems to me that if you
> want to split something, split() would be a good place to start.
>
> cheers,
>
> Rolf Turner
>
> --
> Technical Editor ANZJS
> Department of Statistics
> University of Auckland
> Phone: +64-9-373-7599 ext. 88276


-- 
Witold Eryk Wolski

Ivan Calandra

2016-Jun-29 13:54 UTC

head link

[R] [FORGED] Splitting data.frame into a list of small data.frames given indices

Hi,

I don't really understand why you split every row... This makes it very 
slow. Try with a more realistic example (with a factor to split).

Ivan

--
Ivan Calandra, PhD
Scientific Mediator
University of Reims Champagne-Ardenne
GEGENAA - EA 3795
CREA - 2 esplanade Roland Garros
51100 Reims, France
+33(0)3 26 77 36 89
ivan.calandra at univ-reims.fr
--
https://www.researchgate.net/profile/Ivan_Calandra
https://publons.com/author/705639/

Le 29/06/2016 ? 15:21, Witold E Wolski a ?crit :> Hi,
>
> Here is an complete example which shows the the complexity of split or
> by is O(n^2)
>
> nrows <- c(1e3,5e3, 1e4 ,5e4, 1e5 ,2e5)
> res<-list()
>
> for(i in nrows){
>    dum <- data.frame(x = runif(i,1,1000), y=runif(i,1,1000))
>    res[[length(res)+1]]<-(system.time(x<- split(dum, 1:nrow(dum))))
> }
> res <- do.call("rbind",res)
> plot(nrows^2, res[,"elapsed"])
>
> And I can't see a reason why this has to be so slow.
>
>
> cheers
>
>
>
>
>
>
>
> On 29 June 2016 at 12:00, Rolf Turner <r.turner at auckland.ac.nz>
wrote:
>> On 29/06/16 21:16, Witold E Wolski wrote:
>>> It's the inverse problem to merging a list of data.frames into
a large
>>> data.frame just discussed in the "performance of
do.call("rbind")"
>>> thread
>>>
>>> I would like to split a data.frame into a list of data.frames
>>> according to first column.
>>> This SEEMS to be easily possible with the function base::by.
However,
>>> as soon as the data.frame has a few million rows this function CAN
NOT
>>> BE USED (except you have A PLENTY OF TIME).
>>>
>>> for 'by' runtime ~ nrow^2, or formally O(n^2)  (see
benchmark below).
>>>
>>> So basically I am looking for a similar function with better
complexity.
>>>
>>>
>>>   > nrows <- c(1e5,1e6,2e6,3e6,5e6)
>>>> timing <- list()
>>>> for(i in nrows){
>>> + dum <- peaks[1:i,]
>>> + timing[[length(timing)+1]] <- system.time(x<- by(dum[,2:3],
>>> INDICES=list(dum[,1]), FUN=function(x){x}, simplify = FALSE))
>>> + }
>>>> names(timing)<- nrows
>>>> timing
>>> $`1e+05`
>>>     user  system elapsed
>>>     0.05    0.00    0.05
>>>
>>> $`1e+06`
>>>     user  system elapsed
>>>     1.48    2.98    4.46
>>>
>>> $`2e+06`
>>>     user  system elapsed
>>>     7.25   11.39   18.65
>>>
>>> $`3e+06`
>>>     user  system elapsed
>>>    16.15   25.81   41.99
>>>
>>> $`5e+06`
>>>     user  system elapsed
>>>    43.22   74.72  118.09
>>
>> I'm not sure that I follow what you're doing, and your example
is not
>> reproducible, since we have no idea what "peaks" is, but on a
toy example
>> with 5e6 rows in the data frame I got a timing result of
>>
>>     user  system elapsed
>>    0.379 0.025 0.406
>>
>> when I applied split().  Is this adequately fast? Seems to me that if
you
>> want to split something, split() would be a good place to start.
>>
>> cheers,
>>
>> Rolf Turner
>>
>> --
>> Technical Editor ANZJS
>> Department of Statistics
>> University of Auckland
>> Phone: +64-9-373-7599 ext. 88276
>
>

William Dunlap

2016-Jun-29 14:13 UTC

head link

[R] [FORGED] Splitting data.frame into a list of small data.frames given indices

I won't go into why splitting data.frames (or factors) uses time
proportional to the number of input rows times the number of
levels in the splitting factor, but you will get much better mileage
if you call split individually on each 'atomic' (numeric, character,
...)
variable and use mapply on the resulting lists.

The plyr and dplyr packages were developed to deal with this
sort of problem.  Check them out.


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Wed, Jun 29, 2016 at 6:21 AM, Witold E Wolski <wewolski at gmail.com>
wrote:
> Hi,
>
> Here is an complete example which shows the the complexity of split or
> by is O(n^2)
>
> nrows <- c(1e3,5e3, 1e4 ,5e4, 1e5 ,2e5)
> res<-list()
>
> for(i in nrows){
>   dum <- data.frame(x = runif(i,1,1000), y=runif(i,1,1000))
>   res[[length(res)+1]]<-(system.time(x<- split(dum, 1:nrow(dum))))
> }
> res <- do.call("rbind",res)
> plot(nrows^2, res[,"elapsed"])
>
> And I can't see a reason why this has to be so slow.
>
>
> cheers
>
>
>
>
>
>
>
> On 29 June 2016 at 12:00, Rolf Turner <r.turner at auckland.ac.nz>
wrote:
> > On 29/06/16 21:16, Witold E Wolski wrote:
> >>
> >> It's the inverse problem to merging a list of data.frames into
a large
> >> data.frame just discussed in the "performance of
do.call("rbind")"
> >> thread
> >>
> >> I would like to split a data.frame into a list of data.frames
> >> according to first column.
> >> This SEEMS to be easily possible with the function base::by.
However,
> >> as soon as the data.frame has a few million rows this function CAN
NOT
> >> BE USED (except you have A PLENTY OF TIME).
> >>
> >> for 'by' runtime ~ nrow^2, or formally O(n^2)  (see
benchmark below).
> >>
> >> So basically I am looking for a similar function with better
complexity.
> >>
> >>
> >>  > nrows <- c(1e5,1e6,2e6,3e6,5e6)
> >>>
> >>> timing <- list()
> >>> for(i in nrows){
> >>
> >> + dum <- peaks[1:i,]
> >> + timing[[length(timing)+1]] <- system.time(x<-
by(dum[,2:3],
> >> INDICES=list(dum[,1]), FUN=function(x){x}, simplify = FALSE))
> >> + }
> >>>
> >>> names(timing)<- nrows
> >>> timing
> >>
> >> $`1e+05`
> >>    user  system elapsed
> >>    0.05    0.00    0.05
> >>
> >> $`1e+06`
> >>    user  system elapsed
> >>    1.48    2.98    4.46
> >>
> >> $`2e+06`
> >>    user  system elapsed
> >>    7.25   11.39   18.65
> >>
> >> $`3e+06`
> >>    user  system elapsed
> >>   16.15   25.81   41.99
> >>
> >> $`5e+06`
> >>    user  system elapsed
> >>   43.22   74.72  118.09
> >
> >
> > I'm not sure that I follow what you're doing, and your example
is not
> > reproducible, since we have no idea what "peaks" is, but on
a toy example
> > with 5e6 rows in the data frame I got a timing result of
> >
> >    user  system elapsed
> >   0.379 0.025 0.406
> >
> > when I applied split().  Is this adequately fast? Seems to me that if
you
> > want to split something, split() would be a good place to start.
> >
> > cheers,
> >
> > Rolf Turner
> >
> > --
> > Technical Editor ANZJS
> > Department of Statistics
> > University of Auckland
> > Phone: +64-9-373-7599 ext. 88276
>
>
>
> --
> Witold Eryk Wolski
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Ivan Calandra

2016-Jun-29 14:22 UTC

head link

[R] [FORGED] Splitting data.frame into a list of small data.frames given indices

Your answer did not make it to the list...

Le 29/06/2016 ? 16:06, Witold E Wolski a ?crit :> If you do not understand than why do you reply?
>
>
> On 29 June 2016 at 15:54, Ivan Calandra <ivan.calandra at
univ-reims.fr> wrote:
>> Hi,
>>
>> I don't really understand why you split every row... This makes it
very
>> slow. Try with a more realistic example (with a factor to split).
>>
>> Ivan
>>
>> --
>> Ivan Calandra, PhD
>> Scientific Mediator
>> University of Reims Champagne-Ardenne
>> GEGENAA - EA 3795
>> CREA - 2 esplanade Roland Garros
>> 51100 Reims, France
>> +33(0)3 26 77 36 89
>> ivan.calandra at univ-reims.fr
>> --
>> https://www.researchgate.net/profile/Ivan_Calandra
>> https://publons.com/author/705639/
>>
>>
>> Le 29/06/2016 ? 15:21, Witold E Wolski a ?crit :
>>> Hi,
>>>
>>> Here is an complete example which shows the the complexity of split
or
>>> by is O(n^2)
>>>
>>> nrows <- c(1e3,5e3, 1e4 ,5e4, 1e5 ,2e5)
>>> res<-list()
>>>
>>> for(i in nrows){
>>>     dum <- data.frame(x = runif(i,1,1000), y=runif(i,1,1000))
>>>     res[[length(res)+1]]<-(system.time(x<- split(dum,
1:nrow(dum))))
>>> }
>>> res <- do.call("rbind",res)
>>> plot(nrows^2, res[,"elapsed"])
>>>
>>> And I can't see a reason why this has to be so slow.
>>>
>>>
>>> cheers
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 29 June 2016 at 12:00, Rolf Turner <r.turner at
auckland.ac.nz> wrote:
>>>> On 29/06/16 21:16, Witold E Wolski wrote:
>>>>> It's the inverse problem to merging a list of
data.frames into a large
>>>>> data.frame just discussed in the "performance of
do.call("rbind")"
>>>>> thread
>>>>>
>>>>> I would like to split a data.frame into a list of
data.frames
>>>>> according to first column.
>>>>> This SEEMS to be easily possible with the function
base::by. However,
>>>>> as soon as the data.frame has a few million rows this
function CAN NOT
>>>>> BE USED (except you have A PLENTY OF TIME).
>>>>>
>>>>> for 'by' runtime ~ nrow^2, or formally O(n^2)  (see
benchmark below).
>>>>>
>>>>> So basically I am looking for a similar function with
better complexity.
>>>>>
>>>>>
>>>>>    > nrows <- c(1e5,1e6,2e6,3e6,5e6)
>>>>>> timing <- list()
>>>>>> for(i in nrows){
>>>>> + dum <- peaks[1:i,]
>>>>> + timing[[length(timing)+1]] <- system.time(x<-
by(dum[,2:3],
>>>>> INDICES=list(dum[,1]), FUN=function(x){x}, simplify =
FALSE))
>>>>> + }
>>>>>> names(timing)<- nrows
>>>>>> timing
>>>>> $`1e+05`
>>>>>      user  system elapsed
>>>>>      0.05    0.00    0.05
>>>>>
>>>>> $`1e+06`
>>>>>      user  system elapsed
>>>>>      1.48    2.98    4.46
>>>>>
>>>>> $`2e+06`
>>>>>      user  system elapsed
>>>>>      7.25   11.39   18.65
>>>>>
>>>>> $`3e+06`
>>>>>      user  system elapsed
>>>>>     16.15   25.81   41.99
>>>>>
>>>>> $`5e+06`
>>>>>      user  system elapsed
>>>>>     43.22   74.72  118.09
>>>>
>>>> I'm not sure that I follow what you're doing, and your
example is not
>>>> reproducible, since we have no idea what "peaks" is,
but on a toy example
>>>> with 5e6 rows in the data frame I got a timing result of
>>>>
>>>>      user  system elapsed
>>>>     0.379 0.025 0.406
>>>>
>>>> when I applied split().  Is this adequately fast? Seems to me
that if you
>>>> want to split something, split() would be a good place to
start.
>>>>
>>>> cheers,
>>>>
>>>> Rolf Turner
>>>>
>>>> --
>>>> Technical Editor ANZJS
>>>> Department of Statistics
>>>> University of Auckland
>>>> Phone: +64-9-373-7599 ext. 88276
>>>
>>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>

Witold E Wolski

2016-Jul-01 14:40 UTC

head link

[R] [FORGED] Splitting data.frame into a list of small data.frames given indices

Hi William,

I tested plyrs dlply function, and it seems to have  have an O(N*
log(R)) complexity (tested for R=N) so I do not know if N is the
number of rows or nr of categories.

For the data.frame example with 2e5 rows and 2e5 categories it is
approx. 10 times faster than split. Still, it is 10 seconds on an
i7-5930K 3.5GHz Intel.

It would be nice if the documentation would contain runtime
complexity information and the documentation of base package function
would point to function which should be used instead.

Thanks




On 29 June 2016 at 16:13, William Dunlap <wdunlap at tibco.com>
wrote:> I won't go into why splitting data.frames (or factors) uses time
> proportional to the number of input rows times the number of
> levels in the splitting factor, but you will get much better mileage
> if you call split individually on each 'atomic' (numeric,
character, ...)
> variable and use mapply on the resulting lists.
>
> The plyr and dplyr packages were developed to deal with this
> sort of problem.  Check them out.
>
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
> On Wed, Jun 29, 2016 at 6:21 AM, Witold E Wolski <wewolski at
gmail.com> wrote:
>>
>> Hi,
>>
>> Here is an complete example which shows the the complexity of split or
>> by is O(n^2)
>>
>> nrows <- c(1e3,5e3, 1e4 ,5e4, 1e5 ,2e5)
>> res<-list()
>>
>> for(i in nrows){
>>   dum <- data.frame(x = runif(i,1,1000), y=runif(i,1,1000))
>>   res[[length(res)+1]]<-(system.time(x<- split(dum,
1:nrow(dum))))
>> }
>> res <- do.call("rbind",res)
>> plot(nrows^2, res[,"elapsed"])
>>
>> And I can't see a reason why this has to be so slow.
>>
>>
>> cheers
>>
>>
>>
>>
>>
>>
>>
>> On 29 June 2016 at 12:00, Rolf Turner <r.turner at
auckland.ac.nz> wrote:
>> > On 29/06/16 21:16, Witold E Wolski wrote:
>> >>
>> >> It's the inverse problem to merging a list of data.frames
into a large
>> >> data.frame just discussed in the "performance of
do.call("rbind")"
>> >> thread
>> >>
>> >> I would like to split a data.frame into a list of data.frames
>> >> according to first column.
>> >> This SEEMS to be easily possible with the function base::by.
However,
>> >> as soon as the data.frame has a few million rows this function
CAN NOT
>> >> BE USED (except you have A PLENTY OF TIME).
>> >>
>> >> for 'by' runtime ~ nrow^2, or formally O(n^2)  (see
benchmark below).
>> >>
>> >> So basically I am looking for a similar function with better
>> >> complexity.
>> >>
>> >>
>> >>  > nrows <- c(1e5,1e6,2e6,3e6,5e6)
>> >>>
>> >>> timing <- list()
>> >>> for(i in nrows){
>> >>
>> >> + dum <- peaks[1:i,]
>> >> + timing[[length(timing)+1]] <- system.time(x<-
by(dum[,2:3],
>> >> INDICES=list(dum[,1]), FUN=function(x){x}, simplify = FALSE))
>> >> + }
>> >>>
>> >>> names(timing)<- nrows
>> >>> timing
>> >>
>> >> $`1e+05`
>> >>    user  system elapsed
>> >>    0.05    0.00    0.05
>> >>
>> >> $`1e+06`
>> >>    user  system elapsed
>> >>    1.48    2.98    4.46
>> >>
>> >> $`2e+06`
>> >>    user  system elapsed
>> >>    7.25   11.39   18.65
>> >>
>> >> $`3e+06`
>> >>    user  system elapsed
>> >>   16.15   25.81   41.99
>> >>
>> >> $`5e+06`
>> >>    user  system elapsed
>> >>   43.22   74.72  118.09
>> >
>> >
>> > I'm not sure that I follow what you're doing, and your
example is not
>> > reproducible, since we have no idea what "peaks" is, but
on a toy
>> > example
>> > with 5e6 rows in the data frame I got a timing result of
>> >
>> >    user  system elapsed
>> >   0.379 0.025 0.406
>> >
>> > when I applied split().  Is this adequately fast? Seems to me that
if
>> > you
>> > want to split something, split() would be a good place to start.
>> >
>> > cheers,
>> >
>> > Rolf Turner
>> >
>> > --
>> > Technical Editor ANZJS
>> > Department of Statistics
>> > University of Auckland
>> > Phone: +64-9-373-7599 ext. 88276
>>
>>
>>
>> --
>> Witold Eryk Wolski
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>


-- 
Witold Eryk Wolski

R help - Jun 2016 - [FORGED] Splitting data.frame into a list of small data.frames given indices

[R] [FORGED] Splitting data.frame into a list of small data.frames given indices

[R] [FORGED] Splitting data.frame into a list of small data.frames given indices

[R] [FORGED] Splitting data.frame into a list of small data.frames given indices

[R] [FORGED] Splitting data.frame into a list of small data.frames given indices

[R] [FORGED] Splitting data.frame into a list of small data.frames given indices