thr3ads.net - R devel - [Rd] fast version of split.data.frame or conversion from data.frame to list of its rows [Apr 2012]

If this information is useful, please help other people find it:
Share via:

Antonio Piccolboni

2012-Apr-30 23:28 UTC

[Rd] fast version of split.data.frame or conversion from data.frame to list of its rows

Hi,
I was wondering if there is anything more efficient than split to do the
kind of conversion in the subject. If I create a data frame as in

system.time({fd =  data.frame(x=1:2000, y = rnorm(2000), id =
paste("x",
1:2000, sep =""))})
  user  system elapsed
  0.004   0.000   0.004

and then I try to split it
> system.time(split(fd, 1:nrow(fd)))   user  system elapsed
  0.333   0.031   0.415


You will be quick to notice the roughly two orders of magnitude difference
in time between creation and conversion. Granted, it's not written anywhere
that they should be similar but the latter seems interpreter-slow to me
(split is implemented with a lapply in the data frame case) There is also a
memory issue when I hit about 20000 elements (allocating 3GB when
interrupted). So before I resort to Rcpp, despite the electrifying feeling
of approaching the bare metal and for the sake of getting things done, I
thought I would ask the experts. Thanks


Antonio

	[[alternative HTML version deleted]]

Matthew Dowle

2012-May-01 09:26 UTC

head link

[Rd] fast version of split.data.frame or conversion from data.frame to list of its rows

Antonio Piccolboni <antonio <at> piccolboni.info>
writes:> Hi,
> I was wondering if there is anything more efficient than split to do the
> kind of conversion in the subject. If I create a data frame as in
> 
> system.time({fd =  data.frame(x=1:2000, y = rnorm(2000), id =
paste("x",
> 1:2000, sep =""))})
>   user  system elapsed
>   0.004   0.000   0.004
> 
> and then I try to split it
> 
> > system.time(split(fd, 1:nrow(fd)))
>    user  system elapsed
>   0.333   0.031   0.415
> 
> You will be quick to notice the roughly two orders of magnitude difference
> in time between creation and conversion. Granted, it's not written
anywhere
> that they should be similar but the latter seems interpreter-slow to me
> (split is implemented with a lapply in the data frame case) There is also a
> memory issue when I hit about 20000 elements (allocating 3GB when
> interrupted). So before I resort to Rcpp, despite the electrifying feeling
> of approaching the bare metal and for the sake of getting things done, I
> thought I would ask the experts. Thanks
> 
> Antonio
Perhaps r-help or Stack Overflow would have been more appropriate to try first, 
before r-devel. If you did, please say so.

Answering anyway. Do you really want to split every single row? What's the 
bigger picture? Perhaps you don't need to split at all.

On the off chance that the example was just for exposition, and applying some 
(biased) guesswork, have you seen the data.table package? It doesn't use the
split-apply-combine paradigm because, as your (extreme) example shows, that 
doesn't scale. When you use the 'by' argument of [.data.table, it
allocates
memory once for the largest group. Then it reuses that same memory for each 
group. That's one reason it's fast and memory efficient at grouping (an
order
of magnitude faster than tapply).

Independent timings :
http://www.r-bloggers.com/comparison-of-ave-ddply-and-data-table/

If you really do want to split every single row, then
    DT[,<something>,by=1:nrow(DT)]
will give perhaps two orders of magnitude speedup, but that's an unfair
example
because it isn't very realistic. Scaling applies to the size of the
data.frame,
and, how much you want to split it up. Your example is extreme in the latter 
but not the former. data.table scales in both.

It's nothing to do with the interpreter, btw, just memory usage.

Matthew

Prof Brian Ripley

2012-May-01 12:46 UTC

head link

[Rd] fast version of split.data.frame or conversion from data.frame to list of its rows

On 01/05/2012 00:28, Antonio Piccolboni wrote:> Hi,
> I was wondering if there is anything more efficient than split to do the
> kind of conversion in the subject. If I create a data frame as in
>
> system.time({fd =  data.frame(x=1:2000, y = rnorm(2000), id =
paste("x",
> 1:2000, sep =""))})
>    user  system elapsed
>    0.004   0.000   0.004
>
> and then I try to split it
>
>> system.time(split(fd, 1:nrow(fd)))
>     user  system elapsed
>    0.333   0.031   0.415
>
>
> You will be quick to notice the roughly two orders of magnitude difference
> in time between creation and conversion. Granted, it's not written
anywhere
Unsurprising when you create three orders of magnitude more data frames, 
is it?  That's a list of 2000 data frames.  Try

system.time(for(i in 1:2000) data.frame(x = i, y = rnorm(1), id = 
paste0("x", i)))

> that they should be similar but the latter seems interpreter-slow to me
> (split is implemented with a lapply in the data frame case) There is also a
> memory issue when I hit about 20000 elements (allocating 3GB when
> interrupted). So before I resort to Rcpp, despite the electrifying feeling
> of approaching the bare metal and for the sake of getting things done, I
> thought I would ask the experts. Thanks
You need to re-think your data structures: 1-row data frames are not 
sensible.

>
>
> Antonio
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Reasonably Related Threads

Search for more seemingly similar threads

R devel - Apr 2012 - fast version of split.data.frame or conversion from data.frame to list of its rows

[Rd] fast version of split.data.frame or conversion from data.frame to list of its rows

[Rd] fast version of split.data.frame or conversion from data.frame to list of its rows

[Rd] fast version of split.data.frame or conversion from data.frame to list of its rows

Reasonably Related Threads