thr3ads.net - R devel - [Rd] split() is slow on data.frame (PR#14123) [Dec 2009]

If this information is useful, please help other people find it:
Share via:

pengyu.ut at gmail.com

2009-Dec-09 22:10 UTC

[Rd] split() is slow on data.frame (PR#14123)

Please see the following code for the runtime comparison between
split() and mysplit.data.frame() (they do the same thing
semantically). mysplit.data.frame() is a fix of split() in term of
performance. Could somebody include this fix (with possible checking
for corner cases) in future version of R and let me know the inclusion
of the fix?

m=300000
n=6
k=30000

set.seed(0)
x=replicate(n,rnorm(m))
f=sample(1:k, size=m, replace=T)

mysplit.data.frame<-function(x,f) {
  print('processing data.frame')
  v=lapply(
      1:dim(x)[[2]]
      , function(i) {
        split(x[,i],f)
      }
      )

  w=lapply(
      seq(along=v[[1]])
      , function(i) {
        result=do.call(
            cbind
            , lapply(v,
                function(vj) {
                  vj[[i]]
                }
                )
            )
        colnames(result)=colnames(x)
        return(result)
      }
      )
  names(w)=names(v[[1]])
  return(w)
}

system.time(split(as.data.frame(x),f))
system.time(mysplit.data.frame(as.data.frame(x),f))

William Dunlap

2009-Dec-09 22:26 UTC

head link

[Rd] split() is slow on data.frame (PR#14123)

Here are some differences between the current and proposed
split.data.frame.
> d<-data.frame(Matrix=I(matrix(1:10, ncol=2)),Named=c(one=1,two=2,three=3,four=4,five=5),
row.names=as.character(1001:1005))>
group<-c("A","B","A","A","B")
> split.data.frame(d,group)$A
     Matrix.1 Matrix.2 Named
1001        1        6     1
1003        3        8     3
1004        4        9     4

$B
     Matrix.1 Matrix.2 Named
1002        2        7     2
1005        5       10     5
> mysplit.data.frame(d,group) # lost row.names and 2nd column of Matrix[1] "processing data.frame"
$A
     Matrix Named
[1,]      1     1
[2,]      3     3
[3,]      4     4

$B
     Matrix Named
[1,]      2     2
[2,]      5     5


Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com  
> -----Original Message-----
> From: r-devel-bounces at r-project.org 
> [mailto:r-devel-bounces at r-project.org] On Behalf Of 
> pengyu.ut at gmail.com
> Sent: Wednesday, December 09, 2009 2:10 PM
> To: r-devel at stat.math.ethz.ch
> Cc: R-bugs at r-project.org
> Subject: [Rd] split() is slow on data.frame (PR#14123)
> 
> Please see the following code for the runtime comparison between
> split() and mysplit.data.frame() (they do the same thing
> semantically). mysplit.data.frame() is a fix of split() in term of
> performance. Could somebody include this fix (with possible checking
> for corner cases) in future version of R and let me know the inclusion
> of the fix?
> 
> m=300000
> n=6
> k=30000
> 
> set.seed(0)
> x=replicate(n,rnorm(m))
> f=sample(1:k, size=m, replace=T)
> 
> mysplit.data.frame<-function(x,f) {
>   print('processing data.frame')
>   v=lapply(
>       1:dim(x)[[2]]
>       , function(i) {
>         split(x[,i],f)
>       }
>       )
> 
>   w=lapply(
>       seq(along=v[[1]])
>       , function(i) {
>         result=do.call(
>             cbind
>             , lapply(v,
>                 function(vj) {
>                   vj[[i]]
>                 }
>                 )
>             )
>         colnames(result)=colnames(x)
>         return(result)
>       }
>       )
>   names(w)=names(v[[1]])
>   return(w)
> }
> 
> system.time(split(as.data.frame(x),f))
> system.time(mysplit.data.frame(as.data.frame(x),f))
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

Possibly Parallel Threads

Search for more reasonably related threads

R devel - Dec 2009 - split() is slow on data.frame (PR#14123)

[Rd] split() is slow on data.frame (PR#14123)

[Rd] split() is slow on data.frame (PR#14123)

Possibly Parallel Threads