pengyu.ut at gmail.com
2009-Dec-09 22:10 UTC
[Rd] split() is slow on data.frame (PR#14123)
Please see the following code for the runtime comparison between split() and mysplit.data.frame() (they do the same thing semantically). mysplit.data.frame() is a fix of split() in term of performance. Could somebody include this fix (with possible checking for corner cases) in future version of R and let me know the inclusion of the fix? m=300000 n=6 k=30000 set.seed(0) x=replicate(n,rnorm(m)) f=sample(1:k, size=m, replace=T) mysplit.data.frame<-function(x,f) { print('processing data.frame') v=lapply( 1:dim(x)[[2]] , function(i) { split(x[,i],f) } ) w=lapply( seq(along=v[[1]]) , function(i) { result=do.call( cbind , lapply(v, function(vj) { vj[[i]] } ) ) colnames(result)=colnames(x) return(result) } ) names(w)=names(v[[1]]) return(w) } system.time(split(as.data.frame(x),f)) system.time(mysplit.data.frame(as.data.frame(x),f))
Here are some differences between the current and proposed split.data.frame.> d<-data.frame(Matrix=I(matrix(1:10, ncol=2)),Named=c(one=1,two=2,three=3,four=4,five=5), row.names=as.character(1001:1005))> group<-c("A","B","A","A","B") > split.data.frame(d,group)$A Matrix.1 Matrix.2 Named 1001 1 6 1 1003 3 8 3 1004 4 9 4 $B Matrix.1 Matrix.2 Named 1002 2 7 2 1005 5 10 5> mysplit.data.frame(d,group) # lost row.names and 2nd column of Matrix[1] "processing data.frame" $A Matrix Named [1,] 1 1 [2,] 3 3 [3,] 4 4 $B Matrix Named [1,] 2 2 [2,] 5 5 Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com> -----Original Message----- > From: r-devel-bounces at r-project.org > [mailto:r-devel-bounces at r-project.org] On Behalf Of > pengyu.ut at gmail.com > Sent: Wednesday, December 09, 2009 2:10 PM > To: r-devel at stat.math.ethz.ch > Cc: R-bugs at r-project.org > Subject: [Rd] split() is slow on data.frame (PR#14123) > > Please see the following code for the runtime comparison between > split() and mysplit.data.frame() (they do the same thing > semantically). mysplit.data.frame() is a fix of split() in term of > performance. Could somebody include this fix (with possible checking > for corner cases) in future version of R and let me know the inclusion > of the fix? > > m=300000 > n=6 > k=30000 > > set.seed(0) > x=replicate(n,rnorm(m)) > f=sample(1:k, size=m, replace=T) > > mysplit.data.frame<-function(x,f) { > print('processing data.frame') > v=lapply( > 1:dim(x)[[2]] > , function(i) { > split(x[,i],f) > } > ) > > w=lapply( > seq(along=v[[1]]) > , function(i) { > result=do.call( > cbind > , lapply(v, > function(vj) { > vj[[i]] > } > ) > ) > colnames(result)=colnames(x) > return(result) > } > ) > names(w)=names(v[[1]]) > return(w) > } > > system.time(split(as.data.frame(x),f)) > system.time(mysplit.data.frame(as.data.frame(x),f)) > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
Seemingly Similar Threads
- split() is slow on data.frame (PR#14123)
- problem about set operation and computation after split
- lme unequal random-effects variances varIdent pdMat Pinheiro Bates nlme
- split data, but ensure each level of the factor is represented
- Using vorbis-java to read an existing file?