pengyu.ut at gmail.com
2009-Dec-09 22:10 UTC
[Rd] split() is slow on data.frame (PR#14123)
Please see the following code for the runtime comparison between
split() and mysplit.data.frame() (they do the same thing
semantically). mysplit.data.frame() is a fix of split() in term of
performance. Could somebody include this fix (with possible checking
for corner cases) in future version of R and let me know the inclusion
of the fix?
m=300000
n=6
k=30000
set.seed(0)
x=replicate(n,rnorm(m))
f=sample(1:k, size=m, replace=T)
mysplit.data.frame<-function(x,f) {
print('processing data.frame')
v=lapply(
1:dim(x)[[2]]
, function(i) {
split(x[,i],f)
}
)
w=lapply(
seq(along=v[[1]])
, function(i) {
result=do.call(
cbind
, lapply(v,
function(vj) {
vj[[i]]
}
)
)
colnames(result)=colnames(x)
return(result)
}
)
names(w)=names(v[[1]])
return(w)
}
system.time(split(as.data.frame(x),f))
system.time(mysplit.data.frame(as.data.frame(x),f))
Here are some differences between the current and proposed split.data.frame.> d<-data.frame(Matrix=I(matrix(1:10, ncol=2)),Named=c(one=1,two=2,three=3,four=4,five=5), row.names=as.character(1001:1005))> group<-c("A","B","A","A","B") > split.data.frame(d,group)$A Matrix.1 Matrix.2 Named 1001 1 6 1 1003 3 8 3 1004 4 9 4 $B Matrix.1 Matrix.2 Named 1002 2 7 2 1005 5 10 5> mysplit.data.frame(d,group) # lost row.names and 2nd column of Matrix[1] "processing data.frame" $A Matrix Named [1,] 1 1 [2,] 3 3 [3,] 4 4 $B Matrix Named [1,] 2 2 [2,] 5 5 Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com> -----Original Message----- > From: r-devel-bounces at r-project.org > [mailto:r-devel-bounces at r-project.org] On Behalf Of > pengyu.ut at gmail.com > Sent: Wednesday, December 09, 2009 2:10 PM > To: r-devel at stat.math.ethz.ch > Cc: R-bugs at r-project.org > Subject: [Rd] split() is slow on data.frame (PR#14123) > > Please see the following code for the runtime comparison between > split() and mysplit.data.frame() (they do the same thing > semantically). mysplit.data.frame() is a fix of split() in term of > performance. Could somebody include this fix (with possible checking > for corner cases) in future version of R and let me know the inclusion > of the fix? > > m=300000 > n=6 > k=30000 > > set.seed(0) > x=replicate(n,rnorm(m)) > f=sample(1:k, size=m, replace=T) > > mysplit.data.frame<-function(x,f) { > print('processing data.frame') > v=lapply( > 1:dim(x)[[2]] > , function(i) { > split(x[,i],f) > } > ) > > w=lapply( > seq(along=v[[1]]) > , function(i) { > result=do.call( > cbind > , lapply(v, > function(vj) { > vj[[i]] > } > ) > ) > colnames(result)=colnames(x) > return(result) > } > ) > names(w)=names(v[[1]]) > return(w) > } > > system.time(split(as.data.frame(x),f)) > system.time(mysplit.data.frame(as.data.frame(x),f)) > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
Seemingly Similar Threads
- split() is slow on data.frame (PR#14123)
- problem about set operation and computation after split
- lme unequal random-effects variances varIdent pdMat Pinheiro Bates nlme
- split data, but ensure each level of the factor is represented
- Using vorbis-java to read an existing file?