Roberto Perdisci
2009-Oct-22 00:17 UTC
[R] loop vs. apply(): strange behavior with data frame?
Hi everybody, I noticed a strange behavior when using loops versus apply() on a data frame. The example below "explicitly" computes a distance matrix given a dataset. When the dataset is a matrix, everything works fine. But when the dataset is a data.frame, the dist.for function written using nested loops will take a lot longer than the dist.apply ######## USING FOR ####### dist.for <- function(data) { d <- matrix(0,nrow=nrow(data),ncol=nrow(data)) n <- ncol(data) r <- nrow(data) for(i in 1:r) { for(j in 1:r) { d[i,j] <- sum(abs(data[i,]-data[j,]))/n } } return(as.dist(d)) } ######## USING APPLY ####### f <- function(data.row,data.rest) { r2 <- as.double(apply(data.rest,1,g,data.row)) } g <- function(row2,row1) { return(sum(abs(row1-row2))/length(row1)) } dist.apply <- function(data) { d <- apply(data,1,f,data) return(as.dist(d)) } ######## TESTING ####### library(mvtnorm) data <- rmvnorm(100,mean=seq(1,10),sigma=diag(1,nrow=10,ncol=10)) tf <- system.time(df <- dist.for(data)) ta <- system.time(da <- dist.apply(data)) print(paste('diff = ',sum(as.matrix(df) - as.matrix(da)))) print("tf = ") print(tf) print("ta = ") print(ta) print('----------------------------------') print('Same experiment on data.frame...') data2 <- as.data.frame(data) tf <- system.time(df <- dist.for(data2)) ta <- system.time(da <- dist.apply(data2)) print(paste('diff = ',sum(as.matrix(df) - as.matrix(da)))) print("tf = ") print(tf) print("ta = ") print(ta) ######################## Here is the output I get on my system (R version 2.7.1 on a Debian lenny) [1] "diff = 0" [1] "tf = " user system elapsed 0.088 0.000 0.087 [1] "ta = " user system elapsed 0.128 0.000 0.128 [1] "----------------------------------" [1] "Same experiment on data.frame..." [1] "diff = 0" [1] "tf = " user system elapsed 35.031 0.000 35.029 [1] "ta = " user system elapsed 0.184 0.000 0.185 Could you explain why that happens? thank you, regards Roberto
try running Rprof on the two examples to see what the difference is. what you will probably see is a lot of the time on the dataframe is spent in accessing it like a matrix ('['). Rprof is very helpful to see where time is spent in your scripts. Sent from my iPhone On Oct 21, 2009, at 17:17, Roberto Perdisci <roberto.perdisci at gmail.com> wrote:> Hi everybody, > I noticed a strange behavior when using loops versus apply() on a > data frame. > The example below "explicitly" computes a distance matrix given a > dataset. When the dataset is a matrix, everything works fine. But when > the dataset is a data.frame, the dist.for function written using > nested loops will take a lot longer than the dist.apply > > ######## USING FOR ####### > > dist.for <- function(data) { > > d <- matrix(0,nrow=nrow(data),ncol=nrow(data)) > n <- ncol(data) > r <- nrow(data) > > for(i in 1:r) { > for(j in 1:r) { > d[i,j] <- sum(abs(data[i,]-data[j,]))/n > } > } > > return(as.dist(d)) > } > > ######## USING APPLY ####### > > f <- function(data.row,data.rest) { > > r2 <- as.double(apply(data.rest,1,g,data.row)) > > } > > g <- function(row2,row1) { > return(sum(abs(row1-row2))/length(row1)) > } > > dist.apply <- function(data) { > d <- apply(data,1,f,data) > > return(as.dist(d)) > } > > > ######## TESTING ####### > > library(mvtnorm) > data <- rmvnorm(100,mean=seq(1,10),sigma=diag(1,nrow=10,ncol=10)) > > tf <- system.time(df <- dist.for(data)) > ta <- system.time(da <- dist.apply(data)) > > print(paste('diff = ',sum(as.matrix(df) - as.matrix(da)))) > print("tf = ") > print(tf) > print("ta = ") > print(ta) > > print('----------------------------------') > print('Same experiment on data.frame...') > data2 <- as.data.frame(data) > > tf <- system.time(df <- dist.for(data2)) > ta <- system.time(da <- dist.apply(data2)) > > print(paste('diff = ',sum(as.matrix(df) - as.matrix(da)))) > print("tf = ") > print(tf) > print("ta = ") > print(ta) > > ######################## > > Here is the output I get on my system (R version 2.7.1 on a Debian > lenny) > > [1] "diff = 0" > [1] "tf = " > user system elapsed > 0.088 0.000 0.087 > [1] "ta = " > user system elapsed > 0.128 0.000 0.128 > [1] "----------------------------------" > [1] "Same experiment on data.frame..." > [1] "diff = 0" > [1] "tf = " > user system elapsed > 35.031 0.000 35.029 > [1] "ta = " > user system elapsed > 0.184 0.000 0.185 > > Could you explain why that happens? > > thank you, > regards > > Roberto > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.