Roberto Perdisci
2009-Oct-22 00:17 UTC
[R] loop vs. apply(): strange behavior with data frame?
Hi everybody,
I noticed a strange behavior when using loops versus apply() on a data frame.
The example below "explicitly" computes a distance matrix given a
dataset. When the dataset is a matrix, everything works fine. But when
the dataset is a data.frame, the dist.for function written using
nested loops will take a lot longer than the dist.apply
######## USING FOR #######
dist.for <- function(data) {
d <- matrix(0,nrow=nrow(data),ncol=nrow(data))
n <- ncol(data)
r <- nrow(data)
for(i in 1:r) {
for(j in 1:r) {
d[i,j] <- sum(abs(data[i,]-data[j,]))/n
}
}
return(as.dist(d))
}
######## USING APPLY #######
f <- function(data.row,data.rest) {
r2 <- as.double(apply(data.rest,1,g,data.row))
}
g <- function(row2,row1) {
return(sum(abs(row1-row2))/length(row1))
}
dist.apply <- function(data) {
d <- apply(data,1,f,data)
return(as.dist(d))
}
######## TESTING #######
library(mvtnorm)
data <- rmvnorm(100,mean=seq(1,10),sigma=diag(1,nrow=10,ncol=10))
tf <- system.time(df <- dist.for(data))
ta <- system.time(da <- dist.apply(data))
print(paste('diff = ',sum(as.matrix(df) - as.matrix(da))))
print("tf = ")
print(tf)
print("ta = ")
print(ta)
print('----------------------------------')
print('Same experiment on data.frame...')
data2 <- as.data.frame(data)
tf <- system.time(df <- dist.for(data2))
ta <- system.time(da <- dist.apply(data2))
print(paste('diff = ',sum(as.matrix(df) - as.matrix(da))))
print("tf = ")
print(tf)
print("ta = ")
print(ta)
########################
Here is the output I get on my system (R version 2.7.1 on a Debian lenny)
[1] "diff = 0"
[1] "tf = "
user system elapsed
0.088 0.000 0.087
[1] "ta = "
user system elapsed
0.128 0.000 0.128
[1] "----------------------------------"
[1] "Same experiment on data.frame..."
[1] "diff = 0"
[1] "tf = "
user system elapsed
35.031 0.000 35.029
[1] "ta = "
user system elapsed
0.184 0.000 0.185
Could you explain why that happens?
thank you,
regards
Roberto
try running Rprof on the two examples to see what the difference is.
what you will probably see is a lot of the time on the dataframe is
spent in accessing it like a matrix ('['). Rprof is very helpful to
see where time is spent in your scripts.
Sent from my iPhone
On Oct 21, 2009, at 17:17, Roberto Perdisci
<roberto.perdisci at gmail.com> wrote:
> Hi everybody,
> I noticed a strange behavior when using loops versus apply() on a
> data frame.
> The example below "explicitly" computes a distance matrix given a
> dataset. When the dataset is a matrix, everything works fine. But when
> the dataset is a data.frame, the dist.for function written using
> nested loops will take a lot longer than the dist.apply
>
> ######## USING FOR #######
>
> dist.for <- function(data) {
>
> d <- matrix(0,nrow=nrow(data),ncol=nrow(data))
> n <- ncol(data)
> r <- nrow(data)
>
> for(i in 1:r) {
> for(j in 1:r) {
> d[i,j] <- sum(abs(data[i,]-data[j,]))/n
> }
> }
>
> return(as.dist(d))
> }
>
> ######## USING APPLY #######
>
> f <- function(data.row,data.rest) {
>
> r2 <- as.double(apply(data.rest,1,g,data.row))
>
> }
>
> g <- function(row2,row1) {
> return(sum(abs(row1-row2))/length(row1))
> }
>
> dist.apply <- function(data) {
> d <- apply(data,1,f,data)
>
> return(as.dist(d))
> }
>
>
> ######## TESTING #######
>
> library(mvtnorm)
> data <- rmvnorm(100,mean=seq(1,10),sigma=diag(1,nrow=10,ncol=10))
>
> tf <- system.time(df <- dist.for(data))
> ta <- system.time(da <- dist.apply(data))
>
> print(paste('diff = ',sum(as.matrix(df) - as.matrix(da))))
> print("tf = ")
> print(tf)
> print("ta = ")
> print(ta)
>
> print('----------------------------------')
> print('Same experiment on data.frame...')
> data2 <- as.data.frame(data)
>
> tf <- system.time(df <- dist.for(data2))
> ta <- system.time(da <- dist.apply(data2))
>
> print(paste('diff = ',sum(as.matrix(df) - as.matrix(da))))
> print("tf = ")
> print(tf)
> print("ta = ")
> print(ta)
>
> ########################
>
> Here is the output I get on my system (R version 2.7.1 on a Debian
> lenny)
>
> [1] "diff = 0"
> [1] "tf = "
> user system elapsed
> 0.088 0.000 0.087
> [1] "ta = "
> user system elapsed
> 0.128 0.000 0.128
> [1] "----------------------------------"
> [1] "Same experiment on data.frame..."
> [1] "diff = 0"
> [1] "tf = "
> user system elapsed
> 35.031 0.000 35.029
> [1] "ta = "
> user system elapsed
> 0.184 0.000 0.185
>
> Could you explain why that happens?
>
> thank you,
> regards
>
> Roberto
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.