R devel - Mar 2007 - extracting rows from a data frame by looping over the row names: performance issues

Hi,

I have a big data frame:

  > mat <- matrix(rep(paste(letters, collapse=""), 5*300000),
ncol=5)
  > dat <- as.data.frame(mat)

and I need to do some computation on each row. Currently I'm doing this:

  > for (key in row.names(dat)) { row <- dat[key, ]; ... do some
computation on row... }

which could probably considered a very natural (and R'ish) way of doing it
(but maybe I'm wrong and the real idiom for doing this is something
different).

The problem with this "idiomatic form" is that it is _very_ slow. The
loop
itself + the simple extraction of the rows (no computation on the rows) takes
10 hours on a powerful server (quad core Linux with 8G of RAM)!

Looping over the first 100 rows takes 12 seconds:

  > system.time(for (key in row.names(dat)[1:100]) { row <- dat[key, ] })
     user  system elapsed
   12.637   0.120  12.756

But if, instead of the above, I do this:

  > for (i in nrow(dat)) { row <- sapply(dat, function(col) col[i]) }

then it's 20 times faster!!

  > system.time(for (i in 1:100) { row <- sapply(dat, function(col)
col[i]) })
     user  system elapsed
    0.576   0.096   0.673

I hope you will agree that this second form is much less natural.

So I was wondering why the "idiomatic form" is so slow? Shouldn't
the idiomatic
form be, not only elegant and easy to read, but also efficient?

Thanks,
H.

> sessionInfo()R version 2.5.0 Under development (unstable) (2007-01-05 r40386)
x86_64-unknown-linux-gnu

locale:
LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C

attached base packages:
[1] "stats"     "graphics"  "grDevices"
"utils"     "datasets"  "methods"
[7] "base"

R devel - Mar 2007 - extracting rows from a data frame by looping over the row names: performance issues

[Rd] extracting rows from a data frame by looping over the row names: performance issues

[Rd] extracting rows from a data frame by looping over the row names: performance issues

[Rd] extracting rows from a data frame by looping over the row names: performance issues

[Rd] extracting rows from a data frame by looping over the row names: performance issues

[Rd] extracting rows from a data frame by looping over the row names: performance issues

[Rd] extracting rows from a data frame by looping over the row names: performance issues

Maybe Matching Threads