Herve Pages
2007-Mar-02 18:39 UTC
[Rd] extracting rows from a data frame by looping over the row names: performance issues
Hi, I have a big data frame: > mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5) > dat <- as.data.frame(mat) and I need to do some computation on each row. Currently I'm doing this: > for (key in row.names(dat)) { row <- dat[key, ]; ... do some computation on row... } which could probably considered a very natural (and R'ish) way of doing it (but maybe I'm wrong and the real idiom for doing this is something different). The problem with this "idiomatic form" is that it is _very_ slow. The loop itself + the simple extraction of the rows (no computation on the rows) takes 10 hours on a powerful server (quad core Linux with 8G of RAM)! Looping over the first 100 rows takes 12 seconds: > system.time(for (key in row.names(dat)[1:100]) { row <- dat[key, ] }) user system elapsed 12.637 0.120 12.756 But if, instead of the above, I do this: > for (i in nrow(dat)) { row <- sapply(dat, function(col) col[i]) } then it's 20 times faster!! > system.time(for (i in 1:100) { row <- sapply(dat, function(col) col[i]) }) user system elapsed 0.576 0.096 0.673 I hope you will agree that this second form is much less natural. So I was wondering why the "idiomatic form" is so slow? Shouldn't the idiomatic form be, not only elegant and easy to read, but also efficient? Thanks, H.> sessionInfo()R version 2.5.0 Under development (unstable) (2007-01-05 r40386) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C attached base packages: [1] "stats" "graphics" "grDevices" "utils" "datasets" "methods" [7] "base"
Herve Pages
2007-Mar-02 19:03 UTC
[Rd] extracting rows from a data frame by looping over the row names: performance issues
Herve Pages wrote: ...> But if, instead of the above, I do this: > > > for (i in nrow(dat)) { row <- sapply(dat, function(col) col[i]) }Should have been: > for (i in 1:nrow(dat)) { row <- sapply(dat, function(col) col[i]) }> > then it's 20 times faster!! > > > system.time(for (i in 1:100) { row <- sapply(dat, function(col) col[i]) }) > user system elapsed > 0.576 0.096 0.673... Cheers, H.
Roger D. Peng
2007-Mar-02 19:43 UTC
[Rd] extracting rows from a data frame by looping over the row names: performance issues
Extracting rows from data frames is tricky, since each of the columns could be of a different class. For your toy example, it seems a matrix would be a more reasonable option. R-devel has some improvements to row extraction, if I remember correctly. You might want to try your example there. -roger Herve Pages wrote:> Hi, > > > I have a big data frame: > > > mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5) > > dat <- as.data.frame(mat) > > and I need to do some computation on each row. Currently I'm doing this: > > > for (key in row.names(dat)) { row <- dat[key, ]; ... do some computation on row... } > > which could probably considered a very natural (and R'ish) way of doing it > (but maybe I'm wrong and the real idiom for doing this is something different). > > The problem with this "idiomatic form" is that it is _very_ slow. The loop > itself + the simple extraction of the rows (no computation on the rows) takes > 10 hours on a powerful server (quad core Linux with 8G of RAM)! > > Looping over the first 100 rows takes 12 seconds: > > > system.time(for (key in row.names(dat)[1:100]) { row <- dat[key, ] }) > user system elapsed > 12.637 0.120 12.756 > > But if, instead of the above, I do this: > > > for (i in nrow(dat)) { row <- sapply(dat, function(col) col[i]) } > > then it's 20 times faster!! > > > system.time(for (i in 1:100) { row <- sapply(dat, function(col) col[i]) }) > user system elapsed > 0.576 0.096 0.673 > > I hope you will agree that this second form is much less natural. > > So I was wondering why the "idiomatic form" is so slow? Shouldn't the idiomatic > form be, not only elegant and easy to read, but also efficient? > > > Thanks, > H. > > >> sessionInfo() > R version 2.5.0 Under development (unstable) (2007-01-05 r40386) > x86_64-unknown-linux-gnu > > locale: > LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C > > attached base packages: > [1] "stats" "graphics" "grDevices" "utils" "datasets" "methods" > [7] "base" > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >-- Roger D. Peng | http://www.biostat.jhsph.edu/~rpeng/
Greg Snow
2007-Mar-02 19:51 UTC
[Rd] extracting rows from a data frame by looping over the row names: performance issues
Your 2 examples have 2 differences and they are therefore confounded in their effects. What are your results for: system.time(for (i in 1:100) {row <- dat[i, ] }) -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at intermountainmail.org (801) 408-8111> -----Original Message----- > From: r-devel-bounces at r-project.org > [mailto:r-devel-bounces at r-project.org] On Behalf Of Herve Pages > Sent: Friday, March 02, 2007 11:40 AM > To: r-devel at r-project.org > Subject: [Rd] extracting rows from a data frame by looping > over the row names: performance issues > > Hi, > > > I have a big data frame: > > > mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5) > > dat <- as.data.frame(mat) > > and I need to do some computation on each row. Currently I'm > doing this: > > > for (key in row.names(dat)) { row <- dat[key, ]; ... do > some computation on row... } > > which could probably considered a very natural (and R'ish) > way of doing it (but maybe I'm wrong and the real idiom for > doing this is something different). > > The problem with this "idiomatic form" is that it is _very_ > slow. The loop itself + the simple extraction of the rows (no > computation on the rows) takes 10 hours on a powerful server > (quad core Linux with 8G of RAM)! > > Looping over the first 100 rows takes 12 seconds: > > > system.time(for (key in row.names(dat)[1:100]) { row <- > dat[key, ] }) > user system elapsed > 12.637 0.120 12.756 > > But if, instead of the above, I do this: > > > for (i in nrow(dat)) { row <- sapply(dat, function(col) col[i]) } > > then it's 20 times faster!! > > > system.time(for (i in 1:100) { row <- sapply(dat, > function(col) col[i]) }) > user system elapsed > 0.576 0.096 0.673 > > I hope you will agree that this second form is much less natural. > > So I was wondering why the "idiomatic form" is so slow? > Shouldn't the idiomatic form be, not only elegant and easy to > read, but also efficient? > > > Thanks, > H. > > > > sessionInfo() > R version 2.5.0 Under development (unstable) (2007-01-05 > r40386) x86_64-unknown-linux-gnu > > locale: > LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_ > MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_A > DDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C > > attached base packages: > [1] "stats" "graphics" "grDevices" "utils" > "datasets" "methods" > [7] "base" > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
Herve Pages
2007-Mar-03 02:03 UTC
[Rd] extracting rows from a data frame by looping over the row names: performance issues
Hi Greg, Greg Snow wrote:> Your 2 examples have 2 differences and they are therefore confounded in > their effects. > > What are your results for: > > system.time(for (i in 1:100) {row <- dat[i, ] }) > > >Right. What you suggest is even faster (and more simple): > mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5) > dat <- as.data.frame(mat) > system.time(for (key in row.names(dat)[1:100]) { row <- dat[key, ] }) user system elapsed 13.241 0.460 13.702 > system.time(for (i in 1:100) { row <- sapply(dat, function(col) col[i]) }) user system elapsed 0.280 0.372 0.650 > system.time(for (i in 1:100) {row <- dat[i, ] }) user system elapsed 0.044 0.088 0.130 So apparently here extracting with dat[i, ] is 300 times faster than extracting with dat[key, ] !> system.time(for (i in 1:100) dat["1", ])user system elapsed 12.680 0.396 13.075> system.time(for (i in 1:100) dat[1, ])user system elapsed 0.060 0.076 0.137 Good to know! Thanks a lot, H.
Greg Snow
2007-Mar-05 16:07 UTC
[Rd] extracting rows from a data frame by looping over the row names: performance issues
The difference is in indexing by row number vs. indexing by row name. It has long been known that names slow matricies down, some routines make a copy of dimnames of a matrix, remove the dimnames, do the computations with the matrix, then put the dimnames back on. This can speed things up quite a bit in some circumstances. For your example, indexing by number means jumping to a specific offset in the matrix, indexing by name means searching through all the names and doing string comparisons to find the match. A 300 fold difference in speed is not suprising. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at intermountainmail.org (801) 408-8111> -----Original Message----- > From: Herve Pages [mailto:hpages at fhcrc.org] > Sent: Friday, March 02, 2007 7:04 PM > To: Greg Snow > Cc: r-devel at r-project.org > Subject: Re: [Rd] extracting rows from a data frame by > looping over the row names: performance issues > > Hi Greg, > > Greg Snow wrote: > > Your 2 examples have 2 differences and they are therefore > confounded > > in their effects. > > > > What are your results for: > > > > system.time(for (i in 1:100) {row <- dat[i, ] }) > > > > > > > > Right. What you suggest is even faster (and more simple): > > > mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5) > > dat <- as.data.frame(mat) > > > system.time(for (key in row.names(dat)[1:100]) { row <- > dat[key, ] }) > user system elapsed > 13.241 0.460 13.702 > > > system.time(for (i in 1:100) { row <- sapply(dat, > function(col) col[i]) }) > user system elapsed > 0.280 0.372 0.650 > > > system.time(for (i in 1:100) {row <- dat[i, ] }) > user system elapsed > 0.044 0.088 0.130 > > So apparently here extracting with dat[i, ] is 300 times > faster than extracting with dat[key, ] ! > > > system.time(for (i in 1:100) dat["1", ]) > user system elapsed > 12.680 0.396 13.075 > > > system.time(for (i in 1:100) dat[1, ]) > user system elapsed > 0.060 0.076 0.137 > > Good to know! > > Thanks a lot, > H. > > >