dear R wizards: here is the strange question for the day. It seems to me that nrow() is very slow. Let me explain what I mean: ds= data.frame( NA, x=rnorm(10000) ) ## a sample data set> system.time( { for (i in 1:10000) NA } ) ## doing nothing takesvirtually no time user system elapsed 0.000 0.000 0.001 ## this is something that should take time; we need to add 10,000 values 10,000 times> system.time( { for (i in 1:10000) mean(ds$x) } )user system elapsed 0.416 0.001 0.416 ## alas, this should be very fast. it is just reading off an attribute of ds. it takes almost a quarter of the time of mean()!> system.time( { for (i in 1:10000) nrow(ds) } )user system elapsed 0.124 0.001 0.125 ## here is an alternative way to implement nrows, which is already much faster:> system.time( { for (i in 1:10000) length(ds$x) } )user system elapsed 0.041 0.000 0.041 is there a faster way to learn how big a data frame is? I know this sounds silly, but this is inside a "by" statement, where I figure out how many observations are in each subset. strangely, this takes a whole lot of time. I don't believe it is possible to ask "by" to attach an attribute to the data frame that stores the number of observations that it is actually passing. pointers appreciated. regards, /iaw -- Ivo Welch (ivo.welch@brown.edu, ivo.welch@gmail.com) [[alternative HTML version deleted]]
On Sep 15, 2009, at 10:48 AM, ivo welch wrote:> dear R wizards: here is the strange question for the day. It seems > to me > that nrow() is very slow. Let me explain what I mean: > > ds= data.frame( NA, x=rnorm(10000) ) ## a sample data set > >> system.time( { for (i in 1:10000) NA } ) ## doing nothing takes > virtually no time > user system elapsed > 0.000 0.000 0.001 > > ## this is something that should take time; we need to add 10,000 > values > 10,000 times >> system.time( { for (i in 1:10000) mean(ds$x) } ) > user system elapsed > 0.416 0.001 0.416 > > ## alas, this should be very fast. it is just reading off an > attribute of > ds. it takes almost a quarter of the time of mean()! >> system.time( { for (i in 1:10000) nrow(ds) } ) > user system elapsed > 0.124 0.001 0.125I am guessing that you are coming from a statistical paradigm where there is an implicit looping construct in a data step. In R you find the number of rows not with a loop, but with the nrow function used just once. > ds= data.frame( NA, x=rnorm(10000) ) > system.time(nrow(ds)) user system elapsed 0 0 0> > ## here is an alternative way to implement nrows, which is already > much > faster: >> system.time( { for (i in 1:10000) length(ds$x) } ) > user system elapsed > 0.041 0.000 0.041 > > is there a faster way to learn how big a data frame is?> length(ds) [1] 2 > nrow(ds) [1] 10000 # Or: > dim(ds) [1] 10000 2> I know this sounds > silly, but this is inside a "by" statement, where I figure out how > many > observations are in each subset. strangely, this takes a whole lot of > time. I don't believe it is possible to ask "by" to attach an > attribute to > the data frame that stores the number of observations that it is > actually > passing. > > pointers appreciated. > > regards, > > /iaw > -- > Ivo Welch (ivo.welch at brown.edu, ivo.welch at gmail.com) > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD Heritage Laboratories West Hartford, CT
On Tue, Sep 15, 2009 at 9:48 AM, ivo welch <ivowel at gmail.com> wrote:> dear R wizards: ?here is the strange question for the day. ?It seems to me > that nrow() is very slow. ?Let me explain what I mean: > > ds= data.frame( NA, x=rnorm(10000) ) ? ## ?a sample data set > >> system.time( { for (i in 1:10000) NA } ) ? ## doing nothing takes > virtually no time > ? user ?system elapsed > ?0.000 ? 0.000 ? 0.001 > > ## this is something that should take time; we need to add 10,000 values > 10,000 times >> system.time( { for (i in 1:10000) mean(ds$x) } ) > ? user ?system elapsed > ?0.416 ? 0.001 ? 0.416 > > ## alas, this should be very fast. ?it is just reading off an attribute of > ds. ?it takes almost a quarter of the time of mean()! >> system.time( { for (i in 1:10000) nrow(ds) } ) > ? user ?system elapsed > ?0.124 ? 0.001 ? 0.125I just encountered this same problem. nrow is so slow because it works like this: nrow(df) dim(df)[1] dim.data.frame(df)[1] c(.row_names_info(df, 2L), length(df)) If you use .row_names_info(df, 2L) directly it's about 6 times faster.> system.time( { for (i in 1:10000) nrow(ds) })user system elapsed 0.183 0.002 0.187> system.time( { for (i in 1:10000) .row_names_info(ds, 2) })user system elapsed 0.026 0.000 0.027 Hadley -- had.co.nz