I have always known that "matrices are faster than data frames", for instance this function: dumkoll <- function(n = 1000, df = TRUE){ dfr <- data.frame(x = rnorm(n), y = rnorm(n)) if (df){ for (i in 2:NROW(dfr)){ if (!(i %% 100)) cat("i = ", i, "\n") dfr$x[i] <- dfr$x[i-1] } }else{ dm <- as.matrix(dfr) for (i in 2:NROW(dm)){ if (!(i %% 100)) cat("i = ", i, "\n") dm[i, 1] <- dm[i-1, 1] } dfr$x <- dm[, 1] } } -------------------- > system.time(dumkoll()) user system elapsed 0.046 0.000 0.045 > system.time(dumkoll(df = FALSE)) user system elapsed 0.007 0.000 0.008 ---------------------- OK, no big deal, but I stumbled over a data frame with one million records. Then, with df = TRUE, ---------------------------- user system elapsed 44677.141 1271.544 46016.754 ---------------------------- This is around 12 hours. With df = FALSE, it took only six seconds! About 7500 time faster. I was really surprised by the huge difference, and I wonder if this is to be expected, or if it is some peculiarity with my installation: I'm running Ubuntu 13.10 on a MacBook Pro with 8 Gb memory, R-3.0.3. G?ran B.
Hello, This is to be expected. Matrices can hold only one type of data so the problem is solved once and for all, data frames can have many types of data so the code to handle them must determine which type to handle on every access. Hope this helps, Rui Barradas Em 16-03-2014 18:57, G?ran Brostr?m escreveu:> I have always known that "matrices are faster than data frames", for > instance this function: > > > dumkoll <- function(n = 1000, df = TRUE){ > dfr <- data.frame(x = rnorm(n), y = rnorm(n)) > if (df){ > for (i in 2:NROW(dfr)){ > if (!(i %% 100)) cat("i = ", i, "\n") > dfr$x[i] <- dfr$x[i-1] > } > }else{ > dm <- as.matrix(dfr) > for (i in 2:NROW(dm)){ > if (!(i %% 100)) cat("i = ", i, "\n") > dm[i, 1] <- dm[i-1, 1] > } > dfr$x <- dm[, 1] > } > } > > -------------------- > > system.time(dumkoll()) > > user system elapsed > 0.046 0.000 0.045 > > > system.time(dumkoll(df = FALSE)) > > user system elapsed > 0.007 0.000 0.008 > ---------------------- > > OK, no big deal, but I stumbled over a data frame with one million > records. Then, with df = TRUE, > ---------------------------- > user system elapsed > 44677.141 1271.544 46016.754 > ---------------------------- > This is around 12 hours. > > With df = FALSE, it took only six seconds! About 7500 time faster. > > I was really surprised by the huge difference, and I wonder if this is > to be expected, or if it is some peculiarity with my installation: I'm > running Ubuntu 13.10 on a MacBook Pro with 8 Gb memory, R-3.0.3. > > G?ran B. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
On 14-03-16 2:57 PM, G?ran Brostr?m wrote:> I have always known that "matrices are faster than data frames", for > instance this function: > > > dumkoll <- function(n = 1000, df = TRUE){ > dfr <- data.frame(x = rnorm(n), y = rnorm(n)) > if (df){ > for (i in 2:NROW(dfr)){ > if (!(i %% 100)) cat("i = ", i, "\n") > dfr$x[i] <- dfr$x[i-1] > } > }else{ > dm <- as.matrix(dfr) > for (i in 2:NROW(dm)){ > if (!(i %% 100)) cat("i = ", i, "\n") > dm[i, 1] <- dm[i-1, 1] > } > dfr$x <- dm[, 1] > } > } > > -------------------- > > system.time(dumkoll()) > > user system elapsed > 0.046 0.000 0.045 > > > system.time(dumkoll(df = FALSE)) > > user system elapsed > 0.007 0.000 0.008 > ---------------------- > > OK, no big deal, but I stumbled over a data frame with one million > records. Then, with df = TRUE, > ---------------------------- > user system elapsed > 44677.141 1271.544 46016.754 > ---------------------------- > This is around 12 hours. > > With df = FALSE, it took only six seconds! About 7500 time faster. > > I was really surprised by the huge difference, and I wonder if this is > to be expected, or if it is some peculiarity with my installation: I'm > running Ubuntu 13.10 on a MacBook Pro with 8 Gb memory, R-3.0.3.I don't find it surprising. The line dfr$x[i] <- dfr$x[i-1] will be executed about a million times. It does the following: 1. Get a pointer to the x element of dfr. This requires R to look through all the names of dfr to figure out which one is "x". 2. Extract the i-1 element from it. Not particularly slow. 3. Get a pointer to the x element of dfr again. (R doesn't cache these things.) 4. Set the i element of it to a new value. This could require the entire column or even the entire dataframe to be copied, if R hasn't kept track of the fact that it is really being changed in place. In a complex assignment like that, I wouldn't be surprised if that took place. (In the matrix equivalent, it would be easier to recognize that it is safe to change the existing value.) Luke Tierney is making some changes in R-devel that might help a lot in cases like this, but I expect the matrix code will always be faster. Duncan Murdoch
Did you really intend to make all of the x values the same? If so, try one line instead of the for loop: dfr$x[ 2:n ] <- dfr$x[ 1 ] If that was merely an error in your example, then you could use a different one-liner: dfr$x[ 2:n ] <- dfr$x[ seq.int( n-1 ) ] In either case, the speedup is considerable. I use data frames far more than matrices and don't feel I am suffering for it, but then I also use creative indexing way more than for loops. --------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --------------------------------------------------------------------------- Sent from my phone. Please excuse my brevity. On March 16, 2014 11:57:33 AM PDT, "G?ran Brostr?m" <goran.brostrom at umu.se> wrote:>I have always known that "matrices are faster than data frames", for >instance this function: > > >dumkoll <- function(n = 1000, df = TRUE){ > dfr <- data.frame(x = rnorm(n), y = rnorm(n)) > if (df){ > for (i in 2:NROW(dfr)){ > if (!(i %% 100)) cat("i = ", i, "\n") > dfr$x[i] <- dfr$x[i-1] > } > }else{ > dm <- as.matrix(dfr) > for (i in 2:NROW(dm)){ > if (!(i %% 100)) cat("i = ", i, "\n") > dm[i, 1] <- dm[i-1, 1] > } > dfr$x <- dm[, 1] > } >} > >-------------------- > > system.time(dumkoll()) > > user system elapsed > 0.046 0.000 0.045 > > > system.time(dumkoll(df = FALSE)) > > user system elapsed > 0.007 0.000 0.008 >---------------------- > >OK, no big deal, but I stumbled over a data frame with one million >records. Then, with df = TRUE, >---------------------------- > user system elapsed >44677.141 1271.544 46016.754 >---------------------------- >This is around 12 hours. > >With df = FALSE, it took only six seconds! About 7500 time faster. > >I was really surprised by the huge difference, and I wonder if this is >to be expected, or if it is some peculiarity with my installation: I'm >running Ubuntu 13.10 on a MacBook Pro with 8 Gb memory, R-3.0.3. > >G?ran B. > >______________________________________________ >R-help at r-project.org mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.