thr3ads.net - R help - [R] data frame vs. matrix [Mar 2014]

If this information is useful, please help other people find it:
Share via:

Göran Broström

2014-Mar-16 18:57 UTC

[R] data frame vs. matrix

I have always known that "matrices are faster than data frames", for 
instance this function:


dumkoll <- function(n = 1000, df = TRUE){
     dfr <- data.frame(x = rnorm(n), y = rnorm(n))
     if (df){
         for (i in 2:NROW(dfr)){
             if (!(i %% 100)) cat("i = ", i, "\n")
             dfr$x[i] <- dfr$x[i-1]
         }
     }else{
         dm <- as.matrix(dfr)
         for (i in 2:NROW(dm)){
             if (!(i %% 100)) cat("i = ", i, "\n")
             dm[i, 1] <- dm[i-1, 1]
         }
         dfr$x <- dm[, 1]
     }
}

--------------------
 > system.time(dumkoll())

    user  system elapsed
   0.046   0.000   0.045

 > system.time(dumkoll(df = FALSE))

    user  system elapsed
   0.007   0.000   0.008
----------------------

OK, no big deal, but I stumbled over a data frame with one million 
records. Then, with df = TRUE,
----------------------------
      user    system   elapsed
44677.141  1271.544 46016.754
----------------------------
This is around 12 hours.

With df = FALSE, it took only six seconds! About 7500 time faster.

I was really surprised by the huge difference, and I wonder if this is 
to be expected, or if it is some peculiarity with my installation: I'm 
running Ubuntu 13.10 on a MacBook Pro with 8 Gb memory, R-3.0.3.

G?ran B.

Rui Barradas

2014-Mar-16 22:51 UTC

head link

[R] data frame vs. matrix

Hello,

This is to be expected. Matrices can hold only one type of data so the 
problem is solved once and for all, data frames can have many types of 
data so the code to handle them must determine which type to handle on 
every access.

Hope this helps,

Rui Barradas

Em 16-03-2014 18:57, G?ran Brostr?m escreveu:> I have always known that "matrices are faster than data frames",
for
> instance this function:
>
>
> dumkoll <- function(n = 1000, df = TRUE){
>      dfr <- data.frame(x = rnorm(n), y = rnorm(n))
>      if (df){
>          for (i in 2:NROW(dfr)){
>              if (!(i %% 100)) cat("i = ", i, "\n")
>              dfr$x[i] <- dfr$x[i-1]
>          }
>      }else{
>          dm <- as.matrix(dfr)
>          for (i in 2:NROW(dm)){
>              if (!(i %% 100)) cat("i = ", i, "\n")
>              dm[i, 1] <- dm[i-1, 1]
>          }
>          dfr$x <- dm[, 1]
>      }
> }
>
> --------------------
>  > system.time(dumkoll())
>
>     user  system elapsed
>    0.046   0.000   0.045
>
>  > system.time(dumkoll(df = FALSE))
>
>     user  system elapsed
>    0.007   0.000   0.008
> ----------------------
>
> OK, no big deal, but I stumbled over a data frame with one million
> records. Then, with df = TRUE,
> ----------------------------
>       user    system   elapsed
> 44677.141  1271.544 46016.754
> ----------------------------
> This is around 12 hours.
>
> With df = FALSE, it took only six seconds! About 7500 time faster.
>
> I was really surprised by the huge difference, and I wonder if this is
> to be expected, or if it is some peculiarity with my installation: I'm
> running Ubuntu 13.10 on a MacBook Pro with 8 Gb memory, R-3.0.3.
>
> G?ran B.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Duncan Murdoch

2014-Mar-16 22:56 UTC

head link

[R] data frame vs. matrix

On 14-03-16 2:57 PM, G?ran Brostr?m wrote:> I have always known that "matrices are faster than data frames",
for
> instance this function:
>
>
> dumkoll <- function(n = 1000, df = TRUE){
>       dfr <- data.frame(x = rnorm(n), y = rnorm(n))
>       if (df){
>           for (i in 2:NROW(dfr)){
>               if (!(i %% 100)) cat("i = ", i, "\n")
>               dfr$x[i] <- dfr$x[i-1]
>           }
>       }else{
>           dm <- as.matrix(dfr)
>           for (i in 2:NROW(dm)){
>               if (!(i %% 100)) cat("i = ", i, "\n")
>               dm[i, 1] <- dm[i-1, 1]
>           }
>           dfr$x <- dm[, 1]
>       }
> }
>
> --------------------
>   > system.time(dumkoll())
>
>      user  system elapsed
>     0.046   0.000   0.045
>
>   > system.time(dumkoll(df = FALSE))
>
>      user  system elapsed
>     0.007   0.000   0.008
> ----------------------
>
> OK, no big deal, but I stumbled over a data frame with one million
> records. Then, with df = TRUE,
> ----------------------------
>        user    system   elapsed
> 44677.141  1271.544 46016.754
> ----------------------------
> This is around 12 hours.
>
> With df = FALSE, it took only six seconds! About 7500 time faster.
>
> I was really surprised by the huge difference, and I wonder if this is
> to be expected, or if it is some peculiarity with my installation: I'm
> running Ubuntu 13.10 on a MacBook Pro with 8 Gb memory, R-3.0.3.
I don't find it surprising.  The line

dfr$x[i] <- dfr$x[i-1]

will be executed about a million times.  It does the following:

1.  Get a pointer to the x element of dfr.  This requires R to look 
through all the names of dfr to figure out which one is "x".

2.  Extract the i-1 element from it.  Not particularly slow.

3.  Get a pointer to the x element of dfr again.  (R doesn't cache these 
things.)

4.  Set the i element of it to a new value.  This could require the 
entire column or even the entire dataframe to be copied, if R hasn't 
kept track of the fact that it is really being changed in place.  In a 
complex assignment like that, I wouldn't be surprised if that took 
place.  (In the matrix equivalent, it would be easier to recognize that 
it is safe to change the existing value.)

Luke Tierney is making some changes in R-devel that might help a lot in 
cases like this, but I expect the matrix code will always be faster.

Duncan Murdoch

Jeff Newmiller

2014-Mar-17 00:31 UTC

head link

[R] data frame vs. matrix

Did you really intend to make all of the x values the same? If so, try one line
instead of the for loop:

dfr$x[ 2:n ] <- dfr$x[ 1 ]

If that was merely an error in your example, then you could use a different
one-liner:

dfr$x[ 2:n ] <- dfr$x[ seq.int( n-1 ) ]

In either case, the speedup is considerable.

I use data frames far more than matrices and don't feel I am suffering for
it, but then I also use creative indexing way more than for loops.

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

On March 16, 2014 11:57:33 AM PDT, "G?ran Brostr?m" <goran.brostrom
at umu.se> wrote:>I have always known that "matrices are faster than data frames",
for
>instance this function:
>
>
>dumkoll <- function(n = 1000, df = TRUE){
>     dfr <- data.frame(x = rnorm(n), y = rnorm(n))
>     if (df){
>         for (i in 2:NROW(dfr)){
>             if (!(i %% 100)) cat("i = ", i, "\n")
>             dfr$x[i] <- dfr$x[i-1]
>         }
>     }else{
>         dm <- as.matrix(dfr)
>         for (i in 2:NROW(dm)){
>             if (!(i %% 100)) cat("i = ", i, "\n")
>             dm[i, 1] <- dm[i-1, 1]
>         }
>         dfr$x <- dm[, 1]
>     }
>}
>
>--------------------
> > system.time(dumkoll())
>
>    user  system elapsed
>   0.046   0.000   0.045
>
> > system.time(dumkoll(df = FALSE))
>
>    user  system elapsed
>   0.007   0.000   0.008
>----------------------
>
>OK, no big deal, but I stumbled over a data frame with one million 
>records. Then, with df = TRUE,
>----------------------------
>      user    system   elapsed
>44677.141  1271.544 46016.754
>----------------------------
>This is around 12 hours.
>
>With df = FALSE, it took only six seconds! About 7500 time faster.
>
>I was really surprised by the huge difference, and I wonder if this is 
>to be expected, or if it is some peculiarity with my installation: I'm 
>running Ubuntu 13.10 on a MacBook Pro with 8 Gb memory, R-3.0.3.
>
>G?ran B.
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

R help - Mar 2014 - data frame vs. matrix

[R] data frame vs. matrix

[R] data frame vs. matrix

[R] data frame vs. matrix

[R] data frame vs. matrix