thr3ads.net - R devel - [Rd] speeding up perception [Jul 2011]

If this information is useful, please help other people find it:
Share via:

ivo welch

2011-Jul-02 18:23 UTC

[Rd] speeding up perception

Dear R developers: R is supposed to be slow for iterative
calculations.  actually, it isn't.  matrix operations are fast.  it is
data frame operations that are slow.

R <- 1000
C <- 1000

example <- function(m) {
  cat("rows: "); cat(system.time( for (r in 1:R) m[r,20] <-
sqrt(abs(m[r,20])) + rnorm(1) ), "\n")
  cat("columns: "); cat(system.time(for (c in 1:C) m[20,c] <-
sqrt(abs(m[20,c])) + rnorm(1)), "\n")
  if (is.data.frame(m)) { cat("df: columns as names: ");
cat(system.time(for (c in 1:C) m[[c]][20] <- sqrt(abs(m[[c]][20])) +
rnorm(1)), "\n") }
}

cat("\n**** Now as matrix\n")
example( matrix( rnorm(C*R), nrow=R ) )

cat("\n**** Now as data frame\n")
example( as.data.frame( matrix( rnorm(C*R), nrow=R ) ) )


When m is a data frame, the operation is about 300 times slower than
when m is a matrix.    The program is basically accessing 1000
numbers.  When m is a data frame, the speed of R is about 20 accesses
per seconds on a Mac Pro.  This is pretty pathetic.

I do not know the R internals, so the following is pure speculation.
I understand that an index calculation is faster than a vector lookup
for arbitrary size objects, but it seems as if R relies on search to
find its element.  maybe there isn't even a basic vector lookup table.
 a vector lookup table should be possible at least along the dimension
of consecutive storage.  another possible improvement would be to add
an operation that adds an attribute to the data frame that contains a
full index table to the object for quick lookup.  (if the index table
is there, it could be used.  otherwise, R could simply use the
existing internal mechanism.)

I think faster data frame access would significantly improve the
impression that R makes on novices.  just my 5 cents.

/iaw
----
Ivo Welch (ivo.welch at gmail.com)
http://www.ivo-welch.info/
J. Fred Weston Professor of Finance
Anderson School at UCLA, C519

Simon Urbanek

2011-Jul-03 04:30 UTC

head link

[Rd] speeding up perception

This is just a quick, incomplete response, but the main misconception is really
the use of data.frames. If you don't use the elaborate mechanics of data
frames that involve the management of row names, then they are definitely the
wrong tool to use, because most of the overhead is exactly to manage to row
names and you pay a substantial penalty for that. Just drop that one feature and
you get timings similar to a matrix:
> m=matrix(rnorm(C*R), nrow=R)
> example(m)rows: 0.015 0 0.015 0 0 
columns: 0.01 0 0.01 0 0 
> l=as.list(as.data.frame( matrix( rnorm(C*R), nrow=R ) ))
> example(l)rows: 0.015 0 0.016 0 0 
columns: 0.012 0 0.011 0 0 

(with example modified to use m[[y]][x] instad of m[x,y])

I would not be surprised that many people use data.frames for the convenience of
the matrix subsetting/subassignement operators and don't really care about
the row names and for all those uses data.frames are the wrong tool. (Just look
at `[.data.frame` and `[<-.data.frame`).

As a side note, it's a bit pointless to compare the performance to matrices
as they imposes much more rigorous structure (all columns have the same type) -
if you use data frames in such special (rare) cases, it's really your fault
;). So the bottom line is to educate users to not use data frames where not
needed and/or provide alternatives (and there may be some things coming up,
too).

And as I said, this is just a quick note, so carry on and comment on the
original question ;).

Cheers,
Simon


On Jul 2, 2011, at 2:23 PM, ivo welch wrote:
> Dear R developers: R is supposed to be slow for iterative
> calculations.  actually, it isn't.  matrix operations are fast.  it is
> data frame operations that are slow.
> 
> R <- 1000
> C <- 1000
> 
> example <- function(m) {
>  cat("rows: "); cat(system.time( for (r in 1:R) m[r,20] <-
> sqrt(abs(m[r,20])) + rnorm(1) ), "\n")
>  cat("columns: "); cat(system.time(for (c in 1:C) m[20,c] <-
> sqrt(abs(m[20,c])) + rnorm(1)), "\n")
>  if (is.data.frame(m)) { cat("df: columns as names: ");
> cat(system.time(for (c in 1:C) m[[c]][20] <- sqrt(abs(m[[c]][20])) +
> rnorm(1)), "\n") }
> }
> 
> cat("\n**** Now as matrix\n")
> example( matrix( rnorm(C*R), nrow=R ) )
> 
> cat("\n**** Now as data frame\n")
> example( as.data.frame( matrix( rnorm(C*R), nrow=R ) ) )
> 
> 
> When m is a data frame, the operation is about 300 times slower than
> when m is a matrix.    The program is basically accessing 1000
> numbers.  When m is a data frame, the speed of R is about 20 accesses
> per seconds on a Mac Pro.  This is pretty pathetic.
> 
> I do not know the R internals, so the following is pure speculation.
> I understand that an index calculation is faster than a vector lookup
> for arbitrary size objects, but it seems as if R relies on search to
> find its element.  maybe there isn't even a basic vector lookup table.
> a vector lookup table should be possible at least along the dimension
> of consecutive storage.  another possible improvement would be to add
> an operation that adds an attribute to the data frame that contains a
> full index table to the object for quick lookup.  (if the index table
> is there, it could be used.  otherwise, R could simply use the
> existing internal mechanism.)
> 
> I think faster data frame access would significantly improve the
> impression that R makes on novices.  just my 5 cents.
> 
> /iaw
> ----
> Ivo Welch (ivo.welch at gmail.com)
> http://www.ivo-welch.info/
> J. Fred Weston Professor of Finance
> Anderson School at UCLA, C519
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
>

ivo welch

2011-Jul-04 05:19 UTC

head link

[Rd] speeding up perception

thank you, simon. ?this was very interesting indeed.  I also now
understand how far out of my depth I am here.

fortunately, as an end user, obviously, *I* now know how to avoid the
problem.  I particularly like the as.list() transformation and back to
as.data.frame() to speed things up without loss of (much)
functionality.


more broadly, I view the avoidance of individual access through the
use of apply and vector operations as a mixed "IQ test" and
"knowledge
test" (which I often fail).  However, even for the most clever, there
are also situations where the KISS programming principle makes
explicit loops still preferable.  Personally, I would have preferred
it if R had, in its standard "statistical data set" data structure,
foregone the row names feature in exchange for retaining fast direct
access.  R could have reserved its current implementation "with row
names but slow access" for a less common (possibly pseudo-inheriting)
data structure.


If end users commonly do iterations over a data frame, which I would
guess to be the case, then the impression of R by (novice) end users
could be greatly enhanced if the extreme penalties could be eliminated
or at least flagged.  For example, I wonder if modest special internal
code could store data frames internally and transparently as lists of
vectors UNTIL a row name is assigned to.  Easier and uglier, a simple
but specific warning message could be issued with a suggestion if
there is an individual read/write into a data frame ("Warning: data
frames are much slower than lists of vectors for individual element
access").


I would also suggest changing the "Introduction to R" 6.3  from
"A
data frame may for many purposes be regarded as a matrix with columns
possibly of differing modes and attributes. It may be displayed in
matrix form, and its rows and columns extracted using matrix indexing
conventions." to "A data frame may for many purposes be regarded as a
matrix with columns possibly of differing modes and attributes. It may
be displayed in matrix form, and its rows and columns extracted using
matrix indexing conventions.  However, data frames can be much slower
than matrices or even lists of vectors (which, like data frames, can
contain different types of columns) when individual elements need to
be accessed."  Reading about it immediately upon introduction could
flag the problem in a more visible manner.


regards,

/iaw

Possibly Parallel Threads

Search for more seemingly similar threads

R devel - Jul 2011 - speeding up perception

[Rd] speeding up perception

[Rd] speeding up perception

[Rd] speeding up perception

Possibly Parallel Threads