thr3ads.net - R devel - [Rd] Assigning NULL to large variables is much faster than rm() - any reason why I should still use rm()? [May 2013]

If this information is useful, please help other people find it:
Share via:

Henrik Bengtsson

2013-May-25 19:48 UTC

[Rd] Assigning NULL to large variables is much faster than rm() - any reason why I should still use rm()?

Hi,

in my packages/functions/code I tend to remove large temporary
variables as soon as possible, e.g. large intermediate vectors used in
iterations.  I sometimes also have the habit of doing this to make it
explicit in the source code when a temporary object is no longer
needed.  However, I did notice that this can add a noticeable overhead
when the rest of the iteration step does not take that much time.

Trying to speed this up, I first noticed that rm(list="a") is much
faster than rm(a).  While at it, I realized that for the purpose of
keeping the memory footprint small, I can equally well reassign the
variable the value of a small object (e.g. a <- NULL), which is
significantly faster than using rm().

SOME BENCHMARKS:
A toy example imitating an iterative algorithm with "large" temporary
objects.

x <- matrix(rnorm(100e6), ncol=10e3)

t1 <- system.time(for (k in 1:ncol(x)) {
  a <- x[,k]
  colSum <- sum(a)
  rm(a) # Not needed anymore
  b <- x[k,]
  rowSum <- sum(b)
  rm(b) # Not needed anymore
})

t2 <- system.time(for (k in 1:ncol(x)) {
  a <- x[,k]
  colSum <- sum(a)
  rm(list="a") # Not needed anymore
  b <- x[k,]
  rowSum <- sum(b)
  rm(list="b") # Not needed anymore
})

t3 <- system.time(for (k in 1:ncol(x)) {
  a <- x[,k]
  colSum <- sum(a)
  a <- NULL # Not needed anymore
  b <- x[k,]
  rowSum <- sum(b)
  b <- NULL # Not needed anymore
})
> t1   user  system elapsed
   8.03    0.00    8.08> t1/t2    user   system  elapsed
1.322900 0.000000 1.320261> t1/t3    user   system  elapsed
1.715812 0.000000 1.662551


Is there a reason why I shouldn't assign NULL instead of using rm()?
As far as I understand it, the garbage collector will be equally
efficient cleaning out the previous object when using rm(a) or a <-
NULL.  Is there anything else I'm overlooking?  Am I adding overhead
somewhere else?

/Henrik


PS. With the above toy example one can obviously be a bit smarter by using:

t4 <- system.time({for (k in 1:ncol(x)) {
  a <- x[,k]
  colSum <- sum(a)
  a <- x[k,]
  rowSum <- sum(a)
}
rm(list="a")
})

but that's not my point.

William Dunlap

2013-May-25 21:00 UTC

head link

[Rd] Assigning NULL to large variables is much faster than rm() - any reason why I should still use rm()?

Another way to avoid using rm() in loops is to use throw-away
functions.  E.g., > t3 <- system.time(for (k in 1:ncol(x)) { # your last, fastest, example+   a <- x[,k]
+   colSum <- sum(a)
+   a <- NULL # Not needed anymore
+   b <- x[k,]
+   rowSum <- sum(b)
+   b <- NULL # Not needed anymore
+ })> t4 <- system.time({ # use some throw-away functions+     colKSum <- function(k) { a <- x[,k] ; sum(a) }
+     rowKSum <- function(k) { b <- x[k,] ; sum(b) }
+     for(k in 1:ncol(x)) {
+         colSum <- colKSum(k)
+         rowSum <- rowKSum(k)
+ }})> t3   user  system elapsed 
   7.89    0.02    7.93 > t4   user  system elapsed 
   7.88    0.02    7.93
I think the code is clearer.  It might make the compiler's job easier.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

> -----Original Message-----
> From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at
r-project.org] On Behalf
> Of Henrik Bengtsson
> Sent: Saturday, May 25, 2013 12:49 PM
> To: R-devel
> Subject: [Rd] Assigning NULL to large variables is much faster than rm() -
any reason why
> I should still use rm()?
> 
> Hi,
> 
> in my packages/functions/code I tend to remove large temporary
> variables as soon as possible, e.g. large intermediate vectors used in
> iterations.  I sometimes also have the habit of doing this to make it
> explicit in the source code when a temporary object is no longer
> needed.  However, I did notice that this can add a noticeable overhead
> when the rest of the iteration step does not take that much time.
> 
> Trying to speed this up, I first noticed that rm(list="a") is
much
> faster than rm(a).  While at it, I realized that for the purpose of
> keeping the memory footprint small, I can equally well reassign the
> variable the value of a small object (e.g. a <- NULL), which is
> significantly faster than using rm().
> 
> SOME BENCHMARKS:
> A toy example imitating an iterative algorithm with "large"
temporary objects.
> 
> x <- matrix(rnorm(100e6), ncol=10e3)
> 
> t1 <- system.time(for (k in 1:ncol(x)) {
>   a <- x[,k]
>   colSum <- sum(a)
>   rm(a) # Not needed anymore
>   b <- x[k,]
>   rowSum <- sum(b)
>   rm(b) # Not needed anymore
> })
> 
> t2 <- system.time(for (k in 1:ncol(x)) {
>   a <- x[,k]
>   colSum <- sum(a)
>   rm(list="a") # Not needed anymore
>   b <- x[k,]
>   rowSum <- sum(b)
>   rm(list="b") # Not needed anymore
> })
> 
> t3 <- system.time(for (k in 1:ncol(x)) {
>   a <- x[,k]
>   colSum <- sum(a)
>   a <- NULL # Not needed anymore
>   b <- x[k,]
>   rowSum <- sum(b)
>   b <- NULL # Not needed anymore
> })
> 
> > t1
>    user  system elapsed
>    8.03    0.00    8.08
> > t1/t2
>     user   system  elapsed
> 1.322900 0.000000 1.320261
> > t1/t3
>     user   system  elapsed
> 1.715812 0.000000 1.662551
> 
> 
> Is there a reason why I shouldn't assign NULL instead of using rm()?
> As far as I understand it, the garbage collector will be equally
> efficient cleaning out the previous object when using rm(a) or a <-
> NULL.  Is there anything else I'm overlooking?  Am I adding overhead
> somewhere else?
> 
> /Henrik
> 
> 
> PS. With the above toy example one can obviously be a bit smarter by using:
> 
> t4 <- system.time({for (k in 1:ncol(x)) {
>   a <- x[,k]
>   colSum <- sum(a)
>   a <- x[k,]
>   rowSum <- sum(a)
> }
> rm(list="a")
> })
> 
> but that's not my point.
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Simon Urbanek

2013-May-25 23:38 UTC

head link

[Rd] Assigning NULL to large variables is much faster than rm() - any reason why I should still use rm()?

On May 25, 2013, at 3:48 PM, Henrik Bengtsson wrote:
> Hi,
> 
> in my packages/functions/code I tend to remove large temporary
> variables as soon as possible, e.g. large intermediate vectors used in
> iterations.  I sometimes also have the habit of doing this to make it
> explicit in the source code when a temporary object is no longer
> needed.  However, I did notice that this can add a noticeable overhead
> when the rest of the iteration step does not take that much time.
> 
> Trying to speed this up, I first noticed that rm(list="a") is
much
> faster than rm(a).  While at it, I realized that for the purpose of
> keeping the memory footprint small, I can equally well reassign the
> variable the value of a small object (e.g. a <- NULL), which is
> significantly faster than using rm().
> 
Yes, as you probably noticed rm() is a quite complex function because it has to
deal with different ways to specify input etc.
When you remove that overhead (by calling .Internal(remove("a",
parent.frame(), FALSE))), you get the same performance as the assignment.
If you really want to go overboard, you can define your own function:

SEXP rm(SEXP x, SEXP rho) { setVar(x, R_UnboundValue, rho); return R_NilValue; }
poof <- function(x) .Call(rm_C, substitute(x), parent.frame())

That will be faster than anything else (mainly because it avoids the trip
through strings as it can use the symbol directly).

But as Bill noted - it practice I'd recommend using either local() or
functions to control the scope - using rm() or assignments seems too error-prone
to me.

Cheers,
Simon


> SOME BENCHMARKS:
> A toy example imitating an iterative algorithm with "large"
temporary objects.
> 
> x <- matrix(rnorm(100e6), ncol=10e3)
> 
> t1 <- system.time(for (k in 1:ncol(x)) {
>  a <- x[,k]
>  colSum <- sum(a)
>  rm(a) # Not needed anymore
>  b <- x[k,]
>  rowSum <- sum(b)
>  rm(b) # Not needed anymore
> })
> 
> t2 <- system.time(for (k in 1:ncol(x)) {
>  a <- x[,k]
>  colSum <- sum(a)
>  rm(list="a") # Not needed anymore
>  b <- x[k,]
>  rowSum <- sum(b)
>  rm(list="b") # Not needed anymore
> })
> 
> t3 <- system.time(for (k in 1:ncol(x)) {
>  a <- x[,k]
>  colSum <- sum(a)
>  a <- NULL # Not needed anymore
>  b <- x[k,]
>  rowSum <- sum(b)
>  b <- NULL # Not needed anymore
> })
> 
>> t1
>   user  system elapsed
>   8.03    0.00    8.08
>> t1/t2
>    user   system  elapsed
> 1.322900 0.000000 1.320261
>> t1/t3
>    user   system  elapsed
> 1.715812 0.000000 1.662551
> 
> 
> Is there a reason why I shouldn't assign NULL instead of using rm()?
> As far as I understand it, the garbage collector will be equally
> efficient cleaning out the previous object when using rm(a) or a <-
> NULL.  Is there anything else I'm overlooking?  Am I adding overhead
> somewhere else?
> 
> /Henrik
> 
> 
> PS. With the above toy example one can obviously be a bit smarter by using:
> 
> t4 <- system.time({for (k in 1:ncol(x)) {
>  a <- x[,k]
>  colSum <- sum(a)
>  a <- x[k,]
>  rowSum <- sum(a)
> }
> rm(list="a")
> })
> 
> but that's not my point.
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
>

Possibly Parallel Threads

Search for more possibly parallel threads

R devel - May 2013 - Assigning NULL to large variables is much faster than rm() - any reason why I should still use rm()?

[Rd] Assigning NULL to large variables is much faster than rm() - any reason why I should still use rm()?

[Rd] Assigning NULL to large variables is much faster than rm() - any reason why I should still use rm()?

[Rd] Assigning NULL to large variables is much faster than rm() - any reason why I should still use rm()?

Possibly Parallel Threads