Henrik Bengtsson
2013-May-25 19:48 UTC
[Rd] Assigning NULL to large variables is much faster than rm() - any reason why I should still use rm()?
Hi, in my packages/functions/code I tend to remove large temporary variables as soon as possible, e.g. large intermediate vectors used in iterations. I sometimes also have the habit of doing this to make it explicit in the source code when a temporary object is no longer needed. However, I did notice that this can add a noticeable overhead when the rest of the iteration step does not take that much time. Trying to speed this up, I first noticed that rm(list="a") is much faster than rm(a). While at it, I realized that for the purpose of keeping the memory footprint small, I can equally well reassign the variable the value of a small object (e.g. a <- NULL), which is significantly faster than using rm(). SOME BENCHMARKS: A toy example imitating an iterative algorithm with "large" temporary objects. x <- matrix(rnorm(100e6), ncol=10e3) t1 <- system.time(for (k in 1:ncol(x)) { a <- x[,k] colSum <- sum(a) rm(a) # Not needed anymore b <- x[k,] rowSum <- sum(b) rm(b) # Not needed anymore }) t2 <- system.time(for (k in 1:ncol(x)) { a <- x[,k] colSum <- sum(a) rm(list="a") # Not needed anymore b <- x[k,] rowSum <- sum(b) rm(list="b") # Not needed anymore }) t3 <- system.time(for (k in 1:ncol(x)) { a <- x[,k] colSum <- sum(a) a <- NULL # Not needed anymore b <- x[k,] rowSum <- sum(b) b <- NULL # Not needed anymore })> t1user system elapsed 8.03 0.00 8.08> t1/t2user system elapsed 1.322900 0.000000 1.320261> t1/t3user system elapsed 1.715812 0.000000 1.662551 Is there a reason why I shouldn't assign NULL instead of using rm()? As far as I understand it, the garbage collector will be equally efficient cleaning out the previous object when using rm(a) or a <- NULL. Is there anything else I'm overlooking? Am I adding overhead somewhere else? /Henrik PS. With the above toy example one can obviously be a bit smarter by using: t4 <- system.time({for (k in 1:ncol(x)) { a <- x[,k] colSum <- sum(a) a <- x[k,] rowSum <- sum(a) } rm(list="a") }) but that's not my point.
William Dunlap
2013-May-25 21:00 UTC
[Rd] Assigning NULL to large variables is much faster than rm() - any reason why I should still use rm()?
Another way to avoid using rm() in loops is to use throw-away functions. E.g.,> t3 <- system.time(for (k in 1:ncol(x)) { # your last, fastest, example+ a <- x[,k] + colSum <- sum(a) + a <- NULL # Not needed anymore + b <- x[k,] + rowSum <- sum(b) + b <- NULL # Not needed anymore + })> t4 <- system.time({ # use some throw-away functions+ colKSum <- function(k) { a <- x[,k] ; sum(a) } + rowKSum <- function(k) { b <- x[k,] ; sum(b) } + for(k in 1:ncol(x)) { + colSum <- colKSum(k) + rowSum <- rowKSum(k) + }})> t3user system elapsed 7.89 0.02 7.93> t4user system elapsed 7.88 0.02 7.93 I think the code is clearer. It might make the compiler's job easier. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com> -----Original Message----- > From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-project.org] On Behalf > Of Henrik Bengtsson > Sent: Saturday, May 25, 2013 12:49 PM > To: R-devel > Subject: [Rd] Assigning NULL to large variables is much faster than rm() - any reason why > I should still use rm()? > > Hi, > > in my packages/functions/code I tend to remove large temporary > variables as soon as possible, e.g. large intermediate vectors used in > iterations. I sometimes also have the habit of doing this to make it > explicit in the source code when a temporary object is no longer > needed. However, I did notice that this can add a noticeable overhead > when the rest of the iteration step does not take that much time. > > Trying to speed this up, I first noticed that rm(list="a") is much > faster than rm(a). While at it, I realized that for the purpose of > keeping the memory footprint small, I can equally well reassign the > variable the value of a small object (e.g. a <- NULL), which is > significantly faster than using rm(). > > SOME BENCHMARKS: > A toy example imitating an iterative algorithm with "large" temporary objects. > > x <- matrix(rnorm(100e6), ncol=10e3) > > t1 <- system.time(for (k in 1:ncol(x)) { > a <- x[,k] > colSum <- sum(a) > rm(a) # Not needed anymore > b <- x[k,] > rowSum <- sum(b) > rm(b) # Not needed anymore > }) > > t2 <- system.time(for (k in 1:ncol(x)) { > a <- x[,k] > colSum <- sum(a) > rm(list="a") # Not needed anymore > b <- x[k,] > rowSum <- sum(b) > rm(list="b") # Not needed anymore > }) > > t3 <- system.time(for (k in 1:ncol(x)) { > a <- x[,k] > colSum <- sum(a) > a <- NULL # Not needed anymore > b <- x[k,] > rowSum <- sum(b) > b <- NULL # Not needed anymore > }) > > > t1 > user system elapsed > 8.03 0.00 8.08 > > t1/t2 > user system elapsed > 1.322900 0.000000 1.320261 > > t1/t3 > user system elapsed > 1.715812 0.000000 1.662551 > > > Is there a reason why I shouldn't assign NULL instead of using rm()? > As far as I understand it, the garbage collector will be equally > efficient cleaning out the previous object when using rm(a) or a <- > NULL. Is there anything else I'm overlooking? Am I adding overhead > somewhere else? > > /Henrik > > > PS. With the above toy example one can obviously be a bit smarter by using: > > t4 <- system.time({for (k in 1:ncol(x)) { > a <- x[,k] > colSum <- sum(a) > a <- x[k,] > rowSum <- sum(a) > } > rm(list="a") > }) > > but that's not my point. > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Simon Urbanek
2013-May-25 23:38 UTC
[Rd] Assigning NULL to large variables is much faster than rm() - any reason why I should still use rm()?
On May 25, 2013, at 3:48 PM, Henrik Bengtsson wrote:> Hi, > > in my packages/functions/code I tend to remove large temporary > variables as soon as possible, e.g. large intermediate vectors used in > iterations. I sometimes also have the habit of doing this to make it > explicit in the source code when a temporary object is no longer > needed. However, I did notice that this can add a noticeable overhead > when the rest of the iteration step does not take that much time. > > Trying to speed this up, I first noticed that rm(list="a") is much > faster than rm(a). While at it, I realized that for the purpose of > keeping the memory footprint small, I can equally well reassign the > variable the value of a small object (e.g. a <- NULL), which is > significantly faster than using rm(). >Yes, as you probably noticed rm() is a quite complex function because it has to deal with different ways to specify input etc. When you remove that overhead (by calling .Internal(remove("a", parent.frame(), FALSE))), you get the same performance as the assignment. If you really want to go overboard, you can define your own function: SEXP rm(SEXP x, SEXP rho) { setVar(x, R_UnboundValue, rho); return R_NilValue; } poof <- function(x) .Call(rm_C, substitute(x), parent.frame()) That will be faster than anything else (mainly because it avoids the trip through strings as it can use the symbol directly). But as Bill noted - it practice I'd recommend using either local() or functions to control the scope - using rm() or assignments seems too error-prone to me. Cheers, Simon> SOME BENCHMARKS: > A toy example imitating an iterative algorithm with "large" temporary objects. > > x <- matrix(rnorm(100e6), ncol=10e3) > > t1 <- system.time(for (k in 1:ncol(x)) { > a <- x[,k] > colSum <- sum(a) > rm(a) # Not needed anymore > b <- x[k,] > rowSum <- sum(b) > rm(b) # Not needed anymore > }) > > t2 <- system.time(for (k in 1:ncol(x)) { > a <- x[,k] > colSum <- sum(a) > rm(list="a") # Not needed anymore > b <- x[k,] > rowSum <- sum(b) > rm(list="b") # Not needed anymore > }) > > t3 <- system.time(for (k in 1:ncol(x)) { > a <- x[,k] > colSum <- sum(a) > a <- NULL # Not needed anymore > b <- x[k,] > rowSum <- sum(b) > b <- NULL # Not needed anymore > }) > >> t1 > user system elapsed > 8.03 0.00 8.08 >> t1/t2 > user system elapsed > 1.322900 0.000000 1.320261 >> t1/t3 > user system elapsed > 1.715812 0.000000 1.662551 > > > Is there a reason why I shouldn't assign NULL instead of using rm()? > As far as I understand it, the garbage collector will be equally > efficient cleaning out the previous object when using rm(a) or a <- > NULL. Is there anything else I'm overlooking? Am I adding overhead > somewhere else? > > /Henrik > > > PS. With the above toy example one can obviously be a bit smarter by using: > > t4 <- system.time({for (k in 1:ncol(x)) { > a <- x[,k] > colSum <- sum(a) > a <- x[k,] > rowSum <- sum(a) > } > rm(list="a") > }) > > but that's not my point. > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > >