thr3ads.net - R devel - [Rd] Cost of garbage collection seems excessive [Jan 2015]

If this information is useful, please help other people find it:
Share via:

Nathan Kurz

2015-Jan-09 09:31 UTC

[Rd] Cost of garbage collection seems excessive

When doing repeated regressions on large data sets, I'm finding that
the time spent on garbage collection often exceeds the time spent on
the regression itself.   Consider this test program which I'm running
on  an Intel Haswell i7-4470 processor under Linux 3.13 using R 3.1.2
compiled with ICPC 14.1:

nate at haswell:~$ cat > gc.R
  library(speedglm)
  createData <- function(n) {
      int <- -5
      x <- rnorm(n, 50, 7)
      e <- rnorm(n, 0, 1)
      y <- int + (1.2 * x) + e
      return(data.frame(y, x))
  }
 gc.time()
 data <- createData(500000)
 data.y <- as.matrix(data[1])
 data.x <- model.matrix(y ~ ., data)
 for (i in 1:100) speedglm.wfit(X=data.x, y=data.y, family=gaussian())
 gc.time()

nate at haswell:~$ time Rscript gc.R
 Loading required package: Matrix
 Loading required package: methods
 [1] 0 0 0 0 0
 [1] 10.410  0.024 10.441  0.000  0.000
real 0m17.167s
user 0m16.996s
sys 0m0.176s

The total execution time is 17 seconds, and the time spent on garbage
collection is almost 2/3 of that.  My actual use case is a package
that creates an ensemble from a variety of cross-validated
regressions, and exhibits the same poor performance. Is this expected
behavior?

I've found that I can reduce the garbage collection time to a
tolerable level by setting the R_VSIZE environment value to a large
enough value:

nate at haswell:~$ time R_VSIZE=1GB Rscript gc.R
Loading required package: Matrix
Loading required package: methods
[1] 0 0 0 0 0
[1] 0.716 0.025 0.739 0.000 0.000
real 0m7.694s
user 0m7.388s
sys 0m0.309s

I can do slightly better with even higher values, and by using
R_GC_MEM_GROW=3.  But while using the environment variables solves the
issue for me, I fear that the end users of my package won't be able to
set them.   Is there a way that I can achieve the higher performance
from within R rather than from the command line?

Thanks!

--nate

luke-tierney at uiowa.edu

2015-Jan-11 19:25 UTC

head link

[Rd] Cost of garbage collection seems excessive

This is a known issue that is being looked into. The primary culprit
seems to be the case labels that are created and need to be scanned by
the GC.

Best,

luke

On Fri, 9 Jan 2015, Nathan Kurz wrote:
> When doing repeated regressions on large data sets, I'm finding that
> the time spent on garbage collection often exceeds the time spent on
> the regression itself.   Consider this test program which I'm running
> on  an Intel Haswell i7-4470 processor under Linux 3.13 using R 3.1.2
> compiled with ICPC 14.1:
>
> nate at haswell:~$ cat > gc.R
>  library(speedglm)
>  createData <- function(n) {
>      int <- -5
>      x <- rnorm(n, 50, 7)
>      e <- rnorm(n, 0, 1)
>      y <- int + (1.2 * x) + e
>      return(data.frame(y, x))
>  }
> gc.time()
> data <- createData(500000)
> data.y <- as.matrix(data[1])
> data.x <- model.matrix(y ~ ., data)
> for (i in 1:100) speedglm.wfit(X=data.x, y=data.y, family=gaussian())
> gc.time()
>
> nate at haswell:~$ time Rscript gc.R
> Loading required package: Matrix
> Loading required package: methods
> [1] 0 0 0 0 0
> [1] 10.410  0.024 10.441  0.000  0.000
> real 0m17.167s
> user 0m16.996s
> sys 0m0.176s
>
> The total execution time is 17 seconds, and the time spent on garbage
> collection is almost 2/3 of that.  My actual use case is a package
> that creates an ensemble from a variety of cross-validated
> regressions, and exhibits the same poor performance. Is this expected
> behavior?
>
> I've found that I can reduce the garbage collection time to a
> tolerable level by setting the R_VSIZE environment value to a large
> enough value:
>
> nate at haswell:~$ time R_VSIZE=1GB Rscript gc.R
> Loading required package: Matrix
> Loading required package: methods
> [1] 0 0 0 0 0
> [1] 0.716 0.025 0.739 0.000 0.000
> real 0m7.694s
> user 0m7.388s
> sys 0m0.309s
>
> I can do slightly better with even higher values, and by using
> R_GC_MEM_GROW=3.  But while using the environment variables solves the
> issue for me, I fear that the end users of my package won't be able to
> set them.   Is there a way that I can achieve the higher performance
> from within R rather than from the command line?
>
> Thanks!
>
> --nate
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
-- 
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   luke-tierney at uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu

Reasonably Related Threads

Search for more apparently analagous threads

R devel - Jan 2015 - Cost of garbage collection seems excessive

[Rd] Cost of garbage collection seems excessive

[Rd] Cost of garbage collection seems excessive

Reasonably Related Threads