thr3ads.net - R devel - [Rd] Closing over Garbage [Jan 2015]

If this information is useful, please help other people find it:
Share via:

Christian Sigg

2015-Jan-15 17:37 UTC

[Rd] Closing over Garbage

Given a large data.frame, a function trains a series of models by looping over
two steps:

1. Create a model-specific subset of the complete training data
2. Train a model on the subset data

The function returns a list of trained models which are later used for
prediction on test data.

Due to how models and closures work in R, each model contains a lot of data that
is not necessary for prediction. The space requirements of all models combined
can become prohibitive if the number of samples in the training data and the
number of models is sufficiently large:

1. A trained linear model (and other models that follow the same conventions)
contains the training data itself and related quantities (such as residuals).
While this is convenient for some kinds of analysis, it negates the space saving
effect of compacting the training data into the model parameters.

2. Any function created in the loop contains the training data in its enclosing
environment. For example, a linearising transform defined as

linearise <- function(x) {
    x^gamma
}

(where gamma is derived from the training data) does not only contain `gamma`
but other objects in its enclosing environment as well (e.g. intermediate
computations in the loop). If `linearise` is returned with the model, those
objects are also returned implicitly.

The first point can be dealt with by removing those components of the model
which are not necessary for prediction (e.g. model$residuals <- NULL). For
the second point, more work and care is needed to clean up all enclosing
environments of created functions (not only `linearise` but also model$terms
etc.).

I have read that V8's garbage collector avoids this problem by
distinguishing between local and context variables

https://stackoverflow.com/questions/5326300/garbage-collection-with-node-js

Can something similar be done in R? Is there a programming technique that is
less tedious than "manual" cleanup of all enclosing environments?

Thanks,
Christian

luke-tierney at uiowa.edu

2015-Jan-15 19:43 UTC

head link

[Rd] Closing over Garbage

On Thu, 15 Jan 2015, Christian Sigg wrote:
> Given a large data.frame, a function trains a series of models by looping
over two steps:
>
> 1. Create a model-specific subset of the complete training data
> 2. Train a model on the subset data
>
> The function returns a list of trained models which are later used for
prediction on test data.
>
> Due to how models and closures work in R, each model contains a lot of data
that is not necessary for prediction. The space requirements of all models
combined can become prohibitive if the number of samples in the training data
and the number of models is sufficiently large:
>
> 1. A trained linear model (and other models that follow the same
conventions) contains the training data itself and related quantities (such as
residuals). While this is convenient for some kinds of analysis, it negates the
space saving effect of compacting the training data into the model parameters.
>
> 2. Any function created in the loop contains the training data in its
enclosing environment. For example, a linearising transform defined as
>
> linearise <- function(x) {
>    x^gamma
> }
>
> (where gamma is derived from the training data) does not only contain
`gamma` but other objects in its enclosing environment as well (e.g.
intermediate computations in the loop). If `linearise` is returned with the
model, those objects are also returned implicitly.
>
> The first point can be dealt with by removing those components of the model
which are not necessary for prediction (e.g. model$residuals <- NULL). For
the second point, more work and care is needed to clean up all enclosing
environments of created functions (not only `linearise` but also model$terms
etc.).
>
> I have read that V8's garbage collector avoids this problem by
distinguishing between local and context variables
>
> https://stackoverflow.com/questions/5326300/garbage-collection-with-node-js
>
> Can something similar be done in R? Is there a programming technique that
is less tedious than "manual" cleanup of all enclosing environments?
R's semantics do not permit this sort of optimization in general --
there may be something we could do if users could provide annotations
that allow the semantics to be relaxed; that sort of thing is being
considered but won't be available anytime soon.

The approach I use in situations like this is to write top-level
functions that create closures. So for your example, replace

  linearise <- function(x) {
     x^gamma
  }

with

  makeLinearize(x, gamma),

where makeLinearize is defined at top level as

makeLinearize <- function(x, gamma) {
     function(x) {
         x^gamma
     }
}

Best,

luke
>
> Thanks,
> Christian
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
-- 
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   luke-tierney at uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu

Seemingly Similar Threads

Search for more seemingly similar threads

R devel - Jan 2015 - Closing over Garbage

[Rd] Closing over Garbage

[Rd] Closing over Garbage

Seemingly Similar Threads