Kenny Bell
2016-Jul-27 17:48 UTC
[Rd] Model object, when generated in a function, saves entire environment when saved
In the below, I generate a model from an environment that isn't .GlobalEnv with a large object that is unrelated to the model generation. It seems to save the irrelevant object unnecessarily. In my actual use case, I am running and saving many models in a loop that each use a single large data.frame (that gets collapsed into a small data.frame for estimation), so removing it isn't an option. In the case where the model exists in .GlobalEnv, everything is peachy. So replicating whatever happens when saving the model that was generated in .GlobalEnv at the return() stage of the function call would fix this problem. I was referred to this list from r-bugs. First time r-devel poster. Hope this helps, Kendon ``` tmp_fun <- function(x){ iris_big <- lapply(1:10000, function(x) iris) lm(Sepal.Length ~ Sepal.Width, data = iris) } out <- tmp_fun(1) object.size(out) # 48008 save(out, file = "tmp.RData", compress = FALSE) file.size("tmp.RData") # 57196752 - way too big # Works fine when in .GlobalEnv iris_big <- lapply(1:10000, function(x) iris) out <- lm(Sepal.Length ~ Sepal.Width, data = iris) object.size(out) # 48008 save(out, file = "tmp.RData", compress = FALSE) file.size("tmp.RData") # 16641 - good size. ``` [[alternative HTML version deleted]]
William Dunlap
2016-Jul-27 18:19 UTC
[Rd] Model object, when generated in a function, saves entire environment when saved
One way around this problem is to make a new environment whose parent environment is .GlobalEnv and which contains only what the the call to lm() requires and to compute lm() in that environment. E.g., tfun1 <- function (subset) { junk <- 1:1e+06 env <- new.env(parent = globalenv()) env$subset <- subset with(env, lm(Sepal.Length ~ Sepal.Width, data = iris, subset = subset)) } Then we get > saveSize(tfun1(1:4)) # see below for def. of saveSize [1] 910 instead of the 2129743 bytes in the save file when using the naive method. saveSize <- function (object) { tf <- tempfile(fileext = ".RData") on.exit(unlink(tf)) save(object, file = tf) file.size(tf) } Bill Dunlap TIBCO Software wdunlap tibco.com On Wed, Jul 27, 2016 at 10:48 AM, Kenny Bell <kmb56 at berkeley.edu> wrote:> In the below, I generate a model from an environment that isn't > .GlobalEnv with a large object that is unrelated to the model > generation. It seems to save the irrelevant object unnecessarily. In > my actual use case, I am running and saving many models in a loop that > each use a single large data.frame (that gets collapsed into a small > data.frame for estimation), so removing it isn't an option. > > In the case where the model exists in .GlobalEnv, everything is > peachy. So replicating whatever happens when saving the model that was > generated in .GlobalEnv at the return() stage of the function call > would fix this problem. > > I was referred to this list from r-bugs. First time r-devel poster. > > Hope this helps, > > Kendon > > ``` > tmp_fun <- function(x){ > iris_big <- lapply(1:10000, function(x) iris) > lm(Sepal.Length ~ Sepal.Width, data = iris) > } > > out <- tmp_fun(1) > object.size(out) > # 48008 > save(out, file = "tmp.RData", compress = FALSE) > file.size("tmp.RData") > # 57196752 - way too big > > # Works fine when in .GlobalEnv > iris_big <- lapply(1:10000, function(x) iris) > out <- lm(Sepal.Length ~ Sepal.Width, data = iris) > > object.size(out) > # 48008 > save(out, file = "tmp.RData", compress = FALSE) > file.size("tmp.RData") > # 16641 - good size. > ``` > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]
Duncan Murdoch
2016-Jul-27 19:11 UTC
[Rd] Model object, when generated in a function, saves entire environment when saved
On 27/07/2016 1:48 PM, Kenny Bell wrote:> In the below, I generate a model from an environment that isn't > .GlobalEnv with a large object that is unrelated to the model > generation. It seems to save the irrelevant object unnecessarily. In > my actual use case, I am running and saving many models in a loop that > each use a single large data.frame (that gets collapsed into a small > data.frame for estimation), so removing it isn't an option.If each of those many models refers to the object in the formula, then you need to keep it. But you'll only have one copy of it, because environments are reference objects in R. If your loop looks like this, for (i in 1:n) { subset <- bigdf[ fn(i), ] model[i] <- lm(y ~ x, data = subset) } then you might be in trouble. You'll only get one copy of the "subset" variable in the environment, so in any cases where code gets it from there, they'll get the last one, not the one for model[i]. One way around this is to write a nested function to create the subset variable, e.g. nested <- function(subset) { lm(y ~ x, data = subset) } for (i in 1:n) model[i] <- nested(bigdf[ fn(i), ]) rm(bigdf) and it will be safe to remove bigdf after the loop. (I see that Bill Dunlap has posted a different way of achieving the same sort of thing.) Duncan Murdoch> > In the case where the model exists in .GlobalEnv, everything is > peachy. So replicating whatever happens when saving the model that was > generated in .GlobalEnv at the return() stage of the function call > would fix this problem. > > I was referred to this list from r-bugs. First time r-devel poster. > > Hope this helps, > > Kendon > > ``` > tmp_fun <- function(x){ > iris_big <- lapply(1:10000, function(x) iris) > lm(Sepal.Length ~ Sepal.Width, data = iris) > } > > out <- tmp_fun(1) > object.size(out) > # 48008 > save(out, file = "tmp.RData", compress = FALSE) > file.size("tmp.RData") > # 57196752 - way too big > > # Works fine when in .GlobalEnv > iris_big <- lapply(1:10000, function(x) iris) > out <- lm(Sepal.Length ~ Sepal.Width, data = iris) > > object.size(out) > # 48008 > save(out, file = "tmp.RData", compress = FALSE) > file.size("tmp.RData") > # 16641 - good size. > ``` > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
William Dunlap
2016-Jul-27 19:28 UTC
[Rd] Model object, when generated in a function, saves entire environment when saved
Another solution is to only save the parts of the model object that interest you. As long as they don't include the formula (which is what drags along the environment it was created in), you will save space. E.g., tfun2 <- function(subset) { junk <- 1:1e6 list(subset=subset, lm(Sepal.Length ~ Sepal.Width, data=iris, subset=subset)$coef) } saveSize(tfun2(1:4)) #[1] 152 Bill Dunlap TIBCO Software wdunlap tibco.com On Wed, Jul 27, 2016 at 11:19 AM, William Dunlap <wdunlap at tibco.com> wrote:> One way around this problem is to make a new environment whose > parent environment is .GlobalEnv and which contains only what the > the call to lm() requires and to compute lm() in that environment. E.g., > > tfun1 <- function (subset) > { > junk <- 1:1e+06 > env <- new.env(parent = globalenv()) > env$subset <- subset > with(env, lm(Sepal.Length ~ Sepal.Width, data = iris, subset = subset)) > } > Then we get > > saveSize(tfun1(1:4)) # see below for def. of saveSize > [1] 910 > instead of the 2129743 bytes in the save file when using the naive method. > > saveSize <- function (object) { > tf <- tempfile(fileext = ".RData") > on.exit(unlink(tf)) > save(object, file = tf) > file.size(tf) > } > > > > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > On Wed, Jul 27, 2016 at 10:48 AM, Kenny Bell <kmb56 at berkeley.edu> wrote: > >> In the below, I generate a model from an environment that isn't >> .GlobalEnv with a large object that is unrelated to the model >> generation. It seems to save the irrelevant object unnecessarily. In >> my actual use case, I am running and saving many models in a loop that >> each use a single large data.frame (that gets collapsed into a small >> data.frame for estimation), so removing it isn't an option. >> >> In the case where the model exists in .GlobalEnv, everything is >> peachy. So replicating whatever happens when saving the model that was >> generated in .GlobalEnv at the return() stage of the function call >> would fix this problem. >> >> I was referred to this list from r-bugs. First time r-devel poster. >> >> Hope this helps, >> >> Kendon >> >> ``` >> tmp_fun <- function(x){ >> iris_big <- lapply(1:10000, function(x) iris) >> lm(Sepal.Length ~ Sepal.Width, data = iris) >> } >> >> out <- tmp_fun(1) >> object.size(out) >> # 48008 >> save(out, file = "tmp.RData", compress = FALSE) >> file.size("tmp.RData") >> # 57196752 - way too big >> >> # Works fine when in .GlobalEnv >> iris_big <- lapply(1:10000, function(x) iris) >> out <- lm(Sepal.Length ~ Sepal.Width, data = iris) >> >> object.size(out) >> # 48008 >> save(out, file = "tmp.RData", compress = FALSE) >> file.size("tmp.RData") >> # 16641 - good size. >> ``` >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > >[[alternative HTML version deleted]]
Apparently Analagous Threads
- Model object, when generated in a function, saves entire environment when saved
- Model object, when generated in a function, saves entire environment when saved
- Model object, when generated in a function, saves entire environment when saved
- Model object, when generated in a function, saves entire environment when saved
- Which is more efficient?