William Dunlap
2016-Jul-27 19:28 UTC
[Rd] Model object, when generated in a function, saves entire environment when saved
Another solution is to only save the parts of the model object that interest you. As long as they don't include the formula (which is what drags along the environment it was created in), you will save space. E.g., tfun2 <- function(subset) { junk <- 1:1e6 list(subset=subset, lm(Sepal.Length ~ Sepal.Width, data=iris, subset=subset)$coef) } saveSize(tfun2(1:4)) #[1] 152 Bill Dunlap TIBCO Software wdunlap tibco.com On Wed, Jul 27, 2016 at 11:19 AM, William Dunlap <wdunlap at tibco.com> wrote:> One way around this problem is to make a new environment whose > parent environment is .GlobalEnv and which contains only what the > the call to lm() requires and to compute lm() in that environment. E.g., > > tfun1 <- function (subset) > { > junk <- 1:1e+06 > env <- new.env(parent = globalenv()) > env$subset <- subset > with(env, lm(Sepal.Length ~ Sepal.Width, data = iris, subset = subset)) > } > Then we get > > saveSize(tfun1(1:4)) # see below for def. of saveSize > [1] 910 > instead of the 2129743 bytes in the save file when using the naive method. > > saveSize <- function (object) { > tf <- tempfile(fileext = ".RData") > on.exit(unlink(tf)) > save(object, file = tf) > file.size(tf) > } > > > > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > On Wed, Jul 27, 2016 at 10:48 AM, Kenny Bell <kmb56 at berkeley.edu> wrote: > >> In the below, I generate a model from an environment that isn't >> .GlobalEnv with a large object that is unrelated to the model >> generation. It seems to save the irrelevant object unnecessarily. In >> my actual use case, I am running and saving many models in a loop that >> each use a single large data.frame (that gets collapsed into a small >> data.frame for estimation), so removing it isn't an option. >> >> In the case where the model exists in .GlobalEnv, everything is >> peachy. So replicating whatever happens when saving the model that was >> generated in .GlobalEnv at the return() stage of the function call >> would fix this problem. >> >> I was referred to this list from r-bugs. First time r-devel poster. >> >> Hope this helps, >> >> Kendon >> >> ``` >> tmp_fun <- function(x){ >> iris_big <- lapply(1:10000, function(x) iris) >> lm(Sepal.Length ~ Sepal.Width, data = iris) >> } >> >> out <- tmp_fun(1) >> object.size(out) >> # 48008 >> save(out, file = "tmp.RData", compress = FALSE) >> file.size("tmp.RData") >> # 57196752 - way too big >> >> # Works fine when in .GlobalEnv >> iris_big <- lapply(1:10000, function(x) iris) >> out <- lm(Sepal.Length ~ Sepal.Width, data = iris) >> >> object.size(out) >> # 48008 >> save(out, file = "tmp.RData", compress = FALSE) >> file.size("tmp.RData") >> # 16641 - good size. >> ``` >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > >[[alternative HTML version deleted]]
Kenny Bell
2016-Jul-27 19:31 UTC
[Rd] Model object, when generated in a function, saves entire environment when saved
Thanks so much for all this. The first solution is what I'm going with as I want the terms object to come along so that predict still works. On Wed, Jul 27, 2016 at 12:28 PM, William Dunlap via R-devel < r-devel at r-project.org> wrote:> Another solution is to only save the parts of the model object that > interest you. As long as they don't include the formula (which is > what drags along the environment it was created in), you will > save space. E.g., > > tfun2 <- function(subset) { > junk <- 1:1e6 > list(subset=subset, lm(Sepal.Length ~ Sepal.Width, data=iris, > subset=subset)$coef) > } > > saveSize(tfun2(1:4)) > #[1] 152 > > > > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > On Wed, Jul 27, 2016 at 11:19 AM, William Dunlap <wdunlap at tibco.com> > wrote: > > > One way around this problem is to make a new environment whose > > parent environment is .GlobalEnv and which contains only what the > > the call to lm() requires and to compute lm() in that environment. > E.g., > > > > tfun1 <- function (subset) > > { > > junk <- 1:1e+06 > > env <- new.env(parent = globalenv()) > > env$subset <- subset > > with(env, lm(Sepal.Length ~ Sepal.Width, data = iris, subset > subset)) > > } > > Then we get > > > saveSize(tfun1(1:4)) # see below for def. of saveSize > > [1] 910 > > instead of the 2129743 bytes in the save file when using the naive > method. > > > > saveSize <- function (object) { > > tf <- tempfile(fileext = ".RData") > > on.exit(unlink(tf)) > > save(object, file = tf) > > file.size(tf) > > } > > > > > > > > Bill Dunlap > > TIBCO Software > > wdunlap tibco.com > > > > On Wed, Jul 27, 2016 at 10:48 AM, Kenny Bell <kmb56 at berkeley.edu> wrote: > > > >> In the below, I generate a model from an environment that isn't > >> .GlobalEnv with a large object that is unrelated to the model > >> generation. It seems to save the irrelevant object unnecessarily. In > >> my actual use case, I am running and saving many models in a loop that > >> each use a single large data.frame (that gets collapsed into a small > >> data.frame for estimation), so removing it isn't an option. > >> > >> In the case where the model exists in .GlobalEnv, everything is > >> peachy. So replicating whatever happens when saving the model that was > >> generated in .GlobalEnv at the return() stage of the function call > >> would fix this problem. > >> > >> I was referred to this list from r-bugs. First time r-devel poster. > >> > >> Hope this helps, > >> > >> Kendon > >> > >> ``` > >> tmp_fun <- function(x){ > >> iris_big <- lapply(1:10000, function(x) iris) > >> lm(Sepal.Length ~ Sepal.Width, data = iris) > >> } > >> > >> out <- tmp_fun(1) > >> object.size(out) > >> # 48008 > >> save(out, file = "tmp.RData", compress = FALSE) > >> file.size("tmp.RData") > >> # 57196752 - way too big > >> > >> # Works fine when in .GlobalEnv > >> iris_big <- lapply(1:10000, function(x) iris) > >> out <- lm(Sepal.Length ~ Sepal.Width, data = iris) > >> > >> object.size(out) > >> # 48008 > >> save(out, file = "tmp.RData", compress = FALSE) > >> file.size("tmp.RData") > >> # 16641 - good size. > >> ``` > >> > >> [[alternative HTML version deleted]] > >> > >> ______________________________________________ > >> R-devel at r-project.org mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-devel > >> > > > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]
Kenny Bell
2020-Jan-29 19:25 UTC
[Rd] Model object, when generated in a function, saves entire environment when saved
Reviving an old thread. I haven't noticed this be a problem for a while when saving RDS's which is great. However, I noticed the problem again when saving `qs` files (https://github.com/traversc/qs) which is an RDS replacement with a fast serialization / compression system. I'd like to get an idea of what change was made within R to address this issue for `saveRDS`. My thought is that this will help the author of the `qs` package do something similar. I have had a browse through the release notes for the last few years (Ctrl-F-ing "environment") and couldn't see it. Many thanks for any help and best wishes to all. The following code uses R 3.6.2 and requires you to run install.packages("qs") first: save_size_qs <- function (object) { tf <- tempfile(fileext = ".qs") on.exit(unlink(tf)) qs::qsave(object, file = tf) file.size(tf) } save_size_rds <- function (object) { tf <- tempfile(fileext = ".rds") on.exit(unlink(tf)) saveRDS(object, file = tf) file.size(tf) } normal_lm <- function(){ junk <- 1:1e+08 lm(Sepal.Length ~ Sepal.Width, data = iris) } normal_ggplot <- function(){ junk <- 1:1e+08 ggplot2::ggplot() } clean_lm <- function () { junk <- 1:1e+08 # Run the lm in its own environment env <- new.env(parent = globalenv()) env$subset <- subset with(env, lm(Sepal.Length ~ Sepal.Width, data = iris)) } # The qs save size includes the junk but the rds does not save_size_qs(normal_lm()) #> [1] 848396 save_size_rds(normal_lm()) #> [1] 4163 save_size_qs(normal_ggplot()) #> [1] 857446 save_size_rds(normal_ggplot()) #> [1] 12895 # Both exclude the junk when separating the lm into its own environment save_size_qs(clean_lm()) #> [1] 6154 save_size_rds(clean_lm()) #> [1] 4255 On Thu, Jul 28, 2016 at 7:31 AM Kenny Bell <kmbell56 at gmail.com> wrote:> Thanks so much for all this. > > The first solution is what I'm going with as I want the terms object to > come along so that predict still works. > > On Wed, Jul 27, 2016 at 12:28 PM, William Dunlap via R-devel < > r-devel at r-project.org> wrote: > >> Another solution is to only save the parts of the model object that >> interest you. As long as they don't include the formula (which is >> what drags along the environment it was created in), you will >> save space. E.g., >> >> tfun2 <- function(subset) { >> junk <- 1:1e6 >> list(subset=subset, lm(Sepal.Length ~ Sepal.Width, data=iris, >> subset=subset)$coef) >> } >> >> saveSize(tfun2(1:4)) >> #[1] 152 >> >> >> >> Bill Dunlap >> TIBCO Software >> wdunlap tibco.com >> >> On Wed, Jul 27, 2016 at 11:19 AM, William Dunlap <wdunlap at tibco.com> >> wrote: >> >> > One way around this problem is to make a new environment whose >> > parent environment is .GlobalEnv and which contains only what the >> > the call to lm() requires and to compute lm() in that environment. >> E.g., >> > >> > tfun1 <- function (subset) >> > { >> > junk <- 1:1e+06 >> > env <- new.env(parent = globalenv()) >> > env$subset <- subset >> > with(env, lm(Sepal.Length ~ Sepal.Width, data = iris, subset >> subset)) >> > } >> > Then we get >> > > saveSize(tfun1(1:4)) # see below for def. of saveSize >> > [1] 910 >> > instead of the 2129743 bytes in the save file when using the naive >> method. >> > >> > saveSize <- function (object) { >> > tf <- tempfile(fileext = ".RData") >> > on.exit(unlink(tf)) >> > save(object, file = tf) >> > file.size(tf) >> > } >> > >> > >> > >> > Bill Dunlap >> > TIBCO Software >> > wdunlap tibco.com >> > >> > On Wed, Jul 27, 2016 at 10:48 AM, Kenny Bell <kmb56 at berkeley.edu> >> wrote: >> > >> >> In the below, I generate a model from an environment that isn't >> >> .GlobalEnv with a large object that is unrelated to the model >> >> generation. It seems to save the irrelevant object unnecessarily. In >> >> my actual use case, I am running and saving many models in a loop that >> >> each use a single large data.frame (that gets collapsed into a small >> >> data.frame for estimation), so removing it isn't an option. >> >> >> >> In the case where the model exists in .GlobalEnv, everything is >> >> peachy. So replicating whatever happens when saving the model that was >> >> generated in .GlobalEnv at the return() stage of the function call >> >> would fix this problem. >> >> >> >> I was referred to this list from r-bugs. First time r-devel poster. >> >> >> >> Hope this helps, >> >> >> >> Kendon >> >> >> >> ``` >> >> tmp_fun <- function(x){ >> >> iris_big <- lapply(1:10000, function(x) iris) >> >> lm(Sepal.Length ~ Sepal.Width, data = iris) >> >> } >> >> >> >> out <- tmp_fun(1) >> >> object.size(out) >> >> # 48008 >> >> save(out, file = "tmp.RData", compress = FALSE) >> >> file.size("tmp.RData") >> >> # 57196752 - way too big >> >> >> >> # Works fine when in .GlobalEnv >> >> iris_big <- lapply(1:10000, function(x) iris) >> >> out <- lm(Sepal.Length ~ Sepal.Width, data = iris) >> >> >> >> object.size(out) >> >> # 48008 >> >> save(out, file = "tmp.RData", compress = FALSE) >> >> file.size("tmp.RData") >> >> # 16641 - good size. >> >> ``` >> >> >> >> [[alternative HTML version deleted]] >> >> >> >> ______________________________________________ >> >> R-devel at r-project.org mailing list >> >> https://stat.ethz.ch/mailman/listinfo/r-devel >> >> >> > >> > >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > >[[alternative HTML version deleted]]
Possibly Parallel Threads
- Model object, when generated in a function, saves entire environment when saved
- Model object, when generated in a function, saves entire environment when saved
- Model object, when generated in a function, saves entire environment when saved
- Model object, when generated in a function, saves entire environment when saved
- -msave-args backend support for x86_64