Kenny Bell
2020-Jan-29 19:25 UTC
[Rd] Model object, when generated in a function, saves entire environment when saved
Reviving an old thread. I haven't noticed this be a problem for a while when saving RDS's which is great. However, I noticed the problem again when saving `qs` files (https://github.com/traversc/qs) which is an RDS replacement with a fast serialization / compression system. I'd like to get an idea of what change was made within R to address this issue for `saveRDS`. My thought is that this will help the author of the `qs` package do something similar. I have had a browse through the release notes for the last few years (Ctrl-F-ing "environment") and couldn't see it. Many thanks for any help and best wishes to all. The following code uses R 3.6.2 and requires you to run install.packages("qs") first: save_size_qs <- function (object) { tf <- tempfile(fileext = ".qs") on.exit(unlink(tf)) qs::qsave(object, file = tf) file.size(tf) } save_size_rds <- function (object) { tf <- tempfile(fileext = ".rds") on.exit(unlink(tf)) saveRDS(object, file = tf) file.size(tf) } normal_lm <- function(){ junk <- 1:1e+08 lm(Sepal.Length ~ Sepal.Width, data = iris) } normal_ggplot <- function(){ junk <- 1:1e+08 ggplot2::ggplot() } clean_lm <- function () { junk <- 1:1e+08 # Run the lm in its own environment env <- new.env(parent = globalenv()) env$subset <- subset with(env, lm(Sepal.Length ~ Sepal.Width, data = iris)) } # The qs save size includes the junk but the rds does not save_size_qs(normal_lm()) #> [1] 848396 save_size_rds(normal_lm()) #> [1] 4163 save_size_qs(normal_ggplot()) #> [1] 857446 save_size_rds(normal_ggplot()) #> [1] 12895 # Both exclude the junk when separating the lm into its own environment save_size_qs(clean_lm()) #> [1] 6154 save_size_rds(clean_lm()) #> [1] 4255 On Thu, Jul 28, 2016 at 7:31 AM Kenny Bell <kmbell56 at gmail.com> wrote:> Thanks so much for all this. > > The first solution is what I'm going with as I want the terms object to > come along so that predict still works. > > On Wed, Jul 27, 2016 at 12:28 PM, William Dunlap via R-devel < > r-devel at r-project.org> wrote: > >> Another solution is to only save the parts of the model object that >> interest you. As long as they don't include the formula (which is >> what drags along the environment it was created in), you will >> save space. E.g., >> >> tfun2 <- function(subset) { >> junk <- 1:1e6 >> list(subset=subset, lm(Sepal.Length ~ Sepal.Width, data=iris, >> subset=subset)$coef) >> } >> >> saveSize(tfun2(1:4)) >> #[1] 152 >> >> >> >> Bill Dunlap >> TIBCO Software >> wdunlap tibco.com >> >> On Wed, Jul 27, 2016 at 11:19 AM, William Dunlap <wdunlap at tibco.com> >> wrote: >> >> > One way around this problem is to make a new environment whose >> > parent environment is .GlobalEnv and which contains only what the >> > the call to lm() requires and to compute lm() in that environment. >> E.g., >> > >> > tfun1 <- function (subset) >> > { >> > junk <- 1:1e+06 >> > env <- new.env(parent = globalenv()) >> > env$subset <- subset >> > with(env, lm(Sepal.Length ~ Sepal.Width, data = iris, subset >> subset)) >> > } >> > Then we get >> > > saveSize(tfun1(1:4)) # see below for def. of saveSize >> > [1] 910 >> > instead of the 2129743 bytes in the save file when using the naive >> method. >> > >> > saveSize <- function (object) { >> > tf <- tempfile(fileext = ".RData") >> > on.exit(unlink(tf)) >> > save(object, file = tf) >> > file.size(tf) >> > } >> > >> > >> > >> > Bill Dunlap >> > TIBCO Software >> > wdunlap tibco.com >> > >> > On Wed, Jul 27, 2016 at 10:48 AM, Kenny Bell <kmb56 at berkeley.edu> >> wrote: >> > >> >> In the below, I generate a model from an environment that isn't >> >> .GlobalEnv with a large object that is unrelated to the model >> >> generation. It seems to save the irrelevant object unnecessarily. In >> >> my actual use case, I am running and saving many models in a loop that >> >> each use a single large data.frame (that gets collapsed into a small >> >> data.frame for estimation), so removing it isn't an option. >> >> >> >> In the case where the model exists in .GlobalEnv, everything is >> >> peachy. So replicating whatever happens when saving the model that was >> >> generated in .GlobalEnv at the return() stage of the function call >> >> would fix this problem. >> >> >> >> I was referred to this list from r-bugs. First time r-devel poster. >> >> >> >> Hope this helps, >> >> >> >> Kendon >> >> >> >> ``` >> >> tmp_fun <- function(x){ >> >> iris_big <- lapply(1:10000, function(x) iris) >> >> lm(Sepal.Length ~ Sepal.Width, data = iris) >> >> } >> >> >> >> out <- tmp_fun(1) >> >> object.size(out) >> >> # 48008 >> >> save(out, file = "tmp.RData", compress = FALSE) >> >> file.size("tmp.RData") >> >> # 57196752 - way too big >> >> >> >> # Works fine when in .GlobalEnv >> >> iris_big <- lapply(1:10000, function(x) iris) >> >> out <- lm(Sepal.Length ~ Sepal.Width, data = iris) >> >> >> >> object.size(out) >> >> # 48008 >> >> save(out, file = "tmp.RData", compress = FALSE) >> >> file.size("tmp.RData") >> >> # 16641 - good size. >> >> ``` >> >> >> >> [[alternative HTML version deleted]] >> >> >> >> ______________________________________________ >> >> R-devel at r-project.org mailing list >> >> https://stat.ethz.ch/mailman/listinfo/r-devel >> >> >> > >> > >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > >[[alternative HTML version deleted]]
Duncan Murdoch
2020-Jan-29 20:24 UTC
[Rd] Model object, when generated in a function, saves entire environment when saved
On 29/01/2020 2:25 p.m., Kenny Bell wrote:> Reviving an old thread. I haven't noticed this be a problem for a while > when saving RDS's which is great. However, I noticed the problem again when > saving `qs` files (https://github.com/traversc/qs) which is an RDS > replacement with a fast serialization / compression system. > > I'd like to get an idea of what change was made within R to address this > issue for `saveRDS`. My thought is that this will help the author of the > `qs` package do something similar. I have had a browse through the release > notes for the last few years (Ctrl-F-ing "environment") and couldn't see it.The vector 1:1e+08 is stored very compactly in recent R versions (the start and end plus a marker that it's a sequence), and it appears saveRDS takes advantage of that while qs::qsave doesn't. That's not a very useful test, because environments typically aren't filled with long sequence vectors. If you replace the line junk <- 1:1e+08 with junk <- runif(1e+08) you'll see drastically different results: > save_size_qs(normal_lm()) [1] 417953609 > #> [1] 848396 > save_size_rds(normal_lm()) [1] 532614827 > #> [1] 4163 > save_size_qs(normal_ggplot()) [1] 417967987 > #> [1] 857446 > save_size_rds(normal_ggplot()) [1] 532624477 > #> [1] 12895 Duncan Murdoch
Harvey Smith
2020-Jan-30 21:53 UTC
[Rd] Model object, when generated in a function, saves entire environment when saved
Depending on if you need the data in the referenced environments later, you could fit the model normally and use the refhook argument in saveRDS/readRDS to replace references to environments in the model with a dummy value. normal_lm <- function(){ junk <- runif(1e+08) lm(Sepal.Length ~ Sepal.Width, data = iris) } object = normal_lm() tf <- tempfile(fileext = ".rds") saveRDS(object, file = tf, refhook = function(...) {""}) object2 = readRDS(file = tf, refhook = function(...) { .GlobalEnv }) file.size(tf) On Wed, Jan 29, 2020 at 3:24 PM Duncan Murdoch <murdoch.duncan at gmail.com> wrote:> On 29/01/2020 2:25 p.m., Kenny Bell wrote: > > Reviving an old thread. I haven't noticed this be a problem for a while > > when saving RDS's which is great. However, I noticed the problem again > when > > saving `qs` files (https://github.com/traversc/qs) which is an RDS > > replacement with a fast serialization / compression system. > > > > I'd like to get an idea of what change was made within R to address this > > issue for `saveRDS`. My thought is that this will help the author of the > > `qs` package do something similar. I have had a browse through the > release > > notes for the last few years (Ctrl-F-ing "environment") and couldn't see > it. > > The vector 1:1e+08 is stored very compactly in recent R versions (the > start and end plus a marker that it's a sequence), and it appears > saveRDS takes advantage of that while qs::qsave doesn't. That's not a > very useful test, because environments typically aren't filled with long > sequence vectors. If you replace the line > > junk <- 1:1e+08 > > with > > junk <- runif(1e+08) > > you'll see drastically different results: > > > save_size_qs(normal_lm()) > [1] 417953609 > > #> [1] 848396 > > save_size_rds(normal_lm()) > [1] 532614827 > > #> [1] 4163 > > save_size_qs(normal_ggplot()) > [1] 417967987 > > > #> [1] 857446 > > save_size_rds(normal_ggplot()) > [1] 532624477 > > #> [1] 12895 > > Duncan Murdoch > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]
Maybe Matching Threads
- Model object, when generated in a function, saves entire environment when saved
- Model object, when generated in a function, saves entire environment when saved
- Can one set --no-save in .Rprofile
- Help --- My phone number field saves blank
- [Bug 568] New: iptables-save saves option hashlimit-htable-gcinterval with error