Kenny Bell
2020-Jan-29  19:25 UTC
[Rd] Model object, when generated in a function, saves entire environment when saved
Reviving an old thread. I haven't noticed this be a problem for a while
when saving RDS's which is great. However, I noticed the problem again when
saving `qs` files (https://github.com/traversc/qs) which is an RDS
replacement with a fast serialization / compression system.
I'd like to get an idea of what change was made within R to address this
issue for `saveRDS`. My thought is that this will help the author of the
`qs` package do something similar. I have had a browse through the release
notes for the last few years (Ctrl-F-ing "environment") and
couldn't see it.
Many thanks for any help and best wishes to all.
The following code uses R 3.6.2 and requires you to run
install.packages("qs") first:
save_size_qs <- function (object) {
  tf <- tempfile(fileext = ".qs")
  on.exit(unlink(tf))
  qs::qsave(object, file = tf)
  file.size(tf)
}
save_size_rds <- function (object) {
  tf <- tempfile(fileext = ".rds")
  on.exit(unlink(tf))
  saveRDS(object, file = tf)
  file.size(tf)
}
normal_lm <- function(){
  junk <- 1:1e+08
  lm(Sepal.Length ~ Sepal.Width, data = iris)
}
normal_ggplot <- function(){
  junk <- 1:1e+08
  ggplot2::ggplot()
}
clean_lm <- function () {
  junk <- 1:1e+08
  # Run the lm in its own environment
  env <- new.env(parent = globalenv())
  env$subset <- subset
  with(env, lm(Sepal.Length ~ Sepal.Width, data = iris))
}
# The qs save size includes the junk but the rds does not
save_size_qs(normal_lm())
#> [1] 848396
save_size_rds(normal_lm())
#> [1] 4163
save_size_qs(normal_ggplot())
#> [1] 857446
save_size_rds(normal_ggplot())
#> [1] 12895
# Both exclude the junk when separating the lm into its own environment
save_size_qs(clean_lm())
#> [1] 6154
save_size_rds(clean_lm())
#> [1] 4255
On Thu, Jul 28, 2016 at 7:31 AM Kenny Bell <kmbell56 at gmail.com> wrote:
> Thanks so much for all this.
>
> The first solution is what I'm going with as I want the terms object to
> come along so that predict still works.
>
> On Wed, Jul 27, 2016 at 12:28 PM, William Dunlap via R-devel <
> r-devel at r-project.org> wrote:
>
>> Another solution is to only save the parts of the model object that
>> interest you.  As long as they don't include the formula (which is
>> what drags along the environment it was created in), you will
>> save space.  E.g.,
>>
>> tfun2 <- function(subset) {
>>    junk <- 1:1e6
>>    list(subset=subset, lm(Sepal.Length ~ Sepal.Width, data=iris,
>> subset=subset)$coef)
>> }
>>
>> saveSize(tfun2(1:4))
>> #[1] 152
>>
>>
>>
>> Bill Dunlap
>> TIBCO Software
>> wdunlap tibco.com
>>
>> On Wed, Jul 27, 2016 at 11:19 AM, William Dunlap <wdunlap at
tibco.com>
>> wrote:
>>
>> > One way around this problem is to make a new environment whose
>> > parent environment is .GlobalEnv and which contains only what the
>> > the call to lm() requires and to compute lm() in that environment.
>>  E.g.,
>> >
>> > tfun1 <- function (subset)
>> > {
>> >     junk <- 1:1e+06
>> >     env <- new.env(parent = globalenv())
>> >     env$subset <- subset
>> >     with(env, lm(Sepal.Length ~ Sepal.Width, data = iris, subset
>> subset))
>> > }
>> > Then we get
>> >    > saveSize(tfun1(1:4)) # see below for def. of saveSize
>> >    [1] 910
>> > instead of the 2129743 bytes in the save file when using the naive
>> method.
>> >
>> > saveSize <- function (object) {
>> >     tf <- tempfile(fileext = ".RData")
>> >     on.exit(unlink(tf))
>> >     save(object, file = tf)
>> >     file.size(tf)
>> > }
>> >
>> >
>> >
>> > Bill Dunlap
>> > TIBCO Software
>> > wdunlap tibco.com
>> >
>> > On Wed, Jul 27, 2016 at 10:48 AM, Kenny Bell <kmb56 at
berkeley.edu>
>> wrote:
>> >
>> >> In the below, I generate a model from an environment that
isn't
>> >> .GlobalEnv with a large object that is unrelated to the model
>> >> generation. It seems to save the irrelevant object
unnecessarily. In
>> >> my actual use case, I am running and saving many models in a
loop that
>> >> each use a single large data.frame (that gets collapsed into a
small
>> >> data.frame for estimation), so removing it isn't an
option.
>> >>
>> >> In the case where the model exists in .GlobalEnv, everything
is
>> >> peachy. So replicating whatever happens when saving the model
that was
>> >> generated in .GlobalEnv at the return() stage of the function
call
>> >> would fix this problem.
>> >>
>> >> I was referred to this list from r-bugs. First time r-devel
poster.
>> >>
>> >> Hope this helps,
>> >>
>> >> Kendon
>> >>
>> >> ```
>> >> tmp_fun <- function(x){
>> >>   iris_big <- lapply(1:10000, function(x) iris)
>> >>   lm(Sepal.Length ~ Sepal.Width, data = iris)
>> >> }
>> >>
>> >> out <- tmp_fun(1)
>> >> object.size(out)
>> >> # 48008
>> >> save(out, file = "tmp.RData", compress = FALSE)
>> >> file.size("tmp.RData")
>> >> # 57196752 - way too big
>> >>
>> >> # Works fine when in .GlobalEnv
>> >> iris_big <- lapply(1:10000, function(x) iris)
>> >> out <- lm(Sepal.Length ~ Sepal.Width, data = iris)
>> >>
>> >> object.size(out)
>> >> # 48008
>> >> save(out, file = "tmp.RData", compress = FALSE)
>> >> file.size("tmp.RData")
>> >> # 16641 - good size.
>> >> ```
>> >>
>> >>         [[alternative HTML version deleted]]
>> >>
>> >> ______________________________________________
>> >> R-devel at r-project.org mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/r-devel
>> >>
>> >
>> >
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
>
	[[alternative HTML version deleted]]
Duncan Murdoch
2020-Jan-29  20:24 UTC
[Rd] Model object, when generated in a function, saves entire environment when saved
On 29/01/2020 2:25 p.m., Kenny Bell wrote:> Reviving an old thread. I haven't noticed this be a problem for a while > when saving RDS's which is great. However, I noticed the problem again when > saving `qs` files (https://github.com/traversc/qs) which is an RDS > replacement with a fast serialization / compression system. > > I'd like to get an idea of what change was made within R to address this > issue for `saveRDS`. My thought is that this will help the author of the > `qs` package do something similar. I have had a browse through the release > notes for the last few years (Ctrl-F-ing "environment") and couldn't see it.The vector 1:1e+08 is stored very compactly in recent R versions (the start and end plus a marker that it's a sequence), and it appears saveRDS takes advantage of that while qs::qsave doesn't. That's not a very useful test, because environments typically aren't filled with long sequence vectors. If you replace the line junk <- 1:1e+08 with junk <- runif(1e+08) you'll see drastically different results: > save_size_qs(normal_lm()) [1] 417953609 > #> [1] 848396 > save_size_rds(normal_lm()) [1] 532614827 > #> [1] 4163 > save_size_qs(normal_ggplot()) [1] 417967987 > #> [1] 857446 > save_size_rds(normal_ggplot()) [1] 532624477 > #> [1] 12895 Duncan Murdoch
Harvey Smith
2020-Jan-30  21:53 UTC
[Rd] Model object, when generated in a function, saves entire environment when saved
Depending on if you need the data in the referenced environments later, you
could fit the model normally and use the refhook argument in
saveRDS/readRDS to replace references to environments in the model with a
dummy value.
normal_lm <- function(){
  junk <- runif(1e+08)
  lm(Sepal.Length ~ Sepal.Width, data = iris)
}
object = normal_lm()
tf <- tempfile(fileext = ".rds")
saveRDS(object, file = tf, refhook = function(...) {""})
object2 = readRDS(file = tf, refhook = function(...) { .GlobalEnv })
file.size(tf)
On Wed, Jan 29, 2020 at 3:24 PM Duncan Murdoch <murdoch.duncan at
gmail.com>
wrote:
> On 29/01/2020 2:25 p.m., Kenny Bell wrote:
> > Reviving an old thread. I haven't noticed this be a problem for a
while
> > when saving RDS's which is great. However, I noticed the problem
again
> when
> > saving `qs` files (https://github.com/traversc/qs) which is an RDS
> > replacement with a fast serialization / compression system.
> >
> > I'd like to get an idea of what change was made within R to
address this
> > issue for `saveRDS`. My thought is that this will help the author of
the
> > `qs` package do something similar. I have had a browse through the
> release
> > notes for the last few years (Ctrl-F-ing "environment") and
couldn't see
> it.
>
> The vector 1:1e+08 is stored very compactly in recent R versions (the
> start and end plus a marker that it's a sequence), and it appears
> saveRDS takes advantage of that while qs::qsave doesn't.  That's
not a
> very useful test, because environments typically aren't filled with
long
> sequence vectors.  If you replace the line
>
>    junk <- 1:1e+08
>
> with
>
>    junk <- runif(1e+08)
>
> you'll see drastically different results:
>
>  > save_size_qs(normal_lm())
> [1] 417953609
>  > #> [1] 848396
>  > save_size_rds(normal_lm())
> [1] 532614827
>  > #> [1] 4163
>  > save_size_qs(normal_ggplot())
> [1] 417967987
>
>  > #> [1] 857446
>  > save_size_rds(normal_ggplot())
> [1] 532624477
>  > #> [1] 12895
>
> Duncan Murdoch
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
	[[alternative HTML version deleted]]
Reasonably Related Threads
- Model object, when generated in a function, saves entire environment when saved
- Model object, when generated in a function, saves entire environment when saved
- Can one set --no-save in .Rprofile
- Help --- My phone number field saves blank
- [Bug 568] New: iptables-save saves option hashlimit-htable-gcinterval with error