Tierney, Luke
2019-Jan-22 16:21 UTC
[Rd] Objectsize function visiting every element for alt-rep strings
On Mon, 21 Jan 2019, Martin Maechler wrote:>>>>>> Travers Ching >>>>>> on Tue, 15 Jan 2019 12:50:45 -0800 writes: > > > I have a toy alt-rep string package that generates > > randomly seeded strings. example: library(altstringisode) > > x <- altrandomStrings(1e8) head(x) [1] > > "2PN0bdwPY7CA8M06zVKEkhHgZVgtV1" > > "5PN2qmWqBlQ9wQj99nsQzldVI5ZuGX" ... etc object.size(1e8) > > > Object.size will call the set_altstring_Elt_method for > > every single element, materializing (slowly) every element > > of the vector. This is a problem mostly in R-studio since > > object.size is called automatically, defeating the purpose > > of alt-rep.There is no sensible way in general to figure out how large the strings would be without computing them. There might be specifically for a deferred sequence conversion but it would require a fair bit of effort to figure out that would be better spent elsewhere. I've never been a big fan of object.size since what it is trying to compute isn't very well defined in the context of sharing and possible internal state changes (even before ALTREP byte code compilation could change the internals of a function [which object.size sees] and assigning into environments or evaluating promises can change environments [which object.size ignores]). The issue is not unlike the one faced by identical(), which has a bunch of options for the different ways objects can be identical, and might need even more. We could in general have object.size for and ALTREP return the object.size results of the current internal representation, but that might not always be appropriate. Again, what object.size is trying to compute isn't very well defined. RStudio does seem to call object.size on every assignment to .GlobalEnv. That might be worth revisiting. Best, luke> > Hmm. But still, the idea had been that object.size() *shuld* > return the size of the "de-ALTREP'ed" object *but* should not > de-ALTREP it. > That's what happens for integers, but indeed fails to happen for > such as.character(.)ed integers. > > From my eRum presentation (which took from the official ALTREP documentation > svn.r-project.org/R/branches/ALTREP/ALTREP.html ) : > > > x <- 1:1e15 > > object.size(x) # 8000'000'000'000'048 bytes : 8000 TBytes -- ok, not really > 8000000000000048 bytes > > is.unsorted(x) # FALSE : i.e., R's *knows* it is sorted > [1] FALSE > > xs <- sort(x) # > > .Internal(inspect(x)) > @80255f8 14 REALSXP g0c0 [NAM(7)] 1 : 1000000000000000 (compact) > > > > > cx <- as.character(x) > > .Internal(inspect(cx)) > @80485d8 16 STRSXP g0c0 [NAM(1)] <deferred string conversion> > @80255f8 14 REALSXP g1c0 [MARK,NAM(7)] 1 : 1000000000000000 (compact) > > system.time( print(object.size(x)), gc=FALSE) > 8000000000000048 bytes > user system elapsed > 0.000 0.000 0.001 > > system.time( print(object.size(cx)), gc=FALSE) > Error: cannot allocate vector of size 8388608.0 Gb > Timing stopped at: 11.43 0 11.46 > > > > One could consider it a bug that object.size(cx) is indeed > inspecting every string, i.e., accessing cx[i] for all i. > Note that it is *not* deALTREPing cx itself : > >> x <- 1:1e6 >> cx <- as.character(x) >> .Internal(inspect(cx)) > > @7f5b1a0 16 STRSXP g0c0 [NAM(1)] <deferred string conversion> > @7f5adb0 13 INTSXP g0c0 [NAM(7)] 1 : 1000000 (compact) >> system.time( print(object.size(cx)), gc=FALSE) > 64000048 bytes > user system elapsed > 0.369 0.005 0.374 >> .Internal(inspect(cx)) > @7f5b1a0 16 STRSXP g0c0 [NAM(7)] <deferred string conversion> > @7f5adb0 13 INTSXP g0c0 [NAM(7)] 1 : 1000000 (compact) >> > > > Is there a way to avoid the problem of forced > > materialization in rstudio? > > > PS: Is there a way to tell if a post has been received by > > the mailing list? How long does it take to show up in the > > archives? > > [ that (waiting time) distribution is quite right skewed... I'd > guess it's median to be less than 10 minutes... but we had > artificially delayed it somewhat in the past to fight > spammers, and ETH (the hosting instituttion) and others have > increased spam and virus filtering so everything has become > quite a bit slower ] > > ______________________________________________ > R-devel at r-project.org mailing list > stat.ethz.ch/mailman/listinfo/r-devel >-- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics and Fax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tierney at uiowa.edu Iowa City, IA 52242 WWW: stat.uiowa.edu
Kevin Ushey
2019-Jan-22 17:17 UTC
[Rd] Objectsize function visiting every element for alt-rep strings
I think that object.size() is most commonly used to answer the question, "what R objects are consuming the most memory currently in my R session?" and for that reason I think returning the size of the internal representations of objects (for e.g. ALTREP objects; unevaluated promises) is the right default behavior. I also agree it would be worth considering adding arguments that control how object.size() is computed for different kinds of R objects, since users might want to use object.size() to answer different types of questions. All that said, if the ultimate goal here is to avoid having RStudio materialize ALTREP objects in the background, then perhaps that change should happen in RStudio :-) Best, Kevin On Tue, Jan 22, 2019 at 8:21 AM Tierney, Luke <luke-tierney at uiowa.edu> wrote:> On Mon, 21 Jan 2019, Martin Maechler wrote: > > >>>>>> Travers Ching > >>>>>> on Tue, 15 Jan 2019 12:50:45 -0800 writes: > > > > > I have a toy alt-rep string package that generates > > > randomly seeded strings. example: library(altstringisode) > > > x <- altrandomStrings(1e8) head(x) [1] > > > "2PN0bdwPY7CA8M06zVKEkhHgZVgtV1" > > > "5PN2qmWqBlQ9wQj99nsQzldVI5ZuGX" ... etc object.size(1e8) > > > > > Object.size will call the set_altstring_Elt_method for > > > every single element, materializing (slowly) every element > > > of the vector. This is a problem mostly in R-studio since > > > object.size is called automatically, defeating the purpose > > > of alt-rep. > > There is no sensible way in general to figure out how large the > strings would be without computing them. There might be specifically > for a deferred sequence conversion but it would require a fair bit of > effort to figure out that would be better spent elsewhere. > > I've never been a big fan of object.size since what it is trying to > compute isn't very well defined in the context of sharing and possible > internal state changes (even before ALTREP byte code compilation could > change the internals of a function [which object.size sees] and > assigning into environments or evaluating promises can change > environments [which object.size ignores]). The issue is not unlike the > one faced by identical(), which has a bunch of options for the > different ways objects can be identical, and might need even more. > > We could in general have object.size for and ALTREP return the > object.size results of the current internal representation, but that > might not always be appropriate. Again, what object.size is trying to > compute isn't very well defined. > > RStudio does seem to call object.size on every assignment to > .GlobalEnv. That might be worth revisiting. > > > Best, > > luke > > > > > Hmm. But still, the idea had been that object.size() *shuld* > > return the size of the "de-ALTREP'ed" object *but* should not > > de-ALTREP it. > > That's what happens for integers, but indeed fails to happen for > > such as.character(.)ed integers. > > > > From my eRum presentation (which took from the official ALTREP > documentation > > svn.r-project.org/R/branches/ALTREP/ALTREP.html ) : > > > > > x <- 1:1e15 > > > object.size(x) # 8000'000'000'000'048 bytes : 8000 TBytes -- ok, not > really > > 8000000000000048 bytes > > > is.unsorted(x) # FALSE : i.e., R's *knows* it is sorted > > [1] FALSE > > > xs <- sort(x) # > > > .Internal(inspect(x)) > > @80255f8 14 REALSXP g0c0 [NAM(7)] 1 : 1000000000000000 (compact) > > > > > > > > cx <- as.character(x) > > > .Internal(inspect(cx)) > > @80485d8 16 STRSXP g0c0 [NAM(1)] <deferred string conversion> > > @80255f8 14 REALSXP g1c0 [MARK,NAM(7)] 1 : 1000000000000000 (compact) > > > system.time( print(object.size(x)), gc=FALSE) > > 8000000000000048 bytes > > user system elapsed > > 0.000 0.000 0.001 > > > system.time( print(object.size(cx)), gc=FALSE) > > Error: cannot allocate vector of size 8388608.0 Gb > > Timing stopped at: 11.43 0 11.46 > > > > > > > One could consider it a bug that object.size(cx) is indeed > > inspecting every string, i.e., accessing cx[i] for all i. > > Note that it is *not* deALTREPing cx itself : > > > >> x <- 1:1e6 > >> cx <- as.character(x) > >> .Internal(inspect(cx)) > > > > @7f5b1a0 16 STRSXP g0c0 [NAM(1)] <deferred string conversion> > > @7f5adb0 13 INTSXP g0c0 [NAM(7)] 1 : 1000000 (compact) > >> system.time( print(object.size(cx)), gc=FALSE) > > 64000048 bytes > > user system elapsed > > 0.369 0.005 0.374 > >> .Internal(inspect(cx)) > > @7f5b1a0 16 STRSXP g0c0 [NAM(7)] <deferred string conversion> > > @7f5adb0 13 INTSXP g0c0 [NAM(7)] 1 : 1000000 (compact) > >> > > > > > Is there a way to avoid the problem of forced > > > materialization in rstudio? > > > > > PS: Is there a way to tell if a post has been received by > > > the mailing list? How long does it take to show up in the > > > archives? > > > > [ that (waiting time) distribution is quite right skewed... I'd > > guess it's median to be less than 10 minutes... but we had > > artificially delayed it somewhat in the past to fight > > spammers, and ETH (the hosting instituttion) and others have > > increased spam and virus filtering so everything has become > > quite a bit slower ] > > > > ______________________________________________ > > R-devel at r-project.org mailing list > > stat.ethz.ch/mailman/listinfo/r-devel > > > > -- > Luke Tierney > Ralph E. Wareham Professor of Mathematical Sciences > University of Iowa Phone: 319-335-3386 > Department of Statistics and Fax: 319-335-3017 > Actuarial Science > 241 Schaeffer Hall email: luke-tierney at uiowa.edu > Iowa City, IA 52242 WWW: stat.uiowa.edu > > ______________________________________________ > R-devel at r-project.org mailing list > stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]
Tomas Kalibera
2019-Jan-23 09:33 UTC
[Rd] Objectsize function visiting every element for alt-rep strings
On 1/22/19 6:17 PM, Kevin Ushey wrote:> I think that object.size() is most commonly used to answer the question, > "what R objects are consuming the most memory currently in my R session?" > and for that reason I think returning the size of the internal > representations of objects (for e.g. ALTREP objects; unevaluated promises) > is the right default behavior.I don't think one could answer that question at all in the presence of sharing (of objects with value semantics due to copy on write, string cache or other caches, sharing of objects with referential semantics such as environments, etc). Also the mapping from R objects (SEXPs) to what users might understand as objects would not be clear (which SEXPs belong to which "object", which SEXPs are too low-level for the user to be considered, etc). In principle, there could be a memory profiler working at SEXP level and exposing all the intricacies of the memory layout, answering reachability questions on a heap dump (so one could find out about a 1G integer vector and then list all bindings say in namespace environments from which it is reachable), but of course that would be a lot of work to implement and to maintain. The problem is not unique to R (e.g. see Java with the same problems of sharing that prevent meaningful definition for object size). I am not persuaded it makes sense to add more options to a function that does not have and cannot have a well defined user-level semantics, and I would discourage writing code that is trying to build on that function as I think that it might lead to confusion and frustration. I think equality for example is easier to define (just that one could come up with multiple meaningful definitions, so it makes sense to have multiple options). Best Tomas> > I also agree it would be worth considering adding arguments that control > how object.size() is computed for different kinds of R objects, since users > might want to use object.size() to answer different types of questions. > > All that said, if the ultimate goal here is to avoid having RStudio > materialize ALTREP objects in the background, then perhaps that change > should happen in RStudio :-) > > Best, > Kevin > > On Tue, Jan 22, 2019 at 8:21 AM Tierney, Luke <luke-tierney at uiowa.edu> > wrote: > >> On Mon, 21 Jan 2019, Martin Maechler wrote: >> >>>>>>>> Travers Ching >>>>>>>> on Tue, 15 Jan 2019 12:50:45 -0800 writes: >>> > I have a toy alt-rep string package that generates >>> > randomly seeded strings. example: library(altstringisode) >>> > x <- altrandomStrings(1e8) head(x) [1] >>> > "2PN0bdwPY7CA8M06zVKEkhHgZVgtV1" >>> > "5PN2qmWqBlQ9wQj99nsQzldVI5ZuGX" ... etc object.size(1e8) >>> >>> > Object.size will call the set_altstring_Elt_method for >>> > every single element, materializing (slowly) every element >>> > of the vector. This is a problem mostly in R-studio since >>> > object.size is called automatically, defeating the purpose >>> > of alt-rep. >> There is no sensible way in general to figure out how large the >> strings would be without computing them. There might be specifically >> for a deferred sequence conversion but it would require a fair bit of >> effort to figure out that would be better spent elsewhere. >> >> I've never been a big fan of object.size since what it is trying to >> compute isn't very well defined in the context of sharing and possible >> internal state changes (even before ALTREP byte code compilation could >> change the internals of a function [which object.size sees] and >> assigning into environments or evaluating promises can change >> environments [which object.size ignores]). The issue is not unlike the >> one faced by identical(), which has a bunch of options for the >> different ways objects can be identical, and might need even more. >> >> We could in general have object.size for and ALTREP return the >> object.size results of the current internal representation, but that >> might not always be appropriate. Again, what object.size is trying to >> compute isn't very well defined. >> >> RStudio does seem to call object.size on every assignment to >> .GlobalEnv. That might be worth revisiting. >> >> >> Best, >> >> luke >> >>> Hmm. But still, the idea had been that object.size() *shuld* >>> return the size of the "de-ALTREP'ed" object *but* should not >>> de-ALTREP it. >>> That's what happens for integers, but indeed fails to happen for >>> such as.character(.)ed integers. >>> >>> From my eRum presentation (which took from the official ALTREP >> documentation >>> svn.r-project.org/R/branches/ALTREP/ALTREP.html ) : >>> >>> > x <- 1:1e15 >>> > object.size(x) # 8000'000'000'000'048 bytes : 8000 TBytes -- ok, not >> really >>> 8000000000000048 bytes >>> > is.unsorted(x) # FALSE : i.e., R's *knows* it is sorted >>> [1] FALSE >>> > xs <- sort(x) # >>> > .Internal(inspect(x)) >>> @80255f8 14 REALSXP g0c0 [NAM(7)] 1 : 1000000000000000 (compact) >>> > >>> >>> > cx <- as.character(x) >>> > .Internal(inspect(cx)) >>> @80485d8 16 STRSXP g0c0 [NAM(1)] <deferred string conversion> >>> @80255f8 14 REALSXP g1c0 [MARK,NAM(7)] 1 : 1000000000000000 (compact) >>> > system.time( print(object.size(x)), gc=FALSE) >>> 8000000000000048 bytes >>> user system elapsed >>> 0.000 0.000 0.001 >>> > system.time( print(object.size(cx)), gc=FALSE) >>> Error: cannot allocate vector of size 8388608.0 Gb >>> Timing stopped at: 11.43 0 11.46 >>> > >>> >>> One could consider it a bug that object.size(cx) is indeed >>> inspecting every string, i.e., accessing cx[i] for all i. >>> Note that it is *not* deALTREPing cx itself : >>> >>>> x <- 1:1e6 >>>> cx <- as.character(x) >>>> .Internal(inspect(cx)) >>> @7f5b1a0 16 STRSXP g0c0 [NAM(1)] <deferred string conversion> >>> @7f5adb0 13 INTSXP g0c0 [NAM(7)] 1 : 1000000 (compact) >>>> system.time( print(object.size(cx)), gc=FALSE) >>> 64000048 bytes >>> user system elapsed >>> 0.369 0.005 0.374 >>>> .Internal(inspect(cx)) >>> @7f5b1a0 16 STRSXP g0c0 [NAM(7)] <deferred string conversion> >>> @7f5adb0 13 INTSXP g0c0 [NAM(7)] 1 : 1000000 (compact) >>> > Is there a way to avoid the problem of forced >>> > materialization in rstudio? >>> >>> > PS: Is there a way to tell if a post has been received by >>> > the mailing list? How long does it take to show up in the >>> > archives? >>> >>> [ that (waiting time) distribution is quite right skewed... I'd >>> guess it's median to be less than 10 minutes... but we had >>> artificially delayed it somewhat in the past to fight >>> spammers, and ETH (the hosting instituttion) and others have >>> increased spam and virus filtering so everything has become >>> quite a bit slower ] >>> >>> ______________________________________________ >>> R-devel at r-project.org mailing list >>> stat.ethz.ch/mailman/listinfo/r-devel >>> >> -- >> Luke Tierney >> Ralph E. Wareham Professor of Mathematical Sciences >> University of Iowa Phone: 319-335-3386 >> Department of Statistics and Fax: 319-335-3017 >> Actuarial Science >> 241 Schaeffer Hall email: luke-tierney at uiowa.edu >> Iowa City, IA 52242 WWW: stat.uiowa.edu >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> stat.ethz.ch/mailman/listinfo/r-devel >> > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > stat.ethz.ch/mailman/listinfo/r-devel
Apparently Analagous Threads
- Objectsize function visiting every element for alt-rep strings
- Objectsize function visiting every element for alt-rep strings
- Objectsize function visiting every element for alt-rep strings
- Objectsize function visiting every element for alt-rep strings
- ALTREP wrappers and factors