Ivan Krylov
2024-Jan-18 15:28 UTC
[Rd] Choices to remove `srcref` (and its buddies) when serializing objects
? Tue, 16 Jan 2024 14:16:19 -0500 Dipterix Wang <dipterix.wang at gmail.com> ?????:> Could you recommend any packages/functions that compute hash such > that the source references and sexpinfo_struct are ignored? Basically > a version of `serialize` that convert R objects to raw without > storing the ancillary source reference and sexpinfo.I can show how this can be done, but it's not currently on CRAN or even a well-defined package API. I have adapted a copy of R's serialize() [*] with the following changes: * Function bytecode and flags are ignored: f <- function() invisible() depcache:::hash(f, 2) # This is plain FNV1a-64 of serialize() output # [1] "9b7a1af5468deba4" .Call(depcache:::C_hash2, f) # This is the new hash [1] 91 5f b8 a1 b0 6b cb 40 f() # called once: function gets the MAYBEJIT_MASK flag depcache:::hash(f, 2) # [1] "7d30e05546e7a230" .Call(depcache:::C_hash2, f) # [1] 91 5f b8 a1 b0 6b cb 40 f() # called twice: function now has bytecode depcache:::hash(f, 2) # [1] "2a2cba4150e722b8" .Call(depcache:::C_hash2, f) # [1] 91 5f b8 a1 b0 6b cb 40 # new hash stays the same * Source references are ignored: .Call(depcache:::C_hash2, \( ) invisible( )) # [1] 91 5f b8 a1 b0 6b cb 40 # compare vs. above # For quoted function definitions, source references have to be handled # differently .Call(depcache:::C_hash2, quote(function(){})) [1] 58 0d 44 8e d4 fd 37 6f .Call(depcache:::C_hash2, quote(\( ){ })) [1] 58 0d 44 8e d4 fd 37 6f * ALTREP is ignored: identical(1:10, 1:10+0L) # [1] TRUE identical(serialize(1:10, NULL), serialize(1:10+0L, NULL)) # [1] FALSE identical( .Call(depcache:::C_hash2, 1:10), .Call(depcache:::C_hash2, 1:10+0L) ) # [1] TRUE * Strings not marked as bytes are encoded into UTF-8: identical('\uff', iconv('\uff', 'UTF-8', 'latin1')) # [1] TRUE identical( serialize('\uff', NULL), serialize(iconv('\uff', 'UTF-8', 'latin1'), NULL) ) # [1] FALSE identical( .Call(depcache:::C_hash2, '\uff'), .Call(depcache:::C_hash2, iconv('\uff', 'UTF-8', 'latin1')) ) # [1] TRUE * NaNs with different payloads (except NA_numeric_) are replaced by R_NaN. One of the many downsides to the current approach is that we rely on the non-API entry point getPRIMNAME() in order to hash builtins. Looking at the source code for identical() is no help here, because it uses the private PRIMOFFSET macro. The bitstream being hashed is also, unfortunately, not exactly compatible with R serialization format version 2: I had to ignore the LEVELS of the language objects being hashed both because identical() seems to ignore those and because I was missing multiple private definitions (e.g. the MAYBEJIT flag) to handle them properly. Then there's also the problem of immediate bindings [**]: I've seen bits of vctrs, rstudio, rlang blow up when calling CAR() on SEXP objects that are not safe to handle this way, but R_expand_binding_value() (used by serialize()) is again a private function that is not accessible from packages. identical() won't help here, because it compares reference objects (which may or may not contain such immediate bindings) by their pointer values instead of digging down into them. Dropping the (already violated) requirement to be compatible with R serialization bitstream will make it possible to simplify the code further. Finally: a <- new.env() b <- new.env() a$x <- b$x <- 42 identical(a, b) # [1] FALSE .Call(depcache:::C_hash2, a) # [1] 44 21 f1 36 5d 92 03 1b .Call(depcache:::C_hash2, b) # [1] 44 21 f1 36 5d 92 03 1b ...but that's unavoidable when looking at frozen object contents instead of their live memory layout. If you're interested, here's the development version of the package: install.packages('depcache',contriburl='https://aitap.github.io/Rpackages') -- Best regards, Ivan [*] https://github.com/aitap/depcache/blob/serialize_canonical/src/serialize.c [**] https://svn.r-project.org/R/trunk/doc/notes/immbnd.md
iuke-tier@ey m@iii@g oii uiow@@edu
2024-Jan-18 15:59 UTC
[Rd] [External] Re: Choices to remove `srcref` (and its buddies) when serializing objects
On Thu, 18 Jan 2024, Ivan Krylov via R-devel wrote:> ? Tue, 16 Jan 2024 14:16:19 -0500 > Dipterix Wang <dipterix.wang at gmail.com> ?????: > >> Could you recommend any packages/functions that compute hash such >> that the source references and sexpinfo_struct are ignored? Basically >> a version of `serialize` that convert R objects to raw without >> storing the ancillary source reference and sexpinfo. > > I can show how this can be done, but it's not currently on CRAN or even > a well-defined package API. I have adapted a copy of R's serialize() > [*] with the following changes: > > * Function bytecode and flags are ignored: > > f <- function() invisible() > depcache:::hash(f, 2) # This is plain FNV1a-64 of serialize() output > # [1] "9b7a1af5468deba4" > .Call(depcache:::C_hash2, f) # This is the new hash > [1] 91 5f b8 a1 b0 6b cb 40 > f() # called once: function gets the MAYBEJIT_MASK flag > depcache:::hash(f, 2) > # [1] "7d30e05546e7a230" > .Call(depcache:::C_hash2, f) > # [1] 91 5f b8 a1 b0 6b cb 40 > f() # called twice: function now has bytecode > depcache:::hash(f, 2) > # [1] "2a2cba4150e722b8" > .Call(depcache:::C_hash2, f) > # [1] 91 5f b8 a1 b0 6b cb 40 # new hash stays the same > > * Source references are ignored: > > .Call(depcache:::C_hash2, \( ) invisible( )) > # [1] 91 5f b8 a1 b0 6b cb 40 # compare vs. above > > # For quoted function definitions, source references have to be handled > # differently > .Call(depcache:::C_hash2, quote(function(){})) > [1] 58 0d 44 8e d4 fd 37 6f > .Call(depcache:::C_hash2, quote(\( ){ })) > [1] 58 0d 44 8e d4 fd 37 6f > > * ALTREP is ignored: > > identical(1:10, 1:10+0L) > # [1] TRUE > identical(serialize(1:10, NULL), serialize(1:10+0L, NULL)) > # [1] FALSE > identical( > .Call(depcache:::C_hash2, 1:10), > .Call(depcache:::C_hash2, 1:10+0L) > ) > # [1] TRUE > > * Strings not marked as bytes are encoded into UTF-8: > > identical('\uff', iconv('\uff', 'UTF-8', 'latin1')) > # [1] TRUE > identical( > serialize('\uff', NULL), > serialize(iconv('\uff', 'UTF-8', 'latin1'), NULL) > ) > # [1] FALSE > identical( > .Call(depcache:::C_hash2, '\uff'), > .Call(depcache:::C_hash2, iconv('\uff', 'UTF-8', 'latin1')) > ) > # [1] TRUE > > * NaNs with different payloads (except NA_numeric_) are replaced by > R_NaN. > > One of the many downsides to the current approach is that we rely on > the non-API entry point getPRIMNAME() in order to hash builtins. > Looking at the source code for identical() is no help here, because it > uses the private PRIMOFFSET macro. > > The bitstream being hashed is also, unfortunately, not exactly > compatible with R serialization format version 2: I had to ignore the > LEVELS of the language objects being hashed both because identical() > seems to ignore those and because I was missing multiple private > definitions (e.g. the MAYBEJIT flag) to handle them properly. > > Then there's also the problem of immediate bindings [**]: I've seen bits > of vctrs, rstudio, rlang blow up when calling CAR() on SEXP objects that > are not safe to handle this way, but R_expand_binding_value() (used by > serialize()) is again a private function that is not accessible from > packages. identical() won't help here, because it compares reference > objects (which may or may not contain such immediate bindings) by their > pointer values instead of digging down into them.What does 'blow up' mean? If it is anything other than signal a "bad binding access" error then it would be good to have more details. Best, luke> Dropping the (already violated) requirement to be compatible with R > serialization bitstream will make it possible to simplify the code > further. > > Finally: > > a <- new.env() > b <- new.env() > a$x <- b$x <- 42 > identical(a, b) > # [1] FALSE > .Call(depcache:::C_hash2, a) > # [1] 44 21 f1 36 5d 92 03 1b > .Call(depcache:::C_hash2, b) > # [1] 44 21 f1 36 5d 92 03 1b > > ...but that's unavoidable when looking at frozen object contents > instead of their live memory layout. > > If you're interested, here's the development version of the package: > install.packages('depcache',contriburl='https://aitap.github.io/Rpackages') > >-- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics and Fax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tierney at uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu
Apparently Analagous Threads
- Choices to remove `srcref` (and its buddies) when serializing objects
- Choices to remove `srcref` (and its buddies) when serializing objects
- Choices to remove `srcref` (and its buddies) when serializing objects
- Choices to remove `srcref` (and its buddies) when serializing objects
- Choices to remove `srcref` (and its buddies) when serializing objects