Dipterix Wang
2024-Jan-16 19:16 UTC
[Rd] Choices to remove `srcref` (and its buddies) when serializing objects
Could you recommend any packages/functions that compute hash such that the source references and sexpinfo_struct are ignored? Basically a version of `serialize` that convert R objects to raw without storing the ancillary source reference and sexpinfo. I think most people would think of `digest` but that package uses `serialize` (see discussion https://github.com/eddelbuettel/digest/issues/200#issuecomment-1894289875)> On Jan 12, 2024, at 11:33?AM, Tomas Kalibera <tomas.kalibera at gmail.com> wrote: > > > On 1/12/24 06:11, Dipterix Wang wrote: >> Dear R devs, >> >> I was digging into a package issue today when I realized R serialize function not always generate the same results on equivalent objects when users choose to run differently. For example, the following code >> >> serialize(with(new.env(), { function(){} }), NULL, TRUE) >> >> generates different results when I copy-paste into console vs when I use ctrl+shift+enter to source the file in RStudio. >> >> With a deeper inspect into the cause, I found that function and language get source reference when getOption("keep.source") is TRUE. This means the source reference will make the functions different while in most cases, whether keeping function source might not impact how a function behaves. >> >> While it's OK that function serialize generates different results, functions such as `rlang::hash` and `digest::digest`, which depend on `serialize` might eventually deliver false positives on same inputs. I've checked source code in digest package hoping to get around this issue (for example serialize(..., refhook = ...)). However, my workaround did not work. It seems that the markers to the objects are different even if I used `refhook` to force srcref to be the same. I also tried `removeSource` and `rlang::zap_srcref`. None of them works directly on nested environments with multiple functions. >> >> I wonder how hard it would be to have options to discard source when serializing R objects? >> >> Currently my analyses heavily depend on digest function to generate file caches and automatically schedule pipelines (to update cache) when changes are detected. The pipelines save the hashes of source code, inputs, and outputs together so other people can easily verify the calculation without accessing the original data (which could be sensitive), or running hour-long analyses, or having to buy servers. All of these require `serialize` to produce the same results regardless of how users choose to run the code. >> >> It would be great if this feature could be in the future R. Other pipeline packages such as `targets` and `drake` can also benefit from it. > > I don't think such functionality would belong to serialize(). This function is not meant to produce stable results based on the input, the serialized representation may even differ based on properties not seen by users. > > I think an option to ignore source code would belong to a function that computes the hash, as other options of identical(). > > Tomas > > >> Thanks, >> >> - Dipterix >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-devel at r-project.org <mailto:R-devel at r-project.org> mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel[[alternative HTML version deleted]]
Tomas Kalibera
2024-Jan-17 09:31 UTC
[Rd] Choices to remove `srcref` (and its buddies) when serializing objects
On 1/16/24 20:16, Dipterix Wang wrote:> Could you recommend any packages/functions that compute hash such that > the source references and sexpinfo_struct are ignored? Basically a > version of `serialize` that convert R objects to raw without storing > the ancillary source reference and sexpinfo. > I think most people would think of `digest` but that package uses > `serialize` (see discussion > https://github.com/eddelbuettel/digest/issues/200#issuecomment-1894289875)I think one could implement hashing on the fly without any serialization, similarly to how identical works, but I am not aware of any existing implementation. Again, if that wasn't clear: I don't think trying to compute a hash of an object from its serialized representation is a good idea - it is of course convenient, but has problems like the one you have ran into. In some applications it may still be good enough: if by various tweaks, such as ensuring source references are off in your case, you achieve a state when false alarms are rare (identical objects have different hashes), and hence say unnecessary re-computation is rare, maybe it is good enough. Tomas> >> On Jan 12, 2024, at 11:33?AM, Tomas Kalibera >> <tomas.kalibera at gmail.com> wrote: >> >> >> On 1/12/24 06:11, Dipterix Wang wrote: >>> Dear R devs, >>> >>> I was digging into a package issue today when I realized R serialize >>> function not always generate the same results on equivalent objects >>> when users choose to run differently. For example, the following code >>> >>> serialize(with(new.env(), { function(){} }), NULL, TRUE) >>> >>> generates different results when I copy-paste into console vs when I >>> use ctrl+shift+enter to source the file in RStudio. >>> >>> With a deeper inspect into the cause, I found that function and >>> language get source reference when getOption("keep.source") is TRUE. >>> This means the source reference will make the functions different >>> while in most cases, whether keeping function source might not >>> impact how a function behaves. >>> >>> While it's OK that function serialize generates different results, >>> functions such as `rlang::hash` and `digest::digest`, which depend >>> on `serialize` might eventually deliver false positives on same >>> inputs. I've checked source code in digest package hoping to get >>> around this issue (for example serialize(..., refhook = ...)). >>> However, my workaround did not work. It seems that the markers to >>> the objects are different even if I used `refhook` to force srcref >>> to be the same. I also tried `removeSource` and `rlang::zap_srcref`. >>> None of them works directly on nested environments with multiple >>> functions. >>> >>> I wonder how hard it would be to have options to discard source when >>> serializing R objects? >>> >>> Currently my analyses heavily depend on digest function to generate >>> file caches and automatically schedule pipelines (to update cache) >>> when changes are detected. The pipelines save the hashes of source >>> code, inputs, and outputs together so other people can easily verify >>> the calculation without accessing the original data (which could be >>> sensitive), or running hour-long analyses, or having to buy servers. >>> All of these require `serialize` to produce the same results >>> regardless of how users choose to run the code. >>> >>> It would be great if this feature could be in the future R. Other >>> pipeline packages such as `targets` and `drake` can also benefit >>> from it. >> >> I don't think such functionality would belong to serialize(). This >> function is not meant to produce stable results based on the input, >> the serialized representation may even differ based on properties not >> seen by users. >> >> I think an option to ignore source code would belong to a function >> that computes the hash, as other options of identical(). >> >> Tomas >> >> >>> Thanks, >>> >>> - Dipterix >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-devel at r-project.orgmailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >
Ivan Krylov
2024-Jan-18 15:28 UTC
[Rd] Choices to remove `srcref` (and its buddies) when serializing objects
? Tue, 16 Jan 2024 14:16:19 -0500 Dipterix Wang <dipterix.wang at gmail.com> ?????:> Could you recommend any packages/functions that compute hash such > that the source references and sexpinfo_struct are ignored? Basically > a version of `serialize` that convert R objects to raw without > storing the ancillary source reference and sexpinfo.I can show how this can be done, but it's not currently on CRAN or even a well-defined package API. I have adapted a copy of R's serialize() [*] with the following changes: * Function bytecode and flags are ignored: f <- function() invisible() depcache:::hash(f, 2) # This is plain FNV1a-64 of serialize() output # [1] "9b7a1af5468deba4" .Call(depcache:::C_hash2, f) # This is the new hash [1] 91 5f b8 a1 b0 6b cb 40 f() # called once: function gets the MAYBEJIT_MASK flag depcache:::hash(f, 2) # [1] "7d30e05546e7a230" .Call(depcache:::C_hash2, f) # [1] 91 5f b8 a1 b0 6b cb 40 f() # called twice: function now has bytecode depcache:::hash(f, 2) # [1] "2a2cba4150e722b8" .Call(depcache:::C_hash2, f) # [1] 91 5f b8 a1 b0 6b cb 40 # new hash stays the same * Source references are ignored: .Call(depcache:::C_hash2, \( ) invisible( )) # [1] 91 5f b8 a1 b0 6b cb 40 # compare vs. above # For quoted function definitions, source references have to be handled # differently .Call(depcache:::C_hash2, quote(function(){})) [1] 58 0d 44 8e d4 fd 37 6f .Call(depcache:::C_hash2, quote(\( ){ })) [1] 58 0d 44 8e d4 fd 37 6f * ALTREP is ignored: identical(1:10, 1:10+0L) # [1] TRUE identical(serialize(1:10, NULL), serialize(1:10+0L, NULL)) # [1] FALSE identical( .Call(depcache:::C_hash2, 1:10), .Call(depcache:::C_hash2, 1:10+0L) ) # [1] TRUE * Strings not marked as bytes are encoded into UTF-8: identical('\uff', iconv('\uff', 'UTF-8', 'latin1')) # [1] TRUE identical( serialize('\uff', NULL), serialize(iconv('\uff', 'UTF-8', 'latin1'), NULL) ) # [1] FALSE identical( .Call(depcache:::C_hash2, '\uff'), .Call(depcache:::C_hash2, iconv('\uff', 'UTF-8', 'latin1')) ) # [1] TRUE * NaNs with different payloads (except NA_numeric_) are replaced by R_NaN. One of the many downsides to the current approach is that we rely on the non-API entry point getPRIMNAME() in order to hash builtins. Looking at the source code for identical() is no help here, because it uses the private PRIMOFFSET macro. The bitstream being hashed is also, unfortunately, not exactly compatible with R serialization format version 2: I had to ignore the LEVELS of the language objects being hashed both because identical() seems to ignore those and because I was missing multiple private definitions (e.g. the MAYBEJIT flag) to handle them properly. Then there's also the problem of immediate bindings [**]: I've seen bits of vctrs, rstudio, rlang blow up when calling CAR() on SEXP objects that are not safe to handle this way, but R_expand_binding_value() (used by serialize()) is again a private function that is not accessible from packages. identical() won't help here, because it compares reference objects (which may or may not contain such immediate bindings) by their pointer values instead of digging down into them. Dropping the (already violated) requirement to be compatible with R serialization bitstream will make it possible to simplify the code further. Finally: a <- new.env() b <- new.env() a$x <- b$x <- 42 identical(a, b) # [1] FALSE .Call(depcache:::C_hash2, a) # [1] 44 21 f1 36 5d 92 03 1b .Call(depcache:::C_hash2, b) # [1] 44 21 f1 36 5d 92 03 1b ...but that's unavoidable when looking at frozen object contents instead of their live memory layout. If you're interested, here's the development version of the package: install.packages('depcache',contriburl='https://aitap.github.io/Rpackages') -- Best regards, Ivan [*] https://github.com/aitap/depcache/blob/serialize_canonical/src/serialize.c [**] https://svn.r-project.org/R/trunk/doc/notes/immbnd.md
Possibly Parallel Threads
- Choices to remove `srcref` (and its buddies) when serializing objects
- Choices to remove `srcref` (and its buddies) when serializing objects
- Choices to remove `srcref` (and its buddies) when serializing objects
- Choices to remove `srcref` (and its buddies) when serializing objects
- [External] Re: Choices to remove `srcref` (and its buddies) when serializing objects