Dipterix Wang
2024-Jan-12 05:11 UTC
[Rd] Choices to remove `srcref` (and its buddies) when serializing objects
Dear R devs, I was digging into a package issue today when I realized R serialize function not always generate the same results on equivalent objects when users choose to run differently. For example, the following code serialize(with(new.env(), { function(){} }), NULL, TRUE) generates different results when I copy-paste into console vs when I use ctrl+shift+enter to source the file in RStudio. With a deeper inspect into the cause, I found that function and language get source reference when getOption("keep.source") is TRUE. This means the source reference will make the functions different while in most cases, whether keeping function source might not impact how a function behaves. While it's OK that function serialize generates different results, functions such as `rlang::hash` and `digest::digest`, which depend on `serialize` might eventually deliver false positives on same inputs. I've checked source code in digest package hoping to get around this issue (for example serialize(..., refhook = ...)). However, my workaround did not work. It seems that the markers to the objects are different even if I used `refhook` to force srcref to be the same. I also tried `removeSource` and `rlang::zap_srcref`. None of them works directly on nested environments with multiple functions. I wonder how hard it would be to have options to discard source when serializing R objects? Currently my analyses heavily depend on digest function to generate file caches and automatically schedule pipelines (to update cache) when changes are detected. The pipelines save the hashes of source code, inputs, and outputs together so other people can easily verify the calculation without accessing the original data (which could be sensitive), or running hour-long analyses, or having to buy servers. All of these require `serialize` to produce the same results regardless of how users choose to run the code. It would be great if this feature could be in the future R. Other pipeline packages such as `targets` and `drake` can also benefit from it. Thanks, - Dipterix [[alternative HTML version deleted]]
Ivan Krylov
2024-Jan-12 08:42 UTC
[Rd] Choices to remove `srcref` (and its buddies) when serializing objects
? Fri, 12 Jan 2024 00:11:45 -0500 Dipterix Wang <dipterix.wang at gmail.com> ?????:> I wonder how hard it would be to have options to discard source when > serializing R objects?> Currently my analyses heavily depend on digest function to generate > file caches and automatically schedule pipelines (to update cache) > when changes are detected.Source references may be the main problem here, but not the only one. There are also string encodings and function bytecode (which may or may not be present and probably changes between R versions). I've been collecting the ways that the objects that are identical() to each other can serialize() differently in my package 'depcache'; I'm sure I missed a few. Admittedly, string encodings are less important nowadays (except on older Windows and weirdly set up Unix-like systems). Thankfully, the digest package already knows to skip the serialization header (which contains the current version of R). serialize() only knows about basic types [*], and source references are implemented on top of these as objects of class 'srcref'. Sometimes they are attached as attributes to other objects, other times (e.g. in quote(function(){}), [**]) just sitting there as arguments to a call. Sometimes you can hash the output of deparse(x) instead of serialize(x) [***]. Text representations aren't without their own problems (e.g. IEEE floating-point numbers not being representable as decimal fractions), but at least deparsing both ignores the source references and punts the encoding problem to the abstraction layer above it: deparse() is the same for both '\uff' and iconv('\uff', 'UTF-8', 'latin1'): just "?". Unfortunately, this doesn't solve the environment problem. For these, you really need a way to canonicalize the reference-semantics objects before serializing them without changing the originals, even in cases like a <- new.env(); b <- new.env(); a$x <- b; b$x <- a. I'm not sure that reference hooks can help with that. In order to implement it properly, the fixup process will have to rely on global state and keep weak references to the environments it visits and creates shadow copies of. I think it's not impossible to implement serialize_to_canonical_representation() for an R package, but it will be a lot of work to decide which parts are canonical and which should be discarded. -- Best regards, Ivan [*] https://cran.r-project.org/doc/manuals/R-ints.html#Serialization-Formats [**] https://bugs.r-project.org/show_bug.cgi?id=18638 [***] https://stat.ethz.ch/pipermail/r-devel/2023-March/082505.html
Tomas Kalibera
2024-Jan-12 16:33 UTC
[Rd] Choices to remove `srcref` (and its buddies) when serializing objects
On 1/12/24 06:11, Dipterix Wang wrote:> Dear R devs, > > I was digging into a package issue today when I realized R serialize function not always generate the same results on equivalent objects when users choose to run differently. For example, the following code > > serialize(with(new.env(), { function(){} }), NULL, TRUE) > > generates different results when I copy-paste into console vs when I use ctrl+shift+enter to source the file in RStudio. > > With a deeper inspect into the cause, I found that function and language get source reference when getOption("keep.source") is TRUE. This means the source reference will make the functions different while in most cases, whether keeping function source might not impact how a function behaves. > > While it's OK that function serialize generates different results, functions such as `rlang::hash` and `digest::digest`, which depend on `serialize` might eventually deliver false positives on same inputs. I've checked source code in digest package hoping to get around this issue (for example serialize(..., refhook = ...)). However, my workaround did not work. It seems that the markers to the objects are different even if I used `refhook` to force srcref to be the same. I also tried `removeSource` and `rlang::zap_srcref`. None of them works directly on nested environments with multiple functions. > > I wonder how hard it would be to have options to discard source when serializing R objects? > > Currently my analyses heavily depend on digest function to generate file caches and automatically schedule pipelines (to update cache) when changes are detected. The pipelines save the hashes of source code, inputs, and outputs together so other people can easily verify the calculation without accessing the original data (which could be sensitive), or running hour-long analyses, or having to buy servers. All of these require `serialize` to produce the same results regardless of how users choose to run the code. > > It would be great if this feature could be in the future R. Other pipeline packages such as `targets` and `drake` can also benefit from it.I don't think such functionality would belong to serialize(). This function is not meant to produce stable results based on the input, the serialized representation may even differ based on properties not seen by users. I think an option to ignore source code would belong to a function that computes the hash, as other options of identical(). Tomas> Thanks, > > - Dipterix > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel