Tomas Kalibera
2024-Jan-17 09:31 UTC
[Rd] Choices to remove `srcref` (and its buddies) when serializing objects
On 1/16/24 20:16, Dipterix Wang wrote:> Could you recommend any packages/functions that compute hash such that > the source references and sexpinfo_struct are ignored? Basically a > version of `serialize` that convert R objects to raw without storing > the ancillary source reference and sexpinfo. > I think most people would think of `digest` but that package uses > `serialize` (see discussion > https://github.com/eddelbuettel/digest/issues/200#issuecomment-1894289875)I think one could implement hashing on the fly without any serialization, similarly to how identical works, but I am not aware of any existing implementation. Again, if that wasn't clear: I don't think trying to compute a hash of an object from its serialized representation is a good idea - it is of course convenient, but has problems like the one you have ran into. In some applications it may still be good enough: if by various tweaks, such as ensuring source references are off in your case, you achieve a state when false alarms are rare (identical objects have different hashes), and hence say unnecessary re-computation is rare, maybe it is good enough. Tomas> >> On Jan 12, 2024, at 11:33?AM, Tomas Kalibera >> <tomas.kalibera at gmail.com> wrote: >> >> >> On 1/12/24 06:11, Dipterix Wang wrote: >>> Dear R devs, >>> >>> I was digging into a package issue today when I realized R serialize >>> function not always generate the same results on equivalent objects >>> when users choose to run differently. For example, the following code >>> >>> serialize(with(new.env(), { function(){} }), NULL, TRUE) >>> >>> generates different results when I copy-paste into console vs when I >>> use ctrl+shift+enter to source the file in RStudio. >>> >>> With a deeper inspect into the cause, I found that function and >>> language get source reference when getOption("keep.source") is TRUE. >>> This means the source reference will make the functions different >>> while in most cases, whether keeping function source might not >>> impact how a function behaves. >>> >>> While it's OK that function serialize generates different results, >>> functions such as `rlang::hash` and `digest::digest`, which depend >>> on `serialize` might eventually deliver false positives on same >>> inputs. I've checked source code in digest package hoping to get >>> around this issue (for example serialize(..., refhook = ...)). >>> However, my workaround did not work. It seems that the markers to >>> the objects are different even if I used `refhook` to force srcref >>> to be the same. I also tried `removeSource` and `rlang::zap_srcref`. >>> None of them works directly on nested environments with multiple >>> functions. >>> >>> I wonder how hard it would be to have options to discard source when >>> serializing R objects? >>> >>> Currently my analyses heavily depend on digest function to generate >>> file caches and automatically schedule pipelines (to update cache) >>> when changes are detected. The pipelines save the hashes of source >>> code, inputs, and outputs together so other people can easily verify >>> the calculation without accessing the original data (which could be >>> sensitive), or running hour-long analyses, or having to buy servers. >>> All of these require `serialize` to produce the same results >>> regardless of how users choose to run the code. >>> >>> It would be great if this feature could be in the future R. Other >>> pipeline packages such as `targets` and `drake` can also benefit >>> from it. >> >> I don't think such functionality would belong to serialize(). This >> function is not meant to produce stable results based on the input, >> the serialized representation may even differ based on properties not >> seen by users. >> >> I think an option to ignore source code would belong to a function >> that computes the hash, as other options of identical(). >> >> Tomas >> >> >>> Thanks, >>> >>> - Dipterix >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-devel at r-project.orgmailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >
Lionel Henry
2024-Jan-17 10:32 UTC
[Rd] Choices to remove `srcref` (and its buddies) when serializing objects
> I think one could implement hashing on the fly without any > serialization, similarly to how identical works, but I am not aware of > any existing implementationWe have one in vctrs but it's not exported: https://github.com/r-lib/vctrs/blob/main/src/hash.c The main use is vectorised hashing: ``` # Non-vectorised vctrs:::obj_hash(1:10) #> [1] 1e 77 ce 48 # Vectorised vctrs:::vec_hash(1L) #> [1] 70 a2 85 ef vctrs:::vec_hash(1:2) #> [1] 70 a2 85 ef bf 3c 2c cf # vctrs semantics so dfs are vectors of rows length(vctrs:::vec_hash(mtcars)) / 4 #> [1] 32 nrow(mtcars) #> [1] 32 ``` Best, Lionel On Wed, Jan 17, 2024 at 10:32?AM Tomas Kalibera <tomas.kalibera at gmail.com> wrote:> > On 1/16/24 20:16, Dipterix Wang wrote: > > Could you recommend any packages/functions that compute hash such that > > the source references and sexpinfo_struct are ignored? Basically a > > version of `serialize` that convert R objects to raw without storing > > the ancillary source reference and sexpinfo. > > I think most people would think of `digest` but that package uses > > `serialize` (see discussion > > https://github.com/eddelbuettel/digest/issues/200#issuecomment-1894289875) > > I think one could implement hashing on the fly without any > serialization, similarly to how identical works, but I am not aware of > any existing implementation. Again, if that wasn't clear: I don't think > trying to compute a hash of an object from its serialized representation > is a good idea - it is of course convenient, but has problems like the > one you have ran into. > > In some applications it may still be good enough: if by various tweaks, > such as ensuring source references are off in your case, you achieve a > state when false alarms are rare (identical objects have different > hashes), and hence say unnecessary re-computation is rare, maybe it is > good enough. > > Tomas > > > > >> On Jan 12, 2024, at 11:33?AM, Tomas Kalibera > >> <tomas.kalibera at gmail.com> wrote: > >> > >> > >> On 1/12/24 06:11, Dipterix Wang wrote: > >>> Dear R devs, > >>> > >>> I was digging into a package issue today when I realized R serialize > >>> function not always generate the same results on equivalent objects > >>> when users choose to run differently. For example, the following code > >>> > >>> serialize(with(new.env(), { function(){} }), NULL, TRUE) > >>> > >>> generates different results when I copy-paste into console vs when I > >>> use ctrl+shift+enter to source the file in RStudio. > >>> > >>> With a deeper inspect into the cause, I found that function and > >>> language get source reference when getOption("keep.source") is TRUE. > >>> This means the source reference will make the functions different > >>> while in most cases, whether keeping function source might not > >>> impact how a function behaves. > >>> > >>> While it's OK that function serialize generates different results, > >>> functions such as `rlang::hash` and `digest::digest`, which depend > >>> on `serialize` might eventually deliver false positives on same > >>> inputs. I've checked source code in digest package hoping to get > >>> around this issue (for example serialize(..., refhook = ...)). > >>> However, my workaround did not work. It seems that the markers to > >>> the objects are different even if I used `refhook` to force srcref > >>> to be the same. I also tried `removeSource` and `rlang::zap_srcref`. > >>> None of them works directly on nested environments with multiple > >>> functions. > >>> > >>> I wonder how hard it would be to have options to discard source when > >>> serializing R objects? > >>> > >>> Currently my analyses heavily depend on digest function to generate > >>> file caches and automatically schedule pipelines (to update cache) > >>> when changes are detected. The pipelines save the hashes of source > >>> code, inputs, and outputs together so other people can easily verify > >>> the calculation without accessing the original data (which could be > >>> sensitive), or running hour-long analyses, or having to buy servers. > >>> All of these require `serialize` to produce the same results > >>> regardless of how users choose to run the code. > >>> > >>> It would be great if this feature could be in the future R. Other > >>> pipeline packages such as `targets` and `drake` can also benefit > >>> from it. > >> > >> I don't think such functionality would belong to serialize(). This > >> function is not meant to produce stable results based on the input, > >> the serialized representation may even differ based on properties not > >> seen by users. > >> > >> I think an option to ignore source code would belong to a function > >> that computes the hash, as other options of identical(). > >> > >> Tomas > >> > >> > >>> Thanks, > >>> > >>> - Dipterix > >>> [[alternative HTML version deleted]] > >>> > >>> ______________________________________________ > >>> R-devel at r-project.orgmailing list > >>> https://stat.ethz.ch/mailman/listinfo/r-devel > > > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Possibly Parallel Threads
- Choices to remove `srcref` (and its buddies) when serializing objects
- Choices to remove `srcref` (and its buddies) when serializing objects
- Choices to remove `srcref` (and its buddies) when serializing objects
- Choices to remove `srcref` (and its buddies) when serializing objects
- Choices to remove `srcref` (and its buddies) when serializing objects