Dénes Tóth
2020-May-01 21:00 UTC
[Rd] Request: tools::md5sum should accept connections and finally in-memory objects
AFAIK there is no hashing utility in base R which can create hash digests of arbitrary R objects. However, as also described by Henrik Bengtsson in [1], we have tools::md5sum() which calculates MD5 hashes of files. Calculating hashes of in-memory objects is a very common task in several areas, as demonstrated by the popularity of the 'digest' package (~850.000 downloads/month). Upon the inspection of the relevant files in the R-source (e.g., [2] and [3]), it seems all building blocks have already been implemented so that hashing should not be restricted to files. I would like to ask: 1) Why is md5_buffer unused?: In src/library/tools/src/md5.c [see 2], md5_buffer is implemented which seems to be the counterpart of md5_stream for non-file inputs: --- #ifdef UNUSED /* Compute MD5 message digest for LEN bytes beginning at BUFFER. The result is always in little endian byte order, so that a byte-wise output yields to the wanted ASCII representation of the message digest. */ static void * md5_buffer (const char *buffer, size_t len, void *resblock) { struct md5_ctx ctx; /* Initialize the computation context. */ md5_init_ctx (&ctx); /* Process whole buffer but last len % 64 bytes. */ md5_process_bytes (buffer, len, &ctx); /* Put result in desired memory area. */ return md5_finish_ctx (&ctx, resblock); } #endif --- 2) How can the R-community help so that this feature becomes available in package 'tools'? Suggestions: As a first step, it would be great if tools::md5sum would support connections (credit goes to Henrik for the idea). E.g., instead of the signature tools::md5sum(files), we could have tools::md5sum(files, conn = NULL), which would allow: x <- runif(10) tools::md5sum(conn = rawConnection(serialize(x, NULL))) To avoid the inconsistency between 'files' (which computes the hash digests in a vectorized manner, that is, one for each file) and 'conn' (which expects a single connection), and to make it easier to extend the hashing for other algorithms without changing the main R interface, a more involved solution would be to introduce tools::hash and tools::hashes, in a similar vein to digest::digest and digest::getVDigest. Regards, Denes [1]: https://github.com/HenrikBengtsson/Wishlist-for-R/issues/21 [2]: https://github.com/wch/r-source/blob/5a156a0865362bb8381dcd69ac335f5174a4f60c/src/library/tools/src/md5.c#L172 [3]: https://github.com/wch/r-source/blob/5a156a0865362bb8381dcd69ac335f5174a4f60c/src/library/tools/src/Rmd5.c#L27
John Mount
2020-May-01 21:09 UTC
[Rd] Request: tools::md5sum should accept connections and finally in-memory objects
Perhaps use the digest package? Isn't "R the R packages?"> On May 1, 2020, at 2:00 PM, D?nes T?th <toth.denes at kogentum.hu> wrote: > > > AFAIK there is no hashing utility in base R which can create hash digests of arbitrary R objects. However, as also described by Henrik Bengtsson in [1], we have tools::md5sum() which calculates MD5 hashes of files. Calculating hashes of in-memory objects is a very common task in several areas, as demonstrated by the popularity of the 'digest' package (~850.000 downloads/month). > > Upon the inspection of the relevant files in the R-source (e.g., [2] and [3]), it seems all building blocks have already been implemented so that hashing should not be restricted to files. I would like to ask: > > 1) Why is md5_buffer unused?: > In src/library/tools/src/md5.c [see 2], md5_buffer is implemented which seems to be the counterpart of md5_stream for non-file inputs: > > --- > #ifdef UNUSED > /* Compute MD5 message digest for LEN bytes beginning at BUFFER. The > result is always in little endian byte order, so that a byte-wise > output yields to the wanted ASCII representation of the message > digest. */ > static void * > md5_buffer (const char *buffer, size_t len, void *resblock) > { > struct md5_ctx ctx; > > /* Initialize the computation context. */ > md5_init_ctx (&ctx); > > /* Process whole buffer but last len % 64 bytes. */ > md5_process_bytes (buffer, len, &ctx); > > /* Put result in desired memory area. */ > return md5_finish_ctx (&ctx, resblock); > } > #endif > --- > > 2) How can the R-community help so that this feature becomes available in package 'tools'? > > Suggestions: > As a first step, it would be great if tools::md5sum would support connections (credit goes to Henrik for the idea). E.g., instead of the signature tools::md5sum(files), we could have tools::md5sum(files, conn = NULL), which would allow: > > x <- runif(10) > tools::md5sum(conn = rawConnection(serialize(x, NULL))) > > To avoid the inconsistency between 'files' (which computes the hash digests in a vectorized manner, that is, one for each file) and 'conn' (which expects a single connection), and to make it easier to extend the hashing for other algorithms without changing the main R interface, a more involved solution would be to introduce tools::hash and tools::hashes, in a similar vein to digest::digest and digest::getVDigest. > > Regards, > Denes > > > [1]: https://github.com/HenrikBengtsson/Wishlist-for-R/issues/21 > [2]: https://github.com/wch/r-source/blob/5a156a0865362bb8381dcd69ac335f5174a4f60c/src/library/tools/src/md5.c#L172 > [3]: https://github.com/wch/r-source/blob/5a156a0865362bb8381dcd69ac335f5174a4f60c/src/library/tools/src/Rmd5.c#L27 > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel--------------- John Mount http://www.win-vector.com/ <http://www.win-vector.com/> Our book: Practical Data Science with R http://practicaldatascience.com <http://practicaldatascience.com/> [[alternative HTML version deleted]]
Dénes Tóth
2020-May-01 21:35 UTC
[Rd] Request: tools::md5sum should accept connections and finally in-memory objects
On 5/1/20 11:09 PM, John Mount wrote:> Perhaps use the digest package? Isn't "R the R packages?"I think it is clear that I am aware of the existence of the digest package and also of other packages with similar functionality, e.g. the fastdigest package. (And I actually do use digest as I guess 99% percent of the R developers do at least as an indirect dependency.) The point is that a) digest is a wonderful and very stable package, but still, it is a user-contributed package, whereas b) 'tools' is a base package which is included by default in all R installations, and c) tools::md5sum already exists, with almost all building blocks to allow its extension to calculate MD5 hashes of R objects, and d) there is high demand in the R community for being able to calculate hashes. So yes, if one wants to use all the utilities or the various algos that the digest package provides, one should install and load it. But if one can live with MD5 hashes, why not use the built-in R function? (Well, without serializing an object to a file, calling tools::md5sum, and then cleaning up the file.)> >> On May 1, 2020, at 2:00 PM, D?nes T?th <toth.denes at kogentum.hu >> <mailto:toth.denes at kogentum.hu>> wrote: >> >> >> AFAIK there is no hashing utility in base R which can create hash >> digests of arbitrary R objects. However, as also described by Henrik >> Bengtsson in [1], we have tools::md5sum() which calculates MD5 hashes >> of files. Calculating hashes of in-memory objects is a very common >> task in several areas, as demonstrated by the popularity of the >> 'digest' package (~850.000 downloads/month). >> >> Upon the inspection of the relevant files in the R-source (e.g., [2] >> and [3]), it seems all building blocks have already been implemented >> so that hashing should not be restricted to files. I would like to ask: >> >> 1) Why is md5_buffer unused?: >> In src/library/tools/src/md5.c [see 2], md5_buffer is implemented >> which seems to be the counterpart of md5_stream for non-file inputs: >> >> --- >> #ifdef UNUSED >> /* Compute MD5 message digest for LEN bytes beginning at BUFFER. ?The >> ??result is always in little endian byte order, so that a byte-wise >> ??output yields to the wanted ASCII representation of the message >> ??digest. ?*/ >> static void * >> md5_buffer (const char *buffer, size_t len, void *resblock) >> { >> ?struct md5_ctx ctx; >> >> ?/* Initialize the computation context. ?*/ >> ?md5_init_ctx (&ctx); >> >> ?/* Process whole buffer but last len % 64 bytes. ?*/ >> ?md5_process_bytes (buffer, len, &ctx); >> >> ?/* Put result in desired memory area. ?*/ >> ?return md5_finish_ctx (&ctx, resblock); >> } >> #endif >> --- >> >> 2) How can the R-community help so that this feature becomes available >> in package 'tools'? >> >> Suggestions: >> As a first step, it would be great if tools::md5sum would support >> connections (credit goes to Henrik for the idea). E.g., instead of the >> signature tools::md5sum(files), we could have tools::md5sum(files, >> conn = NULL), which would allow: >> >> x <- runif(10) >> tools::md5sum(conn = rawConnection(serialize(x, NULL))) >> >> To avoid the inconsistency between 'files' (which computes the hash >> digests in a vectorized manner, that is, one for each file) and 'conn' >> (which expects a single connection), and to make it easier to extend >> the hashing for other algorithms without changing the main R >> interface, a more involved solution would be to introduce tools::hash >> and tools::hashes, in a similar vein to digest::digest and >> digest::getVDigest. >> >> Regards, >> Denes >> >> >> [1]: https://github.com/HenrikBengtsson/Wishlist-for-R/issues/21 >> [2]: >> https://github.com/wch/r-source/blob/5a156a0865362bb8381dcd69ac335f5174a4f60c/src/library/tools/src/md5.c#L172 >> [3]: >> https://github.com/wch/r-source/blob/5a156a0865362bb8381dcd69ac335f5174a4f60c/src/library/tools/src/Rmd5.c#L27 >> >> ______________________________________________ >> R-devel at r-project.org <mailto:R-devel at r-project.org> mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > > --------------- > John Mount > http://www.win-vector.com/ > Our book: Practical Data Science with R > http://practicaldatascience.com > > > > >
Duncan Murdoch
2020-May-01 21:35 UTC
[Rd] Request: tools::md5sum should accept connections and finally in-memory objects
The tools package is not for users, it's for functions that R uses in installing packages, checking them, etc. If you want a function for users, it would belong in utils. But what's wrong with the digest package? What's the argument that R Core should take this on? Duncan Murdoch On 01/05/2020 5:00 p.m., D?nes T?th wrote:> > AFAIK there is no hashing utility in base R which can create hash > digests of arbitrary R objects. However, as also described by Henrik > Bengtsson in [1], we have tools::md5sum() which calculates MD5 hashes of > files. Calculating hashes of in-memory objects is a very common task in > several areas, as demonstrated by the popularity of the 'digest' package > (~850.000 downloads/month). > > Upon the inspection of the relevant files in the R-source (e.g., [2] and > [3]), it seems all building blocks have already been implemented so that > hashing should not be restricted to files. I would like to ask: > > 1) Why is md5_buffer unused?: > In src/library/tools/src/md5.c [see 2], md5_buffer is implemented which > seems to be the counterpart of md5_stream for non-file inputs: > > --- > #ifdef UNUSED > /* Compute MD5 message digest for LEN bytes beginning at BUFFER. The > result is always in little endian byte order, so that a byte-wise > output yields to the wanted ASCII representation of the message > digest. */ > static void * > md5_buffer (const char *buffer, size_t len, void *resblock) > { > struct md5_ctx ctx; > > /* Initialize the computation context. */ > md5_init_ctx (&ctx); > > /* Process whole buffer but last len % 64 bytes. */ > md5_process_bytes (buffer, len, &ctx); > > /* Put result in desired memory area. */ > return md5_finish_ctx (&ctx, resblock); > } > #endif > --- > > 2) How can the R-community help so that this feature becomes available > in package 'tools'? > > Suggestions: > As a first step, it would be great if tools::md5sum would support > connections (credit goes to Henrik for the idea). E.g., instead of the > signature tools::md5sum(files), we could have tools::md5sum(files, conn > = NULL), which would allow: > > x <- runif(10) > tools::md5sum(conn = rawConnection(serialize(x, NULL))) > > To avoid the inconsistency between 'files' (which computes the hash > digests in a vectorized manner, that is, one for each file) and 'conn' > (which expects a single connection), and to make it easier to extend the > hashing for other algorithms without changing the main R interface, a > more involved solution would be to introduce tools::hash and > tools::hashes, in a similar vein to digest::digest and digest::getVDigest. > > Regards, > Denes > > > [1]: https://github.com/HenrikBengtsson/Wishlist-for-R/issues/21 > [2]: > https://github.com/wch/r-source/blob/5a156a0865362bb8381dcd69ac335f5174a4f60c/src/library/tools/src/md5.c#L172 > [3]: > https://github.com/wch/r-source/blob/5a156a0865362bb8381dcd69ac335f5174a4f60c/src/library/tools/src/Rmd5.c#L27 > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
Dénes Tóth
2020-May-01 22:07 UTC
[Rd] Request: tools::md5sum should accept connections and finally in-memory objects
On 5/1/20 11:35 PM, Duncan Murdoch wrote:> The tools package is not for users, it's for functions that R uses in > installing packages, checking them, etc.I think the target group for this functionality is the group of R developers, not regular R users.> If you want a function for > users, it would belong in utils.? But what's wrong with the digest > package?? What's the argument that R Core should take this on?There is nothing wrong with the digest package except for being an extra dependency which could be avoided if an already implemented C function were available at the R level. I do understand that given the load on R Core, they do include new features and the related burden of maintenance only if it is absolutely necessary. This is why I asked first whether there is a particular reason not to expose an already existing (base-R) implementation. I think it is reasonable to assume that 'md5_buffer' exists for a reason - but probably there is a reason why it never became part of any exported function. Now I checked the history of the md5.c file; it was last edited 8 years ago. Somewhat surprisingly, md5_buffer was already included in the original file (created 17 years ago), but marked as UNUSED 12 years ago. Just to clarify: I do not want suggest that R Core team should take over all functionalities of the digest package. I do really focus on computing MD5 digests, which is already possible for files. My suggestion for a more general function was meant for keeping potential further enhancements in mind.> > Duncan Murdoch > > On 01/05/2020 5:00 p.m., D?nes T?th wrote: >> >> AFAIK there is no hashing utility in base R which can create hash >> digests of arbitrary R objects. However, as also described by Henrik >> Bengtsson in [1], we have tools::md5sum() which calculates MD5 hashes of >> files. Calculating hashes of in-memory objects is a very common task in >> several areas, as demonstrated by the popularity of the 'digest' package >> (~850.000 downloads/month). >> >> Upon the inspection of the relevant files in the R-source (e.g., [2] and >> [3]), it seems all building blocks have already been implemented so that >> hashing should not be restricted to files. I would like to ask: >> >> 1) Why is md5_buffer unused?: >> In src/library/tools/src/md5.c [see 2], md5_buffer is implemented which >> seems to be the counterpart of md5_stream for non-file inputs: >> >> --- >> #ifdef UNUSED >> /* Compute MD5 message digest for LEN bytes beginning at BUFFER.? The >> ???? result is always in little endian byte order, so that a byte-wise >> ???? output yields to the wanted ASCII representation of the message >> ???? digest.? */ >> static void * >> md5_buffer (const char *buffer, size_t len, void *resblock) >> { >> ??? struct md5_ctx ctx; >> >> ??? /* Initialize the computation context.? */ >> ??? md5_init_ctx (&ctx); >> >> ??? /* Process whole buffer but last len % 64 bytes.? */ >> ??? md5_process_bytes (buffer, len, &ctx); >> >> ??? /* Put result in desired memory area.? */ >> ??? return md5_finish_ctx (&ctx, resblock); >> } >> #endif >> --- >> >> 2) How can the R-community help so that this feature becomes available >> in package 'tools'? >> >> Suggestions: >> As a first step, it would be great if tools::md5sum would support >> connections (credit goes to Henrik for the idea). E.g., instead of the >> signature tools::md5sum(files), we could have tools::md5sum(files, conn >> = NULL), which would allow: >> >> x <- runif(10) >> tools::md5sum(conn = rawConnection(serialize(x, NULL))) >> >> To avoid the inconsistency between 'files' (which computes the hash >> digests in a vectorized manner, that is, one for each file) and 'conn' >> (which expects a single connection), and to make it easier to extend the >> hashing for other algorithms without changing the main R interface, a >> more involved solution would be to introduce tools::hash and >> tools::hashes, in a similar vein to digest::digest and >> digest::getVDigest. >> >> Regards, >> Denes >> >> >> [1]: https://github.com/HenrikBengtsson/Wishlist-for-R/issues/21 >> [2]: >> https://github.com/wch/r-source/blob/5a156a0865362bb8381dcd69ac335f5174a4f60c/src/library/tools/src/md5.c#L172 >> >> [3]: >> https://github.com/wch/r-source/blob/5a156a0865362bb8381dcd69ac335f5174a4f60c/src/library/tools/src/Rmd5.c#L27 >> >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > >
Reasonably Related Threads
- Request: tools::md5sum should accept connections and finally in-memory objects
- Request: tools::md5sum should accept connections and finally in-memory objects
- Request: tools::md5sum should accept connections and finally in-memory objects
- Re: virt-copy-in - how do I get the selinux relabeling done for the file?
- virt-copy-in - how do I get the selinux relabeling done for the file?