On 02.02.14 00:48, Chandler Carruth wrote:> On Sat, Feb 1, 2014 at 8:35 AM, Stephan Tolksdorf <st at quanttec.com > <mailto:st at quanttec.com>> wrote: > > Hi, > > Currently the hashing implementation in ADT/Hashing.h produces hash > values on 32-bit platforms that differ from the lower 32-bits of the > hash values produced on 64-bit platforms. It seems the only reason > for this difference is that the uint64_t integer seed is truncated > to size_t. Since the usage of uint64_t and size_t as types for seed > values in the implementation is somewhat inconsistent, I'm > wondering: Was this difference deliberately introduced as a > performance optimization for 32-bit code, or is this maybe just an > implementation artifact? > > I also noted that the implementation relies on implicit conversions > from uint64_t to size_t in several places where hash_code values are > returned, which may trigger compiler warnings on 32-bit platforms. > > > Almost all of this was largely incidental on my part. I didn't spend a > lot of time tuning the 32-bit platform performance. If you have ideas > that you think would make things better, I'm certainly interested. =D > > One thing I would note is that I would prefer to retain the collision > resistance where possible. So I would be leery of reducing the internal > state used. I would more focus on making the outputs friendly for 32-bit > platforms and the inputs consistent.I've attached a patch that makes the hashing implementation consistently use a 64-bit seed everywhere. With this change the implementation should produce the same hash codes modulo SIZE_MAX + 1 for identical values independent of the platform. (Though hash_combine may still produce different results if the size of an argument type differs between platforms). I suspect the negative performance impact on 32-bit platforms should be small, but I didn't do any benchmarking. With atomics one could probably replace the thread safe local static in get_execution_seed with something that has a little less overhead. The patch also removes a FIXME from the set_fixed_execution_seed implementation and rewords the documentation string a bit, since using this function in a multi-threaded program requires some kind of external synchronization anyway. Another attached patch implements llvm::sys::Process::GetRandomNumber() on Windows. (There's already a Unix implementation.) This function could be useful for implementing the missing random per-execution seed functionality. (Though it might take a little care to deal with Process::GetRandomNumberSeed potentially calling back into hash_combine on Unix.) In a little experiment I made, changing the seed per execution seemed to lead to test suite errors for clang's PCH support and other components. - Stephan -------------- next part --------------
On Mon, Feb 3, 2014 at 1:46 PM, Stephan Tolksdorf <st at quanttec.com> wrote:> I've attached a patch that makes the hashing implementation consistently > use a 64-bit seed everywhere. With this change the implementation should > produce the same hash codes modulo SIZE_MAX + 1 for identical values > independent of the platform. (Though hash_combine may still produce > different results if the size of an argument type differs between > platforms). I suspect the negative performance impact on 32-bit platforms > should be small, but I didn't do any benchmarking. With atomics one could > probably replace the thread safe local static in get_execution_seed with > something that has a little less overhead. > > The patch also removes a FIXME from the set_fixed_execution_seed > implementation and rewords the documentation string a bit, since using this > function in a multi-threaded program requires some kind of external > synchronization anyway. >Very cool. A minor comment: /// \brief Form a hash code directly from a numerical value. - hash_code(size_t value) : value(value) {} + /// + /// The argument is casted to and stored as a size_t value. + hash_code(uint64_t value) : value(static_cast<size_t>(value)) {} I think it would be better to store the 64-bit value internally and reduce to size_t when asked. That way we could (perhaps in the future) expose a higher fidelity uint64_t extraction mechanism. My idea is basically that the hashing code should be 64-bit input always, and 64-bit output always, but provide easy methods for consumers which *need* to only get size_t output to truncate safely and consistently. Does that make sense? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140203/3598881e/attachment.html>
A couple of minor comments about the details of the patch: - hash_state state = state.create(s_begin, seed); + hash_state state = hash_state::create(s_begin, seed); Why this change? The existing code is correct already I think? - return ::llvm::hashing::detail::hash_integer_value(value); + // the cast is necessary for strong enums and avoids compiler warnings Generally LLVM comments should be formed as prose (starting with a capitalized word and ending with appropriate punctuation). I don't think you need a comment here though, or a separate variable. I would just explicitly cast the value in the argument. + const uint64_t n = static_cast<uint64_t>(value); + return ::llvm::hashing::detail::hash_integer_value(n); On Mon, Feb 3, 2014 at 1:46 PM, Stephan Tolksdorf <st at quanttec.com> wrote:> On 02.02.14 00:48, Chandler Carruth wrote: > >> On Sat, Feb 1, 2014 at 8:35 AM, Stephan Tolksdorf <st at quanttec.com >> <mailto:st at quanttec.com>> wrote: >> >> Hi, >> >> Currently the hashing implementation in ADT/Hashing.h produces hash >> values on 32-bit platforms that differ from the lower 32-bits of the >> hash values produced on 64-bit platforms. It seems the only reason >> for this difference is that the uint64_t integer seed is truncated >> to size_t. Since the usage of uint64_t and size_t as types for seed >> values in the implementation is somewhat inconsistent, I'm >> wondering: Was this difference deliberately introduced as a >> performance optimization for 32-bit code, or is this maybe just an >> implementation artifact? >> >> I also noted that the implementation relies on implicit conversions >> from uint64_t to size_t in several places where hash_code values are >> returned, which may trigger compiler warnings on 32-bit platforms. >> >> >> Almost all of this was largely incidental on my part. I didn't spend a >> lot of time tuning the 32-bit platform performance. If you have ideas >> that you think would make things better, I'm certainly interested. =D >> >> One thing I would note is that I would prefer to retain the collision >> resistance where possible. So I would be leery of reducing the internal >> state used. I would more focus on making the outputs friendly for 32-bit >> platforms and the inputs consistent. >> > > I've attached a patch that makes the hashing implementation consistently > use a 64-bit seed everywhere. With this change the implementation should > produce the same hash codes modulo SIZE_MAX + 1 for identical values > independent of the platform. (Though hash_combine may still produce > different results if the size of an argument type differs between > platforms). I suspect the negative performance impact on 32-bit platforms > should be small, but I didn't do any benchmarking. With atomics one could > probably replace the thread safe local static in get_execution_seed with > something that has a little less overhead. > > The patch also removes a FIXME from the set_fixed_execution_seed > implementation and rewords the documentation string a bit, since using this > function in a multi-threaded program requires some kind of external > synchronization anyway. > > Another attached patch implements llvm::sys::Process::GetRandomNumber() > on Windows. (There's already a Unix implementation.) This function could be > useful for implementing the missing random per-execution seed > functionality. (Though it might take a little care to deal with > Process::GetRandomNumberSeed potentially calling back into hash_combine on > Unix.) > > In a little experiment I made, changing the seed per execution seemed to > lead to test suite errors for clang's PCH support and other components. > > - Stephan > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140203/373a6367/attachment.html>
On Mon, Feb 3, 2014 at 1:46 PM, Stephan Tolksdorf <st at quanttec.com> wrote:> Another attached patch implements llvm::sys::Process::GetRandomNumber() > on Windows. (There's already a Unix implementation.) This function could be > useful for implementing the missing random per-execution seed > functionality. (Though it might take a little care to deal with > Process::GetRandomNumberSeed potentially calling back into hash_combine on > Unix.) >I would start another code review thread for this. Totally different set of folks should be looking at it I suspect.> In a little experiment I made, changing the seed per execution seemed to > lead to test suite errors for clang's PCH support and other components. >Yep. This is both why we want the functionality (to catch such bugs) and why it isn't enabled (haven't yet fixed the bugs). -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140203/53258863/attachment.html>
On 14-02-03 23:05, Chandler Carruth wrote:> > Very cool. A minor comment:Thanks!> > /// \brief Form a hash code directly from a numerical value. > - hash_code(size_t value) : value(value) {} > + /// > + /// The argument is casted to and stored as a size_t value. > + hash_code(uint64_t value) : value(static_cast<size_t>(value)) {} > > I think it would be better to store the 64-bit value internally and > reduce to size_t when asked. That way we could (perhaps in the future) > expose a higher fidelity uint64_t extraction mechanism. > > My idea is basically that the hashing code should be 64-bit input > always, and 64-bit output always, but provide easy methods for consumers > which *need* to only get size_t output to truncate safely and consistently. > > Does that make sense?Making hash_code 64-bit would definitely be conceptually cleaner. But I'd be a bit worried that users might directly store `hash_code` values in their data structures, e.g. in hash table buckets, which probably would be a pessimization on 32-bit systems for most use cases. Do you have a good application in mind for an unstable non-cryptographic 64-bit hash code on a 32-bit system? - Stephan
On 14-02-03 23:10, Chandler Carruth wrote:> A couple of minor comments about the details of the patch: > > - hash_state state = state.create(s_begin, seed); > + hash_state state = hash_state::create(s_begin, seed); > > Why this change? The existing code is correct already I think?Is this a common idiom? On first sight this looked like a "self initialization" bug to me, so I thought it might make the code more readable to clarify that create is a static member function of hash_state. But I probably should have put such a code style "cleanup" into a separate patch.> - return ::llvm::hashing::detail::hash_integer_value(value); > + // the cast is necessary for strong enums and avoids compiler warnings > > Generally LLVM comments should be formed as prose (starting with a > capitalized word and ending with appropriate punctuation). I don't think > you need a comment here though, or a separate variable. I would just > explicitly cast the value in the argument. > > + const uint64_t n = static_cast<uint64_t>(value); > + return ::llvm::hashing::detail::hash_integer_value(n);Thanks. Since there previously was no explicit cast I thought it might be useful to explain why I inserted one. I put in the separate variable, so that I'd have a good place to put my comment on without exceeding the column limit :-) Let me know if you'd like me to update my patch. I'm happy to make any changes, but since I can't commit it myself, I'm not sure what would be easier for you.
On 03.02.14 23:11, Chandler Carruth wrote:> I would start another code review thread for this. Totally different set > of folks should be looking at it I suspect.Yeah, I just sent opened a new thread on the commits list. Thanks. - Stephan