thr3ads.net - llvm dev - [LLVMdev] ADT/Hashing.h on 32-bit platforms [Feb 2014]

If this information is useful, please help other people find it:
Share via:

Stephan Tolksdorf

2014-Feb-03 21:46 UTC

[LLVMdev] ADT/Hashing.h on 32-bit platforms

On 02.02.14 00:48, Chandler Carruth wrote:> On Sat, Feb 1, 2014 at 8:35 AM, Stephan Tolksdorf <st at quanttec.com
> <mailto:st at quanttec.com>> wrote:
>
>     Hi,
>
>     Currently the hashing implementation in ADT/Hashing.h produces hash
>     values on 32-bit platforms that differ from the lower 32-bits of the
>     hash values produced on 64-bit platforms. It seems the only reason
>     for this difference is that the uint64_t integer seed is truncated
>     to size_t. Since the usage of uint64_t and size_t as types for seed
>     values in the implementation is somewhat inconsistent, I'm
>     wondering: Was this difference deliberately introduced as a
>     performance optimization for 32-bit code, or is this maybe just an
>     implementation artifact?
>
>     I also noted that the implementation relies on implicit conversions
>     from uint64_t to size_t in several places where hash_code values are
>     returned, which may trigger compiler warnings on 32-bit platforms.
>
>
> Almost all of this was largely incidental on my part. I didn't spend a
> lot of time tuning the 32-bit platform performance. If you have ideas
> that you think would make things better, I'm certainly interested. =D
>
> One thing I would note is that I would prefer to retain the collision
> resistance where possible. So I would be leery of reducing the internal
> state used. I would more focus on making the outputs friendly for 32-bit
> platforms and the inputs consistent.
I've attached a patch that makes the hashing implementation consistently 
use a 64-bit seed everywhere.  With this change the implementation 
should produce the same hash codes modulo SIZE_MAX + 1 for identical 
values independent of the platform. (Though hash_combine may still 
produce different results if the size of an argument type differs 
between platforms). I suspect the negative performance impact on 32-bit 
platforms should be small, but I didn't do any benchmarking. With 
atomics one could probably replace the thread safe local static in 
get_execution_seed with something that has a little less overhead.

The patch also removes a FIXME from the set_fixed_execution_seed 
implementation and rewords the documentation string a bit, since using 
this function in a multi-threaded program requires some kind of external 
synchronization anyway.

Another attached patch implements llvm::sys::Process::GetRandomNumber() 
on Windows. (There's already a Unix implementation.) This function could 
be useful for implementing the missing random per-execution seed 
functionality. (Though it might take a little care to deal with 
Process::GetRandomNumberSeed potentially calling back into hash_combine 
on Unix.)

In a little experiment I made, changing the seed per execution seemed to 
lead to test suite errors for clang's PCH support and other components.

- Stephan

-------------- next part --------------

Chandler Carruth

2014-Feb-03 22:05 UTC

head link

[LLVMdev] ADT/Hashing.h on 32-bit platforms

On Mon, Feb 3, 2014 at 1:46 PM, Stephan Tolksdorf <st at quanttec.com>
wrote:
> I've attached a patch that makes the hashing implementation
consistently
> use a 64-bit seed everywhere.  With this change the implementation should
> produce the same hash codes modulo SIZE_MAX + 1 for identical values
> independent of the platform. (Though hash_combine may still produce
> different results if the size of an argument type differs between
> platforms). I suspect the negative performance impact on 32-bit platforms
> should be small, but I didn't do any benchmarking. With atomics one
could
> probably replace the thread safe local static in get_execution_seed with
> something that has a little less overhead.
>
> The patch also removes a FIXME from the set_fixed_execution_seed
> implementation and rewords the documentation string a bit, since using this
> function in a multi-threaded program requires some kind of external
> synchronization anyway.
>
Very cool. A minor comment:

   /// \brief Form a hash code directly from a numerical value.
-  hash_code(size_t value) : value(value) {}
+  ///
+  /// The argument is casted to and stored as a size_t value.
+  hash_code(uint64_t value) : value(static_cast<size_t>(value)) {}

I think it would be better to store the 64-bit value internally and reduce
to size_t when asked. That way we could (perhaps in the future) expose a
higher fidelity uint64_t extraction mechanism.

My idea is basically that the hashing code should be 64-bit input always,
and 64-bit output always, but provide easy methods for consumers which
*need* to only get size_t output to truncate safely and consistently.

Does that make sense?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140203/3598881e/attachment.html>

Chandler Carruth

2014-Feb-03 22:10 UTC

head link

[LLVMdev] ADT/Hashing.h on 32-bit platforms

A couple of minor comments about the details of the patch:

-  hash_state state = state.create(s_begin, seed);
+  hash_state state = hash_state::create(s_begin, seed);

Why this change? The existing code is correct already I think?

-  return ::llvm::hashing::detail::hash_integer_value(value);
+  // the cast is necessary for strong enums and avoids compiler warnings

Generally LLVM comments should be formed as prose (starting with a
capitalized word and ending with appropriate punctuation). I don't think
you need a comment here though, or a separate variable. I would just
explicitly cast the value in the argument.

+  const uint64_t n = static_cast<uint64_t>(value);
+  return ::llvm::hashing::detail::hash_integer_value(n);




On Mon, Feb 3, 2014 at 1:46 PM, Stephan Tolksdorf <st at quanttec.com>
wrote:
> On 02.02.14 00:48, Chandler Carruth wrote:
>
>> On Sat, Feb 1, 2014 at 8:35 AM, Stephan Tolksdorf <st at
quanttec.com
>> <mailto:st at quanttec.com>> wrote:
>>
>>     Hi,
>>
>>     Currently the hashing implementation in ADT/Hashing.h produces hash
>>     values on 32-bit platforms that differ from the lower 32-bits of
the
>>     hash values produced on 64-bit platforms. It seems the only reason
>>     for this difference is that the uint64_t integer seed is truncated
>>     to size_t. Since the usage of uint64_t and size_t as types for seed
>>     values in the implementation is somewhat inconsistent, I'm
>>     wondering: Was this difference deliberately introduced as a
>>     performance optimization for 32-bit code, or is this maybe just an
>>     implementation artifact?
>>
>>     I also noted that the implementation relies on implicit conversions
>>     from uint64_t to size_t in several places where hash_code values
are
>>     returned, which may trigger compiler warnings on 32-bit platforms.
>>
>>
>> Almost all of this was largely incidental on my part. I didn't
spend a
>> lot of time tuning the 32-bit platform performance. If you have ideas
>> that you think would make things better, I'm certainly interested.
=D
>>
>> One thing I would note is that I would prefer to retain the collision
>> resistance where possible. So I would be leery of reducing the internal
>> state used. I would more focus on making the outputs friendly for
32-bit
>> platforms and the inputs consistent.
>>
>
> I've attached a patch that makes the hashing implementation
consistently
> use a 64-bit seed everywhere.  With this change the implementation should
> produce the same hash codes modulo SIZE_MAX + 1 for identical values
> independent of the platform. (Though hash_combine may still produce
> different results if the size of an argument type differs between
> platforms). I suspect the negative performance impact on 32-bit platforms
> should be small, but I didn't do any benchmarking. With atomics one
could
> probably replace the thread safe local static in get_execution_seed with
> something that has a little less overhead.
>
> The patch also removes a FIXME from the set_fixed_execution_seed
> implementation and rewords the documentation string a bit, since using this
> function in a multi-threaded program requires some kind of external
> synchronization anyway.
>
> Another attached patch implements llvm::sys::Process::GetRandomNumber()
> on Windows. (There's already a Unix implementation.) This function
could be
> useful for implementing the missing random per-execution seed
> functionality. (Though it might take a little care to deal with
> Process::GetRandomNumberSeed potentially calling back into hash_combine on
> Unix.)
>
> In a little experiment I made, changing the seed per execution seemed to
> lead to test suite errors for clang's PCH support and other components.
>
> - Stephan
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140203/373a6367/attachment.html>

Chandler Carruth

2014-Feb-03 22:11 UTC

head link

[LLVMdev] ADT/Hashing.h on 32-bit platforms

On Mon, Feb 3, 2014 at 1:46 PM, Stephan Tolksdorf <st at quanttec.com>
wrote:
> Another attached patch implements llvm::sys::Process::GetRandomNumber()
> on Windows. (There's already a Unix implementation.) This function
could be
> useful for implementing the missing random per-execution seed
> functionality. (Though it might take a little care to deal with
> Process::GetRandomNumberSeed potentially calling back into hash_combine on
> Unix.)
>
I would start another code review thread for this. Totally different set of
folks should be looking at it I suspect.

> In a little experiment I made, changing the seed per execution seemed to
> lead to test suite errors for clang's PCH support and other components.
>
Yep. This is both why we want the functionality (to catch such bugs) and
why it isn't enabled (haven't yet fixed the bugs).
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140203/53258863/attachment.html>

Stephan Tolksdorf

2014-Feb-03 23:39 UTC

head link

[LLVMdev] ADT/Hashing.h on 32-bit platforms

On 14-02-03 23:05, Chandler Carruth wrote:>
> Very cool. A minor comment:
Thanks!
>
>     /// \brief Form a hash code directly from a numerical value.
> -  hash_code(size_t value) : value(value) {}
> +  ///
> +  /// The argument is casted to and stored as a size_t value.
> +  hash_code(uint64_t value) : value(static_cast<size_t>(value)) {}
>
> I think it would be better to store the 64-bit value internally and
> reduce to size_t when asked. That way we could (perhaps in the future)
> expose a higher fidelity uint64_t extraction mechanism.
>
> My idea is basically that the hashing code should be 64-bit input
> always, and 64-bit output always, but provide easy methods for consumers
> which *need* to only get size_t output to truncate safely and consistently.
>
> Does that make sense?
Making hash_code 64-bit would definitely be conceptually cleaner.
But I'd be a bit worried that users might directly store `hash_code` 
values in their data structures, e.g. in hash table buckets, which 
probably would be a pessimization on 32-bit systems for most use cases. 
Do you have a good application in mind for an unstable non-cryptographic 
64-bit hash code on a 32-bit system?

- Stephan

Stephan Tolksdorf

2014-Feb-03 23:47 UTC

head link

[LLVMdev] ADT/Hashing.h on 32-bit platforms

On 14-02-03 23:10, Chandler Carruth wrote:> A couple of minor comments about the details of the patch:
>
> -  hash_state state = state.create(s_begin, seed);
> +  hash_state state = hash_state::create(s_begin, seed);
>
> Why this change? The existing code is correct already I think?
Is this a common idiom? On first sight this looked like a "self 
initialization" bug to me, so I thought it might make the code more 
readable to clarify that create is a static member function of 
hash_state. But I probably should have put such a code style "cleanup"
into a separate patch.
> -  return ::llvm::hashing::detail::hash_integer_value(value);
> +  // the cast is necessary for strong enums and avoids compiler warnings
>
> Generally LLVM comments should be formed as prose (starting with a
> capitalized word and ending with appropriate punctuation). I don't
think
> you need a comment here though, or a separate variable. I would just
> explicitly cast the value in the argument.
>
> +  const uint64_t n = static_cast<uint64_t>(value);
> +  return ::llvm::hashing::detail::hash_integer_value(n);
Thanks. Since there previously was no explicit cast I thought it might 
be useful to explain why I inserted one. I put in the separate variable, 
so that I'd have a good place to put my comment on without exceeding the 
column limit :-)

Let me know if you'd like me to update my patch. I'm happy to make any 
changes, but since I can't commit it myself, I'm not sure what would be 
easier for you.

Stephan Tolksdorf

2014-Feb-04 00:07 UTC

head link

[LLVMdev] ADT/Hashing.h on 32-bit platforms

On 03.02.14 23:11, Chandler Carruth wrote:> I would start another code review thread for this. Totally different set
> of folks should be looking at it I suspect.
Yeah, I just sent opened a new thread on the commits list. Thanks.

- Stephan

llvm dev - Feb 2014 - [LLVMdev] ADT/Hashing.h on 32-bit platforms

[LLVMdev] ADT/Hashing.h on 32-bit platforms

[LLVMdev] ADT/Hashing.h on 32-bit platforms

[LLVMdev] ADT/Hashing.h on 32-bit platforms

[LLVMdev] ADT/Hashing.h on 32-bit platforms

[LLVMdev] ADT/Hashing.h on 32-bit platforms

[LLVMdev] ADT/Hashing.h on 32-bit platforms

[LLVMdev] ADT/Hashing.h on 32-bit platforms