thr3ads.net - llvm dev - [LLVMdev] We need better hashing [Feb 2012]

If this information is useful, please help other people find it:
Share via:

Talin

2012-Feb-17 08:26 UTC

[LLVMdev] We need better hashing

OK here's a patch with the latest, including unit tests. I've also tried
to
make the comments clear that this is intended for the case of
"generic" key
types, where you are either hashing multiple data types together, or you
don't know in advance what the data type will be - for cases such as
strings where a tailored algorithm is available, you should use that
instead of this.

There's a couple of things I don't like: First, there's too many
levels of
inlined function calls - my experience is that compilers aren't as good at
inlining nested calls as is often advertised, however I couldn't figure out
a way to get the same effect without the nested calls.

Similarly the addBitsImpl helper types are a little too complex and magical
for my taste, but again I couldn't think of a better way to automatically
detect whether to use the aligned vs. unaligned hashing routine.

On Wed, Feb 15, 2012 at 12:01 PM, Chris Lattner <clattner at apple.com>
wrote:
> On Feb 14, 2012, at 10:47 PM, Talin wrote:
>
>  /// Add a pointer value
>>   template<typename T>
>>   void add(const T *PtrVal) {
>>     addImpl(
>>         reinterpret_cast<const uint32_t *>(&PtrVal),
>>         reinterpret_cast<const uint32_t *>(&PtrVal + 1));
>>   }
>>
>> This violates TBAA rules and looks pretty dangerous to expose as public
>> API.  Is this really needed?  Also, addImpl is dereferencing the
pointers
>> as uint32_t's, but there is nothing that guarantees that T is a
multiple of
>> 4 bytes.  The ArrayRef version has the same problem.
>>
>> So as far as hashing pointer values is concerned, I was just copying
the
> code from FoldingSet. Since a lot of the keys that we're going to be
> dealing with are uniqued pointers, it makes sense to be able to calculate a
> hash of the bit-value of the pointer, rather than hashing the thing pointed
> to. That being said, renaming it to "addPointer" and adding a
comment might
> be in order. Similarly, I can make the ArrayRef version
'addPointers' and
> have it take an ArrayRef<T*>.
>
>
> Ah, Jay was right, I misread this code!
>
> Though it is more verbose, I think it would be better to expose a template
>> specialization approach to getting the hash_value of T.
>>
>>   /// Add a float
>>   void add(float Data) {
>>     addImpl(
>>       reinterpret_cast<const uint32_t *>(&Data),
>>       reinterpret_cast<const uint32_t *>(&Data + 1));
>>   }
>>
>>   /// Add a double
>>   void add(double Data) {
>>     addImpl(
>>       reinterpret_cast<const uint32_t *>(&Data),
>>       reinterpret_cast<const uint32_t *>(&Data + 1));
>>   }
>>
>> Similarly, these need to go through a union to avoid TBAA problems.
>>
>> I'm not sure how that works. Can you give an example?
>
>
> Just use:
>
> void add(double Data) {
>   union {
>      double D; uint64_t I;
>   };
>   D = Data;
>   add(I);
> }
>
>
>>  void add(StringRef StrVal) {
>>     addImpl(StrVal.data(), StrVal.size());
>>   }
>>
>> I'm contradicting my stance above about not caring about the
>> implementation :), but is MurmurHash a good hash for string data?
>>  The Bernstein hash function works really well and is much cheaper to
>> compute than Murmur.  It is used by HashString (and thus by StringMap).
>>
>> So, MurmurHash is intended for blocks of arbitrary binary data, which
may
> contain character data, integers, or whatever - it's designed to do
such a
> thorough job of mixing the bits that it really doesn't matter what data
> types you feed it. You are correct that for purely string data, you'd
want
> to use a less expensive algorithm (I'm partial to FNV-1, which is as
cheap
> as the Bernstein hash and is AFAIK more mathematically sound.)
>
>
> Ok, so what's the answer? :)   We can do different things for
> ArrayRef<char> and StringRef.
>
> -Chris
>
>
>
>

-- 
-- Talin
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120217/82cde9c0/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hashing.patch
Type: application/octet-stream
Size: 11682 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120217/82cde9c0/attachment.obj>

Jay Foad

2012-Feb-17 08:56 UTC

head link

[LLVMdev] We need better hashing

On 17 February 2012 08:26, Talin <viridia at gmail.com>
wrote:> +  // Helper class to detect if the input type is 4-byte aligned.
> +  template<typename T>
> +  struct is_4byte_aligned {
> +    static const bool value = (sizeof(T) & 3) == 0;
> +  };
I don't think this is safe, e.g. for:

struct {
  char c[4];
}

Jay.

Jay Foad

2012-Feb-17 09:03 UTC

head link

[LLVMdev] We need better hashing

On 17 February 2012 08:56, Jay Foad <jay.foad at gmail.com>
wrote:> On 17 February 2012 08:26, Talin <viridia at gmail.com> wrote:
>> +  // Helper class to detect if the input type is 4-byte aligned.
>> +  template<typename T>
>> +  struct is_4byte_aligned {
>> +    static const bool value = (sizeof(T) & 3) == 0;
>> +  };
>
> I don't think this is safe, e.g. for:
>
> struct {
>  char c[4];
> }
... so you should probably use the stuff in <llvm/Support/AlignOf.h>
instead!

Jay.

Chris Lattner

2012-Feb-17 09:32 UTC

head link

[LLVMdev] We need better hashing

On Feb 17, 2012, at 12:26 AM, Talin wrote:> OK here's a patch with the latest, including unit tests. I've also
tried to make the comments clear that this is intended for the case of
"generic" key types, where you are either hashing multiple data types
together, or you don't know in advance what the data type will be - for
cases such as strings where a tailored algorithm is available, you should use
that instead of this.
Makes sense.

+  /// Add a pointer value.
+  /// Note: this adds pointers to the hash using sizes and endianness that
+  /// depend on the host.  It doesn't matter however, because hashing on
+  /// pointer values is inherently unstable.

This makes perfect sense.

+  /// Add an ArrayRef of arbitrary data.
+  template<typename T>
+  GeneralHash& add(ArrayRef<T> ArrayVal) {
+    addBits(ArrayVal.begin(), ArrayVal.end());
+    return *this;
+  }

Doesn't this have the same host-specificity problem, except that it will
cause things that *are* stable to vary, such as arrays of char, or is the
alignment check enough?

+  /// Add a float
+  GeneralHash& add(float Data) {

It is worth adding a comment here that this does a bitwise hash, so -0.0 and
+0.0 will hash to different values even though they compare equal and two
identical NaN's will hash to the same value even though they compare
unequal.

The mix logic is inherently a series of 32-bit operations.  Would it be possible
and make more sense to implement them as 64-bit operations?  64-bit hosts are
the vast majority of the machines that run a compiler these days.  OTOH, making
it depend on the host brings up host instability issues.

Actually, how much do we care about host instability here?  If this is used by
hashing containers, we can just declare that iteration order is undefined even
for non-pointer values.  The real problem I guess is that we'd have to go
check out all the existing DenseMap's etc to make sure people aren't
doing well-defined iteration right now and fixing the code.  What do you think?
> There's a couple of things I don't like: First, there's too
many levels of inlined function calls - my experience is that compilers
aren't as good at inlining nested calls as is often advertised, however I
couldn't figure out a way to get the same effect without the nested calls.
Is there a specific observed concern here (e.g. when built with Clang) or is
this a historical problem?  Compilers have gotten much better about inlining
stuff that is actually small, if Clang handles it then I think we're good. 
Marking these attribute(always_inline) is massive overkill.

Overall the code is looking great.  I'd recommend checking in the new API
separately from switching Constants.cpp to use it though (just so that any
problems doesn't cause both to get reverted).

-Chris

Talin

2012-Feb-18 09:11 UTC

head link

[LLVMdev] We need better hashing

On Fri, Feb 17, 2012 at 1:32 AM, Chris Lattner <clattner at apple.com>
wrote:
> On Feb 17, 2012, at 12:26 AM, Talin wrote:
> > OK here's a patch with the latest, including unit tests. I've
also tried
> to make the comments clear that this is intended for the case of
"generic"
> key types, where you are either hashing multiple data types together, or
> you don't know in advance what the data type will be - for cases such
as
> strings where a tailored algorithm is available, you should use that
> instead of this.
>
> Makes sense.
>
> +  /// Add a pointer value.
> +  /// Note: this adds pointers to the hash using sizes and endianness that
> +  /// depend on the host.  It doesn't matter however, because hashing
on
> +  /// pointer values is inherently unstable.
>
> This makes perfect sense.
>
> It should, as the comment was copied verbatim from FoldingSetNodeID :)
> +  /// Add an ArrayRef of arbitrary data.
> +  template<typename T>
> +  GeneralHash& add(ArrayRef<T> ArrayVal) {
> +    addBits(ArrayVal.begin(), ArrayVal.end());
> +    return *this;
> +  }
>
> Doesn't this have the same host-specificity problem, except that it
will
> cause things that *are* stable to vary, such as arrays of char, or is the
> alignment check enough?
>
> I thought about whether it would be possible to prevent people frompassing in ArrayRefs with unstable things, and I came to the conclusion
that there's no simple way to distinguish between stable and unstable
ArrayRefs. This is why I decided not to make a special "addPointer"
method
for pointers, because you could easily subvert it by wrapping it in a
singleton ArrayRef.

+  /// Add a float> +  GeneralHash& add(float Data) {
>
> It is worth adding a comment here that this does a bitwise hash, so -0.0
> and +0.0 will hash to different values even though they compare equal and
> two identical NaN's will hash to the same value even though they
compare
> unequal.
>
> BTW, are there in fact any maps in LLVM that use floats as keys, otherthan uniquing of constants? And in the latter case, would you not want to
distinguish between a -0.0 and +0.0 constant?

> The mix logic is inherently a series of 32-bit operations.  Would it be
> possible and make more sense to implement them as 64-bit operations?
>  64-bit hosts are the vast majority of the machines that run a compiler
> these days.  OTOH, making it depend on the host brings up host instability
> issues.
>
> Actually, how much do we care about host instability here?  If this is
> used by hashing containers, we can just declare that iteration order is
> undefined even for non-pointer values.  The real problem I guess is that
> we'd have to go check out all the existing DenseMap's etc to make
sure
> people aren't doing well-defined iteration right now and fixing the
code.
>  What do you think?
>
> I think that you are thinking that existing uses of DenseMap and other ADTcontainers will be affected by this. That wasn't my plan - I was going to
basically use the hashing class to create a custom DenseMapInfo for
specific maps which could benefit from the optimization. Other DenseMaps
would remain as they are.

> > There's a couple of things I don't like: First, there's
too many levels
> of inlined function calls - my experience is that compilers aren't as
good
> at inlining nested calls as is often advertised, however I couldn't
figure
> out a way to get the same effect without the nested calls.
>
> Is there a specific observed concern here (e.g. when built with Clang) or
> is this a historical problem?  Compilers have gotten much better about
> inlining stuff that is actually small, if Clang handles it then I think
> we're good.  Marking these attribute(always_inline) is massive
overkill.
>
> Well, it is historical from about 5 years ago when I was working on EASTL.The compilers we were using at the time were gcc and MSVC. We found cases
in the standard STL where inlines were nested up to 10 levels deep, and in
some of those cases the compiler just gave up trying to inline things that
deeply.

> Overall the code is looking great.  I'd recommend checking in the new
API
> separately from switching Constants.cpp to use it though (just so that any
> problems doesn't cause both to get reverted).
>
> OK. I'm still working on getting a consensus from the hashing experts
:)
>  -Chris
>
-- 
-- Talin
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120218/77faf871/attachment.html>

Reasonably Related Threads

Search for more seemingly similar threads

llvm dev - Feb 2012 - [LLVMdev] We need better hashing

[LLVMdev] We need better hashing

[LLVMdev] We need better hashing

[LLVMdev] We need better hashing

[LLVMdev] We need better hashing

[LLVMdev] We need better hashing

Reasonably Related Threads