thr3ads.net - llvm dev - [LLVMdev] We need better hashing [Feb 2012]

If this information is useful, please help other people find it:
Share via:

Talin

2012-Feb-13 10:00 UTC

[LLVMdev] We need better hashing

On Mon, Feb 13, 2012 at 1:22 AM, Jay Foad <jay.foad at gmail.com> wrote:
> On 13 February 2012 00:59, Talin <viridia at gmail.com> wrote:
> > Here's my latest version of Hashing.h, which I propose to add to
> llvm/ADT.
> > Comments welcome and encouraged.
>
> > /// Adapted from MurmurHash2 by Austin Appleby
>
> Just out of curiosity, why not MurmurHash3 ? This page seems to
> suggest that #2 has some flaw, and #3 is better all round:
>
> https://sites.google.com/site/murmurhash/
>
> The main reason is because there's no incremental version of 3. If youlook at the source, you'll notice that the very first thing that 3 does is
Hash ^= Length, but for the incremental case we don't know the length until
we're done. Also, 2 has fewer instructions per block hashed than 3; 3 also
requires bit rotations, whereas 2 only uses bit shifts.

Bear in mind that the "flaw" in MurmurHash2 is a fairly esoteric case
which
(I suspect) LLVM is unlikely to ever encounter in practice. Austin is
trying to develop the best possible hash for a wide range of key
probability distributions, and his testing methodologies are quite strict.

LLVM's needs, on the other hand, are fairly modest. I'm guessing that
most
DenseMaps contain less than a few thousand entries. Even a bad hash
function wouldn't hurt performance that much, and the time taken to
calculate the hash is probably more of a factor in overall performance than
getting the optimum distribution of hash values.

Would it be possible to use CityHash instead for
strings?>
> http://code.google.com/p/cityhash/
>
> OK by me. My intention is that "Hashing.h" should contain a
variety ofdifferent hashing algorithms for various specific needs. Right now, I am
mainly focusing on hashing of mixed data types - that is, you have some
structure which contains pointers, ints, strings, and you want to calculate
a hash of the entire struct. I need this because I'm working on improving
the uniquing of constants and similar data structures. My next target is to
improve the uniquing of MDNodes, but I want to get the hashing stuff
squared away first.

It is also my intent that some person who is more of an expert in hashing
than I am can do detailed performance analysis under real-world loads (such
as compiling actual programs with clang) and tweak and optimize the hashing
algorithm, without needing to modify the API and/or all of the places that
call it.

> Thanks,
> Jay.
>
-- 
-- Talin
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120213/5531b894/attachment.html>

Chris Lattner

2012-Feb-14 10:44 UTC

head link

[LLVMdev] We need better hashing

On Feb 13, 2012, at 2:00 AM, Talin wrote:> Just out of curiosity, why not MurmurHash3 ? This page seems to
> suggest that #2 has some flaw, and #3 is better all round:
> 
> https://sites.google.com/site/murmurhash/
> 
> The main reason is because there's no incremental version of 3.
I think that that is a great reason.
> LLVM's needs, on the other hand, are fairly modest. I'm guessing
that most DenseMaps contain less than a few thousand entries. Even a bad hash
function wouldn't hurt performance that much, and the time taken to
calculate the hash is probably more of a factor in overall performance than
getting the optimum distribution of hash values.
Indeed.  It can't be hard to be better than our existing adhoc stuff :). 
That said, please do not change the hash function used by StringMap without do
careful performance analysis of the clang preprocessor.  The identifier uniquing
is a very hot path in "clang -E" performance.
> 
> Would it be possible to use CityHash instead for strings?
> 
> http://code.google.com/p/cityhash/
> 
> OK by me. My intention is that "Hashing.h" should contain a
variety of different hashing algorithms for various specific needs. Right now, I
am mainly focusing on hashing of mixed data types - that is, you have some
structure which contains pointers, ints, strings, and you want to calculate a
hash of the entire struct. I need this because I'm working on improving the
uniquing of constants and similar data structures. My next target is to improve
the uniquing of MDNodes, but I want to get the hashing stuff squared away first.
> 
> It is also my intent that some person who is more of an expert in hashing
than I am can do detailed performance analysis under real-world loads (such as
compiling actual programs with clang) and tweak and optimize the hashing
algorithm, without needing to modify the API and/or all of the places that call
it.
I think that this is a great way to stage it.  I personally care more about the
interface than the implementation.  Someone can tweak and tune it after enough
code is using the API to get reasonable performance numbers.


#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/StringRef.h"
#include "llvm/Support/Compiler.h"
#include "llvm/Support/DataTypes.h"
#include "llvm/Support/PointerLikeTypeTraits.h"
#include "llvm/Support/type_traits.h"

Do you actually need all of these includes?  PointerLikeTypeTraits doesn't
seem necessary.  Is type_traits?

  enum {
    BufferSize = 32,

BufferSize is dead.


 /// Add a pointer value
  template<typename T>
  void add(const T *PtrVal) {
    addImpl(
        reinterpret_cast<const uint32_t *>(&PtrVal),
        reinterpret_cast<const uint32_t *>(&PtrVal + 1));
  }

This violates TBAA rules and looks pretty dangerous to expose as public API.  Is
this really needed?  Also, addImpl is dereferencing the pointers as
uint32_t's, but there is nothing that guarantees that T is a multiple of 4
bytes.  The ArrayRef version has the same problem.

Though it is more verbose, I think it would be better to expose a template
specialization approach to getting the hash_value of T.

  /// Add a float
  void add(float Data) {
    addImpl(
      reinterpret_cast<const uint32_t *>(&Data),
      reinterpret_cast<const uint32_t *>(&Data + 1));
  }

  /// Add a double
  void add(double Data) {
    addImpl(
      reinterpret_cast<const uint32_t *>(&Data),
      reinterpret_cast<const uint32_t *>(&Data + 1));
  }

Similarly, these need to go through a union to avoid TBAA problems.



 void add(StringRef StrVal) {
    addImpl(StrVal.data(), StrVal.size());
  }

I'm contradicting my stance above about not caring about the implementation
:), but is MurmurHash a good hash for string data?  The Bernstein hash function
works really well and is much cheaper to compute than Murmur.  It is used by
HashString (and thus by StringMap).

  // Add a possibly unaligned sequence of bytes.
  void addImpl(const char *I, size_t Length) {

This should probably be moved out of line to avoid code bloat.

Overall, the design of the class is making sense to me!  Thanks for tackling
this!

-Chris


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120214/eca27ace/attachment.html>

Talin

2012-Feb-15 06:47 UTC

head link

[LLVMdev] We need better hashing

On Tue, Feb 14, 2012 at 2:44 AM, Chris Lattner <clattner at apple.com>
wrote:
> On Feb 13, 2012, at 2:00 AM, Talin wrote:
>
> Just out of curiosity, why not MurmurHash3 ? This page seems to
>> suggest that #2 has some flaw, and #3 is better all round:
>>
>> https://sites.google.com/site/murmurhash/
>>
>> The main reason is because there's no incremental version of 3.
>
>
> I think that that is a great reason.
>
> LLVM's needs, on the other hand, are fairly modest. I'm guessing
that most
> DenseMaps contain less than a few thousand entries. Even a bad hash
> function wouldn't hurt performance that much, and the time taken to
> calculate the hash is probably more of a factor in overall performance than
> getting the optimum distribution of hash values.
>
>
> Indeed.  It can't be hard to be better than our existing adhoc stuff
:).
>  That said, please do not change the hash function used by StringMap
> without do careful performance analysis of the clang preprocessor.  The
> identifier uniquing is a very hot path in "clang -E" performance.
>
> I'm not planning on touching StringMap.
>
>> Would it be possible to use CityHash instead for strings?
>>
>> http://code.google.com/p/cityhash/
>>
>> OK by me. My intention is that "Hashing.h" should contain a
variety of
> different hashing algorithms for various specific needs. Right now, I am
> mainly focusing on hashing of mixed data types - that is, you have some
> structure which contains pointers, ints, strings, and you want to calculate
> a hash of the entire struct. I need this because I'm working on
improving
> the uniquing of constants and similar data structures. My next target is to
> improve the uniquing of MDNodes, but I want to get the hashing stuff
> squared away first.
>
> It is also my intent that some person who is more of an expert in hashing
> than I am can do detailed performance analysis under real-world loads (such
> as compiling actual programs with clang) and tweak and optimize the hashing
> algorithm, without needing to modify the API and/or all of the places that
> call it.
>
>
> I think that this is a great way to stage it.  I personally care more
> about the interface than the implementation.  Someone can tweak and tune it
> after enough code is using the API to get reasonable performance numbers.
>
>
> #include "llvm/ADT/ArrayRef.h"
> #include "llvm/ADT/StringRef.h"
> #include "llvm/Support/Compiler.h"
> #include "llvm/Support/DataTypes.h"
> #include "llvm/Support/PointerLikeTypeTraits.h"
> #include "llvm/Support/type_traits.h"
>
> Do you actually need all of these includes?  PointerLikeTypeTraits
doesn't
> seem necessary.  Is type_traits?
>
> Ooops, this was a cut & paste error from FoldingSet.cpp.
>   enum {
>     BufferSize = 32,
>
> BufferSize is dead.
>
>
>  /// Add a pointer value
>   template<typename T>
>   void add(const T *PtrVal) {
>     addImpl(
>         reinterpret_cast<const uint32_t *>(&PtrVal),
>         reinterpret_cast<const uint32_t *>(&PtrVal + 1));
>   }
>
> This violates TBAA rules and looks pretty dangerous to expose as public
> API.  Is this really needed?  Also, addImpl is dereferencing the pointers
> as uint32_t's, but there is nothing that guarantees that T is a
multiple of
> 4 bytes.  The ArrayRef version has the same problem.
>
> So as far as hashing pointer values is concerned, I was just copying thecode from FoldingSet. Since a lot of the keys that we're going to be
dealing with are uniqued pointers, it makes sense to be able to calculate a
hash of the bit-value of the pointer, rather than hashing the thing pointed
to. That being said, renaming it to "addPointer" and adding a comment
might
be in order. Similarly, I can make the ArrayRef version 'addPointers'
and
have it take an ArrayRef<T*>.

Now, as to the 4 bytes issue, I think I can solve that with some clever
template methods.

> Though it is more verbose, I think it would be better to expose a template
> specialization approach to getting the hash_value of T.
>
>   /// Add a float
>   void add(float Data) {
>     addImpl(
>       reinterpret_cast<const uint32_t *>(&Data),
>       reinterpret_cast<const uint32_t *>(&Data + 1));
>   }
>
>   /// Add a double
>   void add(double Data) {
>     addImpl(
>       reinterpret_cast<const uint32_t *>(&Data),
>       reinterpret_cast<const uint32_t *>(&Data + 1));
>   }
>
> Similarly, these need to go through a union to avoid TBAA problems.
>
> I'm not sure how that works. Can you give an example?
>
>  void add(StringRef StrVal) {
>     addImpl(StrVal.data(), StrVal.size());
>   }
>
> I'm contradicting my stance above about not caring about the
> implementation :), but is MurmurHash a good hash for string data?
>  The Bernstein hash function works really well and is much cheaper to
> compute than Murmur.  It is used by HashString (and thus by StringMap).
>
> So, MurmurHash is intended for blocks of arbitrary binary data, which maycontain character data, integers, or whatever - it's designed to do such a
thorough job of mixing the bits that it really doesn't matter what data
types you feed it. You are correct that for purely string data, you'd want
to use a less expensive algorithm (I'm partial to FNV-1, which is as cheap
as the Bernstein hash and is AFAIK more mathematically sound.)

>   // Add a possibly unaligned sequence of bytes.
>   void addImpl(const char *I, size_t Length) {
>
> This should probably be moved out of line to avoid code bloat.
>
OK
>
> Overall, the design of the class is making sense to me!  Thanks for
> tackling this!
>
> -Chris
>
>
>

-- 
-- Talin
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120214/4dc54ce6/attachment.html>

Jeffrey Yasskin

2012-Feb-15 08:05 UTC

head link

[LLVMdev] We need better hashing

On Tue, Feb 14, 2012 at 2:44 AM, Chris Lattner <clattner at apple.com>
wrote:> On Feb 13, 2012, at 2:00 AM, Talin wrote:
>>
>> Just out of curiosity, why not MurmurHash3 ? This page seems to
>> suggest that #2 has some flaw, and #3 is better all round:
>>
>> https://sites.google.com/site/murmurhash/
>>
> The main reason is because there's no incremental version of 3.
>
>
> I think that that is a great reason.
>
> LLVM's needs, on the other hand, are fairly modest. I'm guessing
that most
> DenseMaps contain less than a few thousand entries. Even a bad hash
function
> wouldn't hurt performance that much, and the time taken to calculate
the
> hash is probably more of a factor in overall performance than getting the
> optimum distribution of hash values.
>
>
> Indeed.  It can't be hard to be better than our existing adhoc stuff
:).
>  That said, please do not change the hash function used by StringMap
without
> do careful performance analysis of the clang preprocessor.  The identifier
> uniquing is a very hot path in "clang -E" performance.
>
>>
>> Would it be possible to use CityHash instead for strings?
>>
>> http://code.google.com/p/cityhash/
>>
> OK by me. My intention is that "Hashing.h" should contain a
variety of
> different hashing algorithms for various specific needs. Right now, I am
> mainly focusing on hashing of mixed data types - that is, you have some
> structure which contains pointers, ints, strings, and you want to calculate
> a hash of the entire struct. I need this because I'm working on
improving
> the uniquing of constants and similar data structures. My next target is to
> improve the uniquing of MDNodes, but I want to get the hashing stuff
squared
> away first.
>
> It is also my intent that some person who is more of an expert in hashing
> than I am can do detailed performance analysis under real-world loads (such
> as compiling actual programs with clang) and tweak and optimize the hashing
> algorithm, without needing to modify the API and/or all of the places that
> call it.
>
>
> I think that this is a great way to stage it.  I personally care more about
> the interface than the implementation.  Someone can tweak and tune it after
> enough code is using the API to get reasonable performance numbers.
>
>
> #include "llvm/ADT/ArrayRef.h"
> #include "llvm/ADT/StringRef.h"
> #include "llvm/Support/Compiler.h"
> #include "llvm/Support/DataTypes.h"
> #include "llvm/Support/PointerLikeTypeTraits.h"
> #include "llvm/Support/type_traits.h"
>
> Do you actually need all of these includes?  PointerLikeTypeTraits
doesn't
> seem necessary.  Is type_traits?
>
>   enum {
>     BufferSize = 32,
>
> BufferSize is dead.
>
>
>  /// Add a pointer value
>   template<typename T>
>   void add(const T *PtrVal) {
>     addImpl(
>         reinterpret_cast<const uint32_t *>(&PtrVal),
>         reinterpret_cast<const uint32_t *>(&PtrVal + 1));
>   }
>
> This violates TBAA rules and looks pretty dangerous to expose as public
API.
>  Is this really needed?  Also, addImpl is dereferencing the pointers as
> uint32_t's, but there is nothing that guarantees that T is a multiple
of 4
> bytes.  The ArrayRef version has the same problem.
>
> Though it is more verbose, I think it would be better to expose a template
> specialization approach to getting the hash_value of T.
>
>   /// Add a float
>   void add(float Data) {
>     addImpl(
>       reinterpret_cast<const uint32_t *>(&Data),
>       reinterpret_cast<const uint32_t *>(&Data + 1));
>   }
>
>   /// Add a double
>   void add(double Data) {
>     addImpl(
>       reinterpret_cast<const uint32_t *>(&Data),
>       reinterpret_cast<const uint32_t *>(&Data + 1));
>   }
>
> Similarly, these need to go through a union to avoid TBAA problems.
These are just wrong: they'll hash +0 and -0 to different values even
though they compare ==.
>
>  void add(StringRef StrVal) {
>     addImpl(StrVal.data(), StrVal.size());
>   }
>
> I'm contradicting my stance above about not caring about the
implementation
> :), but is MurmurHash a good hash for string data?  The Bernstein hash
> function works really well and is much cheaper to compute than Murmur.
Have you seen benchmarks saying that, or are you just guessing from
the length of the code? The benchmarks I've seen say the opposite:
http://code.google.com/p/smhasher/wiki/MurmurHash3#Bulk_speed_test,_hashing_an_8-byte-aligned_256k_block
and http://code.google.com/p/cityhash/source/browse/trunk/README.
> It
> is used by HashString (and thus by StringMap).
>
>   // Add a possibly unaligned sequence of bytes.
>   void addImpl(const char *I, size_t Length) {
>
> This should probably be moved out of line to avoid code bloat.
>
> Overall, the design of the class is making sense to me!  Thanks for
tackling
> this!
>
> -Chris
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

Jay Foad

2012-Feb-15 10:15 UTC

head link

[LLVMdev] We need better hashing

On 14 February 2012 10:44, Chris Lattner <clattner at apple.com>
wrote:>  /// Add a pointer value
>   template<typename T>
>   void add(const T *PtrVal) {
>     addImpl(
>         reinterpret_cast<const uint32_t *>(&PtrVal),
>         reinterpret_cast<const uint32_t *>(&PtrVal + 1));
>   }
> Also, addImpl is dereferencing the pointers as
> uint32_t's, but there is nothing that guarantees that T is a multiple
of 4
> bytes.
I think you've misread the code. We're not passing PtrVal to addImpl,
we're passing &PtrVal. So the constraint is that the pointer type
"const T *" must be at least as aligned as a uint32_t, which seems
safe.

Jay.

Chandler Carruth

2012-Feb-17 09:59 UTC

head link

[LLVMdev] We need better hashing

On Tue, Feb 14, 2012 at 2:44 AM, Chris Lattner <clattner at apple.com>
wrote:
> I'm contradicting my stance above about not caring about the
> implementation :), but is MurmurHash a good hash for string data?
>  The Bernstein hash function works really well and is much cheaper to
> compute than Murmur.  It is used by HashString (and thus by StringMap).

If you want a good string hashing function, CityHash is by a fair margin
the best one out there. Look at the comparison done by Craig, Howard, and
several others when discussing what hashing function to use for libc++.

The only downside to CityHash is code size. It is heavily tuned, and that
results in several special case routines to get maximal efficiency and hash
quality for short strings (yep, not just huge ones). That said, the code
size increase was measured carefully for libc++ and it's really quite small.

That said, I have no benchmarks showing this matters for our uses of
StringMap. It reduced collisions, it didn't show up as a hot function, but
the collisions and the hashing simply didn't dominate any profiles I looked
at....
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120217/5e13e313/attachment.html>

Maybe Matching Threads

Search for more maybe matching threads

llvm dev - Feb 2012 - [LLVMdev] We need better hashing

[LLVMdev] We need better hashing

[LLVMdev] We need better hashing

[LLVMdev] We need better hashing

[LLVMdev] We need better hashing

[LLVMdev] We need better hashing

[LLVMdev] We need better hashing

Maybe Matching Threads