thr3ads.net - llvm dev - [llvm-dev] Default hashing function for integers (DenseMapInfo.h) [Jan 2017]

If this information is useful, please help other people find it:
Share via:

Francois Fayard via llvm-dev

2017-Jan-10 08:19 UTC

[llvm-dev] Default hashing function for integers (DenseMapInfo.h)

> On Jan 10, 2017, at 8:56 AM, Mehdi Amini <mehdi.amini at apple.com>
wrote
> 
> Some tests I can suggest is to replace the hash function with your favorite
and:
Thanks. I’ll give it a try. It will take some time as I need to rewrite DenseMap
if I want to use the Knuth multiplicative hash.
> ultimately I believe real-world impact is the best way to get a change in.
That’s exactly what I think. I was just trying to get some hints on what parts
of the code could benefit from a better hash function for integers, in order to
find a benchmark which is relevant.
> If this is used for having pointers, then it is likely gonna be used
*everywhere*.
> Even simple test like “time to deserialize a large bitcode file” would be a
good start.

It is not used for pointers, as this function would be truly awful for them:
because of alignments, pointers are of the form 2^a * k0. The hash function used
for pointers is (k >> 4) ^ (k >> 9). My guess is that the k >>
4 is here because pointers returned by malloc are 16 bytes aligned. I don’t have
any guess for the 9 though.
So the hashing function for pointers is not flawed like the one for integers.

François Fayard

Bruce Hoult via llvm-dev

2017-Jan-10 08:36 UTC

head link

[llvm-dev] Default hashing function for integers (DenseMapInfo.h)

Both are not very sophisticated.

You should also look at the different MurmurHash versions, and descendants
such as CityHash.

They are only very slightly more complex (e.g. two multiplies and a rotate)
but of extremely good quality, approaching MD5 or SHA for randomness and
independence of resulting bits (but not secure against malicious attacks
like they are).

I've successfully used simply k (i.e. c =1) for both integers and pointers
in hash tables with prime number sizes. It works fine, and they usually
have non-overlapping ranges in practice. Hashing structures and strings is
the more interesting challenge.

But if you're using power of two (or at least non-prime) table sizes then
you should munch up the bits a lot more with multipliers with a lot more
bits, not just something like 37.

It may be that these hash tables don't have many entries and don't have
many collisions and 37 is fine, and a more expensive hash function would
slow it down. Only testing can tell you.

On Tue, Jan 10, 2017 at 11:19 AM, Francois Fayard via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> > On Jan 10, 2017, at 8:56 AM, Mehdi Amini <mehdi.amini at
apple.com> wrote
> >
> > Some tests I can suggest is to replace the hash function with your
> favorite and:
>
> Thanks. I’ll give it a try. It will take some time as I need to rewrite
> DenseMap if I want to use the Knuth multiplicative hash.
>
> > ultimately I believe real-world impact is the best way to get a change
> in.
>
> That’s exactly what I think. I was just trying to get some hints on what
> parts of the code could benefit from a better hash function for integers,
> in order to find a benchmark which is relevant.
>
> > If this is used for having pointers, then it is likely gonna be used
> *everywhere*.
> > Even simple test like “time to deserialize a large bitcode file” would
> be a good start.
>
>
> It is not used for pointers, as this function would be truly awful for
> them: because of alignments, pointers are of the form 2^a * k0. The hash
> function used for pointers is (k >> 4) ^ (k >> 9). My guess is
that the k
> >> 4 is here because pointers returned by malloc are 16 bytes
aligned. I
> don’t have any guess for the 9 though.
> So the hashing function for pointers is not flawed like the one for
> integers.
>
> François Fayard
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170110/3b98b529/attachment.html>

Francois Fayard via llvm-dev

2017-Jan-10 09:20 UTC

head link

[llvm-dev] Default hashing function for integers (DenseMapInfo.h)

> On Jan 10, 2017, at 9:36 AM, Bruce Hoult <bruce at hoult.org> wrote:
Thanks for your thoughts.
> Both are not very sophisticated.
> You should also look at the different MurmurHash versions, and descendants
such as CityHash.
I’ll give a try at those MurmurHash functions.
> I've successfully used simply k (i.e. c =1) for both integers and
pointers in hash tables with prime number sizes. It works fine, and they usually
have non-overlapping ranges in practice. Hashing structures and strings is the
more interesting challenge.
Using hash(k) = k when the size of the table is a prime number is OK. The bad
news come when you start using 2^k as the size of your hash table and the keys
are of the form 2^a k0 + k1.
> But if you're using power of two (or at least non-prime) table sizes
then you should munch up the bits a lot more with multipliers with a lot more
bits, not just something like 37.
> It may be that these hash tables don't have many entries and don't
have many collisions and 37 is fine, and a more expensive hash function would
slow it down. Only testing can tell you.
I’ll first try to make things worse such as using hash(k) = k for pointers which
should be really bad. See how much difference it does make.

François Fayard

Francois Fayard via llvm-dev

2017-Jan-10 13:20 UTC

head link

[llvm-dev] Default hashing function for integers (DenseMapInfo.h)

> On Jan 10, 2017, at 9:36 AM, Bruce Hoult <bruce at hoult.org> wrote:
> 
> Both are not very sophisticated.
> You should also look at the different MurmurHash versions, and descendants
such as CityHash.
I did a few benchmark this morning, trying to tweak the hashing for pointers (as
many people seem to use pointers as keys). The hash function in LLVM is quite
simple, but it gives very good results. The benchmark I did was compiling a C++
file with clang. I did 10 tests for every version of clang/llvm compiled with
the hash function, and here are the best timings:

- Original LLVM: hash(k) = (k >> 4) ^ (k >> 9)
  3.40s
- Stupid hash: hash(k) = k
  3.55s
- A bit less stupid: hash(k) = k >> unused_bits (defined by sizeof(T) =
2^unused_bits)
  3.47s
- Murmurhash3
  3.47s

So different hashing functions make a difference. For pointers, it seems that
the one used by LLVM is quite good. It gives the best performance here. It is
quite surprising that Murmurhash3 does not behave better than the “A bit less
stupid” hash. My guess is that the LLVM hash function is the best because it is
tailored to the distribution of pointers returned by malloc. As I said, the (k
<< 4) seems to be related to the fact that pointers are 16-bytes aligned
on most OS. I still don’t understand the 9.

I’ll do some benchmarks for hashing integers tomorrow. This is the hashing that
really looks suspicious to me but it does not seem to be used that much in LLVM.

François Fayard

Reasonably Related Threads

Search for more seemingly similar threads

llvm dev - Jan 2017 - Default hashing function for integers (DenseMapInfo.h)

[llvm-dev] Default hashing function for integers (DenseMapInfo.h)

[llvm-dev] Default hashing function for integers (DenseMapInfo.h)

[llvm-dev] Default hashing function for integers (DenseMapInfo.h)

[llvm-dev] Default hashing function for integers (DenseMapInfo.h)

Reasonably Related Threads