thr3ads.net - llvm dev - [llvm-dev] Default hashing function for integers (DenseMapInfo.h) [Jan 2017]

If this information is useful, please help other people find it:
Share via:

Francois Fayard via llvm-dev

2017-Jan-10 07:46 UTC

[llvm-dev] Default hashing function for integers (DenseMapInfo.h)

> On Jan 10, 2017, at 2:31 AM, Chris Lattner <sabre at nondot.org>
wrote:
> 
> As others have pointed out, 37 does have some nice properties (by being
prime), but we’ve also had it from the very early days.  I wouldn’t say that it
has been extremely well considered at all.
> 
> -Chris
Thanks for your reply. But I am not sure to understand the last sentence. Does
it mean that this hash function has been criticized before? Do you also think
that it can be improved?

I am sorry to insist on that, but I don’t see any advantage of using hash(k) = c
* k with c being a primer number. What is important is choosing c such that
gcd(c, m) = 1 for any size m of the hash table. This is the case for c = 1. As a
consequence, I don’t see any benefit in using c = 37 over c = 1. Moreover, all
those hash functions are badly considered for a reason : they behave very poorly
when the keys are of the form d * k0 + k1.

I really think that this hash function could be improved. But to know if
it's worth it, I would like to know if having a better hash for integer
types would benefit LLVM.

François Fayard

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170110/6ae904fc/attachment.html>

Mehdi Amini via llvm-dev

2017-Jan-10 07:56 UTC

head link

[llvm-dev] Default hashing function for integers (DenseMapInfo.h)

Hi François,
> On Jan 9, 2017, at 11:46 PM, Francois Fayard via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> 
>> On Jan 10, 2017, at 2:31 AM, Chris Lattner <sabre at nondot.org
<mailto:sabre at nondot.org>> wrote:
>> 
>> As others have pointed out, 37 does have some nice properties (by being
prime), but we’ve also had it from the very early days.  I wouldn’t say that it
has been extremely well considered at all.
>> 
>> -Chris
> 
> Thanks for your reply. But I am not sure to understand the last sentence.
Does it mean that this hash function has been criticized before? Do you also
think that it can be improved?
> 
> I am sorry to insist on that, but I don’t see any advantage of using
hash(k) = c * k with c being a primer number. What is important is choosing c
such that gcd(c, m) = 1 for any size m of the hash table. This is the case for c
= 1. As a consequence, I don’t see any benefit in using c = 37 over c = 1.
Moreover, all those hash functions are badly considered for a reason : they
behave very poorly when the keys are of the form d * k0 + k1.
> 
> I really think that this hash function could be improved. But to know if
it's worth it, I would like to know if having a better hash for integer
types would benefit LLVM.
Some tests I can suggest is to replace the hash function with your favorite and:

1) Test the impact on the performance (is an LTO link of clang faster? Is `clang
-O0` getting faster?), because ultimately I believe real-world impact is the
best way to get a change in.
2) Instrument the DenseMap accessors with counters and measure impact on the
number of collisions when changing the hash function.

— 
Mehdi

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170109/0b3f3c9e/attachment.html>

Francois Fayard via llvm-dev

2017-Jan-10 08:19 UTC

head link

[llvm-dev] Default hashing function for integers (DenseMapInfo.h)

> On Jan 10, 2017, at 8:56 AM, Mehdi Amini <mehdi.amini at apple.com>
wrote
> 
> Some tests I can suggest is to replace the hash function with your favorite
and:
Thanks. I’ll give it a try. It will take some time as I need to rewrite DenseMap
if I want to use the Knuth multiplicative hash.
> ultimately I believe real-world impact is the best way to get a change in.
That’s exactly what I think. I was just trying to get some hints on what parts
of the code could benefit from a better hash function for integers, in order to
find a benchmark which is relevant.
> If this is used for having pointers, then it is likely gonna be used
*everywhere*.
> Even simple test like “time to deserialize a large bitcode file” would be a
good start.

It is not used for pointers, as this function would be truly awful for them:
because of alignments, pointers are of the form 2^a * k0. The hash function used
for pointers is (k >> 4) ^ (k >> 9). My guess is that the k >>
4 is here because pointers returned by malloc are 16 bytes aligned. I don’t have
any guess for the 9 though.
So the hashing function for pointers is not flawed like the one for integers.

François Fayard

Joerg Sonnenberger via llvm-dev

2017-Jan-10 10:56 UTC

head link

[llvm-dev] Default hashing function for integers (DenseMapInfo.h)

On Tue, Jan 10, 2017 at 08:46:54AM +0100, Francois Fayard via llvm-dev
wrote:> I am sorry to insist on that, but I don’t see any advantage of using
> hash(k) = c * k with c being a primer number. What is important is
> choosing c such that gcd(c, m) = 1 for any size m of the hash table.
> This is the case for c = 1. As a consequence, I don’t see any benefit
> in using c = 37 over c = 1. Moreover, all those hash functions are
> badly considered for a reason : they behave very poorly when the keys
> are of the form d * k0 + k1.
The reason for avoiding c=1 is the trivial aliasing set for power-of-two
differences. Those tend to be fairly common in any data set. Beside
that, 37 works somewhat decent for many small differences, it can be
computed easily without multiplication where necessary and noone has
bothered enough to compare different hash functions across different
architectures. E.g. the current hash is simple, avoids some of the
obvious bad cases and is itself decently fast. Anything else needs
careful analysis of the data.

Joerg

Francois Fayard via llvm-dev

2017-Jan-10 11:12 UTC

head link

[llvm-dev] Default hashing function for integers (DenseMapInfo.h)

> The reason for avoiding c=1 is the trivial aliasing set for power-of-two
> differences. Those tend to be fairly common in any data set.
That’s where I completely disagree. Multiplying by 37 won’t help.
Suppose that you have two integers a and b such that a = b (modulo 8). You also
have 37 a = 37 b (modulo 8), and multiplying by 37 does not change anything
here. Prime numbers are useful for the size of the hash table and kill those
trivial aliasing which is why hash(k) = k is a decent hash for those tables. But
I really don’t see any benefit in multiplying by a prime number.
> E.g. the current hash is simple, avoids some of the
> obvious bad cases and is itself decently fast. Anything else needs
> careful analysis of the data.
It disagree on the fact that it avoids obvious bad cases.

François Fayard

llvm dev - Jan 2017 - Default hashing function for integers (DenseMapInfo.h)

[llvm-dev] Default hashing function for integers (DenseMapInfo.h)

[llvm-dev] Default hashing function for integers (DenseMapInfo.h)

[llvm-dev] Default hashing function for integers (DenseMapInfo.h)

[llvm-dev] Default hashing function for integers (DenseMapInfo.h)

[llvm-dev] Default hashing function for integers (DenseMapInfo.h)