Francois Fayard via llvm-dev
2017-Jan-10 07:46 UTC
[llvm-dev] Default hashing function for integers (DenseMapInfo.h)
> On Jan 10, 2017, at 2:31 AM, Chris Lattner <sabre at nondot.org> wrote: > > As others have pointed out, 37 does have some nice properties (by being prime), but we’ve also had it from the very early days. I wouldn’t say that it has been extremely well considered at all. > > -ChrisThanks for your reply. But I am not sure to understand the last sentence. Does it mean that this hash function has been criticized before? Do you also think that it can be improved? I am sorry to insist on that, but I don’t see any advantage of using hash(k) = c * k with c being a primer number. What is important is choosing c such that gcd(c, m) = 1 for any size m of the hash table. This is the case for c = 1. As a consequence, I don’t see any benefit in using c = 37 over c = 1. Moreover, all those hash functions are badly considered for a reason : they behave very poorly when the keys are of the form d * k0 + k1. I really think that this hash function could be improved. But to know if it's worth it, I would like to know if having a better hash for integer types would benefit LLVM. François Fayard -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170110/6ae904fc/attachment.html>
Mehdi Amini via llvm-dev
2017-Jan-10 07:56 UTC
[llvm-dev] Default hashing function for integers (DenseMapInfo.h)
Hi François,> On Jan 9, 2017, at 11:46 PM, Francois Fayard via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > >> On Jan 10, 2017, at 2:31 AM, Chris Lattner <sabre at nondot.org <mailto:sabre at nondot.org>> wrote: >> >> As others have pointed out, 37 does have some nice properties (by being prime), but we’ve also had it from the very early days. I wouldn’t say that it has been extremely well considered at all. >> >> -Chris > > Thanks for your reply. But I am not sure to understand the last sentence. Does it mean that this hash function has been criticized before? Do you also think that it can be improved? > > I am sorry to insist on that, but I don’t see any advantage of using hash(k) = c * k with c being a primer number. What is important is choosing c such that gcd(c, m) = 1 for any size m of the hash table. This is the case for c = 1. As a consequence, I don’t see any benefit in using c = 37 over c = 1. Moreover, all those hash functions are badly considered for a reason : they behave very poorly when the keys are of the form d * k0 + k1. > > I really think that this hash function could be improved. But to know if it's worth it, I would like to know if having a better hash for integer types would benefit LLVM.Some tests I can suggest is to replace the hash function with your favorite and: 1) Test the impact on the performance (is an LTO link of clang faster? Is `clang -O0` getting faster?), because ultimately I believe real-world impact is the best way to get a change in. 2) Instrument the DenseMap accessors with counters and measure impact on the number of collisions when changing the hash function. — Mehdi -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170109/0b3f3c9e/attachment.html>
Francois Fayard via llvm-dev
2017-Jan-10 08:19 UTC
[llvm-dev] Default hashing function for integers (DenseMapInfo.h)
> On Jan 10, 2017, at 8:56 AM, Mehdi Amini <mehdi.amini at apple.com> wrote > > Some tests I can suggest is to replace the hash function with your favorite and:Thanks. I’ll give it a try. It will take some time as I need to rewrite DenseMap if I want to use the Knuth multiplicative hash.> ultimately I believe real-world impact is the best way to get a change in.That’s exactly what I think. I was just trying to get some hints on what parts of the code could benefit from a better hash function for integers, in order to find a benchmark which is relevant.> If this is used for having pointers, then it is likely gonna be used *everywhere*. > Even simple test like “time to deserialize a large bitcode file” would be a good start.It is not used for pointers, as this function would be truly awful for them: because of alignments, pointers are of the form 2^a * k0. The hash function used for pointers is (k >> 4) ^ (k >> 9). My guess is that the k >> 4 is here because pointers returned by malloc are 16 bytes aligned. I don’t have any guess for the 9 though. So the hashing function for pointers is not flawed like the one for integers. François Fayard
Joerg Sonnenberger via llvm-dev
2017-Jan-10 10:56 UTC
[llvm-dev] Default hashing function for integers (DenseMapInfo.h)
On Tue, Jan 10, 2017 at 08:46:54AM +0100, Francois Fayard via llvm-dev wrote:> I am sorry to insist on that, but I don’t see any advantage of using > hash(k) = c * k with c being a primer number. What is important is > choosing c such that gcd(c, m) = 1 for any size m of the hash table. > This is the case for c = 1. As a consequence, I don’t see any benefit > in using c = 37 over c = 1. Moreover, all those hash functions are > badly considered for a reason : they behave very poorly when the keys > are of the form d * k0 + k1.The reason for avoiding c=1 is the trivial aliasing set for power-of-two differences. Those tend to be fairly common in any data set. Beside that, 37 works somewhat decent for many small differences, it can be computed easily without multiplication where necessary and noone has bothered enough to compare different hash functions across different architectures. E.g. the current hash is simple, avoids some of the obvious bad cases and is itself decently fast. Anything else needs careful analysis of the data. Joerg
Francois Fayard via llvm-dev
2017-Jan-10 11:12 UTC
[llvm-dev] Default hashing function for integers (DenseMapInfo.h)
> The reason for avoiding c=1 is the trivial aliasing set for power-of-two > differences. Those tend to be fairly common in any data set.That’s where I completely disagree. Multiplying by 37 won’t help. Suppose that you have two integers a and b such that a = b (modulo 8). You also have 37 a = 37 b (modulo 8), and multiplying by 37 does not change anything here. Prime numbers are useful for the size of the hash table and kill those trivial aliasing which is why hash(k) = k is a decent hash for those tables. But I really don’t see any benefit in multiplying by a prime number.> E.g. the current hash is simple, avoids some of the > obvious bad cases and is itself decently fast. Anything else needs > careful analysis of the data.It disagree on the fact that it avoids obvious bad cases. François Fayard