Francois Fayard via llvm-dev
2017-Jan-09 10:10 UTC
[llvm-dev] Default hashing function for integers (DenseMapInfo.h)
Hi, I’ve been looking at Chandler Carruth talks about internal data structures for LLVM in order to implement my own library. That’s how I ended up looking at the internals of DenseMap and the default hash functions defined in DenseMapInfo. I have been very surprised by the hash function used for integers which is hash(k) = 37 * k. This kind of hashing function does not change the lowest bits of k, and if your keys happen to all be a multiple of 8, you end up not using about 90% of the slot as a direct access in an open addressing hash table. By the way, does anyone know the rationale behind using 37? In order to improve this, it could be nice to use a better hash function such as the Knuth hash which is defined as: std::size_t hash_value(std::size_t val, int p) { assert(p >= 0 && p <= 64); const std::size_t knuth = 11133510565745311; return (val * knuth) >> (64 - p); } The big advantage of this function is that is shuffles all the bits and does not suffer from the problems given above. It is also quite cheap to compute. Unfortunately, one needs to change the signature of the hash_value function as the hash function in order to take an integer k and hash it into a integer in {0, 1, 2, …, 2^p} is no longer in the form hash(k) % 2^p. It needs to be in the form hash(k, p). As I don’t know anything about the LLVM source code, here are a few questions: - Is DenseMap used a lot with integers as a key in LLVM? Do you think it would benefit from a better default hash function? - Is there any benchmark to run that makes a heavy use of those hash function so I can see if my changes would improve performance? Best regards, François Fayard Founder & Consultant - Inside Loop Applied Mathematics & High Performance Computing Tel: +33 (0)6 01 44 06 93 Web: www.insideloop.io -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170109/499c4a39/attachment.html>
Joerg Sonnenberger via llvm-dev
2017-Jan-09 12:37 UTC
[llvm-dev] Default hashing function for integers (DenseMapInfo.h)
On Mon, Jan 09, 2017 at 11:10:05AM +0100, Francois Fayard via llvm-dev wrote:> I have been very surprised by the hash function used for integers which > is hash(k) = 37 * k. This kind of hashing function does not change the > lowest bits of k, and if your keys happen to all be a multiple of 8, > you end up not using about 90% of the slot as a direct access in an > open addressing hash table. By the way, does anyone know the rationale > behind using 37?37 is a prime. Multiples of 8 that are otherwise evenly distributed will map to different slots in the output space.> In order to improve this, it could be nice to use a better hash > function such as the Knuth hash which is defined as:64bit multiplications are very expensive on many 32bit platforms. It is certainly an order of magnitude slower than the existing hash. Joerg
Francois Fayard via llvm-dev
2017-Jan-09 14:34 UTC
[llvm-dev] Default hashing function for integers (DenseMapInfo.h)
> On Jan 9, 2017, at 1:37 PM, Joerg Sonnenberger <joerg at bec.de> wrote: > > 37 is a prime. Multiples of 8 that are otherwise evenly distributed will > map to different slots in the output space.As the hash number is reduced modulo 2^p (the number of slots), if your hash function is of the form hash(k) = c * k, I understand that you want to be able to fill all te slots. One can show that we get this property if c is not a multiple of 2. So it gives a lot of choices: 1, 3, 5, 7, 9, etc. I don’t see any reason not to use c = 1 if you want to use that kind of hash function.> 64bit multiplications are very expensive on many 32bit platforms. It is > certainly an order of magnitude slower than the existing hash.There is also a 32-bit version of the Knuth hash: std::uint32_t hash(std::uint32_t val, int p) { const std::uint32_t knuth = 2654435769; const std::uint32_t y = val; return (y * knuth) >> (32 - p); } Those numbers come from the following fact. Let alpha = (-1 + sqrt(5)) / 2 which is about 0.618034. Multiply this number by 2^32 (or 2^64 for 64 bit hash), and round this number to the nearest integer. You’ll get 2 654 435 769. alpha has been chosen so that it is an irrational number in between 0 and 1 and that it is very difficult to approach it by rational numbers. One can show that all irrational numbers being a root of a degree 2 polynom with integer coefficients with leading coefficient being 1 have this property (X^2 + X - 1 = 0 for alpha). I wrote my own hash table using open addressing and quadratic probing. I have tried to insert n = 2^26 keys given as follow: generate a random number in between 0 and 2^(64-3) and multiply it by 8. Then search all those keys in the order they were inserted. Here are the timings I get: - hash(x) = x : insertion 7.42s, searching 4.09s - hash(x) = 37 * x : insertion 7.49s, searching 5.50s - Knuth hashing: insertion 6.04, searching 2.54s So using Knuth hashing is really worth it if the keys happen to be in the form 2^k * u + v. And multiply by 37 is not a good idea here. And I am still very puzzled and don’t really understand the situations where it could be useful. François Fayard -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170109/ed931d51/attachment.html>
Chris Lattner via llvm-dev
2017-Jan-10 01:31 UTC
[llvm-dev] Default hashing function for integers (DenseMapInfo.h)
> On Jan 9, 2017, at 2:10 AM, Francois Fayard via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > Hi, > > I’ve been looking at Chandler Carruth talks about internal data structures for LLVM in order to implement my own library. That’s how I ended up looking at the internals of DenseMap and the default hash functions defined in DenseMapInfo. > > I have been very surprised by the hash function used for integers which is hash(k) = 37 * k.As others have pointed out, 37 does have some nice properties (by being prime), but we’ve also had it from the very early days. I wouldn’t say that it has been extremely well considered at all. -Chris
Francois Fayard via llvm-dev
2017-Jan-10 07:46 UTC
[llvm-dev] Default hashing function for integers (DenseMapInfo.h)
> On Jan 10, 2017, at 2:31 AM, Chris Lattner <sabre at nondot.org> wrote: > > As others have pointed out, 37 does have some nice properties (by being prime), but we’ve also had it from the very early days. I wouldn’t say that it has been extremely well considered at all. > > -ChrisThanks for your reply. But I am not sure to understand the last sentence. Does it mean that this hash function has been criticized before? Do you also think that it can be improved? I am sorry to insist on that, but I don’t see any advantage of using hash(k) = c * k with c being a primer number. What is important is choosing c such that gcd(c, m) = 1 for any size m of the hash table. This is the case for c = 1. As a consequence, I don’t see any benefit in using c = 37 over c = 1. Moreover, all those hash functions are badly considered for a reason : they behave very poorly when the keys are of the form d * k0 + k1. I really think that this hash function could be improved. But to know if it's worth it, I would like to know if having a better hash for integer types would benefit LLVM. François Fayard -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170110/6ae904fc/attachment.html>
Maybe Matching Threads
- Default hashing function for integers (DenseMapInfo.h)
- Default hashing function for integers (DenseMapInfo.h)
- Default hashing function for integers (DenseMapInfo.h)
- Default hashing function for integers (DenseMapInfo.h)
- [LLVMdev] Proposed implementation of N3333 hashing interfaces for LLVM (and possible libc++)