Francois Fayard via llvm-dev
2017-Jan-10 08:19 UTC
[llvm-dev] Default hashing function for integers (DenseMapInfo.h)
> On Jan 10, 2017, at 8:56 AM, Mehdi Amini <mehdi.amini at apple.com> wrote > > Some tests I can suggest is to replace the hash function with your favorite and:Thanks. I’ll give it a try. It will take some time as I need to rewrite DenseMap if I want to use the Knuth multiplicative hash.> ultimately I believe real-world impact is the best way to get a change in.That’s exactly what I think. I was just trying to get some hints on what parts of the code could benefit from a better hash function for integers, in order to find a benchmark which is relevant.> If this is used for having pointers, then it is likely gonna be used *everywhere*. > Even simple test like “time to deserialize a large bitcode file” would be a good start.It is not used for pointers, as this function would be truly awful for them: because of alignments, pointers are of the form 2^a * k0. The hash function used for pointers is (k >> 4) ^ (k >> 9). My guess is that the k >> 4 is here because pointers returned by malloc are 16 bytes aligned. I don’t have any guess for the 9 though. So the hashing function for pointers is not flawed like the one for integers. François Fayard
Bruce Hoult via llvm-dev
2017-Jan-10 08:36 UTC
[llvm-dev] Default hashing function for integers (DenseMapInfo.h)
Both are not very sophisticated. You should also look at the different MurmurHash versions, and descendants such as CityHash. They are only very slightly more complex (e.g. two multiplies and a rotate) but of extremely good quality, approaching MD5 or SHA for randomness and independence of resulting bits (but not secure against malicious attacks like they are). I've successfully used simply k (i.e. c =1) for both integers and pointers in hash tables with prime number sizes. It works fine, and they usually have non-overlapping ranges in practice. Hashing structures and strings is the more interesting challenge. But if you're using power of two (or at least non-prime) table sizes then you should munch up the bits a lot more with multipliers with a lot more bits, not just something like 37. It may be that these hash tables don't have many entries and don't have many collisions and 37 is fine, and a more expensive hash function would slow it down. Only testing can tell you. On Tue, Jan 10, 2017 at 11:19 AM, Francois Fayard via llvm-dev < llvm-dev at lists.llvm.org> wrote:> > On Jan 10, 2017, at 8:56 AM, Mehdi Amini <mehdi.amini at apple.com> wrote > > > > Some tests I can suggest is to replace the hash function with your > favorite and: > > Thanks. I’ll give it a try. It will take some time as I need to rewrite > DenseMap if I want to use the Knuth multiplicative hash. > > > ultimately I believe real-world impact is the best way to get a change > in. > > That’s exactly what I think. I was just trying to get some hints on what > parts of the code could benefit from a better hash function for integers, > in order to find a benchmark which is relevant. > > > If this is used for having pointers, then it is likely gonna be used > *everywhere*. > > Even simple test like “time to deserialize a large bitcode file” would > be a good start. > > > It is not used for pointers, as this function would be truly awful for > them: because of alignments, pointers are of the form 2^a * k0. The hash > function used for pointers is (k >> 4) ^ (k >> 9). My guess is that the k > >> 4 is here because pointers returned by malloc are 16 bytes aligned. I > don’t have any guess for the 9 though. > So the hashing function for pointers is not flawed like the one for > integers. > > François Fayard > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170110/3b98b529/attachment.html>
Francois Fayard via llvm-dev
2017-Jan-10 09:20 UTC
[llvm-dev] Default hashing function for integers (DenseMapInfo.h)
> On Jan 10, 2017, at 9:36 AM, Bruce Hoult <bruce at hoult.org> wrote:Thanks for your thoughts.> Both are not very sophisticated. > You should also look at the different MurmurHash versions, and descendants such as CityHash.I’ll give a try at those MurmurHash functions.> I've successfully used simply k (i.e. c =1) for both integers and pointers in hash tables with prime number sizes. It works fine, and they usually have non-overlapping ranges in practice. Hashing structures and strings is the more interesting challenge.Using hash(k) = k when the size of the table is a prime number is OK. The bad news come when you start using 2^k as the size of your hash table and the keys are of the form 2^a k0 + k1.> But if you're using power of two (or at least non-prime) table sizes then you should munch up the bits a lot more with multipliers with a lot more bits, not just something like 37. > It may be that these hash tables don't have many entries and don't have many collisions and 37 is fine, and a more expensive hash function would slow it down. Only testing can tell you.I’ll first try to make things worse such as using hash(k) = k for pointers which should be really bad. See how much difference it does make. François Fayard
Francois Fayard via llvm-dev
2017-Jan-10 13:20 UTC
[llvm-dev] Default hashing function for integers (DenseMapInfo.h)
> On Jan 10, 2017, at 9:36 AM, Bruce Hoult <bruce at hoult.org> wrote: > > Both are not very sophisticated. > You should also look at the different MurmurHash versions, and descendants such as CityHash.I did a few benchmark this morning, trying to tweak the hashing for pointers (as many people seem to use pointers as keys). The hash function in LLVM is quite simple, but it gives very good results. The benchmark I did was compiling a C++ file with clang. I did 10 tests for every version of clang/llvm compiled with the hash function, and here are the best timings: - Original LLVM: hash(k) = (k >> 4) ^ (k >> 9) 3.40s - Stupid hash: hash(k) = k 3.55s - A bit less stupid: hash(k) = k >> unused_bits (defined by sizeof(T) = 2^unused_bits) 3.47s - Murmurhash3 3.47s So different hashing functions make a difference. For pointers, it seems that the one used by LLVM is quite good. It gives the best performance here. It is quite surprising that Murmurhash3 does not behave better than the “A bit less stupid” hash. My guess is that the LLVM hash function is the best because it is tailored to the distribution of pointers returned by malloc. As I said, the (k << 4) seems to be related to the fact that pointers are 16-bytes aligned on most OS. I still don’t understand the 9. I’ll do some benchmarks for hashing integers tomorrow. This is the hashing that really looks suspicious to me but it does not seem to be used that much in LLVM. François Fayard