Francois Fayard via llvm-dev
2017-Jan-10 13:20 UTC
[llvm-dev] Default hashing function for integers (DenseMapInfo.h)
> On Jan 10, 2017, at 9:36 AM, Bruce Hoult <bruce at hoult.org> wrote: > > Both are not very sophisticated. > You should also look at the different MurmurHash versions, and descendants such as CityHash.I did a few benchmark this morning, trying to tweak the hashing for pointers (as many people seem to use pointers as keys). The hash function in LLVM is quite simple, but it gives very good results. The benchmark I did was compiling a C++ file with clang. I did 10 tests for every version of clang/llvm compiled with the hash function, and here are the best timings: - Original LLVM: hash(k) = (k >> 4) ^ (k >> 9) 3.40s - Stupid hash: hash(k) = k 3.55s - A bit less stupid: hash(k) = k >> unused_bits (defined by sizeof(T) = 2^unused_bits) 3.47s - Murmurhash3 3.47s So different hashing functions make a difference. For pointers, it seems that the one used by LLVM is quite good. It gives the best performance here. It is quite surprising that Murmurhash3 does not behave better than the “A bit less stupid” hash. My guess is that the LLVM hash function is the best because it is tailored to the distribution of pointers returned by malloc. As I said, the (k << 4) seems to be related to the fact that pointers are 16-bytes aligned on most OS. I still don’t understand the 9. I’ll do some benchmarks for hashing integers tomorrow. This is the hashing that really looks suspicious to me but it does not seem to be used that much in LLVM. François Fayard
mats petersson via llvm-dev
2017-Jan-10 15:03 UTC
[llvm-dev] Default hashing function for integers (DenseMapInfo.h)
On 10 January 2017 at 13:20, Francois Fayard via llvm-dev < llvm-dev at lists.llvm.org> wrote:> > > On Jan 10, 2017, at 9:36 AM, Bruce Hoult <bruce at hoult.org> wrote: > > > > Both are not very sophisticated. > > You should also look at the different MurmurHash versions, and > descendants such as CityHash. > > I did a few benchmark this morning, trying to tweak the hashing for > pointers (as many people seem to use pointers as keys). The hash function > in LLVM is quite simple, but it gives very good results. The benchmark I > did was compiling a C++ file with clang. I did 10 tests for every version > of clang/llvm compiled with the hash function, and here are the best > timings: > > - Original LLVM: hash(k) = (k >> 4) ^ (k >> 9) > 3.40s > - Stupid hash: hash(k) = k > 3.55s > - A bit less stupid: hash(k) = k >> unused_bits (defined by sizeof(T) > 2^unused_bits) > 3.47s > - Murmurhash3 > 3.47s > > So different hashing functions make a difference. For pointers, it seems > that the one used by LLVM is quite good. It gives the best performance > here. It is quite surprising that Murmurhash3 does not behave better than > the “A bit less stupid” hash. My guess is that the LLVM hash function is > the best because it is tailored to the distribution of pointers returned by > malloc. As I said, the (k << 4) seems to be related to the fact that > pointers are 16-bytes aligned on most OS. I still don’t understand the 9. >It is not clear what the test-case is (what source, what compiler options). My suspicion is that your differences are in the noise, and most of the time is spent doing other things than hashing. Did you profile the a run, and check how much of the total time is spent in the hash-function [you may need to tweak the code a bit to not inline the actual hash function]. Also publishing the RANGE for each set of tests, since the variation between "best and worst" may be more important than the overall best time. Counting the number of collisions or some such would probably also help. [I have no useful answer as to why `k >> 9` is used. `k >> 4` because pointers [allocated from the heap] are often aligned to 16 or greater does make sense. I suspect 9 is just some arbitrary value that gives some more entropy in the low part of the value [and since the value is later used in a modulo buckets or some such, having more variation in the lower bits does help - upper bits are discarded, but also often the same for large number of allocations!] - whether someone spent minutes, hours or days picking exactly 9 I have no idea - nor whether it still is the "right" value, or there should be some other number there... -- Mats> > I’ll do some benchmarks for hashing integers tomorrow. This is the hashing > that really looks suspicious to me but it does not seem to be used that > much in LLVM. > > François Fayard > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170110/f252e875/attachment.html>
Francois Fayard via llvm-dev
2017-Jan-10 15:35 UTC
[llvm-dev] Default hashing function for integers (DenseMapInfo.h)
> It is not clear what the test-case is (what source, what compiler options). My suspicion is that your differences are in the noise, and most of the time is spent doing other things than hashing. Did you profile the a run, and check how much of the total time is spent in the hash-function [you may need to tweak the code a bit to not inline the actual hash function]. Also publishing the RANGE for each set of tests, since the variation between "best and worst" may be more important than the overall best time.I agree that I did not present a clear test case, which was just a random .cpp file that benchmarks different hash tables from LLVM/Google/STL compiled with -O3 option. I’ll try to make it better tomorrow, but I can assure you that it was not the noise. I have disabled turboboost and hyper-threading on my machine and ran the benchmark 10 times for every test and the difference was way larger than the observed standard deviation. But anyway, for the time being, the winner for hashing pointers is the current one from of LLVM.> Counting the number of collisions or some such would probably also help.Yes, I could do that. If I want to report all the collision, I might do that in the destructor of DenseMap. Sorry for my ignorance, but is clang/llvm multithreaded? Is it safe to write to a file in a destructor for logging purposes? François Fayard