thr3ads.net - llvm dev - [llvm-dev] Default hashing function for integers (DenseMapInfo.h) [Jan 2017]

If this information is useful, please help other people find it:
Share via:

Francois Fayard via llvm-dev

2017-Jan-10 13:20 UTC

[llvm-dev] Default hashing function for integers (DenseMapInfo.h)

> On Jan 10, 2017, at 9:36 AM, Bruce Hoult <bruce at hoult.org> wrote:
> 
> Both are not very sophisticated.
> You should also look at the different MurmurHash versions, and descendants
such as CityHash.
I did a few benchmark this morning, trying to tweak the hashing for pointers (as
many people seem to use pointers as keys). The hash function in LLVM is quite
simple, but it gives very good results. The benchmark I did was compiling a C++
file with clang. I did 10 tests for every version of clang/llvm compiled with
the hash function, and here are the best timings:

- Original LLVM: hash(k) = (k >> 4) ^ (k >> 9)
  3.40s
- Stupid hash: hash(k) = k
  3.55s
- A bit less stupid: hash(k) = k >> unused_bits (defined by sizeof(T) =
2^unused_bits)
  3.47s
- Murmurhash3
  3.47s

So different hashing functions make a difference. For pointers, it seems that
the one used by LLVM is quite good. It gives the best performance here. It is
quite surprising that Murmurhash3 does not behave better than the “A bit less
stupid” hash. My guess is that the LLVM hash function is the best because it is
tailored to the distribution of pointers returned by malloc. As I said, the (k
<< 4) seems to be related to the fact that pointers are 16-bytes aligned
on most OS. I still don’t understand the 9.

I’ll do some benchmarks for hashing integers tomorrow. This is the hashing that
really looks suspicious to me but it does not seem to be used that much in LLVM.

François Fayard

mats petersson via llvm-dev

2017-Jan-10 15:03 UTC

head link

[llvm-dev] Default hashing function for integers (DenseMapInfo.h)

On 10 January 2017 at 13:20, Francois Fayard via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
> > On Jan 10, 2017, at 9:36 AM, Bruce Hoult <bruce at hoult.org>
wrote:
> >
> > Both are not very sophisticated.
> > You should also look at the different MurmurHash versions, and
> descendants such as CityHash.
>
> I did a few benchmark this morning, trying to tweak the hashing for
> pointers (as many people seem to use pointers as keys). The hash function
> in LLVM is quite simple, but it gives very good results. The benchmark I
> did was compiling a C++ file with clang. I did 10 tests for every version
> of clang/llvm compiled with the hash function, and here are the best
> timings:
>
> - Original LLVM: hash(k) = (k >> 4) ^ (k >> 9)
>   3.40s
> - Stupid hash: hash(k) = k
>   3.55s
> - A bit less stupid: hash(k) = k >> unused_bits (defined by sizeof(T)
> 2^unused_bits)
>   3.47s
> - Murmurhash3
>   3.47s
>
> So different hashing functions make a difference. For pointers, it seems
> that the one used by LLVM is quite good. It gives the best performance
> here. It is quite surprising that Murmurhash3 does not behave better than
> the “A bit less stupid” hash. My guess is that the LLVM hash function is
> the best because it is tailored to the distribution of pointers returned by
> malloc. As I said, the (k << 4) seems to be related to the fact that
> pointers are 16-bytes aligned on most OS. I still don’t understand the 9.
>
It is not clear what the test-case is (what source, what compiler options).
My suspicion is that your differences are in the noise, and most of the
time is spent doing other things than hashing. Did you profile the a run,
and check how much of the total time is spent in the hash-function [you may
need to tweak the code a bit to not inline the actual hash function]. Also
publishing the RANGE for each set of tests, since the variation between
"best and worst" may be more important than the overall best time.

Counting the number of collisions or some such would probably also help.

[I have no useful answer as to why `k >> 9` is used. `k >> 4`
because
pointers [allocated from the heap] are often aligned to 16 or greater does
make sense. I suspect 9 is just some arbitrary value that gives some more
entropy in the low part of the value [and since the value is later used in
a modulo buckets or some such, having more variation in the lower bits does
help - upper bits are discarded, but also often the same for large number
of allocations!] - whether someone spent minutes, hours or days picking
exactly 9 I have no idea - nor whether it still is the "right" value,
or
there should be some other number there...

--
Mats
>
> I’ll do some benchmarks for hashing integers tomorrow. This is the hashing
> that really looks suspicious to me but it does not seem to be used that
> much in LLVM.
>
> François Fayard
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170110/f252e875/attachment.html>

Francois Fayard via llvm-dev

2017-Jan-10 15:35 UTC

head link

[llvm-dev] Default hashing function for integers (DenseMapInfo.h)

> It is not clear what the test-case is (what source, what compiler options).
My suspicion is that your differences are in the noise, and most of the time is
spent doing other things than hashing. Did you profile the a run, and check how
much of the total time is spent in the hash-function [you may need to tweak the
code a bit to not inline the actual hash function]. Also publishing the RANGE
for each set of tests, since the variation between "best and worst"
may be more important than the overall best time.
I agree that I did not present a clear test case, which was just a random .cpp
file that benchmarks different hash tables from LLVM/Google/STL compiled with
-O3 option. I’ll try to make it better tomorrow, but I can assure you that it
was not the noise. I have disabled turboboost and hyper-threading on my machine
and ran the benchmark 10 times for every test and the difference was way larger
than the observed standard deviation. But anyway, for the time being, the winner
for hashing pointers is the current one from of LLVM.
> Counting the number of collisions or some such would probably also help.
Yes, I could do that. If I want to report all the collision, I might do that in
the destructor of DenseMap. Sorry for my ignorance, but is clang/llvm
multithreaded? Is it safe to write to a file in a destructor for logging
purposes?

François Fayard

llvm dev - Jan 2017 - Default hashing function for integers (DenseMapInfo.h)

[llvm-dev] Default hashing function for integers (DenseMapInfo.h)

[llvm-dev] Default hashing function for integers (DenseMapInfo.h)

[llvm-dev] Default hashing function for integers (DenseMapInfo.h)