thr3ads.net - llvm dev - [LLVMdev] We need better hashing [Feb 2012]

If this information is useful, please help other people find it:
Share via:

Talin

2012-Feb-13 00:59 UTC

[LLVMdev] We need better hashing

Here's my latest version of Hashing.h, which I propose to add to llvm/ADT.
Comments welcome and encouraged.

On Thu, Feb 9, 2012 at 11:23 AM, Talin <viridia at gmail.com> wrote:
> By the way, the reason I'm bringing this up is that a number of folks
are
> currently working on optimizing the use of hash tables within LLVM's
code
> base, and unless we can come up with a common hashing facility, there will
> be an increasing proliferation of cut & paste copies of hash functions.
So
> feedback would be nice.
>
>
> On Tue, Feb 7, 2012 at 10:58 PM, Talin <viridia at gmail.com> wrote:
>
>> LLVM currently has a bunch of different hashing algorithms scattered
>> throughout the code base.
>>
>> There's also a number of places in the code where a
FoldingSetNodeID is
>> created for the purpose of calculating a hash, and then discarded. From
an
>> efficiency standpoint, this isn't all that bad unless the number of
>> individual items being hashed > 32, at which point the SmallVector
>> overflows and memory is allocated.
>>
>> I personally want to see a better approach to hashing because of the
>> cleanup work I've been doing - I've been replacing std::map and
FoldingSet
>> with DenseMap in a number of places, and plan to do more of this. The
thing
>> is, for complex key types, you really need to have a custom
DenseMapInfo,
>> and that's where having a good hash function comes in.
>>
>> There are a bunch of hash functions out there (FNV1, SuperFastHash, and
>> many others). The best overall hash function that I am currently aware
of
>> is Austin Appleby's MurmurHash3
(http://code.google.com/p/smhasher/).
>>
>> For LLVM's use, we want a hash function that can handle mixed data
- that
>> is, pointers, ints, strings, and so on. Most of the high-performance
hash
>> functions will work well on mixed data types, but you have to put
>> everything into a flat buffer - that is, an array of machine words
whose
>> starting address is aligned on a machine-word boundary. The inner loops
of
>> the hash functions are designed to take advantage of parallelism of the
>> instruction pipeline, and if you try feeding in values one at a time
it's
>> possible that you can lose a lot of speed. (Although I am not an expert
in
>> this area, so feel free to correct me on this point.) On the other
hand, if
>> your input values aren't already arranged into a flat buffer, the
cost of
>> writing them to memory has to be taken into account.
>>
>> Also, most of the maps in LLVM are fairly small (<1000 entries), so
the
>> speed of the hash function itself is probably more important than
getting
>> the best possible mixing of bits.
>>
>> It seems that for LLVM's purposes, something that has an interface
>> similar to FoldingSetNodeID would make for an easy transition. One
approach
>> would be to start with something very much like FoldingSetNodeID,
except
>> with a fixed-length buffer instead of a SmallVector - the idea is that
when
>> you are about to overflow, instead of allocating more space, you would
>> compute an intermediate hash value and then start over at the beginning
of
>> the buffer.
>>
>> Another question is whether or not you would want to replace the hash
>> functions in DenseMapInfo, which are designed to be efficient for very
>> small keys - most of the high-performance hash functions have a fairly
>> substantial fixed overhead (usually in the form of a final mixing step)
and
>> thus only make sense for larger key sizes.
>>
>> --
>> -- Talin
>>
>
>
>
> --
> -- Talin
>


-- 
-- Talin
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120212/664503d3/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Hashing.h
Type: text/x-chdr
Size: 4038 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120212/664503d3/attachment.h>

Jay Foad

2012-Feb-13 09:22 UTC

head link

[LLVMdev] We need better hashing

On 13 February 2012 00:59, Talin <viridia at gmail.com>
wrote:> Here's my latest version of Hashing.h, which I propose to add to
llvm/ADT.
> Comments welcome and encouraged.
> /// Adapted from MurmurHash2 by Austin Appleby
Just out of curiosity, why not MurmurHash3 ? This page seems to
suggest that #2 has some flaw, and #3 is better all round:

https://sites.google.com/site/murmurhash/

Would it be possible to use CityHash instead for strings?

http://code.google.com/p/cityhash/

Thanks,
Jay.

Jay Foad

2012-Feb-13 09:27 UTC

head link

[LLVMdev] We need better hashing

On 13 February 2012 09:22, Jay Foad <jay.foad at gmail.com>
wrote:> Would it be possible to use CityHash instead for strings?
>
> http://code.google.com/p/cityhash/
Incidentally there was talk of using CityHash for LLVM's StringMap
last year, but I don't think it ever came to anything:

http://lists.cs.uiuc.edu/pipermail/cfe-dev/2011-April/014656.html

Jay.

Talin

2012-Feb-13 10:00 UTC

head link

[LLVMdev] We need better hashing

On Mon, Feb 13, 2012 at 1:22 AM, Jay Foad <jay.foad at gmail.com> wrote:
> On 13 February 2012 00:59, Talin <viridia at gmail.com> wrote:
> > Here's my latest version of Hashing.h, which I propose to add to
> llvm/ADT.
> > Comments welcome and encouraged.
>
> > /// Adapted from MurmurHash2 by Austin Appleby
>
> Just out of curiosity, why not MurmurHash3 ? This page seems to
> suggest that #2 has some flaw, and #3 is better all round:
>
> https://sites.google.com/site/murmurhash/
>
> The main reason is because there's no incremental version of 3. If youlook at the source, you'll notice that the very first thing that 3 does is
Hash ^= Length, but for the incremental case we don't know the length until
we're done. Also, 2 has fewer instructions per block hashed than 3; 3 also
requires bit rotations, whereas 2 only uses bit shifts.

Bear in mind that the "flaw" in MurmurHash2 is a fairly esoteric case
which
(I suspect) LLVM is unlikely to ever encounter in practice. Austin is
trying to develop the best possible hash for a wide range of key
probability distributions, and his testing methodologies are quite strict.

LLVM's needs, on the other hand, are fairly modest. I'm guessing that
most
DenseMaps contain less than a few thousand entries. Even a bad hash
function wouldn't hurt performance that much, and the time taken to
calculate the hash is probably more of a factor in overall performance than
getting the optimum distribution of hash values.

Would it be possible to use CityHash instead for
strings?>
> http://code.google.com/p/cityhash/
>
> OK by me. My intention is that "Hashing.h" should contain a
variety ofdifferent hashing algorithms for various specific needs. Right now, I am
mainly focusing on hashing of mixed data types - that is, you have some
structure which contains pointers, ints, strings, and you want to calculate
a hash of the entire struct. I need this because I'm working on improving
the uniquing of constants and similar data structures. My next target is to
improve the uniquing of MDNodes, but I want to get the hashing stuff
squared away first.

It is also my intent that some person who is more of an expert in hashing
than I am can do detailed performance analysis under real-world loads (such
as compiling actual programs with clang) and tweak and optimize the hashing
algorithm, without needing to modify the API and/or all of the places that
call it.

> Thanks,
> Jay.
>
-- 
-- Talin
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120213/5531b894/attachment.html>

Chandler Carruth

2012-Feb-17 09:53 UTC

head link

[LLVMdev] We need better hashing

Jeffrey and I are working on future standard library functionality for
hashing user defined types:

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3333.html

I would much rather have an interface that is close to or mirrors this one.
We already have some field experience with it, and using it in LLVM and
Clang would provide more. Also, it would be possible to essentially share
code between such an implementation and libc++.

We looked closely at 'hasher' objects and using add methods on them and
they tended to have some serious drawbacks:

1) they require some amount of "incrementality", limiting the quality
and
performance of the hashing algorithm
2) they require more boiler plate
3) they compose recursively less cleanly

Even given your interface, there is no actual requirement for an
incremental hash. Simply store intermediate state in the object, and
provide a 'finalize' step that produces the final hash code.

On Sun, Feb 12, 2012 at 4:59 PM, Talin <viridia at gmail.com> wrote:
> Here's my latest version of Hashing.h, which I propose to add to
llvm/ADT.
> Comments welcome and encouraged.
>
>
> On Thu, Feb 9, 2012 at 11:23 AM, Talin <viridia at gmail.com> wrote:
>
>> By the way, the reason I'm bringing this up is that a number of
folks are
>> currently working on optimizing the use of hash tables within
LLVM's code
>> base, and unless we can come up with a common hashing facility, there
will
>> be an increasing proliferation of cut & paste copies of hash
functions. So
>> feedback would be nice.
>>
>>
>> On Tue, Feb 7, 2012 at 10:58 PM, Talin <viridia at gmail.com>
wrote:
>>
>>> LLVM currently has a bunch of different hashing algorithms
scattered
>>> throughout the code base.
>>>
>>> There's also a number of places in the code where a
FoldingSetNodeID is
>>> created for the purpose of calculating a hash, and then discarded.
From an
>>> efficiency standpoint, this isn't all that bad unless the
number of
>>> individual items being hashed > 32, at which point the
SmallVector
>>> overflows and memory is allocated.
>>>
>>> I personally want to see a better approach to hashing because of
the
>>> cleanup work I've been doing - I've been replacing std::map
and FoldingSet
>>> with DenseMap in a number of places, and plan to do more of this.
The thing
>>> is, for complex key types, you really need to have a custom
DenseMapInfo,
>>> and that's where having a good hash function comes in.
>>>
>>> There are a bunch of hash functions out there (FNV1, SuperFastHash,
and
>>> many others). The best overall hash function that I am currently
aware of
>>> is Austin Appleby's MurmurHash3
(http://code.google.com/p/smhasher/).
>>>
>>> For LLVM's use, we want a hash function that can handle mixed
data -
>>> that is, pointers, ints, strings, and so on. Most of the
high-performance
>>> hash functions will work well on mixed data types, but you have to
put
>>> everything into a flat buffer - that is, an array of machine words
whose
>>> starting address is aligned on a machine-word boundary. The inner
loops of
>>> the hash functions are designed to take advantage of parallelism of
the
>>> instruction pipeline, and if you try feeding in values one at a
time it's
>>> possible that you can lose a lot of speed. (Although I am not an
expert in
>>> this area, so feel free to correct me on this point.) On the other
hand, if
>>> your input values aren't already arranged into a flat buffer,
the cost of
>>> writing them to memory has to be taken into account.
>>>
>>> Also, most of the maps in LLVM are fairly small (<1000 entries),
so the
>>> speed of the hash function itself is probably more important than
getting
>>> the best possible mixing of bits.
>>>
>>> It seems that for LLVM's purposes, something that has an
interface
>>> similar to FoldingSetNodeID would make for an easy transition. One
approach
>>> would be to start with something very much like FoldingSetNodeID,
except
>>> with a fixed-length buffer instead of a SmallVector - the idea is
that when
>>> you are about to overflow, instead of allocating more space, you
would
>>> compute an intermediate hash value and then start over at the
beginning of
>>> the buffer.
>>>
>>> Another question is whether or not you would want to replace the
hash
>>> functions in DenseMapInfo, which are designed to be efficient for
very
>>> small keys - most of the high-performance hash functions have a
fairly
>>> substantial fixed overhead (usually in the form of a final mixing
step) and
>>> thus only make sense for larger key sizes.
>>>
>>> --
>>> -- Talin
>>>
>>
>>
>>
>> --
>> -- Talin
>>
>
>
>
> --
> -- Talin
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120217/be838875/attachment.html>

Talin

2012-Feb-18 06:58 UTC

head link

[LLVMdev] We need better hashing

On Fri, Feb 17, 2012 at 1:53 AM, Chandler Carruth <chandlerc at
google.com>wrote:
> Jeffrey and I are working on future standard library functionality for
> hashing user defined types:
>
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3333.html
>
> I would much rather have an interface that is close to or mirrors this
> one. We already have some field experience with it, and using it in LLVM
> and Clang would provide more. Also, it would be possible to essentially
> share code between such an implementation and libc++.
>
> We looked closely at 'hasher' objects and using add methods on them
and
> they tended to have some serious drawbacks:
>
> 1) they require some amount of "incrementality", limiting the
quality and
> performance of the hashing algorithm
> 2) they require more boiler plate
> 3) they compose recursively less cleanly
>
> Even given your interface, there is no actual requirement for an
> incremental hash. Simply store intermediate state in the object, and
> provide a 'finalize' step that produces the final hash code.
>
I'm not sure I understand what you are proposing here. I don't know what
you mean by "intermediate state". However, I really do need an
incremental
hash for the various uniquing maps which I'm attempting to optimize. Take
for example the case of uniquing a constant array - the key consists of a
type* pointer and an array of constant*. Those data fields are not stored
contiguously in memory, so I need to hash them separately and then combine
the hashes. Being able to hash the data fields in place (as opposed to
copying them to a contiguous buffer) turns out to be a fairly significant
win for the uniquing maps - otherwise you end up having to do a malloc just
to look up a key, and that's going to be slower than any incremental hash
algorithm.

> On Sun, Feb 12, 2012 at 4:59 PM, Talin <viridia at gmail.com> wrote:
>
>> Here's my latest version of Hashing.h, which I propose to add to
>> llvm/ADT. Comments welcome and encouraged.
>>
>>
>> On Thu, Feb 9, 2012 at 11:23 AM, Talin <viridia at gmail.com>
wrote:
>>
>>> By the way, the reason I'm bringing this up is that a number of
folks
>>> are currently working on optimizing the use of hash tables within
LLVM's
>>> code base, and unless we can come up with a common hashing
facility, there
>>> will be an increasing proliferation of cut & paste copies of
hash
>>> functions. So feedback would be nice.
>>>
>>>
>>> On Tue, Feb 7, 2012 at 10:58 PM, Talin <viridia at gmail.com>
wrote:
>>>
>>>> LLVM currently has a bunch of different hashing algorithms
scattered
>>>> throughout the code base.
>>>>
>>>> There's also a number of places in the code where a
FoldingSetNodeID is
>>>> created for the purpose of calculating a hash, and then
discarded. From an
>>>> efficiency standpoint, this isn't all that bad unless the
number of
>>>> individual items being hashed > 32, at which point the
SmallVector
>>>> overflows and memory is allocated.
>>>>
>>>> I personally want to see a better approach to hashing because
of the
>>>> cleanup work I've been doing - I've been replacing
std::map and FoldingSet
>>>> with DenseMap in a number of places, and plan to do more of
this. The thing
>>>> is, for complex key types, you really need to have a custom
DenseMapInfo,
>>>> and that's where having a good hash function comes in.
>>>>
>>>> There are a bunch of hash functions out there (FNV1,
SuperFastHash, and
>>>> many others). The best overall hash function that I am
currently aware of
>>>> is Austin Appleby's MurmurHash3
(http://code.google.com/p/smhasher/).
>>>>
>>>> For LLVM's use, we want a hash function that can handle
mixed data -
>>>> that is, pointers, ints, strings, and so on. Most of the
high-performance
>>>> hash functions will work well on mixed data types, but you have
to put
>>>> everything into a flat buffer - that is, an array of machine
words whose
>>>> starting address is aligned on a machine-word boundary. The
inner loops of
>>>> the hash functions are designed to take advantage of
parallelism of the
>>>> instruction pipeline, and if you try feeding in values one at a
time it's
>>>> possible that you can lose a lot of speed. (Although I am not
an expert in
>>>> this area, so feel free to correct me on this point.) On the
other hand, if
>>>> your input values aren't already arranged into a flat
buffer, the cost of
>>>> writing them to memory has to be taken into account.
>>>>
>>>> Also, most of the maps in LLVM are fairly small (<1000
entries), so the
>>>> speed of the hash function itself is probably more important
than getting
>>>> the best possible mixing of bits.
>>>>
>>>> It seems that for LLVM's purposes, something that has an
interface
>>>> similar to FoldingSetNodeID would make for an easy transition.
One approach
>>>> would be to start with something very much like
FoldingSetNodeID, except
>>>> with a fixed-length buffer instead of a SmallVector - the idea
is that when
>>>> you are about to overflow, instead of allocating more space,
you would
>>>> compute an intermediate hash value and then start over at the
beginning of
>>>> the buffer.
>>>>
>>>> Another question is whether or not you would want to replace
the hash
>>>> functions in DenseMapInfo, which are designed to be efficient
for very
>>>> small keys - most of the high-performance hash functions have a
fairly
>>>> substantial fixed overhead (usually in the form of a final
mixing step) and
>>>> thus only make sense for larger key sizes.
>>>>
>>>> --
>>>> -- Talin
>>>>
>>>
>>>
>>>
>>> --
>>> -- Talin
>>>
>>
>>
>>
>> --
>> -- Talin
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>

-- 
-- Talin
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120217/5e4332d5/attachment.html>

Apparently Analagous Threads

Search for more possibly parallel threads

llvm dev - Feb 2012 - [LLVMdev] We need better hashing

[LLVMdev] We need better hashing

[LLVMdev] We need better hashing

[LLVMdev] We need better hashing

[LLVMdev] We need better hashing

[LLVMdev] We need better hashing

[LLVMdev] We need better hashing

Apparently Analagous Threads