thr3ads.net - llvm dev - [LLVMdev] We need better hashing [Feb 2012]

If this information is useful, please help other people find it:
Share via:

Talin

2012-Feb-08 06:58 UTC

[LLVMdev] We need better hashing

LLVM currently has a bunch of different hashing algorithms scattered
throughout the code base.

There's also a number of places in the code where a FoldingSetNodeID is
created for the purpose of calculating a hash, and then discarded. From an
efficiency standpoint, this isn't all that bad unless the number of
individual items being hashed > 32, at which point the SmallVector
overflows and memory is allocated.

I personally want to see a better approach to hashing because of the
cleanup work I've been doing - I've been replacing std::map and
FoldingSet
with DenseMap in a number of places, and plan to do more of this. The thing
is, for complex key types, you really need to have a custom DenseMapInfo,
and that's where having a good hash function comes in.

There are a bunch of hash functions out there (FNV1, SuperFastHash, and
many others). The best overall hash function that I am currently aware of
is Austin Appleby's MurmurHash3 (http://code.google.com/p/smhasher/).

For LLVM's use, we want a hash function that can handle mixed data - that
is, pointers, ints, strings, and so on. Most of the high-performance hash
functions will work well on mixed data types, but you have to put
everything into a flat buffer - that is, an array of machine words whose
starting address is aligned on a machine-word boundary. The inner loops of
the hash functions are designed to take advantage of parallelism of the
instruction pipeline, and if you try feeding in values one at a time it's
possible that you can lose a lot of speed. (Although I am not an expert in
this area, so feel free to correct me on this point.) On the other hand, if
your input values aren't already arranged into a flat buffer, the cost of
writing them to memory has to be taken into account.

Also, most of the maps in LLVM are fairly small (<1000 entries), so the
speed of the hash function itself is probably more important than getting
the best possible mixing of bits.

It seems that for LLVM's purposes, something that has an interface similar
to FoldingSetNodeID would make for an easy transition. One approach would
be to start with something very much like FoldingSetNodeID, except with a
fixed-length buffer instead of a SmallVector - the idea is that when you
are about to overflow, instead of allocating more space, you would compute
an intermediate hash value and then start over at the beginning of the
buffer.

Another question is whether or not you would want to replace the hash
functions in DenseMapInfo, which are designed to be efficient for very
small keys - most of the high-performance hash functions have a fairly
substantial fixed overhead (usually in the form of a final mixing step) and
thus only make sense for larger key sizes.

-- 
-- Talin
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120207/4dc9995f/attachment.html>

Talin

2012-Feb-09 19:23 UTC

head link

[LLVMdev] We need better hashing

By the way, the reason I'm bringing this up is that a number of folks are
currently working on optimizing the use of hash tables within LLVM's code
base, and unless we can come up with a common hashing facility, there will
be an increasing proliferation of cut & paste copies of hash functions. So
feedback would be nice.

On Tue, Feb 7, 2012 at 10:58 PM, Talin <viridia at gmail.com> wrote:
> LLVM currently has a bunch of different hashing algorithms scattered
> throughout the code base.
>
> There's also a number of places in the code where a FoldingSetNodeID is
> created for the purpose of calculating a hash, and then discarded. From an
> efficiency standpoint, this isn't all that bad unless the number of
> individual items being hashed > 32, at which point the SmallVector
> overflows and memory is allocated.
>
> I personally want to see a better approach to hashing because of the
> cleanup work I've been doing - I've been replacing std::map and
FoldingSet
> with DenseMap in a number of places, and plan to do more of this. The thing
> is, for complex key types, you really need to have a custom DenseMapInfo,
> and that's where having a good hash function comes in.
>
> There are a bunch of hash functions out there (FNV1, SuperFastHash, and
> many others). The best overall hash function that I am currently aware of
> is Austin Appleby's MurmurHash3 (http://code.google.com/p/smhasher/).
>
> For LLVM's use, we want a hash function that can handle mixed data -
that
> is, pointers, ints, strings, and so on. Most of the high-performance hash
> functions will work well on mixed data types, but you have to put
> everything into a flat buffer - that is, an array of machine words whose
> starting address is aligned on a machine-word boundary. The inner loops of
> the hash functions are designed to take advantage of parallelism of the
> instruction pipeline, and if you try feeding in values one at a time
it's
> possible that you can lose a lot of speed. (Although I am not an expert in
> this area, so feel free to correct me on this point.) On the other hand, if
> your input values aren't already arranged into a flat buffer, the cost
of
> writing them to memory has to be taken into account.
>
> Also, most of the maps in LLVM are fairly small (<1000 entries), so the
> speed of the hash function itself is probably more important than getting
> the best possible mixing of bits.
>
> It seems that for LLVM's purposes, something that has an interface
similar
> to FoldingSetNodeID would make for an easy transition. One approach would
> be to start with something very much like FoldingSetNodeID, except with a
> fixed-length buffer instead of a SmallVector - the idea is that when you
> are about to overflow, instead of allocating more space, you would compute
> an intermediate hash value and then start over at the beginning of the
> buffer.
>
> Another question is whether or not you would want to replace the hash
> functions in DenseMapInfo, which are designed to be efficient for very
> small keys - most of the high-performance hash functions have a fairly
> substantial fixed overhead (usually in the form of a final mixing step) and
> thus only make sense for larger key sizes.
>
> --
> -- Talin
>


-- 
-- Talin
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120209/245b0127/attachment.html>

Talin

2012-Feb-13 00:59 UTC

head link

[LLVMdev] We need better hashing

Here's my latest version of Hashing.h, which I propose to add to llvm/ADT.
Comments welcome and encouraged.

On Thu, Feb 9, 2012 at 11:23 AM, Talin <viridia at gmail.com> wrote:
> By the way, the reason I'm bringing this up is that a number of folks
are
> currently working on optimizing the use of hash tables within LLVM's
code
> base, and unless we can come up with a common hashing facility, there will
> be an increasing proliferation of cut & paste copies of hash functions.
So
> feedback would be nice.
>
>
> On Tue, Feb 7, 2012 at 10:58 PM, Talin <viridia at gmail.com> wrote:
>
>> LLVM currently has a bunch of different hashing algorithms scattered
>> throughout the code base.
>>
>> There's also a number of places in the code where a
FoldingSetNodeID is
>> created for the purpose of calculating a hash, and then discarded. From
an
>> efficiency standpoint, this isn't all that bad unless the number of
>> individual items being hashed > 32, at which point the SmallVector
>> overflows and memory is allocated.
>>
>> I personally want to see a better approach to hashing because of the
>> cleanup work I've been doing - I've been replacing std::map and
FoldingSet
>> with DenseMap in a number of places, and plan to do more of this. The
thing
>> is, for complex key types, you really need to have a custom
DenseMapInfo,
>> and that's where having a good hash function comes in.
>>
>> There are a bunch of hash functions out there (FNV1, SuperFastHash, and
>> many others). The best overall hash function that I am currently aware
of
>> is Austin Appleby's MurmurHash3
(http://code.google.com/p/smhasher/).
>>
>> For LLVM's use, we want a hash function that can handle mixed data
- that
>> is, pointers, ints, strings, and so on. Most of the high-performance
hash
>> functions will work well on mixed data types, but you have to put
>> everything into a flat buffer - that is, an array of machine words
whose
>> starting address is aligned on a machine-word boundary. The inner loops
of
>> the hash functions are designed to take advantage of parallelism of the
>> instruction pipeline, and if you try feeding in values one at a time
it's
>> possible that you can lose a lot of speed. (Although I am not an expert
in
>> this area, so feel free to correct me on this point.) On the other
hand, if
>> your input values aren't already arranged into a flat buffer, the
cost of
>> writing them to memory has to be taken into account.
>>
>> Also, most of the maps in LLVM are fairly small (<1000 entries), so
the
>> speed of the hash function itself is probably more important than
getting
>> the best possible mixing of bits.
>>
>> It seems that for LLVM's purposes, something that has an interface
>> similar to FoldingSetNodeID would make for an easy transition. One
approach
>> would be to start with something very much like FoldingSetNodeID,
except
>> with a fixed-length buffer instead of a SmallVector - the idea is that
when
>> you are about to overflow, instead of allocating more space, you would
>> compute an intermediate hash value and then start over at the beginning
of
>> the buffer.
>>
>> Another question is whether or not you would want to replace the hash
>> functions in DenseMapInfo, which are designed to be efficient for very
>> small keys - most of the high-performance hash functions have a fairly
>> substantial fixed overhead (usually in the form of a final mixing step)
and
>> thus only make sense for larger key sizes.
>>
>> --
>> -- Talin
>>
>
>
>
> --
> -- Talin
>


-- 
-- Talin
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120212/664503d3/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Hashing.h
Type: text/x-chdr
Size: 4038 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120212/664503d3/attachment.h>

Apparently Analagous Threads

Search for more apparently analagous threads

llvm dev - Feb 2012 - [LLVMdev] We need better hashing

[LLVMdev] We need better hashing

[LLVMdev] We need better hashing

[LLVMdev] We need better hashing

Apparently Analagous Threads