thr3ads.net - llvm dev - [LLVMdev] We need better hashing [Feb 2012]

If this information is useful, please help other people find it:
Share via:

Talin

2012-Feb-18 06:58 UTC

[LLVMdev] We need better hashing

On Fri, Feb 17, 2012 at 1:53 AM, Chandler Carruth <chandlerc at
google.com>wrote:
> Jeffrey and I are working on future standard library functionality for
> hashing user defined types:
>
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3333.html
>
> I would much rather have an interface that is close to or mirrors this
> one. We already have some field experience with it, and using it in LLVM
> and Clang would provide more. Also, it would be possible to essentially
> share code between such an implementation and libc++.
>
> We looked closely at 'hasher' objects and using add methods on them
and
> they tended to have some serious drawbacks:
>
> 1) they require some amount of "incrementality", limiting the
quality and
> performance of the hashing algorithm
> 2) they require more boiler plate
> 3) they compose recursively less cleanly
>
> Even given your interface, there is no actual requirement for an
> incremental hash. Simply store intermediate state in the object, and
> provide a 'finalize' step that produces the final hash code.
>
I'm not sure I understand what you are proposing here. I don't know what
you mean by "intermediate state". However, I really do need an
incremental
hash for the various uniquing maps which I'm attempting to optimize. Take
for example the case of uniquing a constant array - the key consists of a
type* pointer and an array of constant*. Those data fields are not stored
contiguously in memory, so I need to hash them separately and then combine
the hashes. Being able to hash the data fields in place (as opposed to
copying them to a contiguous buffer) turns out to be a fairly significant
win for the uniquing maps - otherwise you end up having to do a malloc just
to look up a key, and that's going to be slower than any incremental hash
algorithm.

> On Sun, Feb 12, 2012 at 4:59 PM, Talin <viridia at gmail.com> wrote:
>
>> Here's my latest version of Hashing.h, which I propose to add to
>> llvm/ADT. Comments welcome and encouraged.
>>
>>
>> On Thu, Feb 9, 2012 at 11:23 AM, Talin <viridia at gmail.com>
wrote:
>>
>>> By the way, the reason I'm bringing this up is that a number of
folks
>>> are currently working on optimizing the use of hash tables within
LLVM's
>>> code base, and unless we can come up with a common hashing
facility, there
>>> will be an increasing proliferation of cut & paste copies of
hash
>>> functions. So feedback would be nice.
>>>
>>>
>>> On Tue, Feb 7, 2012 at 10:58 PM, Talin <viridia at gmail.com>
wrote:
>>>
>>>> LLVM currently has a bunch of different hashing algorithms
scattered
>>>> throughout the code base.
>>>>
>>>> There's also a number of places in the code where a
FoldingSetNodeID is
>>>> created for the purpose of calculating a hash, and then
discarded. From an
>>>> efficiency standpoint, this isn't all that bad unless the
number of
>>>> individual items being hashed > 32, at which point the
SmallVector
>>>> overflows and memory is allocated.
>>>>
>>>> I personally want to see a better approach to hashing because
of the
>>>> cleanup work I've been doing - I've been replacing
std::map and FoldingSet
>>>> with DenseMap in a number of places, and plan to do more of
this. The thing
>>>> is, for complex key types, you really need to have a custom
DenseMapInfo,
>>>> and that's where having a good hash function comes in.
>>>>
>>>> There are a bunch of hash functions out there (FNV1,
SuperFastHash, and
>>>> many others). The best overall hash function that I am
currently aware of
>>>> is Austin Appleby's MurmurHash3
(http://code.google.com/p/smhasher/).
>>>>
>>>> For LLVM's use, we want a hash function that can handle
mixed data -
>>>> that is, pointers, ints, strings, and so on. Most of the
high-performance
>>>> hash functions will work well on mixed data types, but you have
to put
>>>> everything into a flat buffer - that is, an array of machine
words whose
>>>> starting address is aligned on a machine-word boundary. The
inner loops of
>>>> the hash functions are designed to take advantage of
parallelism of the
>>>> instruction pipeline, and if you try feeding in values one at a
time it's
>>>> possible that you can lose a lot of speed. (Although I am not
an expert in
>>>> this area, so feel free to correct me on this point.) On the
other hand, if
>>>> your input values aren't already arranged into a flat
buffer, the cost of
>>>> writing them to memory has to be taken into account.
>>>>
>>>> Also, most of the maps in LLVM are fairly small (<1000
entries), so the
>>>> speed of the hash function itself is probably more important
than getting
>>>> the best possible mixing of bits.
>>>>
>>>> It seems that for LLVM's purposes, something that has an
interface
>>>> similar to FoldingSetNodeID would make for an easy transition.
One approach
>>>> would be to start with something very much like
FoldingSetNodeID, except
>>>> with a fixed-length buffer instead of a SmallVector - the idea
is that when
>>>> you are about to overflow, instead of allocating more space,
you would
>>>> compute an intermediate hash value and then start over at the
beginning of
>>>> the buffer.
>>>>
>>>> Another question is whether or not you would want to replace
the hash
>>>> functions in DenseMapInfo, which are designed to be efficient
for very
>>>> small keys - most of the high-performance hash functions have a
fairly
>>>> substantial fixed overhead (usually in the form of a final
mixing step) and
>>>> thus only make sense for larger key sizes.
>>>>
>>>> --
>>>> -- Talin
>>>>
>>>
>>>
>>>
>>> --
>>> -- Talin
>>>
>>
>>
>>
>> --
>> -- Talin
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>

-- 
-- Talin
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120217/5e4332d5/attachment.html>

Chandler Carruth

2012-Feb-18 08:00 UTC

head link

[LLVMdev] We need better hashing

On Fri, Feb 17, 2012 at 10:58 PM, Talin <viridia at gmail.com> wrote:
> However, I really do need an incremental hash for the various uniquing
> maps which I'm attempting to optimize. Take for example the case of
> uniquing a constant array - the key consists of a type* pointer and an
> array of constant*. Those data fields are not stored contiguously in
> memory, so I need to hash them separately and then combine the hashes.
> Being able to hash the data fields in place (as opposed to copying them to
> a contiguous buffer) turns out to be a fairly significant win for the
> uniquing maps - otherwise you end up having to do a malloc just to look up
> a key, and that's going to be slower than any incremental hash
algorithm.

I think you have a different idea of what 'incremental hash' means from
what I do, and that's leading to the confusion, because we're talking
about
two very different things.

The term "incremental hash function" usually is a term of art
referring to
a hash function with the following property:

Given a series of bytes (or other units if you like) in 's', and using a
python-like syntax:

 H(s) = H(H(s[:-1]), s[-1])

This means that you can re-use a hash for a smaller chunk of data when
computing the hash for a larger chunk. I don't think this is what you're
looking for, and I know that most efficient, high-quality,
non-cryptographic hash functions don't satisfy this property.

What you're talking about is just being able to hash a non-contiguous set
of data. That is clearly important, but there are a lot of ways to achieve
it.

Most of the functions I'm referring to are essentially block based. For
example, CityHash is based on a 64-byte block hashing system. While this
can necessitate copying the data, it should never require a malloc. You
simply fill a block, and flush it out. The block can be on the stack, and
modern hashing algorithms will find it much faster to memcpy mall
(pointer-sized) bits of data into a single contiguous N-byte (64 in the
case of city) block first, so I don't think this "costs" us
anything, we
actually want to copy the data in, and do the hashing algorithm once.

Note that the current CityHash algorithm (especially in its current
implementation) isn't really setup for this, but we've talked to the
author
of it as well as Austin when we were designing the standard interface, and
they seemed very positive on being able to adapt these high-quality hashing
algorithms to the proposed interface.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120218/72da270c/attachment.html>

Talin

2012-Feb-18 08:58 UTC

head link

[LLVMdev] We need better hashing

On Sat, Feb 18, 2012 at 12:00 AM, Chandler Carruth <chandlerc at
google.com>wrote:
> On Fri, Feb 17, 2012 at 10:58 PM, Talin <viridia at gmail.com> wrote:
>
>> However, I really do need an incremental hash for the various uniquing
>> maps which I'm attempting to optimize. Take for example the case of
>> uniquing a constant array - the key consists of a type* pointer and an
>> array of constant*. Those data fields are not stored contiguously in
>> memory, so I need to hash them separately and then combine the hashes.
>> Being able to hash the data fields in place (as opposed to copying them
to
>> a contiguous buffer) turns out to be a fairly significant win for the
>> uniquing maps - otherwise you end up having to do a malloc just to look
up
>> a key, and that's going to be slower than any incremental hash
algorithm.
>
>
> I think you have a different idea of what 'incremental hash' means
from
> what I do, and that's leading to the confusion, because we're
talking about
> two very different things.
>
> The term "incremental hash function" usually is a term of art
referring to
> a hash function with the following property:
>
> Given a series of bytes (or other units if you like) in 's', and
using a
> python-like syntax:
>
>  H(s) = H(H(s[:-1]), s[-1])
>
> This means that you can re-use a hash for a smaller chunk of data when
> computing the hash for a larger chunk. I don't think this is what
you're
> looking for, and I know that most efficient, high-quality,
> non-cryptographic hash functions don't satisfy this property.
>
>
> What you're talking about is just being able to hash a non-contiguous
set
> of data. That is clearly important, but there are a lot of ways to achieve
> it.
>
> Most of the functions I'm referring to are essentially block based. For
> example, CityHash is based on a 64-byte block hashing system. While this
> can necessitate copying the data, it should never require a malloc. You
> simply fill a block, and flush it out. The block can be on the stack, and
> modern hashing algorithms will find it much faster to memcpy mall
> (pointer-sized) bits of data into a single contiguous N-byte (64 in the
> case of city) block first, so I don't think this "costs" us
anything, we
> actually want to copy the data in, and do the hashing algorithm once.
>
> Note that the current CityHash algorithm (especially in its current
> implementation) isn't really setup for this, but we've talked to
the author
> of it as well as Austin when we were designing the standard interface, and
> they seemed very positive on being able to adapt these high-quality hashing
> algorithms to the proposed interface.
>
OK. I actually coded something like this in an earlier incarnation of the
patch, and at some point it kind of got optimized away. But that's
unimportant - what I am primarily interested in is the interface, not the
hashing function per se. (Note that the interface didn't change much when I
converted from a block-based version to the current one. Which is not
surprising, as the current interface is essentially isomorphic to the one
used by FoldingSet.)

One of the things I'm focusing on right now is taking old container classes
that were written before the advent of ArrayRef and modernizing them so
that they don't do so much allocating and copying of keys. All of this
hashing stuff is merely yak shaving as far as I am concerned - if someone
has a better idea I'm open to it, as long as they understand what my
requirements are, and the fact that my work on the containers is kind of
blocked until this gets resolved.

-- 
-- Talin
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120218/85bd28d0/attachment.html>

Apparently Analagous Threads

Search for more reasonably related threads

llvm dev - Feb 2012 - [LLVMdev] We need better hashing

[LLVMdev] We need better hashing

[LLVMdev] We need better hashing

[LLVMdev] We need better hashing

Apparently Analagous Threads