On Tue, Feb 14, 2012 at 2:44 AM, Chris Lattner <clattner at apple.com> wrote:> On Feb 13, 2012, at 2:00 AM, Talin wrote: > > Just out of curiosity, why not MurmurHash3 ? This page seems to >> suggest that #2 has some flaw, and #3 is better all round: >> >> https://sites.google.com/site/murmurhash/ >> >> The main reason is because there's no incremental version of 3. > > > I think that that is a great reason. > > LLVM's needs, on the other hand, are fairly modest. I'm guessing that most > DenseMaps contain less than a few thousand entries. Even a bad hash > function wouldn't hurt performance that much, and the time taken to > calculate the hash is probably more of a factor in overall performance than > getting the optimum distribution of hash values. > > > Indeed. It can't be hard to be better than our existing adhoc stuff :). > That said, please do not change the hash function used by StringMap > without do careful performance analysis of the clang preprocessor. The > identifier uniquing is a very hot path in "clang -E" performance. > > I'm not planning on touching StringMap.> >> Would it be possible to use CityHash instead for strings? >> >> http://code.google.com/p/cityhash/ >> >> OK by me. My intention is that "Hashing.h" should contain a variety of > different hashing algorithms for various specific needs. Right now, I am > mainly focusing on hashing of mixed data types - that is, you have some > structure which contains pointers, ints, strings, and you want to calculate > a hash of the entire struct. I need this because I'm working on improving > the uniquing of constants and similar data structures. My next target is to > improve the uniquing of MDNodes, but I want to get the hashing stuff > squared away first. > > It is also my intent that some person who is more of an expert in hashing > than I am can do detailed performance analysis under real-world loads (such > as compiling actual programs with clang) and tweak and optimize the hashing > algorithm, without needing to modify the API and/or all of the places that > call it. > > > I think that this is a great way to stage it. I personally care more > about the interface than the implementation. Someone can tweak and tune it > after enough code is using the API to get reasonable performance numbers. > > > #include "llvm/ADT/ArrayRef.h" > #include "llvm/ADT/StringRef.h" > #include "llvm/Support/Compiler.h" > #include "llvm/Support/DataTypes.h" > #include "llvm/Support/PointerLikeTypeTraits.h" > #include "llvm/Support/type_traits.h" > > Do you actually need all of these includes? PointerLikeTypeTraits doesn't > seem necessary. Is type_traits? > > Ooops, this was a cut & paste error from FoldingSet.cpp.> enum { > BufferSize = 32, > > BufferSize is dead. > > > /// Add a pointer value > template<typename T> > void add(const T *PtrVal) { > addImpl( > reinterpret_cast<const uint32_t *>(&PtrVal), > reinterpret_cast<const uint32_t *>(&PtrVal + 1)); > } > > This violates TBAA rules and looks pretty dangerous to expose as public > API. Is this really needed? Also, addImpl is dereferencing the pointers > as uint32_t's, but there is nothing that guarantees that T is a multiple of > 4 bytes. The ArrayRef version has the same problem. > > So as far as hashing pointer values is concerned, I was just copying thecode from FoldingSet. Since a lot of the keys that we're going to be dealing with are uniqued pointers, it makes sense to be able to calculate a hash of the bit-value of the pointer, rather than hashing the thing pointed to. That being said, renaming it to "addPointer" and adding a comment might be in order. Similarly, I can make the ArrayRef version 'addPointers' and have it take an ArrayRef<T*>. Now, as to the 4 bytes issue, I think I can solve that with some clever template methods.> Though it is more verbose, I think it would be better to expose a template > specialization approach to getting the hash_value of T. > > /// Add a float > void add(float Data) { > addImpl( > reinterpret_cast<const uint32_t *>(&Data), > reinterpret_cast<const uint32_t *>(&Data + 1)); > } > > /// Add a double > void add(double Data) { > addImpl( > reinterpret_cast<const uint32_t *>(&Data), > reinterpret_cast<const uint32_t *>(&Data + 1)); > } > > Similarly, these need to go through a union to avoid TBAA problems. > > I'm not sure how that works. Can you give an example?> > void add(StringRef StrVal) { > addImpl(StrVal.data(), StrVal.size()); > } > > I'm contradicting my stance above about not caring about the > implementation :), but is MurmurHash a good hash for string data? > The Bernstein hash function works really well and is much cheaper to > compute than Murmur. It is used by HashString (and thus by StringMap). > > So, MurmurHash is intended for blocks of arbitrary binary data, which maycontain character data, integers, or whatever - it's designed to do such a thorough job of mixing the bits that it really doesn't matter what data types you feed it. You are correct that for purely string data, you'd want to use a less expensive algorithm (I'm partial to FNV-1, which is as cheap as the Bernstein hash and is AFAIK more mathematically sound.)> // Add a possibly unaligned sequence of bytes. > void addImpl(const char *I, size_t Length) { > > This should probably be moved out of line to avoid code bloat. >OK> > Overall, the design of the class is making sense to me! Thanks for > tackling this! > > -Chris > > >-- -- Talin -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120214/4dc54ce6/attachment.html>
On Tue, Feb 14, 2012 at 10:47 PM, Talin <viridia at gmail.com> wrote:> On Tue, Feb 14, 2012 at 2:44 AM, Chris Lattner <clattner at apple.com> wrote: > >> On Feb 13, 2012, at 2:00 AM, Talin wrote: >> >> Just out of curiosity, why not MurmurHash3 ? This page seems to >>> suggest that #2 has some flaw, and #3 is better all round: >>> >>> https://sites.google.com/site/murmurhash/ >>> >>> The main reason is because there's no incremental version of 3. >> >> >> I think that that is a great reason. >> >> LLVM's needs, on the other hand, are fairly modest. I'm guessing that >> most DenseMaps contain less than a few thousand entries. Even a bad hash >> function wouldn't hurt performance that much, and the time taken to >> calculate the hash is probably more of a factor in overall performance than >> getting the optimum distribution of hash values. >> >> >> Indeed. It can't be hard to be better than our existing adhoc stuff :). >> That said, please do not change the hash function used by StringMap >> without do careful performance analysis of the clang preprocessor. The >> identifier uniquing is a very hot path in "clang -E" performance. >> >> I'm not planning on touching StringMap. > >> >>> Would it be possible to use CityHash instead for strings? >>> >>> http://code.google.com/p/cityhash/ >>> >>> OK by me. My intention is that "Hashing.h" should contain a variety of >> different hashing algorithms for various specific needs. Right now, I am >> mainly focusing on hashing of mixed data types - that is, you have some >> structure which contains pointers, ints, strings, and you want to calculate >> a hash of the entire struct. I need this because I'm working on improving >> the uniquing of constants and similar data structures. My next target is to >> improve the uniquing of MDNodes, but I want to get the hashing stuff >> squared away first. >> >> It is also my intent that some person who is more of an expert in hashing >> than I am can do detailed performance analysis under real-world loads (such >> as compiling actual programs with clang) and tweak and optimize the hashing >> algorithm, without needing to modify the API and/or all of the places that >> call it. >> >> >> I think that this is a great way to stage it. I personally care more >> about the interface than the implementation. Someone can tweak and tune it >> after enough code is using the API to get reasonable performance numbers. >> >> >> #include "llvm/ADT/ArrayRef.h" >> #include "llvm/ADT/StringRef.h" >> #include "llvm/Support/Compiler.h" >> #include "llvm/Support/DataTypes.h" >> #include "llvm/Support/PointerLikeTypeTraits.h" >> #include "llvm/Support/type_traits.h" >> >> Do you actually need all of these includes? PointerLikeTypeTraits >> doesn't seem necessary. Is type_traits? >> >> Ooops, this was a cut & paste error from FoldingSet.cpp. > > >> enum { >> BufferSize = 32, >> >> BufferSize is dead. >> >> >> /// Add a pointer value >> template<typename T> >> void add(const T *PtrVal) { >> addImpl( >> reinterpret_cast<const uint32_t *>(&PtrVal), >> reinterpret_cast<const uint32_t *>(&PtrVal + 1)); >> } >> >> This violates TBAA rules and looks pretty dangerous to expose as public >> API. Is this really needed? Also, addImpl is dereferencing the pointers >> as uint32_t's, but there is nothing that guarantees that T is a multiple of >> 4 bytes. The ArrayRef version has the same problem. >> >> So as far as hashing pointer values is concerned, I was just copying the > code from FoldingSet. Since a lot of the keys that we're going to be > dealing with are uniqued pointers, it makes sense to be able to calculate a > hash of the bit-value of the pointer, rather than hashing the thing pointed > to. That being said, renaming it to "addPointer" and adding a comment might > be in order. Similarly, I can make the ArrayRef version 'addPointers' and > have it take an ArrayRef<T*>. > > Also, something I forgot to mention. Since it's possible to convert anysingle T into an ArrayRef of size 1, then technically you could just have one method that accepts an ArrayRef<T> which would work for all cases - all of the other methods are just optimizations. In which case, my question is do you still need a union? In other words, would the following work as expected? void add(float Data) { addArray(ArrayRef<float>(Data)); }> Now, as to the 4 bytes issue, I think I can solve that with some clever > template methods. > > >> Though it is more verbose, I think it would be better to expose a >> template specialization approach to getting the hash_value of T. >> >> /// Add a float >> void add(float Data) { >> addImpl( >> reinterpret_cast<const uint32_t *>(&Data), >> reinterpret_cast<const uint32_t *>(&Data + 1)); >> } >> >> /// Add a double >> void add(double Data) { >> addImpl( >> reinterpret_cast<const uint32_t *>(&Data), >> reinterpret_cast<const uint32_t *>(&Data + 1)); >> } >> >> Similarly, these need to go through a union to avoid TBAA problems. >> >> I'm not sure how that works. Can you give an example? > >> >> void add(StringRef StrVal) { >> addImpl(StrVal.data(), StrVal.size()); >> } >> >> I'm contradicting my stance above about not caring about the >> implementation :), but is MurmurHash a good hash for string data? >> The Bernstein hash function works really well and is much cheaper to >> compute than Murmur. It is used by HashString (and thus by StringMap). >> >> So, MurmurHash is intended for blocks of arbitrary binary data, which may > contain character data, integers, or whatever - it's designed to do such a > thorough job of mixing the bits that it really doesn't matter what data > types you feed it. You are correct that for purely string data, you'd want > to use a less expensive algorithm (I'm partial to FNV-1, which is as cheap > as the Bernstein hash and is AFAIK more mathematically sound.) > > >> // Add a possibly unaligned sequence of bytes. >> void addImpl(const char *I, size_t Length) { >> >> This should probably be moved out of line to avoid code bloat. >> > > OK > >> >> Overall, the design of the class is making sense to me! Thanks for >> tackling this! >> >> -Chris >> >> >> > > > -- > -- Talin >-- -- Talin -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120214/38a645ca/attachment.html>
On Feb 14, 2012, at 10:47 PM, Talin wrote:> /// Add a pointer value > template<typename T> > void add(const T *PtrVal) { > addImpl( > reinterpret_cast<const uint32_t *>(&PtrVal), > reinterpret_cast<const uint32_t *>(&PtrVal + 1)); > } > > This violates TBAA rules and looks pretty dangerous to expose as public API. Is this really needed? Also, addImpl is dereferencing the pointers as uint32_t's, but there is nothing that guarantees that T is a multiple of 4 bytes. The ArrayRef version has the same problem. > > So as far as hashing pointer values is concerned, I was just copying the code from FoldingSet. Since a lot of the keys that we're going to be dealing with are uniqued pointers, it makes sense to be able to calculate a hash of the bit-value of the pointer, rather than hashing the thing pointed to. That being said, renaming it to "addPointer" and adding a comment might be in order. Similarly, I can make the ArrayRef version 'addPointers' and have it take an ArrayRef<T*>.Ah, Jay was right, I misread this code!> Though it is more verbose, I think it would be better to expose a template specialization approach to getting the hash_value of T. > > /// Add a float > void add(float Data) { > addImpl( > reinterpret_cast<const uint32_t *>(&Data), > reinterpret_cast<const uint32_t *>(&Data + 1)); > } > > /// Add a double > void add(double Data) { > addImpl( > reinterpret_cast<const uint32_t *>(&Data), > reinterpret_cast<const uint32_t *>(&Data + 1)); > } > > Similarly, these need to go through a union to avoid TBAA problems. > > I'm not sure how that works. Can you give an example?Just use: void add(double Data) { union { double D; uint64_t I; }; D = Data; add(I); }> > void add(StringRef StrVal) { > addImpl(StrVal.data(), StrVal.size()); > } > > I'm contradicting my stance above about not caring about the implementation :), but is MurmurHash a good hash for string data? The Bernstein hash function works really well and is much cheaper to compute than Murmur. It is used by HashString (and thus by StringMap). > > So, MurmurHash is intended for blocks of arbitrary binary data, which may contain character data, integers, or whatever - it's designed to do such a thorough job of mixing the bits that it really doesn't matter what data types you feed it. You are correct that for purely string data, you'd want to use a less expensive algorithm (I'm partial to FNV-1, which is as cheap as the Bernstein hash and is AFAIK more mathematically sound.)Ok, so what's the answer? :) We can do different things for ArrayRef<char> and StringRef. -Chris -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120215/d43c6f03/attachment.html>
OK here's a patch with the latest, including unit tests. I've also tried to make the comments clear that this is intended for the case of "generic" key types, where you are either hashing multiple data types together, or you don't know in advance what the data type will be - for cases such as strings where a tailored algorithm is available, you should use that instead of this. There's a couple of things I don't like: First, there's too many levels of inlined function calls - my experience is that compilers aren't as good at inlining nested calls as is often advertised, however I couldn't figure out a way to get the same effect without the nested calls. Similarly the addBitsImpl helper types are a little too complex and magical for my taste, but again I couldn't think of a better way to automatically detect whether to use the aligned vs. unaligned hashing routine. On Wed, Feb 15, 2012 at 12:01 PM, Chris Lattner <clattner at apple.com> wrote:> On Feb 14, 2012, at 10:47 PM, Talin wrote: > > /// Add a pointer value >> template<typename T> >> void add(const T *PtrVal) { >> addImpl( >> reinterpret_cast<const uint32_t *>(&PtrVal), >> reinterpret_cast<const uint32_t *>(&PtrVal + 1)); >> } >> >> This violates TBAA rules and looks pretty dangerous to expose as public >> API. Is this really needed? Also, addImpl is dereferencing the pointers >> as uint32_t's, but there is nothing that guarantees that T is a multiple of >> 4 bytes. The ArrayRef version has the same problem. >> >> So as far as hashing pointer values is concerned, I was just copying the > code from FoldingSet. Since a lot of the keys that we're going to be > dealing with are uniqued pointers, it makes sense to be able to calculate a > hash of the bit-value of the pointer, rather than hashing the thing pointed > to. That being said, renaming it to "addPointer" and adding a comment might > be in order. Similarly, I can make the ArrayRef version 'addPointers' and > have it take an ArrayRef<T*>. > > > Ah, Jay was right, I misread this code! > > Though it is more verbose, I think it would be better to expose a template >> specialization approach to getting the hash_value of T. >> >> /// Add a float >> void add(float Data) { >> addImpl( >> reinterpret_cast<const uint32_t *>(&Data), >> reinterpret_cast<const uint32_t *>(&Data + 1)); >> } >> >> /// Add a double >> void add(double Data) { >> addImpl( >> reinterpret_cast<const uint32_t *>(&Data), >> reinterpret_cast<const uint32_t *>(&Data + 1)); >> } >> >> Similarly, these need to go through a union to avoid TBAA problems. >> >> I'm not sure how that works. Can you give an example? > > > Just use: > > void add(double Data) { > union { > double D; uint64_t I; > }; > D = Data; > add(I); > } > > >> void add(StringRef StrVal) { >> addImpl(StrVal.data(), StrVal.size()); >> } >> >> I'm contradicting my stance above about not caring about the >> implementation :), but is MurmurHash a good hash for string data? >> The Bernstein hash function works really well and is much cheaper to >> compute than Murmur. It is used by HashString (and thus by StringMap). >> >> So, MurmurHash is intended for blocks of arbitrary binary data, which may > contain character data, integers, or whatever - it's designed to do such a > thorough job of mixing the bits that it really doesn't matter what data > types you feed it. You are correct that for purely string data, you'd want > to use a less expensive algorithm (I'm partial to FNV-1, which is as cheap > as the Bernstein hash and is AFAIK more mathematically sound.) > > > Ok, so what's the answer? :) We can do different things for > ArrayRef<char> and StringRef. > > -Chris > > > >-- -- Talin -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120217/82cde9c0/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: hashing.patch Type: application/octet-stream Size: 11682 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120217/82cde9c0/attachment.obj>