Justin Lebar via llvm-dev
2016-Oct-14 18:58 UTC
[llvm-dev] RFC: Reducing the number of set classes in ADT
tl;dr: I think we have too many different set classes. They have incompatible APIs and surprising behavior. I think we can reduce their number substantially. Dear all, The following is a subset of the set classes we have in ADT: * DenseSet * SmallDenseSet (https://reviews.llvm.org/D25628) * SetVector * SmallSetVector * SmallSet * SmallPtrSet * StringSet * FoldingSet * ContextualFoldingSet * FoldingSetVector Every one of these classes has quirks. Here are some particularly bad ones: * SmallSet falls back to std::set<T> when it runs out of inline storage (which itself is pretty bad because std::set will pwn your caches). But if T is a pointer, SmallSet is an alias for SmallPtrSet, which has a completely different API, and has completely different performance characteristics. "It's vector<bool> all over again." As a concrete example, I see a lot of code using SmallSet<std::pair<T*, U*>>, and it would be reasonable to assume this is similar to SmallSet<T*>. But it's not. * SetVector's set is a DenseSet, but SmallSetVector's set is a SmallSet, which uses a std::set, not a DenseSet. Using DenseSet for one and std::set for the other means that, in addition to getting different performance characteristics, the set of key types that work with SetVector and SmallSetVector are rather different. For example, today you can use std::string as the key of a SmallSetVector, but not a SetVector. (Although using std::string inside of SmallSetVector is kind of perverse, because unless your strings are all very short, you're going to malloc twice for every string you insert.) * FoldingSet uses chaining, which is likely less memory- and time-efficient than open addressing (which pretty much all of our other set classes use). * SmallPtrSet has an interface that lets you operate on an instance without knowing its inline size (SmallPtrSetImpl), but SmallSetVector and SmallDenseSet don't have this. But more to my point, each of these sets has a different API. Some support limited C++11-isms, others don't. Most don't support the full range of stl set operations, but they all pick and choose a different subset of operations to support. And many add their own API extensions, which are of course all different. I think we can do better than this. It seems to me that ideally we'd need only four set classes, down from ten: * DenseSet * SmallDenseSet * SetVector (uses std::vector and DenseMap) * SmallSetVector (uses SmallVector and SmallDenseMap) Let me try to convince you that this is feasible. * SmallPtrSet is already basically the same as SmallDenseSet. * StringSet is just a DenseSet with a special key type that's an owning string, but where you can do lookups given a StringRef. This is like DenseMap::find_as. * SmallSet is usually used with pointer types, so in those cases is just another name for SmallPtrSet. In other cases, we can use DenseSet (in the worst case by wrapping the key type in a class that adds "is empty" and "is toombstone" bits). * I may be wrong about this one, but at first blush, it seems that FoldingSet is morally equivalent to DenseMap::find_as. I hope the advantages of having fewer set classes is clear, but to be explicit: This would simplify our code and reduce surprising behavior. We wouldn't need to remember all of the API quirks of the different set classes. And it would let us focus our efforts on making one set of great set APIs, instead of adding API improvements piecemeal across all of our classes. More wood, fewer arrows, that sort of thing. As another benefit, having one base hashtable implementation would mean that if someone wanted to try implementing a newfangled hashtable algorithm (cuckoo or Robin Hood hashing, anyone?), we'd reap the benefit across all of llvm's sets. Obviously this is a large change, and I don't propose we make it quickly or all at once. Here's how I currently imagine we'd do it: 1. Switch SmallSetVector to use SmallDenseSet. This is a relatively small change; we just have to fix up some users who use keys that aren't compatible with DenseSet. 2. Get rid of SmallSet, replacing it with SmallDenseSet and SmallPtrSet. This will involve a) making a SmallDenseSetImpl (like SmallPtrSetImpl), so that you can operate on SmallDenseSets without knowing their size, and b) fixing up users whose keys that aren't compatible with DenseSet (like (1)). Part (a) will also give us a SmallDenseMapImpl class, which should be useful. 3. Get rid of SmallPtrSet, replacing it with SmallDenseSet. 4. Get rid of StringSet, replacing it with DenseSet (and improving DenseSet's "heterogeneous lookup" APIs so that this retains approximately as nice an API as StringSet currently has). 5. Get rid of FoldingSet? Again I haven't looked too deeply into this one. These steps are roughly in increasing order of complexity, and I think any prefix of this list is still worth doing. I'm going to start sending out patches for (1), which is a small change that I hope is uncontroversial. I expect even if we agree on the larger plan that this will be a background task for me, so I don't anticipate getting through it all anytime particularly soon. Help would, of course, be appreciated! Regards, -Justin
Krzysztof Parzyszek via llvm-dev
2016-Oct-14 19:55 UTC
[llvm-dev] RFC: Reducing the number of set classes in ADT
On 10/14/2016 1:58 PM, Justin Lebar via llvm-dev wrote:> * SmallSet falls back to std::set<T> when it runs out of inline storage (which > itself is pretty bad because std::set will pwn your caches).Can we use std::unordered_set instead? -Krzysztof -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Justin Lebar via llvm-dev
2016-Oct-15 22:59 UTC
[llvm-dev] RFC: Reducing the number of set classes in ADT
> Can we use std::unordered_set instead?We could, but my preference is just to remove SmallSet, if possible. std::unordered_set is still worse than llvm::DenseSet for most purposes, I expect. FYI I've just sent out a patch queue for part (1) of this plan. D25628 [ADT] Add SmallDenseSet. D25629 [ADT] Add an initializer_list constructor to {Small,}DenseSet. D25643 [IR] Add DenseMapInfo<CallSite>. D25644 [ADT] Move CachedHashString to a own header in ADT, and rename to CachedHashStringRef. D25645 [ADT] Add CachedHashString. D25646 Use CachedHashStringRef instead of CachedHash<StringRef>. D25630 [ADT] Remove CachedHash, which is unused. D25647 [clang-tidy] Don't use a SmallSetVector of an enum. D25648 Switch SmallSetVector to use DenseSet when it overflows its inline space. On Fri, Oct 14, 2016 at 12:55 PM, Krzysztof Parzyszek via llvm-dev <llvm-dev at lists.llvm.org> wrote:> On 10/14/2016 1:58 PM, Justin Lebar via llvm-dev wrote: >> >> * SmallSet falls back to std::set<T> when it runs out of inline storage >> (which >> itself is pretty bad because std::set will pwn your caches). > > > Can we use std::unordered_set instead? > > -Krzysztof > > -- > Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by > The Linux Foundation > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev