Justin Lebar via llvm-dev
2016-Oct-14  18:58 UTC
[llvm-dev] RFC: Reducing the number of set classes in ADT
tl;dr: I think we have too many different set classes.  They have incompatible
APIs and surprising behavior. I think we can reduce their number substantially.
Dear all,
The following is a subset of the set classes we have in ADT:
 * DenseSet
 * SmallDenseSet (https://reviews.llvm.org/D25628)
 * SetVector
 * SmallSetVector
 * SmallSet
 * SmallPtrSet
 * StringSet
 * FoldingSet
 * ContextualFoldingSet
 * FoldingSetVector
Every one of these classes has quirks.  Here are some particularly bad ones:
* SmallSet falls back to std::set<T> when it runs out of inline storage
(which
  itself is pretty bad because std::set will pwn your caches).  But if T is a
  pointer, SmallSet is an alias for SmallPtrSet, which has a completely
  different API, and has completely different performance characteristics.
  "It's vector<bool> all over again."
  As a concrete example, I see a lot of code using SmallSet<std::pair<T*,
U*>>,
  and it would be reasonable to assume this is similar to SmallSet<T*>. 
But
  it's not.
* SetVector's set is a DenseSet, but SmallSetVector's set is a SmallSet,
which
  uses a std::set, not a DenseSet.  Using DenseSet for one and std::set for the
  other means that, in addition to getting different performance
  characteristics, the set of key types that work with SetVector and
  SmallSetVector are rather different.
  For example, today you can use std::string as the key of a SmallSetVector,
  but not a SetVector.  (Although using std::string inside of SmallSetVector is
  kind of perverse, because unless your strings are all very short, you're
  going to malloc twice for every string you insert.)
* FoldingSet uses chaining, which is likely less memory- and time-efficient
  than open addressing (which pretty much all of our other set classes use).
* SmallPtrSet has an interface that lets you operate on an instance without
  knowing its inline size (SmallPtrSetImpl), but SmallSetVector and
  SmallDenseSet don't have this.
But more to my point, each of these sets has a different API.  Some support
limited C++11-isms, others don't.  Most don't support the full range of
stl set
operations, but they all pick and choose a different subset of operations to
support.  And many add their own API extensions, which are of course all
different.
I think we can do better than this.
It seems to me that ideally we'd need only four set classes, down from ten:
 * DenseSet
 * SmallDenseSet
 * SetVector (uses std::vector and DenseMap)
 * SmallSetVector (uses SmallVector and SmallDenseMap)
Let me try to convince you that this is feasible.
 * SmallPtrSet is already basically the same as SmallDenseSet.
 * StringSet is just a DenseSet with a special key type that's an owning
   string, but where you can do lookups given a StringRef.  This is like
   DenseMap::find_as.
 * SmallSet is usually used with pointer types, so in those cases is just
   another name for SmallPtrSet.  In other cases, we can use DenseSet (in the
   worst case by wrapping the key type in a class that adds "is empty"
and "is
   toombstone" bits).
 * I may be wrong about this one, but at first blush, it seems that FoldingSet
   is morally equivalent to DenseMap::find_as.
I hope the advantages of having fewer set classes is clear, but to be explicit:
This would simplify our code and reduce surprising behavior.  We wouldn't
need
to remember all of the API quirks of the different set classes.  And it would
let us focus our efforts on making one set of great set APIs, instead of adding
API improvements piecemeal across all of our classes.  More wood, fewer arrows,
that sort of thing.
As another benefit, having one base hashtable implementation would mean that if
someone wanted to try implementing a newfangled hashtable algorithm (cuckoo or
Robin Hood hashing, anyone?), we'd reap the benefit across all of llvm's
sets.
Obviously this is a large change, and I don't propose we make it quickly or
all
at once.  Here's how I currently imagine we'd do it:
1. Switch SmallSetVector to use SmallDenseSet.
   This is a relatively small change; we just have to fix up some users who use
   keys that aren't compatible with DenseSet.
2. Get rid of SmallSet, replacing it with SmallDenseSet and SmallPtrSet.
   This will involve
   a) making a SmallDenseSetImpl (like SmallPtrSetImpl), so that you can
      operate on SmallDenseSets without knowing their size, and
   b) fixing up users whose keys that aren't compatible with DenseSet (like
      (1)).
   Part (a) will also give us a SmallDenseMapImpl class, which should be
   useful.
3. Get rid of SmallPtrSet, replacing it with SmallDenseSet.
4. Get rid of StringSet, replacing it with DenseSet (and improving
DenseSet's
   "heterogeneous lookup" APIs so that this retains approximately as
nice an
   API as StringSet currently has).
5. Get rid of FoldingSet?  Again I haven't looked too deeply into this one.
These steps are roughly in increasing order of complexity, and I think any
prefix of this list is still worth doing.
I'm going to start sending out patches for (1), which is a small change that
I
hope is uncontroversial.  I expect even if we agree on the larger plan that
this will be a background task for me, so I don't anticipate getting through
it
all anytime particularly soon.
Help would, of course, be appreciated!
Regards,
-Justin
Krzysztof Parzyszek via llvm-dev
2016-Oct-14  19:55 UTC
[llvm-dev] RFC: Reducing the number of set classes in ADT
On 10/14/2016 1:58 PM, Justin Lebar via llvm-dev wrote:> * SmallSet falls back to std::set<T> when it runs out of inline storage (which > itself is pretty bad because std::set will pwn your caches).Can we use std::unordered_set instead? -Krzysztof -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Justin Lebar via llvm-dev
2016-Oct-15  22:59 UTC
[llvm-dev] RFC: Reducing the number of set classes in ADT
> Can we use std::unordered_set instead?We could, but my preference is just to remove SmallSet, if possible. std::unordered_set is still worse than llvm::DenseSet for most purposes, I expect. FYI I've just sent out a patch queue for part (1) of this plan. D25628 [ADT] Add SmallDenseSet. D25629 [ADT] Add an initializer_list constructor to {Small,}DenseSet. D25643 [IR] Add DenseMapInfo<CallSite>. D25644 [ADT] Move CachedHashString to a own header in ADT, and rename to CachedHashStringRef. D25645 [ADT] Add CachedHashString. D25646 Use CachedHashStringRef instead of CachedHash<StringRef>. D25630 [ADT] Remove CachedHash, which is unused. D25647 [clang-tidy] Don't use a SmallSetVector of an enum. D25648 Switch SmallSetVector to use DenseSet when it overflows its inline space. On Fri, Oct 14, 2016 at 12:55 PM, Krzysztof Parzyszek via llvm-dev <llvm-dev at lists.llvm.org> wrote:> On 10/14/2016 1:58 PM, Justin Lebar via llvm-dev wrote: >> >> * SmallSet falls back to std::set<T> when it runs out of inline storage >> (which >> itself is pretty bad because std::set will pwn your caches). > > > Can we use std::unordered_set instead? > > -Krzysztof > > -- > Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by > The Linux Foundation > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev