escha via llvm-dev
2015-Sep-14  16:37 UTC
[llvm-dev] RFC: speedups with instruction side-data (ADCE, perhaps others?)
I’ve been playing around with optimizing performance various passes and noticed something about ADCE: it keeps an Alive set that requires a lot of upkeep time-wise, but if each instruction had a /single bit/ of side data (to represent liveness, solely within the pass), the set wouldn’t be needed. I tested this out and observed a ~1/3 reduction in time in ADCE: 1454ms to 982ms according to a profile over our test suite (for a total of 0.6% compile time savings in our pipeline). Are there any other passes that could benefit from having a single bit (or similarly small amount) of per-Instruction metadata local to the pass, i.e. to avoid having to keep a side-Set of instructions that’s only used to test membership of the set (i.e. not iterated over)? Is this sort of thing something reasonable (maybe it could be stuffed in the SubclassData short and exposed via a public API?) —escha
escha via llvm-dev
2015-Sep-14  17:29 UTC
[llvm-dev] RFC: speedups with instruction side-data (ADCE, perhaps others?)
> On Sep 14, 2015, at 9:37 AM, escha via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > I’ve been playing around with optimizing performance various passes and noticed something about ADCE: it keeps an Alive set that requires a lot of upkeep time-wise, but if each instruction had a /single bit/ of side data (to represent liveness, solely within the pass), the set wouldn’t be needed. I tested this out and observed a ~1/3 reduction in time in ADCE: 1454ms to 982ms according to a profile over our test suite (for a total of 0.6% compile time savings in our pipeline). > > Are there any other passes that could benefit from having a single bit (or similarly small amount) of per-Instruction metadata local to the pass, i.e. to avoid having to keep a side-Set of instructions that’s only used to test membership of the set (i.e. not iterated over)? Is this sort of thing something reasonable (maybe it could be stuffed in the SubclassData short and exposed via a public API?)Here’s a simple prototype of the kind of thing I mean: diff --git a/include/llvm/IR/Value.h b/include/llvm/IR/Value.h index 9d83cef..d48f8f3 100644 --- a/include/llvm/IR/Value.h +++ b/include/llvm/IR/Value.h @@ -104,13 +104,17 @@ protected: /// /// Note, this should *NOT* be used directly by any class other than User. /// User uses this value to find the Use list. - enum : unsigned { NumUserOperandsBits = 29 }; + enum : unsigned { NumUserOperandsBits = 28 }; unsigned NumUserOperands : NumUserOperandsBits; bool IsUsedByMD : 1; bool HasName : 1; bool HasHungOffUses : 1; + /// \brief A 1-bit marker that can be used freely within passes, e.g. for + /// mark-sweep algorithms like ADCE, to avoid keeping a side data set. + bool Marker : 1; + private: template <typename UseT> // UseT == 'Use' or 'const Use' class use_iterator_impl @@ -388,6 +392,12 @@ public: /// \brief Return true if there is a value handle associated with this value. bool hasValueHandle() const { return HasValueHandle; } + /// \brief Return the current value of the Marker. + bool getMarker() const { return Marker; } + + /// \brief Set the current value of the Marker. + void setMarker(bool Val) { Marker = Val; } + /// \brief Return true if there is metadata referencing this value. bool isUsedByMetadata() const { return IsUsedByMD; } diff --git a/lib/Transforms/Scalar/ADCE.cpp b/lib/Transforms/Scalar/ADCE.cpp index 96a0107..5578860 100644 --- a/lib/Transforms/Scalar/ADCE.cpp +++ b/lib/Transforms/Scalar/ADCE.cpp @@ -53,15 +53,17 @@ bool ADCE::runOnFunction(Function& F) { if (skipOptnoneFunction(F)) return false; - SmallPtrSet<Instruction*, 128> Alive; + // SmallPtrSet<Instruction*, 128> Alive; SmallVector<Instruction*, 128> Worklist; // Collect the set of "root" instructions that are known live. for (Instruction &I : instructions(F)) { if (isa<TerminatorInst>(I) || isa<DbgInfoIntrinsic>(I) || I.isEHPad() || I.mayHaveSideEffects()) { - Alive.insert(&I); + I.setMarker(true); Worklist.push_back(&I); + } else { + I.setMarker(false); } } @@ -70,8 +72,10 @@ bool ADCE::runOnFunction(Function& F) { Instruction *Curr = Worklist.pop_back_val(); for (Use &OI : Curr->operands()) { if (Instruction *Inst = dyn_cast<Instruction>(OI)) - if (Alive.insert(Inst).second) + if (!Inst->getMarker()) { Worklist.push_back(Inst); + Inst->setMarker(true); + } } } @@ -80,7 +84,7 @@ bool ADCE::runOnFunction(Function& F) { // value of the function, and may therefore be deleted safely. // NOTE: We reuse the Worklist vector here for memory efficiency. for (Instruction &I : instructions(F)) { - if (!Alive.count(&I)) { + if (!I.getMarker()) { Worklist.push_back(&I); I.dropAllReferences(); }
Steve King via llvm-dev
2015-Sep-14  18:23 UTC
[llvm-dev] RFC: speedups with instruction side-data (ADCE, perhaps others?)
On Mon, Sep 14, 2015 at 9:37 AM, escha via llvm-dev <llvm-dev at lists.llvm.org> wrote:> Are there any other passes that could benefit from having a single bit (or similarly small amount) of per-Instruction metadata local to the pass, i.e. to avoid having to keep a side-Set of instructions...FWIW, I have a target specific pass that needs exactly this sort of tracking. Once you offer 1 bit, it's a slippery slope to giving a whole pointer. I like your idea regardless.
Daniel Berlin via llvm-dev
2015-Sep-14  18:34 UTC
[llvm-dev] RFC: speedups with instruction side-data (ADCE, perhaps others?)
I did something similar for dominators, for GVN, etc. All see significant speedups. However, the answer i got back when i mentioned this was "things like ptrset and densemap should only have a small performance difference from side data when used and sized right", and i've found this to mostly be true after looking harder. In the case you are looking at, i see: - SmallPtrSet<Instruction*, 128> Alive; This seems ... wrong. In fact, it seems optimally bad. This says the small ptr set size is 128. That is, the smallsize is 128. SmallPtrSet will linear search the array when it's size <= smallsize, and otherwise fall back to building a non-small array and using better algorithms. Linear searching a 128 member array == slow I bet if you change the 128 to 8, you will see significant speedups. On Mon, Sep 14, 2015 at 11:23 AM, Steve King via llvm-dev <llvm-dev at lists.llvm.org> wrote:> On Mon, Sep 14, 2015 at 9:37 AM, escha via llvm-dev > <llvm-dev at lists.llvm.org> wrote: >> Are there any other passes that could benefit from having a single bit (or similarly small amount) of per-Instruction metadata local to the pass, i.e. to avoid having to keep a side-Set of instructions... > > FWIW, I have a target specific pass that needs exactly this sort of > tracking. Once you offer 1 bit, it's a slippery slope to giving a > whole pointer. I like your idea regardless. > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Sean Silva via llvm-dev
2015-Sep-14  21:28 UTC
[llvm-dev] RFC: speedups with instruction side-data (ADCE, perhaps others?)
What is the contract between passes for the marker bit? I.e. do passes need to "clean up" the marker bit to a predetermined value after they run? (or they need to assume that it might be dirty? (like I think you do below)) -- Sean Silva On Mon, Sep 14, 2015 at 9:37 AM, escha via llvm-dev <llvm-dev at lists.llvm.org> wrote:> I’ve been playing around with optimizing performance various passes and > noticed something about ADCE: it keeps an Alive set that requires a lot of > upkeep time-wise, but if each instruction had a /single bit/ of side data > (to represent liveness, solely within the pass), the set wouldn’t be > needed. I tested this out and observed a ~1/3 reduction in time in ADCE: > 1454ms to 982ms according to a profile over our test suite (for a total of > 0.6% compile time savings in our pipeline). > > Are there any other passes that could benefit from having a single bit (or > similarly small amount) of per-Instruction metadata local to the pass, i.e. > to avoid having to keep a side-Set of instructions that’s only used to test > membership of the set (i.e. not iterated over)? Is this sort of thing > something reasonable (maybe it could be stuffed in the SubclassData short > and exposed via a public API?) > > —escha > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150914/d37e2e8f/attachment-0001.html>
Pete Cooper via llvm-dev
2015-Sep-14  21:31 UTC
[llvm-dev] RFC: speedups with instruction side-data (ADCE, perhaps others?)
> On Sep 14, 2015, at 2:28 PM, Sean Silva via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > What is the contract between passes for the marker bit? I.e. do passes need to "clean up" the marker bit to a predetermined value after they run? (or they need to assume that it might be dirty? (like I think you do below))Good questions, and to add to this, what happens if we run passes in parallel? I can imagine a world where we run 2 function passes in parallel so tagging instructions is ok, but what if we then started to run 2 BB passes in parallel, or used the tag to track something like GlobalValue. Going to break horribly in such scenarios. Pete> > -- Sean Silva > > On Mon, Sep 14, 2015 at 9:37 AM, escha via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: > I’ve been playing around with optimizing performance various passes and noticed something about ADCE: it keeps an Alive set that requires a lot of upkeep time-wise, but if each instruction had a /single bit/ of side data (to represent liveness, solely within the pass), the set wouldn’t be needed. I tested this out and observed a ~1/3 reduction in time in ADCE: 1454ms to 982ms according to a profile over our test suite (for a total of 0.6% compile time savings in our pipeline). > > Are there any other passes that could benefit from having a single bit (or similarly small amount) of per-Instruction metadata local to the pass, i.e. to avoid having to keep a side-Set of instructions that’s only used to test membership of the set (i.e. not iterated over)? Is this sort of thing something reasonable (maybe it could be stuffed in the SubclassData short and exposed via a public API?) > > —escha > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150914/6ee7a4b8/attachment.html>
escha via llvm-dev
2015-Sep-14  21:47 UTC
[llvm-dev] RFC: speedups with instruction side-data (ADCE, perhaps others?)
I would assume that it’s just considered to be garbage. I feel like any sort of per-pass side data like this should come with absolute minimal contracts, to avoid introducing any more inter-pass complexity. —escha> On Sep 14, 2015, at 2:28 PM, Sean Silva <chisophugis at gmail.com> wrote: > > What is the contract between passes for the marker bit? I.e. do passes need to "clean up" the marker bit to a predetermined value after they run? (or they need to assume that it might be dirty? (like I think you do below)) > > -- Sean Silva > > On Mon, Sep 14, 2015 at 9:37 AM, escha via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: > I’ve been playing around with optimizing performance various passes and noticed something about ADCE: it keeps an Alive set that requires a lot of upkeep time-wise, but if each instruction had a /single bit/ of side data (to represent liveness, solely within the pass), the set wouldn’t be needed. I tested this out and observed a ~1/3 reduction in time in ADCE: 1454ms to 982ms according to a profile over our test suite (for a total of 0.6% compile time savings in our pipeline). > > Are there any other passes that could benefit from having a single bit (or similarly small amount) of per-Instruction metadata local to the pass, i.e. to avoid having to keep a side-Set of instructions that’s only used to test membership of the set (i.e. not iterated over)? Is this sort of thing something reasonable (maybe it could be stuffed in the SubclassData short and exposed via a public API?) > > —escha > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150914/441edf37/attachment.html>
Chris Lattner via llvm-dev
2015-Sep-17  03:36 UTC
[llvm-dev] RFC: speedups with instruction side-data (ADCE, perhaps others?)
> On Sep 14, 2015, at 4:28 PM, Sean Silva via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > What is the contract between passes for the marker bit? I.e. do passes need to "clean up" the marker bit to a predetermined value after they run? (or they need to assume that it might be dirty? (like I think you do below))This is the key question for this sort of thing (which has come up many times before). For the record, I’m not opposed to something like this, so long as: - It only applies to instructions and basic blocks, not constants, Function, GlobalVariable, etc. We don’t want to make parallelization harder than it already is. - The contract (enforced by the verifier) is that these bits are always zero between passes. - Adding bits doesn’t increase the size of the IR over what it is today. Beyond that, I think that it is really important to design the right data-structure around this to provide these invariants. Here’s a sketch of something that I think could work well, over-simplified to be just a single bit on Instruction to illustrate the point: class Instruction { bool TheBit = 0 // squished into something that already exists. ... public: bool isBitSet() const { return TheBit; } void setBit(BitTracker &BT, bool Value = true) { if (TheBit) return; BT.addToSet(this); TheBIt = Value; } } Where BitTracker is basically just an std::vector<Instruction*> that keeps track of all the instructions which (might) have a bit set, and offers a method to go zero all the bits. An API like this would make it very straight-forward for function passes to preserve the invariant. -Chris
Apparently Analagous Threads
- RFC: speedups with instruction side-data (ADCE, perhaps others?)
- RFC: speedups with instruction side-data (ADCE, perhaps others?)
- RFC: speedups with instruction side-data (ADCE, perhaps others?)
- RFC: speedups with instruction side-data (ADCE, perhaps others?)
- RFC: speedups with instruction side-data (ADCE, perhaps others?)