Reid Kleckner via llvm-dev
2018-Sep-25 21:35 UTC
[llvm-dev] RFC Storing BB order in llvm::Instruction for faster local dominance
On Tue, Sep 25, 2018 at 12:16 PM Sanjoy Das <sanjoy at playingwithpointers.com> wrote:> Let's not assume a dichotomy between storing a int64 in > llvm::Instruction and bitwise tricks -- they're both ways of caching > position within llvm::Instruction. I think we first need to establish > that we need such a cache in llvm::Instruction/llvm::BasicBlock at > all. > > Do you have a sense of where these expensive domtree queries are > coming from? Are they from a couple of key places or are they > scattered throughout the pass pipeline? >When I dug into the profile with WPA, most of the time was spent in DSE and memcpyopt, which call AAResults::callCapturesBefore, which calls llvm::PointerMayBeCapturedBefore. Neither pass in an OrderedBasicBlock, so they rebuild the OrderedBasicBlock in linear time on every query. These passes insert instructions, so it's not correct to simply create and reuse an OrderedBasicBlock at this level. As suggested in the bug, if we were to rewrite these passes to use MemorySSA, this bottleneck would go away. I rebased a patch to do that for DSE, but finishing it off and enabling it by default is probably out of scope for me. These key call sites aside, dominance seems like a pretty common query. We have several somewhat overlapping mechanisms for checking dominance: - DominatorTree::dominates(Instruction*,Instruction*), the original interface, widely known as a bottleneck - OrderedBasicBlock, as used by memdep, my motivating example - OrderedInstructions, uses OrderedBasicBlock, used by GVN and others - MemorySSA has a similar ordering of memory values that is also invalidated when inserting memory defs. I may have misunderstood how this works, though. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180925/952d0082/attachment.html>
Reid Kleckner via llvm-dev
2018-Sep-26 03:10 UTC
[llvm-dev] RFC Storing BB order in llvm::Instruction for faster local dominance
On Tue, Sep 25, 2018 at 2:35 PM Reid Kleckner <rnk at google.com> wrote:> On Tue, Sep 25, 2018 at 12:16 PM Sanjoy Das < > sanjoy at playingwithpointers.com> wrote: > >> Let's not assume a dichotomy between storing a int64 in >> llvm::Instruction and bitwise tricks -- they're both ways of caching >> position within llvm::Instruction. I think we first need to establish >> that we need such a cache in llvm::Instruction/llvm::BasicBlock at >> all. >> >After thinking more about this over the day, I thought about the bugs we have when the cache isn't on the BasicBlock or Instruction. We have one here: https://reviews.llvm.org/D50377 I would say that, while the expensive queries come from just a few passes (DSE & memcpyopt), OrderedBasicBlock is widely used by more than just a few transforms. Putting aside performance, the existence of all those OrderedBasicBlock caches tells me that yes, we do need to cache this info, and it is widely used. Maybe it doesn't need to live directly on Instruction, but the cache should be maintained by BasicBlock. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180925/8c6d7bf2/attachment.html>
Chris Lattner via llvm-dev
2018-Sep-26 05:45 UTC
[llvm-dev] RFC Storing BB order in llvm::Instruction for faster local dominance
On Sep 25, 2018, at 2:35 PM, Reid Kleckner <rnk at google.com> wrote:> On Tue, Sep 25, 2018 at 12:16 PM Sanjoy Das <sanjoy at playingwithpointers.com <mailto:sanjoy at playingwithpointers.com>> wrote: > Let's not assume a dichotomy between storing a int64 in > llvm::Instruction and bitwise tricks -- they're both ways of caching > position within llvm::Instruction. I think we first need to establish > that we need such a cache in llvm::Instruction/llvm::BasicBlock at > all. > > Do you have a sense of where these expensive domtree queries are > coming from? Are they from a couple of key places or are they > scattered throughout the pass pipeline?+1.> When I dug into the profile with WPA, most of the time was spent in DSE and memcpyopt, which call AAResults::callCapturesBefore, which calls llvm::PointerMayBeCapturedBefore. Neither pass in an OrderedBasicBlock, so they rebuild the OrderedBasicBlock in linear time on every query. These passes insert instructions, so it's not correct to simply create and reuse an OrderedBasicBlock at this level.So this is one of the reasons I find your patch to be problematic: it isn’t fixing N^2 behavior, it is merely changing one N^2 situation for another. AFAICT there are one of two possible cases in these sorts of transformations: 1) These transformations are iteratively removing or inserting instructions, which invalidate the ordering with your approach, causing subsequent queries to rebuild the equivalent of OrderedBasicBlock. 2) These transformations are not doing anything, in which case this is all wasted work. As a next step, can you please instrument the calls to calls from DSE and/or memcpyopt and see how many of the ones for the huge basic blocks actually turn into a transformation in the code? If there are zero, then we should just disable these optimizations for large blocks. If there are a few improvements, then we should see how to characterize them and what the right solution is based on the sorts of information they need. LLVM is occasionally maligned for the magic numbers like “6” in various optimizations that bound the work in various analyses (e.g. when computing masked bits) but they exist for very real reasons: the cases in which a large search will actually lead to benefits are marginal to non-existent, and the possibility for unbounded queries for core APIs cause all kinds of problems. I see the dependence analysis queries as the same sort of situation: until something like MemorySSA is able to define away these dense queries, we should be using bounded searches (in this case larger than 6 but less than 11,000 :-). It would be straight-forward to change llvm::BasicBlock to keep track of the number of instruction’s in it (doing so would not change any ilist algorithmic behavior that I’m aware of given the need to keep instruction parent pointers updated), and having that info would make it easy to cap linear passes like this or switch into local search modes. The cost of such a thing is the risk of performance regressions. We control that risk by doing some analysis of the payoff of these transformations on large blocks - if the payoff doesn’t exist then it is a simple answer. The real question here is not whether DSE is able to eliminate a single store, it is whether eliminating that store actually leads to a measurable performance improvement in the generated code. Given huge blocks, I suspect the answer is “definitely not”.> As suggested in the bug, if we were to rewrite these passes to use MemorySSA, this bottleneck would go away. I rebased a patch to do that for DSE, but finishing it off and enabling it by default is probably out of scope for me.Yes, understood, that would be a major change. That said, changing the entire architecture of the compiler to work around this isn’t really appealing to me, I’d rather limit the problematic optimizations on the crazy large cases. -Chris -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180925/ad774625/attachment.html>
Reid Kleckner via llvm-dev
2018-Sep-26 18:55 UTC
[llvm-dev] RFC Storing BB order in llvm::Instruction for faster local dominance
On Tue, Sep 25, 2018 at 10:45 PM Chris Lattner <clattner at nondot.org> wrote:> So this is one of the reasons I find your patch to be problematic: it > isn’t fixing N^2 behavior, it is merely changing one N^2 situation for > another. AFAICT there are one of two possible cases in these sorts of > transformations: > > 1) These transformations are iteratively removing or inserting > instructions, which invalidate the ordering with your approach, causing > subsequent queries to rebuild the equivalent of OrderedBasicBlock. >I would say that this code fixes the quadratic behavior in practice, and lays the foundation to fix it permanently if it becomes a problem. First, removing instructions doesn't invalidate the ordering, and these passes mostly delete instructions, so in practice, the latent quadratic behavior is very hard to exercise. In the future, if profiling shows that "insert;query;insert;query" is a bottleneck, we can fix it by leaving gaps in the numbering space and assigning new numbers halfway between adjacent instructions, renumbering as required to maintain amortized O(1) insertion complexity. I think it's unlikely that we will ever need to implement this fancy renumbering algorithm, and people on the bug agree (https://llvm.org/pr38829#c2). Existing code that uses OrderedBasicBlock already doesn't implement this algorithm. According to existing code, invalidating on modification is good enough. 2) These transformations are not doing anything, in which case this is all> wasted work. > > As a next step, can you please instrument the calls to calls from DSE > and/or memcpyopt and see how many of the ones for the huge basic blocks > actually turn into a transformation in the code? If there are zero, then > we should just disable these optimizations for large blocks. If there are > a few improvements, then we should see how to characterize them and what > the right solution is based on the sorts of information they need. > > LLVM is occasionally maligned for the magic numbers like “6” in various > optimizations that bound the work in various analyses (e.g. when computing > masked bits) but they exist for very real reasons: the cases in which a > large search will actually lead to benefits are marginal to non-existent, > and the possibility for unbounded queries for core APIs cause all kinds of > problems. > > I see the dependence analysis queries as the same sort of situation: until > something like MemorySSA is able to define away these dense queries, we > should be using bounded searches (in this case larger than 6 but less than > 11,000 :-). > > It would be straight-forward to change llvm::BasicBlock to keep track of > the number of instruction’s in it (doing so would not change any ilist > algorithmic behavior that I’m aware of given the need to keep instruction > parent pointers updated), and having that info would make it easy to cap > linear passes like this or switch into local search modes. > > The cost of such a thing is the risk of performance regressions. We > control that risk by doing some analysis of the payoff of these > transformations on large blocks - if the payoff doesn’t exist then it is a > simple answer. The real question here is not whether DSE is able to > eliminate a single store, it is whether eliminating that store actually > leads to a measurable performance improvement in the generated code. Given > huge blocks, I suspect the answer is “definitely not”. > > As suggested in the bug, if we were to rewrite these passes to use > MemorySSA, this bottleneck would go away. I rebased a patch to do that for > DSE, but finishing it off and enabling it by default is probably out of > scope for me. > > > Yes, understood, that would be a major change. That said, changing the > entire architecture of the compiler to work around this isn’t really > appealing to me, I’d rather limit the problematic optimizations on the > crazy large cases. >You're probably right, these passes probably aren't actually firing. But, it sounds like we have two options: 1. Add cutoffs to poorly understood slow passes that are misbehaving 2. Make a core data structure change to make dominance calculations faster and simpler for all transforms I can do this analysis, and add these cutoffs, but I wouldn't feel very good about it. It adds code complexity, we'll have to test it, and tomorrow someone will add new quadratic dominance queries. I don't see how heuristically limiting misbehaving optimizations builds a better foundation for LLVM tomorrow. Dominance has been and will always be an important analysis in LLVM. This patch is basically pushing an analysis cache down into IR. Would it be accurate to say that your objection to this approach has more to do with analysis data living in IR data structures? If I added more infrastructure to invert the dependency, store these instruction numbers in DominatorTree, and make BasicBlock notify DominatorTree of instruction insertion and deletion, would that be preferable? That's very similar to the code we're writing today, where the transform takes responsibility for marrying its modifications to its invalidations. The question is, do we want to keep this stuff up to date automatically, like we do for use lists, or is this something we want to maintain on the side? I think dominance is something we might want to start pushing down into IR. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180926/77266c52/attachment-0001.html>