Finkel, Hal J. via llvm-dev
2018-Sep-27 17:36 UTC
[llvm-dev] RFC Storing BB order in llvm::Instruction for faster local dominance
On 09/27/2018 12:24 AM, Chris Lattner via llvm-dev wrote: On Sep 26, 2018, at 11:55 AM, Reid Kleckner <rnk at google.com<mailto:rnk at google.com>> wrote: As suggested in the bug, if we were to rewrite these passes to use MemorySSA, this bottleneck would go away. I rebased a patch to do that for DSE, but finishing it off and enabling it by default is probably out of scope for me. Yes, understood, that would be a major change. That said, changing the entire architecture of the compiler to work around this isn’t really appealing to me, I’d rather limit the problematic optimizations on the crazy large cases. You're probably right, these passes probably aren't actually firing. But, it sounds like we have two options: 1. Add cutoffs to poorly understood slow passes that are misbehaving 2. Make a core data structure change to make dominance calculations faster and simpler for all transforms I can do this analysis, and add these cutoffs, but I wouldn't feel very good about it. It adds code complexity, we'll have to test it, and tomorrow someone will add new quadratic dominance queries. I don't see how heuristically limiting misbehaving optimizations builds a better foundation for LLVM tomorrow. My answer to this is that the long term path is to move these passes to MemorySSA, which doesn’t have these problems. At which point, the core IR change you are proposing becomes really the wrong thing. Maybe I'm missing something, but why do you say this? MemorySSA certainly collects the memory references in a basic block into a set of use/def chains, but determining dominance still requires walking those chains. This might help reduce the constant factor on the O(N), because it skips the non-memory-access-instructions, but the underlying complexity problem remains. Maybe MemorySSA should cache a local numbering, but... MemorySSA only helps if you're fine with skipping non-aliasing accesses. It's not clear to me that this is always the case. For example, imagine that we're trying to do something like SLP vectorization, and so we follow the use-def chain of an address calculation and find two adjacent, but non-aliasing, loads in the same basic block. We want to convert them into a vector load, so we need to place the new vector load at the position of the first (dominating) load. MemorySSA will help determine legality of the transformation, but MemorySSA won't help determine the domainance because it will have the MemorySSA nodes for both loads tied, potentially, to the same MemorySSA definition node (i.e., you can't walk from one to the other, by design, which is why it helps with the legality determination). Dominance has been and will always be an important analysis in LLVM. This patch is basically pushing an analysis cache down into IR. Would it be accurate to say that your objection to this approach has more to do with analysis data living in IR data structures? If I added more infrastructure to invert the dependency, store these instruction numbers in DominatorTree, and make BasicBlock notify DominatorTree of instruction insertion and deletion, would that be preferable? That's very similar to the code we're writing today, where the transform takes responsibility for marrying its modifications to its invalidations. The question is, do we want to keep this stuff up to date automatically, like we do for use lists, or is this something we want to maintain on the side? I think dominance is something we might want to start pushing down into IR. I have several objections to caches like this, and we have been through many failed attempts at making analyses “autoadapt” to loosely coupled transformations (e.g. the CallbackVH stuff, sigh). My objections are things like: 1) Caching and invalidation are very difficult to get right. 2) Invalidation logic slows down everyone even if they aren’t using the cache (your “check to see if I need to invalidate when permuting the ilist). 3) We try very hard not to put analysis/xform specific gunk into core IR types, because it punishes everyone but benefits only a few passes. Dominance is a fundamental construct to an SSA-form IR. I certainly agree with you in general, but I'm not convinced by this reasoning in this case. 4) LLVM is used for a LOT of widely varying use cases - clang is just one client, and many of them don’t run these passes. This change pessimizes all those clients, particularly the most sensitive 32-bit systems. I don't see why this has anything to do with Clang. As I understood it, the self-host compile time of a large Clang source file was being used only as an example. This change may very well help many clients. Thanks again, Hal 5) If done wrong, these caches can lead break invariants that LLVM optimization passes are supposed to maintain. For example, if you do “sparse numbering” then the actual numbering will depend on the series of transformations that happen, and if someone accidentally uses the numbers in the wrong way, you could make “opt -pass1 -pass2” behave differently than “opt -pass1 | opt -pass2”. Putting this stuff into DominatorTree could make sense, but then clients of dominator tree would have to invalidate this when doing transformations. If you put the invalidation logic into ilist, then many of the problems above reoccur. I don’t think (but don’t know) that it is unreasonable to make all clients of DT invalidate this manually. -Chris _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev -- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180927/f4f891c8/attachment-0001.html>
Chris Lattner via llvm-dev
2018-Sep-27 18:48 UTC
[llvm-dev] RFC Storing BB order in llvm::Instruction for faster local dominance
On Sep 27, 2018, at 10:36 AM, Finkel, Hal J. <hfinkel at anl.gov> wrote:> On 09/27/2018 12:24 AM, Chris Lattner via llvm-dev wrote: >>> On Sep 26, 2018, at 11:55 AM, Reid Kleckner <rnk at google.com <mailto:rnk at google.com>> wrote: >>>> As suggested in the bug, if we were to rewrite these passes to use MemorySSA, this bottleneck would go away. I rebased a patch to do that for DSE, but finishing it off and enabling it by default is probably out of scope for me. >>> >>> Yes, understood, that would be a major change. That said, changing the entire architecture of the compiler to work around this isn’t really appealing to me, I’d rather limit the problematic optimizations on the crazy large cases. >>> >>> You're probably right, these passes probably aren't actually firing. But, it sounds like we have two options: >>> 1. Add cutoffs to poorly understood slow passes that are misbehaving >>> 2. Make a core data structure change to make dominance calculations faster and simpler for all transforms >>> >>> I can do this analysis, and add these cutoffs, but I wouldn't feel very good about it. It adds code complexity, we'll have to test it, and tomorrow someone will add new quadratic dominance queries. I don't see how heuristically limiting misbehaving optimizations builds a better foundation for LLVM tomorrow. >> >> My answer to this is that the long term path is to move these passes to MemorySSA, which doesn’t have these problems. At which point, the core IR change you are proposing becomes really the wrong thing. > > > Maybe I'm missing something, but why do you say this? MemorySSA certainly collects the memory references in a basic block into a set of use/def chains, but determining dominance still requires walking those chains. This might help reduce the constant factor on the O(N), because it skips the non-memory-access-instructions, but the underlying complexity problem remains. Maybe MemorySSA should cache a local numbering, but…memcpyopt isn’t/shouldn’t be querying dominance of random instructions within a block, it is finding memcpy’s by scanning scalar def-use chains and trying to back patch relationships between them. You’d use a fumdamentally different approach with MemorySSA, where just walk the code, find memcpy’s, and ask what they depend on. This is all sparse, and doesn’t require dominance calls. InstCombine doing conceptually similar things, and because it is based on use/def chains it doesn’t need to do any of this: the dominance relationship is implicit.> MemorySSA only helps if you're fine with skipping non-aliasing accesses. It's not clear to me that this is always the case. For example, imagine that we're trying to do something like SLP vectorization, and so we follow the use-def chain of an address calculation and find two adjacent, but non-aliasing, loads in the same basic block.To be clear, I was only referring to the DSE/memcpyopt cases which appear to be the primary motivators for this line of work. Specific transformations with specific needs can continued to use OrderedBB if beneficial.>> 1) Caching and invalidation are very difficult to get right. >> 2) Invalidation logic slows down everyone even if they aren’t using the cache (your “check to see if I need to invalidate when permuting the ilist). >> 3) We try very hard not to put analysis/xform specific gunk into core IR types, because it punishes everyone but benefits only a few passes. > > Dominance is a fundamental construct to an SSA-form IR. I certainly agree with you in general, but I'm not convinced by this reasoning in this case.I agree it is fundamental, but so is instruction ordering and a lot of other things that we already encode in one way in the IR. This is proposing a redundant encoding of information, and we’re all just discussing the SW engineering tradeoffs of various approaches.> >> 4) LLVM is used for a LOT of widely varying use cases - clang is just one client, and many of them don’t run these passes. This change pessimizes all those clients, particularly the most sensitive 32-bit systems. > > I don't see why this has anything to do with Clang. As I understood it, the self-host compile time of a large Clang source file was being used only as an example. This change may very well help many clients.Clang uses memcpyopt and DSE, but (e.g.) a graphics shader compiler would not. Making the proposed change would pessimize memory use for that client and provide very little benefit. -Chris -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180927/dff553bd/attachment.html>
Reid Kleckner via llvm-dev
2018-Oct-02 01:30 UTC
[llvm-dev] RFC Storing BB order in llvm::Instruction for faster local dominance
On Thu, Sep 27, 2018 at 11:48 AM Chris Lattner <clattner at nondot.org> wrote:> On Sep 27, 2018, at 10:36 AM, Finkel, Hal J. <hfinkel at anl.gov> wrote: > > On 09/27/2018 12:24 AM, Chris Lattner via llvm-dev wrote: > > On Sep 26, 2018, at 11:55 AM, Reid Kleckner <rnk at google.com> wrote: > >> As suggested in the bug, if we were to rewrite these passes to use >> MemorySSA, this bottleneck would go away. I rebased a patch to do that for >> DSE, but finishing it off and enabling it by default is probably out of >> scope for me. >> >> >> Yes, understood, that would be a major change. That said, changing the >> entire architecture of the compiler to work around this isn’t really >> appealing to me, I’d rather limit the problematic optimizations on the >> crazy large cases. >> > > You're probably right, these passes probably aren't actually firing. But, > it sounds like we have two options: > 1. Add cutoffs to poorly understood slow passes that are misbehaving > 2. Make a core data structure change to make dominance calculations faster > and simpler for all transforms > > I can do this analysis, and add these cutoffs, but I wouldn't feel very > good about it. It adds code complexity, we'll have to test it, and tomorrow > someone will add new quadratic dominance queries. I don't see how > heuristically limiting misbehaving optimizations builds a better foundation > for LLVM tomorrow. > > > My answer to this is that the long term path is to move these passes to > MemorySSA, which doesn’t have these problems. At which point, the core IR > change you are proposing becomes really the wrong thing. > > > > Maybe I'm missing something, but why do you say this? MemorySSA certainly > collects the memory references in a basic block into a set of use/def > chains, but determining dominance still requires walking those chains. This > might help reduce the constant factor on the O(N), because it skips the > non-memory-access-instructions, but the underlying complexity problem > remains. Maybe MemorySSA should cache a local numbering, but… > > > memcpyopt isn’t/shouldn’t be querying dominance of random instructions > within a block, it is finding memcpy’s by scanning scalar def-use chains > and trying to back patch relationships between them. You’d use a > fumdamentally different approach with MemorySSA, where just walk the code, > find memcpy’s, and ask what they depend on. > > This is all sparse, and doesn’t require dominance calls. InstCombine > doing conceptually similar things, and because it is based on use/def > chains it doesn’t need to do any of this: the dominance relationship is > implicit. >Maybe. I didn't raise this issue to become an expert in DSE or memcpyopt, and unfortunately I don't have time to dig in and really understand it. I'm just trying to do some basic performance engineering.> MemorySSA only helps if you're fine with skipping non-aliasing accesses. > It's not clear to me that this is always the case. For example, imagine > that we're trying to do something like SLP vectorization, and so we follow > the use-def chain of an address calculation and find two adjacent, but > non-aliasing, loads in the same basic block. > > > To be clear, I was only referring to the DSE/memcpyopt cases which appear > to be the primary motivators for this line of work. Specific > transformations with specific needs can continued to use OrderedBB if > beneficial. >I think the point is that we already know we have lots of users of OrderedBasicBlock, and OrderedInstructions, and are likely to grow more as time goes on. With the current design, every user of OrderedBasicBlock has to take responsibility for invalidating the analysis as they add and remove instructions, introducing bugs, and fixing them as each transform develops independently. If we instead make the core data structures efficiently expose facts that they already know about themselves, everyone wins. I think one of the most compelling use cases for this information that we're likely to want to use more of as time goes on is ImplicitControlFlowTracking. This analysis uses OrderedInstructions to figure out if a given instruction is truly "must execute", i.e. if we reach this basic block, we can assume it executes. You can use it to turn LLVM's extended, not-quite-basic blocks into basic blocks, where you can assume that every instruction in the block executes, or every instruction in the block post-dominates instructions that come before it.> 1) Caching and invalidation are very difficult to get right. > > 2) Invalidation logic slows down everyone even if they aren’t using the > cache (your “check to see if I need to invalidate when permuting the ilist). > 3) We try very hard not to put analysis/xform specific gunk into core IR > types, because it punishes everyone but benefits only a few passes. > > > Dominance is a fundamental construct to an SSA-form IR. I certainly agree > with you in general, but I'm not convinced by this reasoning in this case. > > > I agree it is fundamental, but so is instruction ordering and a lot of > other things that we already encode in one way in the IR. This is > proposing a redundant encoding of information, and we’re all just > discussing the SW engineering tradeoffs of various approaches. > > > 4) LLVM is used for a LOT of widely varying use cases - clang is just one > client, and many of them don’t run these passes. This change pessimizes > all those clients, particularly the most sensitive 32-bit systems. > > > I don't see why this has anything to do with Clang. As I understood it, > the self-host compile time of a large Clang source file was being used only > as an example. This change may very well help many clients. > > > Clang uses memcpyopt and DSE, but (e.g.) a graphics shader compiler would > not. Making the proposed change would pessimize memory use for that client > and provide very little benefit. >I think the numbers are pretty compelling. It's 1% memory overhead on LTO. There are many other things that we spend memory on that not all clients use. For example, Instruction::DbgLoc. Without this, tracking source locations was painfully inefficient, but not everyone uses debug info. Use lists themselves are technically an analysis that we don't really need for -O0 codegen, but we maintain them anyway for simplicity. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20181001/ead5e49f/attachment.html>