Philip Reames
2015-Jan-08 01:19 UTC
[LLVMdev] Separating loop nests based on profile information?
I've been playing with approaches to getting better optimization of loops which contain infrequently executed slow paths. I've gotten as far as throwing together a proof of concept implementation of a profile guided optimization to separate a single loop with multiple latches into a loop nest, but I want to get feedback from interested parties before investing much more effort. The primary case I'm concerned about looks something like this C example: while( some_condition ) //do lots of work here if (!some_rare_condition) { <-- This is loop variant helper(); } } The key point here is that the 'do lots of work' section contains things we could lift out of the loop, but our knowledge of memory gets completely destoryed by the 'helper' call which doesn't get inlined. This ends up crippling LICM in particular, and GVN to a lessor degree. The approach I've been playing with would create this loop structure: goto inner while (true) { outer: helper(); inner: while( some_condition ) //do lots of work here if (!some_rare_condition) { <-- This is loop variant goto outer; } } } (Excuse the psuedo-C. Hopefully the idea is clear.) You can see my stalking horse implementation here: http://reviews.llvm.org/D6872 This is not an actual patch per se; it's just intended to make the discussion more concrete and thus hopefully easier to follow. The basic idea of this patch is to use profiling information about the frequency of a backedges to separate a loop with multiple latches into a loop nest with the rare backedge being the outer loop. We already use a set of static heuristics based on non-varying arguments to PHI nodes to do the same type of thing. The motivation is that doing this pulls rarely executed code out of the innermost loop and tends to greatly simplify analysis and optimization of that loop. In particular, it can enable substantial LICM gains when there are clobbering calls down rare paths. The original motivating case for this was the slow path of a safepoint poll, but I believe the possible benefit is more general as well. Points for discussion: * Is using profile information for this purpose even a reasonable thing to do? * I chose to implement this without relying on the existing block frequency analysis. My reasoning was that a) this is a rarely taken case and adding an expensive analysis dependency probably wasn't worthwhile and b) that block frequency analysis was more expensive/precise than I really needed. Is this reasonable? * If so, is the notion of 'rareness' of a loop block something that's worth extracting out on it's own and reusing? Are there other similar uses anyone can think of? * Currently, I'm only supporting a fairly small set of controlling conditions. Are there important cases I'm not considering? * Since the rarest latch is often deep in a loop - with other "if (X) continue;" (i.e. latches) before it - this tends to create loops with multiple exiting blocks. Some of the existing passes might not deal with this well, is that a major concern? Suggestions for how to analysis and validate? * Currently, I've structured this as pulling off the rarest latch as an outer iteration. I could also pull off the most popular latch as an inner iteration. This might give different tradeoffs; thoughts? Generally, any thoughts anyone have on the problem or approach are welcome. I'm not particular attached to the particular approach laid out here and if there's a more advantageous approach, all the better. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150107/83c3990d/attachment.html>
Chandler Carruth
2015-Jan-08 01:33 UTC
[LLVMdev] Separating loop nests based on profile information?
On Wed, Jan 7, 2015 at 5:19 PM, Philip Reames <listmail at philipreames.com> wrote:> I've been playing with approaches to getting better optimization of loops > which contain infrequently executed slow paths. I've gotten as far as > throwing together a proof of concept implementation of a profile guided > optimization to separate a single loop with multiple latches into a loop > nest, but I want to get feedback from interested parties before investing > much more effort. > > The primary case I'm concerned about looks something like this C example: > while( some_condition ) > //do lots of work here > if (!some_rare_condition) { <-- This is loop variant > helper(); > } > } > > The key point here is that the 'do lots of work' section contains things > we could lift out of the loop, but our knowledge of memory gets completely > destoryed by the 'helper' call which doesn't get inlined. This ends up > crippling LICM in particular, and GVN to a lessor degree. > > The approach I've been playing with would create this loop structure: > goto inner > while (true) { > outer: > helper(); > inner: > while( some_condition ) > //do lots of work here > if (!some_rare_condition) { <-- This is loop variant > goto outer; > } > } > } > > (Excuse the psuedo-C. Hopefully the idea is clear.) >Yep. I've not thought about this a lot, but I have two initial questions that maybe you can answer: How does this compare with classical approaches of loop peeling, partitioning, fission, or whatever you might call it? Is there any literature behind this approach or some literature it should be compared with? (I genuinely don't know this area that well, so I'm of little help here...) Some of your points I have quick feedback on:> Points for discussion: > > - Is using profile information for this purpose even a reasonable > thing to do? > > Yes!> > - I chose to implement this without relying on the existing block > frequency analysis. My reasoning was that a) this is a rarely taken case > and adding an expensive analysis dependency probably wasn't worthwhile and > b) that block frequency analysis was more expensive/precise than I really > needed. Is this reasonable? > > I think we should always use the analyses. Either BlockFrequency orBranchProbability. I think probably both in the common joint usage (frequency of the loop header combined with probability of the cold region).> > - If so, is the notion of 'rareness' of a loop block something that's > worth extracting out on it's own and reusing? Are there other similar uses > anyone can think of? > - Currently, I'm only supporting a fairly small set of controlling > conditions. Are there important cases I'm not considering? > > To both of these, I think the general combination to use is to identifythe set of blocks dominated by a block which is in the loop body of a hot loop, and is cold relative to the other successors of its predecessor(s). These form cold "regions" as I think of them without requiring the complexity of the region analysis.> > - Since the rarest latch is often deep in a loop - with other "if (X) > continue;" (i.e. latches) before it - this tends to create loops with > multiple exiting blocks. Some of the existing passes might not deal with > this well, is that a major concern? Suggestions for how to analysis and > validate? > > I'm somewhat concerned about this, but want to think more about thefundamental transformation.> > - Currently, I've structured this as pulling off the rarest latch as > an outer iteration. I could also pull off the most popular latch as an > inner iteration. This might give different tradeoffs; thoughts? > > Generally, any thoughts anyone have on the problem or approach are > welcome. I'm not particular attached to the particular approach laid out > here and if there's a more advantageous approach, all the better. >Thanks for pushing on this! ;] Now I need to go and ponder a lot so i can reply more deeply on the actual transform. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150107/88bb4a04/attachment.html>
Adam Nemet
2015-Jan-08 02:42 UTC
[LLVMdev] Separating loop nests based on profile information?
> On Jan 7, 2015, at 5:33 PM, Chandler Carruth <chandlerc at google.com> wrote: > > > On Wed, Jan 7, 2015 at 5:19 PM, Philip Reames <listmail at philipreames.com <mailto:listmail at philipreames.com>> wrote: > I've been playing with approaches to getting better optimization of loops which contain infrequently executed slow paths. I've gotten as far as throwing together a proof of concept implementation of a profile guided optimization to separate a single loop with multiple latches into a loop nest, but I want to get feedback from interested parties before investing much more effort. > > The primary case I'm concerned about looks something like this C example: > while( some_condition ) > //do lots of work here > if (!some_rare_condition) { <-- This is loop variant > helper(); > } > } > > The key point here is that the 'do lots of work' section contains things we could lift out of the loop, but our knowledge of memory gets completely destoryed by the 'helper' call which doesn't get inlined. This ends up crippling LICM in particular, and GVN to a lessor degree. > > The approach I've been playing with would create this loop structure: > goto inner > while (true) { > outer: > helper(); > inner: > while( some_condition ) > //do lots of work here > if (!some_rare_condition) { <-- This is loop variant > goto outer; > } > } > } > > (Excuse the psuedo-C. Hopefully the idea is clear.) > > Yep. > > I've not thought about this a lot, but I have two initial questions that maybe you can answer: > > How does this compare with classical approaches of loop peeling, partitioning, fission, or whatever you might call it? Is there any literature behind this approach or some literature it should be compared with? (I genuinely don't know this area that well, so I'm of little help here…)I think it’s pretty different from loop distribution/fission. In loop distribution, in order to split the loop into multiple consecutive loops, you need to reason about the memory references so having a function call makes that difficult/impossible. Phillip’s idea works around this exact issue. I explored this space quite a bit recently because we’re working on a proposal to add loop distribution to LLVM. I hope to send out the proposal this week. So I see no issue with trying to handle loops with low-probablity function calls with this and fully self-contained loops with classical loop distribution. Thanks, Adam> Some of your points I have quick feedback on: > Points for discussion: > > Is using profile information for this purpose even a reasonable thing to do? > Yes! > > I chose to implement this without relying on the existing block frequency analysis. My reasoning was that a) this is a rarely taken case and adding an expensive analysis dependency probably wasn't worthwhile and b) that block frequency analysis was more expensive/precise than I really needed. Is this reasonable? > I think we should always use the analyses. Either BlockFrequency or BranchProbability. I think probably both in the common joint usage (frequency of the loop header combined with probability of the cold region). > If so, is the notion of 'rareness' of a loop block something that's worth extracting out on it's own and reusing? Are there other similar uses anyone can think of? > Currently, I'm only supporting a fairly small set of controlling conditions. Are there important cases I'm not considering? > To both of these, I think the general combination to use is to identify the set of blocks dominated by a block which is in the loop body of a hot loop, and is cold relative to the other successors of its predecessor(s). These form cold "regions" as I think of them without requiring the complexity of the region analysis. > > Since the rarest latch is often deep in a loop - with other "if (X) continue;" (i.e. latches) before it - this tends to create loops with multiple exiting blocks. Some of the existing passes might not deal with this well, is that a major concern? Suggestions for how to analysis and validate? > I'm somewhat concerned about this, but want to think more about the fundamental transformation. > > Currently, I've structured this as pulling off the rarest latch as an outer iteration. I could also pull off the most popular latch as an inner iteration. This might give different tradeoffs; thoughts? > Generally, any thoughts anyone have on the problem or approach are welcome. I'm not particular attached to the particular approach laid out here and if there's a more advantageous approach, all the better. > > Thanks for pushing on this! ;] Now I need to go and ponder a lot so i can reply more deeply on the actual transform. > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150107/683bfe08/attachment.html>
Philip Reames
2015-Jan-08 19:37 UTC
[LLVMdev] Separating loop nests based on profile information?
On 01/07/2015 05:33 PM, Chandler Carruth wrote:> > On Wed, Jan 7, 2015 at 5:19 PM, Philip Reames > <listmail at philipreames.com <mailto:listmail at philipreames.com>> wrote: > > I've been playing with approaches to getting better optimization > of loops which contain infrequently executed slow paths. I've > gotten as far as throwing together a proof of concept > implementation of a profile guided optimization to separate a > single loop with multiple latches into a loop nest, but I want to > get feedback from interested parties before investing much more > effort. > > The primary case I'm concerned about looks something like this C > example: > while( some_condition ) > //do lots of work here > if (!some_rare_condition) { <-- This is loop variant > helper(); > } > } > > The key point here is that the 'do lots of work' section contains > things we could lift out of the loop, but our knowledge of memory > gets completely destoryed by the 'helper' call which doesn't get > inlined. This ends up crippling LICM in particular, and GVN to a > lessor degree. > > The approach I've been playing with would create this loop structure: > goto inner > while (true) { > outer: > helper(); > inner: > while( some_condition ) > //do lots of work here > if (!some_rare_condition) { <-- This is loop variant > goto outer; > } > } > } > > (Excuse the psuedo-C. Hopefully the idea is clear.) > > > Yep. > > I've not thought about this a lot, but I have two initial questions > that maybe you can answer: > > How does this compare with classical approaches of loop peeling, > partitioning, fission, or whatever you might call it? Is there any > literature behind this approach or some literature it should be > compared with? (I genuinely don't know this area that well, so I'm of > little help here...)I'm not aware of any literature that covers the specific transform I'm suggesting here. This is probably a lack of awareness on my part, not a lack of literature though. While I have some basic background in the area, I can't claim to have made an extensive study of loop optimization techniques. *Particularly* profile guided ones. w.r.t. the tradoffs in practice, I'm going to answer this in a separate response just to keep this one short. I've been playing with both approaches and there appears to be appeal in both. I'm currently leaning towards a mixture of both approaches.> > Some of your points I have quick feedback on: > > Points for discussion: > > * Is using profile information for this purpose even a > reasonable thing to do? > > Yes! > > * I chose to implement this without relying on the existing > block frequency analysis. My reasoning was that a) this is a > rarely taken case and adding an expensive analysis dependency > probably wasn't worthwhile and b) that block frequency > analysis was more expensive/precise than I really needed. Is > this reasonable? > > I think we should always use the analyses. Either BlockFrequency or > BranchProbability. I think probably both in the common joint usage > (frequency of the loop header combined with probability of the cold > region).Let me ask a basic question: what do BlockFrequency and BranchProbability compute and roughly what mental cost model should I have? Once I have a BlockFrequency, what does it mean? Some of this is documented in BlockFrequencyImpl.h, but the high level bits are mostly missing. Particularly for BranchProbability. I chose to write a custom analysis in large part because I didn't understand the existing analysis and my (possibly completely wrong) perception is that they're overly precise/expensive for what I need. I wanted to focus on the writing the transform since that's the point I actually wanted to discuss. :)> * If so, is the notion of 'rareness' of a loop block something > that's worth extracting out on it's own and reusing? Are there > other similar uses anyone can think of? > * Currently, I'm only supporting a fairly small set of > controlling conditions. Are there important cases I'm not > considering? > > To both of these, I think the general combination to use is to > identify the set of blocks dominated by a block which is in the loop > body of a hot loop, and is cold relative to the other successors of > its predecessor(s). These form cold "regions" as I think of them > without requiring the complexity of the region analysis.I get why you find this a useful mental model, but the transform is far harder to express this way in the couple of cases I've looked at. It's much easier to start with a latch, reason about it's hotness, and then let the generic loop code construct the cold "region". Then again, if you were to ask for the set of latches which were dominated by blocks in the cold-with-hot-predecessors list, that might be interesting. But again, I'd really like to start from the latches since that's what controls the legality/profitability of the transform. Hm, we could walk back along the dom tree from the latch looking for such a cwhp block... This might be a better way to think about the profitability of the transform, even starting from the latch. Is that what you were trying to get at to start with? Or were you looking for something more general?> * Since the rarest latch is often deep in a loop - with other > "if (X) continue;" (i.e. latches) before it - this tends to > create loops with multiple exiting blocks. Some of the > existing passes might not deal with this well, is that a major > concern? Suggestions for how to analysis and validate? > > I'm somewhat concerned about this, but want to think more about the > fundamental transformation. > > * Currently, I've structured this as pulling off the rarest > latch as an outer iteration. I could also pull off the most > popular latch as an inner iteration. This might give different > tradeoffs; thoughts? > > Generally, any thoughts anyone have on the problem or approach are > welcome. I'm not particular attached to the particular approach > laid out here and if there's a more advantageous approach, all the > better. > > Thanks for pushing on this! ;] Now I need to go and ponder a lot so i > can reply more deeply on the actual transform.You're welcome. :) This might be an area where it's profitable to talk in person, are you going to be at the social tonight? Philip -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150108/bd67cbe3/attachment.html>
Philip Reames
2015-Jan-08 20:33 UTC
[LLVMdev] Separating loop nests based on profile information?
On 01/07/2015 05:33 PM, Chandler Carruth wrote:> How does this compare with classical approaches of loop peeling, > partitioning, fission, or whatever you might call it?I'm still developing a good sense for this, but let me lay out some observations. Fair warning, this is going to be long and rambling. Let's start with a toy example: while(c) { x = this->x; y = this->y; if (x == y) { rare(); } } If we could tell x and y were loop invariant, we could unswitch this loop. However, the rare call clobbers our view of memory, so LICM fails, and thus unswitch fails. We'd like to apply PRE here and push the reload into the loop preheader and the rare case. This currently fails for two reasons: 1) We don't allow code duplication during PRE, and 2) the canonical loop-simplify form of having a single latch means that PRE doesn't actually see the rare block at all, it sees the preheader and the join point after the if block. I think both of the previous are fixable: 1) This calls for profile guided duplication. If we can trade one hot load for two cold ones, that's great. Also, there might be static heuristics we could construct. (e.g. if it's not available in the preheader and one of many latches...) 2) We need PRE to look back through merge blocks. This should possibly even live in MemoryDependenceAnalysis. Alternatively, we could create a "deconstruct loop simplify form" pass to run right before GVN to side step this issue. Neither is entirely simple to implement, but both are doable. Probably. In theory, we've now solved the same problem my loop nest transform does. There's two snags: one minor(ish), one major. The minor one is that many of my rare conditions involve volatile loads. MemDepAnalysis currently completely gives up when faced with volatile loads. This is mostly fixable, but given the structure of the code may be fairly invasive. Oddly, LICM actually performs better in this case. For the major one, let's tweak our example slightly: while(c) { x = this->x; if (x == null) return; y = this->y; if (y == null) return; if (x == y) { rare(); } } Let's assume that PRE worked as designed for 'x'. We now need to perform a transform analogous to loop unswitch, but a bit different. We need to move the condition loop exits into each of the predecessors where the answers not available. I don't believe we have anything like this today, and it involves a non-trivial bit of code. Even if we get that done, we're left with a pass ordering problem where we need to run PRE, then branch-PRE, then PRE, then branch-PRE, etc.. (In practice, these chains can be quite long due to null checks and range checks.) If we'd split this into a loop nest structure, we'd still have the chain, but a) that's easier to control with a custom pass order since LICM and LoopUnswitch are far cheaper than GVN/PRE and b) having LICM do a bit of trivial loop unswitch for the terminator of the header appears quite tractable. One other point worth mentioning is that LLVM does not currently have a loop peeling pass. LoopRotate seems to get us most of the way there in common cases, but not always. If we did peel the above loop once, we'd have the same merge block problem I mentioned previously (this time in the preheader *and* the latch), but the profitability becomes slightly more clear without the need for profile information or loop specific heuristics. To be clear, I'm not opposed to fixing some of the issues above. I think that's a great idea, I'm just trying to find my way through to practical performance improvements with minimal invested effort. I probably will in fact fix some of these since they show up in a number of other cases as well. Ok, so that's that. Let's move on to why my proposed loop nest separation change is a bit scary. While the loop nest does allow things to be lifted from the inner loop to the outer loop - without any of the complex PRE or PGO reasoning required above - it suffers from all of the limitations of the existing loop optimization passes. In particular, LICM may not actually be able to do much if the inner loop is still complicated enough to defeat either it's fairly simple alias analysis or fault analysis. GVN/PRE is strictly more powerful - in theory if not always practice - here than what we have in LICM. We've also potentially created a loop nest which is harder to analyse than the original loop with two latches. Consider this example: for(int i = 0; i < 20000) { i++; if(i % 20 != 0) continue; a[i] = 0; } (This is not a real example of anything useful. The important point is the early return is the common case.) This loop clearly runs exactly 20000 times. ScalarEvolution almost certainly gets this right. (I haven't checked.) After separation, we'd have this: int i = 0; while(true) { while(i < 20000) { i++; if(i % 20 == 0) { // now a loop exit a[i] = 0; goto next: } } break: next: } It's not clear to me that SE will always be able to tell that both the inner and outer loop runs a maximum of 20000 times. (In this particular case it might actually optimize the inner loop substantially and thus get a more precise answer, but let's ignore that.) We could potentially loose information about the loop by doing this transformation. Of course, this concern applies to the *existing* loop nest separation heuristics too. I don't have any particular reason to believe one heuristic is worse than the other, but I also have no reason not to believe that either. As a final point, the inner loop now has two exiting blocks not one. I know at least some of our loop transforms give up on anything with more than one loop exit. I'm less concerned by that since a) we should fix them, and b) the cases I've looked at don't look too hard. And that's everything that occurs to me at the moment. I'll probably have to append in further responses as I remember more details, I know I'm forgetting some pieces. Philip
Krzysztof Parzyszek
2015-Jan-12 19:00 UTC
[LLVMdev] Separating loop nests based on profile information?
On 1/7/2015 7:19 PM, Philip Reames wrote:> > The approach I've been playing with would create this loop structure: > goto inner > while (true) { > outer: > helper(); > inner: > while( some_condition ) > //do lots of work here > if (!some_rare_condition) { <-- This is loop variant > goto outer; > } > } > } >Anything that changes loop structure into something like this, is a fundamentally bad idea. This would make the outer loop irreducible, which should be avoided. A lot better approach would be to privatize the invariant objects, and write them back to memory before/after the "helper" is invoked: T1 = p->x; T2 = P->y; while( some_condition ) //do lots of work here, use T1 and T2 if (some_rare_condition) { <-- This is loop variant p->x = T1; p->y = T2; helper(); T1 = p->x; T2 = p->y; } } p->x = T1; p->y = T2; -Krzysztof -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Krzysztof Parzyszek
2015-Jan-12 19:40 UTC
[LLVMdev] Separating loop nests based on profile information?
Upon closer inspection, the loop will not become irreducible. However, the result will be equivalent to reloading the hoisted values after calling "helper". In fact, that's what may happen if some CFG optimization cleans this up. -Krzysztof On 1/12/2015 1:00 PM, Krzysztof Parzyszek wrote:> On 1/7/2015 7:19 PM, Philip Reames wrote: >> >> The approach I've been playing with would create this loop structure: >> goto inner >> while (true) { >> outer: >> helper(); >> inner: >> while( some_condition ) >> //do lots of work here >> if (!some_rare_condition) { <-- This is loop variant >> goto outer; >> } >> } >> } >> > > Anything that changes loop structure into something like this, is a > fundamentally bad idea. This would make the outer loop irreducible, > which should be avoided. > > A lot better approach would be to privatize the invariant objects, and > write them back to memory before/after the "helper" is invoked: > > T1 = p->x; > T2 = P->y; > while( some_condition ) > //do lots of work here, use T1 and T2 > if (some_rare_condition) { <-- This is loop variant > p->x = T1; > p->y = T2; > helper(); > T1 = p->x; > T2 = p->y; > } > } > p->x = T1; > p->y = T2; > > > -Krzysztof >-- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Xinliang David Li
2015-Jan-13 03:27 UTC
[LLVMdev] Separating loop nests based on profile information?
On Wed, Jan 7, 2015 at 5:19 PM, Philip Reames <listmail at philipreames.com> wrote:> I've been playing with approaches to getting better optimization of loops > which contain infrequently executed slow paths. I've gotten as far as > throwing together a proof of concept implementation of a profile guided > optimization to separate a single loop with multiple latches into a loop > nest, but I want to get feedback from interested parties before investing > much more effort. > > The primary case I'm concerned about looks something like this C example: > while( some_condition ) > //do lots of work here > if (!some_rare_condition) { <-- This is loop variant > helper(); > } > } > > The key point here is that the 'do lots of work' section contains things > we could lift out of the loop, but our knowledge of memory gets completely > destoryed by the 'helper' call which doesn't get inlined. This ends up > crippling LICM in particular, and GVN to a lessor degree. > > The approach I've been playing with would create this loop structure: > goto inner > while (true) { > outer: > helper(); > inner: > while( some_condition ) > //do lots of work here > if (!some_rare_condition) { <-- This is loop variant > goto outer; > } > } > } > > (Excuse the psuedo-C. Hopefully the idea is clear.) > > You can see my stalking horse implementation here: > http://reviews.llvm.org/D6872 > > This is not an actual patch per se; it's just intended to make the > discussion more concrete and thus hopefully easier to follow. > > The basic idea of this patch is to use profiling information about the > frequency of a backedges to separate a loop with multiple latches into a > loop nest with the rare backedge being the outer loop. We already use a set > of static heuristics based on non-varying arguments to PHI nodes to do the > same type of thing. > > The motivation is that doing this pulls rarely executed code out of the > innermost loop and tends to greatly simplify analysis and optimization of > that loop. >This is good thing to do from the code/block placement point of view -- to improve icache utilization. Code layout is one of the most important passes that can benefit from profile data.> In particular, it can enable substantial LICM gains when there are > clobbering calls down rare paths. >How can this enable more LICM?> The original motivating case for this was the slow path of a safepoint > poll, but I believe the possible benefit is more general as well. > > Points for discussion: > > - Is using profile information for this purpose even a reasonable > thing to do? > >yes.> > - I chose to implement this without relying on the existing block > frequency analysis. My reasoning was that a) this is a rarely taken case > and adding an expensive analysis dependency probably wasn't worthwhile and > b) that block frequency analysis was more expensive/precise than I really > needed. Is this reasonable? > >IMO no. Remember your pass should also work with real profile data (instrumentation or sample based) -- you should rely on existing profile infrastructure to provide what you need. (Ideally you should treat it as something that is always available for query).> > - If so, is the notion of 'rareness' of a loop block something that's > worth extracting out on it's own and reusing? Are there other similar uses > anyone can think of? > >Please do not introduce new notions for profile information -- there are already many:) You have block 'frequency' to represent intra-function relative hotness. Currently missing, but if we need to represent inter-procedural global hotness, something like bb execution 'count' may be needed. thanks, David> > - Currently, I'm only supporting a fairly small set of controlling > conditions. Are there important cases I'm not considering? > - Since the rarest latch is often deep in a loop - with other "if (X) > continue;" (i.e. latches) before it - this tends to create loops with > multiple exiting blocks. Some of the existing passes might not deal with > this well, is that a major concern? Suggestions for how to analysis and > validate? > - Currently, I've structured this as pulling off the rarest latch as > an outer iteration. I could also pull off the most popular latch as an > inner iteration. This might give different tradeoffs; thoughts? > > Generally, any thoughts anyone have on the problem or approach are > welcome. I'm not particular attached to the particular approach laid out > here and if there's a more advantageous approach, all the better. > > > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150112/295c2791/attachment.html>
Xinliang David Li
2015-Jan-13 03:39 UTC
[LLVMdev] Separating loop nests based on profile information?
On Wed, Jan 7, 2015 at 5:33 PM, Chandler Carruth <chandlerc at google.com> wrote:> > On Wed, Jan 7, 2015 at 5:19 PM, Philip Reames <listmail at philipreames.com> > wrote: > >> I've been playing with approaches to getting better optimization of loops >> which contain infrequently executed slow paths. I've gotten as far as >> throwing together a proof of concept implementation of a profile guided >> optimization to separate a single loop with multiple latches into a loop >> nest, but I want to get feedback from interested parties before investing >> much more effort. >> >> The primary case I'm concerned about looks something like this C example: >> while( some_condition ) >> //do lots of work here >> if (!some_rare_condition) { <-- This is loop variant >> helper(); >> } >> } >> >> The key point here is that the 'do lots of work' section contains things >> we could lift out of the loop, but our knowledge of memory gets completely >> destoryed by the 'helper' call which doesn't get inlined. This ends up >> crippling LICM in particular, and GVN to a lessor degree. >> >> The approach I've been playing with would create this loop structure: >> goto inner >> while (true) { >> outer: >> helper(); >> inner: >> while( some_condition ) >> //do lots of work here >> if (!some_rare_condition) { <-- This is loop variant >> goto outer; >> } >> } >> } >> >> (Excuse the psuedo-C. Hopefully the idea is clear.) >> > > Yep. > > I've not thought about this a lot, but I have two initial questions that > maybe you can answer: > > How does this compare with classical approaches of loop peeling, > partitioning, fission, or whatever you might call it? Is there any > literature behind this approach or some literature it should be compared > with? (I genuinely don't know this area that well, so I'm of little help > here...) >It is not any of those loop transformations. It does not even change the control flow. It is more a code layout optimization -- move the cold trace in the loop out of the body.> > > Some of your points I have quick feedback on: > >> Points for discussion: >> >> - Is using profile information for this purpose even a reasonable >> thing to do? >> >> Yes! > > >> >> - I chose to implement this without relying on the existing block >> frequency analysis. My reasoning was that a) this is a rarely taken case >> and adding an expensive analysis dependency probably wasn't worthwhile and >> b) that block frequency analysis was more expensive/precise than I really >> needed. Is this reasonable? >> >> I think we should always use the analyses. Either BlockFrequency or > BranchProbability. I think probably both in the common joint usage > (frequency of the loop header combined with probability of the cold region). >yes.> >> - If so, is the notion of 'rareness' of a loop block something that's >> worth extracting out on it's own and reusing? Are there other similar uses >> anyone can think of? >> - Currently, I'm only supporting a fairly small set of controlling >> conditions. Are there important cases I'm not considering? >> >> To both of these, I think the general combination to use is to identify > the set of blocks dominated by a block which is in the loop body of a hot > loop, and is cold relative to the other successors of its predecessor(s). > These form cold "regions" as I think of them without requiring the > complexity of the region analysis. > >Static prediction should handle it -- call heuristics or some combination with other heuristics (error handler recognition etc). David> >> - Since the rarest latch is often deep in a loop - with other "if (X) >> continue;" (i.e. latches) before it - this tends to create loops with >> multiple exiting blocks. Some of the existing passes might not deal with >> this well, is that a major concern? Suggestions for how to analysis and >> validate? >> >> I'm somewhat concerned about this, but want to think more about the > fundamental transformation. > > >> >> - Currently, I've structured this as pulling off the rarest latch as >> an outer iteration. I could also pull off the most popular latch as an >> inner iteration. This might give different tradeoffs; thoughts? >> >> Generally, any thoughts anyone have on the problem or approach are >> welcome. I'm not particular attached to the particular approach laid out >> here and if there's a more advantageous approach, all the better. >> > Thanks for pushing on this! ;] Now I need to go and ponder a lot so i can > reply more deeply on the actual transform. > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150112/69bd8bbb/attachment.html>
Philip Reames
2015-Jan-20 01:41 UTC
[LLVMdev] Separating loop nests based on profile information?
On 01/12/2015 07:27 PM, Xinliang David Li wrote:> > > On Wed, Jan 7, 2015 at 5:19 PM, Philip Reames > <listmail at philipreames.com <mailto:listmail at philipreames.com>> wrote: > > The basic idea of this patch is to use profiling information about > the frequency of a backedges to separate a loop with multiple > latches into a loop nest with the rare backedge being the outer > loop. We already use a set of static heuristics based on > non-varying arguments to PHI nodes to do the same type of thing. > > The motivation is that doing this pulls rarely executed code out > of the innermost loop and tends to greatly simplify analysis and > optimization of that loop. > > > This is good thing to do from the code/block placement point of view > -- to improve icache utilization. Code layout is one of the most > important passes that can benefit from profile data.We already try to do that using profile information at a later point. I don't know how effective that is and how much room we might have to improve, but that really wasn't a motivation for this approach.> > In particular, it can enable substantial LICM gains when there are > clobbering calls down rare paths. > > > How can this enable more LICM?Given you now have a loop nest with a simpler inner loop, we can sometimes perform LICM on that inner loop. Note that this is equivalent to running a smarter PRE on the original loop structure.> > * I chose to implement this without relying on the existing > block frequency analysis. My reasoning was that a) this is a > rarely taken case and adding an expensive analysis dependency > probably wasn't worthwhile and b) that block frequency > analysis was more expensive/precise than I really needed. Is > this reasonable? > > > IMO no. Remember your pass should also work with real profile data > (instrumentation or sample based) -- you should rely on existing > profile infrastructure to provide what you need. (Ideally you should > treat it as something that is always available for query).There's a distinction here between 'source of information' and 'source of heuristics'. By accessing the metadata directly, I'm still using whatever profile information the user provided. I'm currently exploring other approaches to this problem, but if I do end up exploring this further, I'll likely rewrite to use one the existing analyses. Philip -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150119/b0202be1/attachment.html>
Maybe Matching Threads
- [LLVMdev] Separating loop nests based on profile information?
- [LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?
- [LLVMdev] IR Passes and TargetTransformInfo: Straw Man
- [LLVMdev] IR Passes and TargetTransformInfo: Straw Man
- Separate LoopVectorize LLVM pass