thr3ads.net - llvm dev - [LLVMdev] Separating loop nests based on profile information? [Jan 2015]

If this information is useful, please help other people find it:
Share via:

Philip Reames

2015-Jan-08 01:19 UTC

[LLVMdev] Separating loop nests based on profile information?

I've been playing with approaches to getting better optimization of 
loops which contain infrequently executed slow paths.  I've gotten as 
far as throwing together a proof of concept implementation of a profile 
guided optimization to separate a single loop with multiple latches into 
a loop nest, but I want to get feedback from interested parties before 
investing much more effort.

The primary case I'm concerned about looks something like this C example:
while( some_condition )
   //do lots of work here
   if (!some_rare_condition) { <-- This is loop variant
     helper();
   }
}

The key point here is that the 'do lots of work' section contains things
we could lift out of the loop, but our knowledge of memory gets 
completely destoryed by the 'helper' call which doesn't get inlined.
This ends up crippling LICM in particular, and GVN to a lessor degree.

The approach I've been playing with would create this loop structure:
goto inner
while (true) {
   outer:
   helper();
   inner:
   while( some_condition )
     //do lots of work here
     if (!some_rare_condition) { <-- This is loop variant
       goto outer;
     }
   }
}

(Excuse the psuedo-C.  Hopefully the idea is clear.)

You can see my stalking horse implementation here:
http://reviews.llvm.org/D6872

This is not an actual patch per se; it's just intended to make the 
discussion more concrete and thus hopefully easier to follow.

The basic idea of this patch is to use profiling information about the 
frequency of a backedges to separate a loop with multiple latches into a 
loop nest with the rare backedge being the outer loop. We already use a 
set of static heuristics based on non-varying arguments to PHI nodes to 
do the same type of thing.

The motivation is that doing this pulls rarely executed code out of the 
innermost loop and tends to greatly simplify analysis and optimization 
of that loop. In particular, it can enable substantial LICM gains when 
there are clobbering calls down rare paths. The original motivating case 
for this was the slow path of a safepoint poll, but I believe the 
possible benefit is more general as well.

Points for discussion:

  * Is using profile information for this purpose even a reasonable
    thing to do?
  * I chose to implement this without relying on the existing block
    frequency analysis. My reasoning was that a) this is a rarely taken
    case and adding an expensive analysis dependency probably wasn't
    worthwhile and b) that block frequency analysis was more
    expensive/precise than I really needed. Is this reasonable?
  * If so, is the notion of 'rareness' of a loop block something
that's
    worth extracting out on it's own and reusing? Are there other
    similar uses anyone can think of?
  * Currently, I'm only supporting a fairly small set of controlling
    conditions. Are there important cases I'm not considering?
  * Since the rarest latch is often deep in a loop - with other "if (X)
    continue;" (i.e. latches) before it - this tends to create loops
    with multiple exiting blocks. Some of the existing passes might not
    deal with this well, is that a major concern? Suggestions for how to
    analysis and validate?
  * Currently, I've structured this as pulling off the rarest latch as
    an outer iteration. I could also pull off the most popular latch as
    an inner iteration. This might give different tradeoffs; thoughts?

Generally, any thoughts anyone have on the problem or approach are 
welcome. I'm not particular attached to the particular approach laid out 
here and if there's a more advantageous approach, all the better.





-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150107/83c3990d/attachment.html>

Chandler Carruth

2015-Jan-08 01:33 UTC

head link

[LLVMdev] Separating loop nests based on profile information?

On Wed, Jan 7, 2015 at 5:19 PM, Philip Reames <listmail at
philipreames.com>
wrote:
> I've been playing with approaches to getting better optimization of
loops
> which contain infrequently executed slow paths.  I've gotten as far as
> throwing together a proof of concept implementation of a profile guided
> optimization to separate a single loop with multiple latches into a loop
> nest, but I want to get feedback from interested parties before investing
> much more effort.
>
> The primary case I'm concerned about looks something like this C
example:
> while( some_condition )
>   //do lots of work here
>   if (!some_rare_condition) { <-- This is loop variant
>     helper();
>   }
> }
>
> The key point here is that the 'do lots of work' section contains
things
> we could lift out of the loop, but our knowledge of memory gets completely
> destoryed by the 'helper' call which doesn't get inlined.  This
ends up
> crippling LICM in particular, and GVN to a lessor degree.
>
> The approach I've been playing with would create this loop structure:
> goto inner
> while (true) {
>   outer:
>   helper();
>   inner:
>   while( some_condition )
>     //do lots of work here
>     if (!some_rare_condition) { <-- This is loop variant
>       goto outer;
>     }
>   }
> }
>
> (Excuse the psuedo-C.  Hopefully the idea is clear.)
>
Yep.

I've not thought about this a lot, but I have two initial questions that
maybe you can answer:

How does this compare with classical approaches of loop peeling,
partitioning, fission, or whatever you might call it? Is there any
literature behind this approach or some literature it should be compared
with? (I genuinely don't know this area that well, so I'm of little help
here...)


Some of your points I have quick feedback on:
> Points for discussion:
>
>    - Is using profile information for this purpose even a reasonable
>    thing to do?
>
> Yes!
>
>    - I chose to implement this without relying on the existing block
>    frequency analysis. My reasoning was that a) this is a rarely taken case
>    and adding an expensive analysis dependency probably wasn't
worthwhile and
>    b) that block frequency analysis was more expensive/precise than I
really
>    needed. Is this reasonable?
>
> I think we should always use the analyses. Either BlockFrequency orBranchProbability. I think probably both in the common joint usage
(frequency of the loop header combined with probability of the cold region).
>
>    - If so, is the notion of 'rareness' of a loop block something
that's
>    worth extracting out on it's own and reusing? Are there other
similar uses
>    anyone can think of?
>    - Currently, I'm only supporting a fairly small set of controlling
>    conditions. Are there important cases I'm not considering?
>
> To both of these, I think the general combination to use is to identifythe set of blocks dominated by a block which is in the loop body of a hot
loop, and is cold relative to the other successors of its predecessor(s).
These form cold "regions" as I think of them without requiring the
complexity of the region analysis.

>
>    - Since the rarest latch is often deep in a loop - with other "if
(X)
>    continue;" (i.e. latches) before it - this tends to create loops
with
>    multiple exiting blocks. Some of the existing passes might not deal with
>    this well, is that a major concern? Suggestions for how to analysis and
>    validate?
>
> I'm somewhat concerned about this, but want to think more about thefundamental transformation.

>
>    - Currently, I've structured this as pulling off the rarest latch as
>    an outer iteration. I could also pull off the most popular latch as an
>    inner iteration. This might give different tradeoffs; thoughts?
>
> Generally, any thoughts anyone have on the problem or approach are
> welcome. I'm not particular attached to the particular approach laid
out
> here and if there's a more advantageous approach, all the better.
>Thanks for pushing on this! ;] Now I need to go and ponder a lot so i can
reply more deeply on the actual transform.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150107/88bb4a04/attachment.html>

Adam Nemet

2015-Jan-08 02:42 UTC

head link

[LLVMdev] Separating loop nests based on profile information?

> On Jan 7, 2015, at 5:33 PM, Chandler Carruth <chandlerc at
google.com> wrote:
> 
> 
> On Wed, Jan 7, 2015 at 5:19 PM, Philip Reames <listmail at
philipreames.com <mailto:listmail at philipreames.com>> wrote:
> I've been playing with approaches to getting better optimization of
loops which contain infrequently executed slow paths.  I've gotten as far as
throwing together a proof of concept implementation of a profile guided
optimization to separate a single loop with multiple latches into a loop nest,
but I want to get feedback from interested parties before investing much more
effort.
> 
> The primary case I'm concerned about looks something like this C
example:
> while( some_condition )
>   //do lots of work here
>   if (!some_rare_condition) { <-- This is loop variant
>     helper();
>   }
> }
> 
> The key point here is that the 'do lots of work' section contains
things we could lift out of the loop, but our knowledge of memory gets
completely destoryed by the 'helper' call which doesn't get inlined.
This ends up crippling LICM in particular, and GVN to a lessor degree.
> 
> The approach I've been playing with would create this loop structure:
> goto inner
> while (true) {
>   outer:
>   helper();
>   inner:
>   while( some_condition )
>     //do lots of work here
>     if (!some_rare_condition) { <-- This is loop variant
>       goto outer;
>     }
>   }
> }
> 
> (Excuse the psuedo-C.  Hopefully the idea is clear.)
> 
> Yep.
> 
> I've not thought about this a lot, but I have two initial questions
that maybe you can answer:
> 
> How does this compare with classical approaches of loop peeling,
partitioning, fission, or whatever you might call it? Is there any literature
behind this approach or some literature it should be compared with? (I genuinely
don't know this area that well, so I'm of little help here…)
I think it’s pretty different from loop distribution/fission.  In loop
distribution, in order to split the loop into multiple consecutive loops, you
need to reason about the memory references so having a function call makes that
difficult/impossible.  Phillip’s idea works around this exact issue.

I explored this space quite a bit recently because we’re working on a proposal
to add loop distribution to LLVM.  I hope to send out the proposal this week.

So I see no issue with trying to handle loops with low-probablity function calls
with this and fully self-contained loops with classical loop distribution.

Thanks,
Adam
> Some of your points I have quick feedback on:
> Points for discussion:
> 
> Is using profile information for this purpose even a reasonable thing to
do?
> Yes!
>  
> I chose to implement this without relying on the existing block frequency
analysis. My reasoning was that a) this is a rarely taken case and adding an
expensive analysis dependency probably wasn't worthwhile and b) that block
frequency analysis was more expensive/precise than I really needed. Is this
reasonable?
> I think we should always use the analyses. Either BlockFrequency or
BranchProbability. I think probably both in the common joint usage (frequency of
the loop header combined with probability of the cold region).
> If so, is the notion of 'rareness' of a loop block something
that's worth extracting out on it's own and reusing? Are there other
similar uses anyone can think of?
> Currently, I'm only supporting a fairly small set of controlling
conditions. Are there important cases I'm not considering?
> To both of these, I think the general combination to use is to identify the
set of blocks dominated by a block which is in the loop body of a hot loop, and
is cold relative to the other successors of its predecessor(s). These form cold
"regions" as I think of them without requiring the complexity of the
region analysis.
>  
> Since the rarest latch is often deep in a loop - with other "if (X)
continue;" (i.e. latches) before it - this tends to create loops with
multiple exiting blocks. Some of the existing passes might not deal with this
well, is that a major concern? Suggestions for how to analysis and validate?
> I'm somewhat concerned about this, but want to think more about the
fundamental transformation.
>  
> Currently, I've structured this as pulling off the rarest latch as an
outer iteration. I could also pull off the most popular latch as an inner
iteration. This might give different tradeoffs; thoughts?
> Generally, any thoughts anyone have on the problem or approach are welcome.
I'm not particular attached to the particular approach laid out here and if
there's a more advantageous approach, all the better.
> 
> Thanks for pushing on this! ;] Now I need to go and ponder a lot so i can
reply more deeply on the actual transform.
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>        
http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150107/683bfe08/attachment.html>

Philip Reames

2015-Jan-08 19:37 UTC

head link

[LLVMdev] Separating loop nests based on profile information?

On 01/07/2015 05:33 PM, Chandler Carruth wrote:>
> On Wed, Jan 7, 2015 at 5:19 PM, Philip Reames 
> <listmail at philipreames.com <mailto:listmail at
philipreames.com>> wrote:
>
>     I've been playing with approaches to getting better optimization
>     of loops which contain infrequently executed slow paths.  I've
>     gotten as far as throwing together a proof of concept
>     implementation of a profile guided optimization to separate a
>     single loop with multiple latches into a loop nest, but I want to
>     get feedback from interested parties before investing much more
>     effort.
>
>     The primary case I'm concerned about looks something like this C
>     example:
>     while( some_condition )
>       //do lots of work here
>       if (!some_rare_condition) { <-- This is loop variant
>         helper();
>       }
>     }
>
>     The key point here is that the 'do lots of work' section
contains
>     things we could lift out of the loop, but our knowledge of memory
>     gets completely destoryed by the 'helper' call which
doesn't get
>     inlined.  This ends up crippling LICM in particular, and GVN to a
>     lessor degree.
>
>     The approach I've been playing with would create this loop
structure:
>     goto inner
>     while (true) {
>       outer:
>       helper();
>       inner:
>       while( some_condition )
>         //do lots of work here
>         if (!some_rare_condition) { <-- This is loop variant
>           goto outer;
>         }
>       }
>     }
>
>     (Excuse the psuedo-C.  Hopefully the idea is clear.)
>
>
> Yep.
>
> I've not thought about this a lot, but I have two initial questions 
> that maybe you can answer:
>
> How does this compare with classical approaches of loop peeling, 
> partitioning, fission, or whatever you might call it? Is there any 
> literature behind this approach or some literature it should be 
> compared with? (I genuinely don't know this area that well, so I'm
of
> little help here...)I'm not aware of any literature that covers the specific transform I'm 
suggesting here.  This is probably a lack of awareness on my part, not a 
lack of literature though.  While I have some basic background in the 
area, I can't claim to have made an extensive study of loop optimization 
techniques.  *Particularly* profile guided ones.

w.r.t. the tradoffs in practice, I'm going to answer this in a separate 
response just to keep this one short.  I've been playing with both 
approaches and there appears to be appeal in both.  I'm currently 
leaning towards a mixture of both approaches.
>
> Some of your points I have quick feedback on:
>
>     Points for discussion:
>
>       * Is using profile information for this purpose even a
>         reasonable thing to do?
>
> Yes!
>
>       * I chose to implement this without relying on the existing
>         block frequency analysis. My reasoning was that a) this is a
>         rarely taken case and adding an expensive analysis dependency
>         probably wasn't worthwhile and b) that block frequency
>         analysis was more expensive/precise than I really needed. Is
>         this reasonable?
>
> I think we should always use the analyses. Either BlockFrequency or 
> BranchProbability. I think probably both in the common joint usage 
> (frequency of the loop header combined with probability of the cold 
> region).Let me ask a basic question: what do BlockFrequency and 
BranchProbability compute and roughly what mental cost model should I 
have?  Once I have a BlockFrequency, what does it mean?  Some of this is 
documented in BlockFrequencyImpl.h, but the high level bits are mostly 
missing.  Particularly for BranchProbability.

I chose to write a custom analysis in large part because I didn't 
understand the existing analysis and my (possibly completely wrong) 
perception is that they're overly precise/expensive for what I need.  I 
wanted to focus on the writing the transform since that's the point I 
actually wanted to discuss.  :)
>       * If so, is the notion of 'rareness' of a loop block
something
>         that's worth extracting out on it's own and reusing? Are
there
>         other similar uses anyone can think of?
>       * Currently, I'm only supporting a fairly small set of
>         controlling conditions. Are there important cases I'm not
>         considering?
>
> To both of these, I think the general combination to use is to 
> identify the set of blocks dominated by a block which is in the loop 
> body of a hot loop, and is cold relative to the other successors of 
> its predecessor(s). These form cold "regions" as I think of them 
> without requiring the complexity of the region analysis.I get why you find this a useful mental model, but the transform is far 
harder to express this way in the couple of cases I've looked at.  It's 
much easier to start with a latch, reason about it's hotness, and then 
let the generic loop code construct the cold "region".

Then again, if you were to ask for the set of latches which were 
dominated by blocks in the cold-with-hot-predecessors list, that might 
be interesting.  But again, I'd really like to start from the latches 
since that's what controls the legality/profitability of the transform.  
Hm, we could walk back along the dom tree from the latch looking for 
such a cwhp block... This might be a better way to think about the 
profitability of the transform, even starting from the latch.  Is that 
what you were trying to get at to start with?  Or were you looking for 
something more general?
>       * Since the rarest latch is often deep in a loop - with other
>         "if (X) continue;" (i.e. latches) before it - this tends
to
>         create loops with multiple exiting blocks. Some of the
>         existing passes might not deal with this well, is that a major
>         concern? Suggestions for how to analysis and validate?
>
> I'm somewhat concerned about this, but want to think more about the 
> fundamental transformation.
>
>       * Currently, I've structured this as pulling off the rarest
>         latch as an outer iteration. I could also pull off the most
>         popular latch as an inner iteration. This might give different
>         tradeoffs; thoughts?
>
>     Generally, any thoughts anyone have on the problem or approach are
>     welcome. I'm not particular attached to the particular approach
>     laid out here and if there's a more advantageous approach, all the
>     better.
>
> Thanks for pushing on this! ;] Now I need to go and ponder a lot so i 
> can reply more deeply on the actual transform.You're welcome.  :)

This might be an area where it's profitable to talk in person, are you 
going to be at the social tonight?

Philip

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150108/bd67cbe3/attachment.html>

Philip Reames

2015-Jan-08 20:33 UTC

head link

[LLVMdev] Separating loop nests based on profile information?

On 01/07/2015 05:33 PM, Chandler Carruth wrote:> How does this compare with classical approaches of loop peeling, 
> partitioning, fission, or whatever you might call it?I'm still developing a good sense for this, but let me lay out some 
observations.  Fair warning, this is going to be long and rambling.

Let's start with a toy example:
while(c) {
   x = this->x;
   y = this->y;
   if (x == y) {
     rare();
   }
}

If we could tell x and y were loop invariant, we could unswitch this 
loop.  However, the rare call clobbers our view of memory, so LICM 
fails, and thus unswitch fails.

We'd like to apply PRE here and push the reload into the loop preheader 
and the rare case.  This currently fails for two reasons: 1) We don't 
allow code duplication during PRE, and 2) the canonical loop-simplify 
form of having a single latch means that PRE doesn't actually see the 
rare block at all, it sees the preheader and the join point after the if 
block.

I think both of the previous are fixable:
1) This calls for profile guided duplication.  If we can trade one hot 
load for two cold ones, that's great.  Also, there might be static 
heuristics we could construct.  (e.g. if it's not available in the 
preheader and one of many latches...)
2) We need PRE to look back through merge blocks.  This should possibly 
even live in MemoryDependenceAnalysis.  Alternatively, we could create a 
"deconstruct loop simplify form" pass to run right before GVN to side 
step this issue.

Neither is entirely simple to implement, but both are doable. Probably.

In theory, we've now solved the same problem my loop nest transform 
does.  There's two snags: one minor(ish), one major.

The minor one is that many of my rare conditions involve volatile 
loads.  MemDepAnalysis currently completely gives up when faced with 
volatile loads.  This is mostly fixable, but given the structure of the 
code may be fairly invasive.  Oddly, LICM actually performs better in 
this case.

For the major one, let's tweak our example slightly:
while(c) {
   x = this->x;
   if (x == null) return;
   y = this->y;
   if (y == null) return;
   if (x == y) {
     rare();
   }
}

Let's assume that PRE worked as designed for 'x'.  We now need to 
perform a transform analogous to loop unswitch, but a bit different.  We 
need to move the condition loop exits into each of the predecessors 
where the answers not available.  I don't believe we have anything like 
this today, and it involves a non-trivial bit of code.  Even if we get 
that done, we're left with a pass ordering problem where we need to run 
PRE, then branch-PRE, then PRE, then branch-PRE, etc..  (In practice, 
these chains can be quite long due to null checks and range checks.)

If we'd split this into a loop nest structure, we'd still have the 
chain, but a) that's easier to control with a custom pass order since 
LICM and LoopUnswitch are far cheaper than GVN/PRE and b) having LICM do 
a bit of trivial loop unswitch for the terminator of the header appears 
quite tractable.

One other point worth mentioning is that LLVM does not currently have a 
loop peeling pass.  LoopRotate seems to get us most of the way there in 
common cases, but not always.  If we did peel the above loop once, we'd 
have the same merge block problem I mentioned previously (this time in 
the preheader *and* the latch), but the profitability becomes slightly 
more clear without the need for profile information or loop specific 
heuristics.

To be clear, I'm not opposed to fixing some of the issues above.  I 
think that's a great idea, I'm just trying to find my way through to 
practical performance improvements with minimal invested effort.  I 
probably will in fact fix some of these since they show up in a number 
of other cases as well.


Ok, so that's that.  Let's move on to why my proposed loop nest 
separation change is a bit scary.

While the loop nest does allow things to be lifted from the inner loop 
to the outer loop - without any of the complex PRE or PGO reasoning 
required above - it suffers from all of the limitations of the existing 
loop optimization passes.  In particular, LICM may not actually be able 
to do much if the inner loop is still complicated enough to defeat 
either it's fairly simple alias analysis or fault analysis.  GVN/PRE is 
strictly more powerful - in theory if not always practice - here than 
what we have in LICM.

We've also potentially created a loop nest which is harder to analyse 
than the original loop with two latches.  Consider this example:
for(int i = 0; i < 20000) {
    i++;
    if(i % 20 != 0) continue;
    a[i] = 0;
}
(This is not a real example of anything useful.  The important point is 
the early return is the common case.)

This loop clearly runs exactly 20000 times.  ScalarEvolution almost 
certainly gets this right. (I haven't checked.)

After separation, we'd have this:

int i = 0;
while(true) {
   while(i < 20000) {
      i++;
      if(i % 20 == 0) { // now a loop exit
        a[i] = 0;
        goto next:
      }
}
break:
   next:
}

It's not clear to me that SE will always be able to tell that both the 
inner and outer loop runs a maximum of 20000 times.  (In this particular 
case it might actually optimize the inner loop substantially and thus 
get a more precise answer, but let's ignore that.)  We could potentially 
loose information about the loop by doing this transformation.  Of 
course, this concern applies to the *existing* loop nest separation 
heuristics too.  I don't have any particular reason to believe one 
heuristic is worse than the other, but I also have no reason not to 
believe that either.

As a final point, the inner loop now has two exiting blocks not one.  I 
know at least some of our loop transforms give up on anything with more 
than one loop exit.  I'm less concerned by that since a) we should fix 
them, and b) the cases I've looked at don't look too hard.


And that's everything that occurs to me at the moment.  I'll probably 
have to append in further responses as I remember more details, I know 
I'm forgetting some pieces.

Philip

Krzysztof Parzyszek

2015-Jan-12 19:00 UTC

head link

[LLVMdev] Separating loop nests based on profile information?

On 1/7/2015 7:19 PM, Philip Reames wrote:>
> The approach I've been playing with would create this loop structure:
> goto inner
> while (true) {
>    outer:
>    helper();
>    inner:
>    while( some_condition )
>      //do lots of work here
>      if (!some_rare_condition) { <-- This is loop variant
>        goto outer;
>      }
>    }
> }
>
Anything that changes loop structure into something like this, is a 
fundamentally bad idea.  This would make the outer loop irreducible, 
which should be avoided.

A lot better approach would be to privatize the invariant objects, and 
write them back to memory before/after the "helper" is invoked:

     T1 = p->x;
     T2 = P->y;
     while( some_condition )
       //do lots of work here, use T1 and T2
       if (some_rare_condition) { <-- This is loop variant
         p->x = T1;
         p->y = T2;
         helper();
         T1 = p->x;
         T2 = p->y;
       }
     }
     p->x = T1;
     p->y = T2;


-Krzysztof

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, 
hosted by The Linux Foundation

Krzysztof Parzyszek

2015-Jan-12 19:40 UTC

head link

[LLVMdev] Separating loop nests based on profile information?

Upon closer inspection, the loop will not become irreducible.  However, 
the result will be equivalent to reloading the hoisted values after 
calling "helper".  In fact, that's what may happen if some CFG 
optimization cleans this up.

-Krzysztof


On 1/12/2015 1:00 PM, Krzysztof Parzyszek wrote:> On 1/7/2015 7:19 PM, Philip Reames wrote:
>>
>> The approach I've been playing with would create this loop
structure:
>> goto inner
>> while (true) {
>>    outer:
>>    helper();
>>    inner:
>>    while( some_condition )
>>      //do lots of work here
>>      if (!some_rare_condition) { <-- This is loop variant
>>        goto outer;
>>      }
>>    }
>> }
>>
>
> Anything that changes loop structure into something like this, is a
> fundamentally bad idea.  This would make the outer loop irreducible,
> which should be avoided.
>
> A lot better approach would be to privatize the invariant objects, and
> write them back to memory before/after the "helper" is invoked:
>
>      T1 = p->x;
>      T2 = P->y;
>      while( some_condition )
>        //do lots of work here, use T1 and T2
>        if (some_rare_condition) { <-- This is loop variant
>          p->x = T1;
>          p->y = T2;
>          helper();
>          T1 = p->x;
>          T2 = p->y;
>        }
>      }
>      p->x = T1;
>      p->y = T2;
>
>
> -Krzysztof
>

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, 
hosted by The Linux Foundation

Xinliang David Li

2015-Jan-13 03:27 UTC

head link

[LLVMdev] Separating loop nests based on profile information?

On Wed, Jan 7, 2015 at 5:19 PM, Philip Reames <listmail at
philipreames.com>
wrote:
>  I've been playing with approaches to getting better optimization of
loops
> which contain infrequently executed slow paths.  I've gotten as far as
> throwing together a proof of concept implementation of a profile guided
> optimization to separate a single loop with multiple latches into a loop
> nest, but I want to get feedback from interested parties before investing
> much more effort.
>
> The primary case I'm concerned about looks something like this C
example:
> while( some_condition )
>   //do lots of work here
>   if (!some_rare_condition) { <-- This is loop variant
>     helper();
>   }
> }
>
> The key point here is that the 'do lots of work' section contains
things
> we could lift out of the loop, but our knowledge of memory gets completely
> destoryed by the 'helper' call which doesn't get inlined.  This
ends up
> crippling LICM in particular, and GVN to a lessor degree.
>
> The approach I've been playing with would create this loop structure:
> goto inner
> while (true) {
>   outer:
>   helper();
>   inner:
>   while( some_condition )
>     //do lots of work here
>     if (!some_rare_condition) { <-- This is loop variant
>       goto outer;
>     }
>   }
> }
>
> (Excuse the psuedo-C.  Hopefully the idea is clear.)
>
> You can see my stalking horse implementation here:
> http://reviews.llvm.org/D6872
>
> This is not an actual patch per se; it's just intended to make the
> discussion more concrete and thus hopefully easier to follow.
>
> The basic idea of this patch is to use profiling information about the
> frequency of a backedges to separate a loop with multiple latches into a
> loop nest with the rare backedge being the outer loop. We already use a set
> of static heuristics based on non-varying arguments to PHI nodes to do the
> same type of thing.
>
> The motivation is that doing this pulls rarely executed code out of the
> innermost loop and tends to greatly simplify analysis and optimization of
> that loop.
>
This is good thing to do from the code/block placement point of view -- to
improve icache utilization. Code layout is one of the most important passes
that can benefit from profile data.


> In particular, it can enable substantial LICM gains when there are
> clobbering calls down rare paths.
>
How can this enable more LICM?

> The original motivating case for this was the slow path of a safepoint
> poll, but I believe the possible benefit is more general as well.
>
> Points for discussion:
>
>    - Is using profile information for this purpose even a reasonable
>    thing to do?
>
>yes.

>
>    - I chose to implement this without relying on the existing block
>    frequency analysis. My reasoning was that a) this is a rarely taken case
>    and adding an expensive analysis dependency probably wasn't
worthwhile and
>    b) that block frequency analysis was more expensive/precise than I
really
>    needed. Is this reasonable?
>
>IMO no. Remember your pass should also work with real profile data
(instrumentation or sample based) -- you should  rely on existing profile
infrastructure to provide what you need. (Ideally you should treat it as
something that is always available for query).

>
>    - If so, is the notion of 'rareness' of a loop block something
that's
>    worth extracting out on it's own and reusing? Are there other
similar uses
>    anyone can think of?
>
>Please do not introduce new notions for profile information -- there are
already many:)  You have block 'frequency' to represent intra-function
relative hotness. Currently missing, but if we need to represent
inter-procedural global hotness, something like bb execution 'count' may
be
needed.

thanks,

David

>
>    - Currently, I'm only supporting a fairly small set of controlling
>    conditions. Are there important cases I'm not considering?
>    - Since the rarest latch is often deep in a loop - with other "if
(X)
>    continue;" (i.e. latches) before it - this tends to create loops
with
>    multiple exiting blocks. Some of the existing passes might not deal with
>    this well, is that a major concern? Suggestions for how to analysis and
>    validate?
>    - Currently, I've structured this as pulling off the rarest latch as
>    an outer iteration. I could also pull off the most popular latch as an
>    inner iteration. This might give different tradeoffs; thoughts?
>
> Generally, any thoughts anyone have on the problem or approach are
> welcome. I'm not particular attached to the particular approach laid
out
> here and if there's a more advantageous approach, all the better.
>
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150112/295c2791/attachment.html>

Xinliang David Li

2015-Jan-13 03:39 UTC

head link

[LLVMdev] Separating loop nests based on profile information?

On Wed, Jan 7, 2015 at 5:33 PM, Chandler Carruth <chandlerc at google.com>
wrote:
>
> On Wed, Jan 7, 2015 at 5:19 PM, Philip Reames <listmail at
philipreames.com>
> wrote:
>
>> I've been playing with approaches to getting better optimization of
loops
>> which contain infrequently executed slow paths.  I've gotten as far
as
>> throwing together a proof of concept implementation of a profile guided
>> optimization to separate a single loop with multiple latches into a
loop
>> nest, but I want to get feedback from interested parties before
investing
>> much more effort.
>>
>> The primary case I'm concerned about looks something like this C
example:
>> while( some_condition )
>>   //do lots of work here
>>   if (!some_rare_condition) { <-- This is loop variant
>>     helper();
>>   }
>> }
>>
>> The key point here is that the 'do lots of work' section
contains things
>> we could lift out of the loop, but our knowledge of memory gets
completely
>> destoryed by the 'helper' call which doesn't get inlined. 
This ends up
>> crippling LICM in particular, and GVN to a lessor degree.
>>
>> The approach I've been playing with would create this loop
structure:
>> goto inner
>> while (true) {
>>   outer:
>>   helper();
>>   inner:
>>   while( some_condition )
>>     //do lots of work here
>>     if (!some_rare_condition) { <-- This is loop variant
>>       goto outer;
>>     }
>>   }
>> }
>>
>> (Excuse the psuedo-C.  Hopefully the idea is clear.)
>>
>
> Yep.
>
> I've not thought about this a lot, but I have two initial questions
that
> maybe you can answer:
>
> How does this compare with classical approaches of loop peeling,
> partitioning, fission, or whatever you might call it? Is there any
> literature behind this approach or some literature it should be compared
> with? (I genuinely don't know this area that well, so I'm of little
help
> here...)
>

It is not any of those loop transformations. It does not even change the
control flow. It is more a code layout optimization -- move the cold trace
in the loop out of the body.

>
>
> Some of your points I have quick feedback on:
>
>> Points for discussion:
>>
>>    - Is using profile information for this purpose even a reasonable
>>    thing to do?
>>
>> Yes!
>
>
>>
>>    - I chose to implement this without relying on the existing block
>>    frequency analysis. My reasoning was that a) this is a rarely taken
case
>>    and adding an expensive analysis dependency probably wasn't
worthwhile and
>>    b) that block frequency analysis was more expensive/precise than I
really
>>    needed. Is this reasonable?
>>
>> I think we should always use the analyses. Either BlockFrequency or
> BranchProbability. I think probably both in the common joint usage
> (frequency of the loop header combined with probability of the cold
region).
>
yes.

>
>>    - If so, is the notion of 'rareness' of a loop block
something that's
>>    worth extracting out on it's own and reusing? Are there other
similar uses
>>    anyone can think of?
>>    - Currently, I'm only supporting a fairly small set of
controlling
>>    conditions. Are there important cases I'm not considering?
>>
>> To both of these, I think the general combination to use is to identify
> the set of blocks dominated by a block which is in the loop body of a hot
> loop, and is cold relative to the other successors of its predecessor(s).
> These form cold "regions" as I think of them without requiring
the
> complexity of the region analysis.
>
>
Static prediction should handle it  --  call heuristics or some combination
with other heuristics (error handler recognition etc).


David

>
>>    - Since the rarest latch is often deep in a loop - with other
"if (X)
>>    continue;" (i.e. latches) before it - this tends to create
loops with
>>    multiple exiting blocks. Some of the existing passes might not deal
with
>>    this well, is that a major concern? Suggestions for how to analysis
and
>>    validate?
>>
>> I'm somewhat concerned about this, but want to think more about the
> fundamental transformation.
>
>
>>
>>    - Currently, I've structured this as pulling off the rarest
latch as
>>    an outer iteration. I could also pull off the most popular latch as
an
>>    inner iteration. This might give different tradeoffs; thoughts?
>>
>> Generally, any thoughts anyone have on the problem or approach are
>> welcome. I'm not particular attached to the particular approach
laid out
>> here and if there's a more advantageous approach, all the better.
>>
> Thanks for pushing on this! ;] Now I need to go and ponder a lot so i can
> reply more deeply on the actual transform.
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150112/69bd8bbb/attachment.html>

Philip Reames

2015-Jan-20 01:41 UTC

head link

[LLVMdev] Separating loop nests based on profile information?

On 01/12/2015 07:27 PM, Xinliang David Li wrote:>
>
> On Wed, Jan 7, 2015 at 5:19 PM, Philip Reames 
> <listmail at philipreames.com <mailto:listmail at
philipreames.com>> wrote:
>
>     The basic idea of this patch is to use profiling information about
>     the frequency of a backedges to separate a loop with multiple
>     latches into a loop nest with the rare backedge being the outer
>     loop. We already use a set of static heuristics based on
>     non-varying arguments to PHI nodes to do the same type of thing.
>
>     The motivation is that doing this pulls rarely executed code out
>     of the innermost loop and tends to greatly simplify analysis and
>     optimization of that loop.
>
>
> This is good thing to do from the code/block placement point of view 
> -- to improve icache utilization. Code layout is one of the most 
> important passes that can benefit from profile data.We already try to do that using profile information at a later point.  I 
don't know how effective that is and how much room we might have to 
improve, but that really wasn't a motivation for this
approach.>
>     In particular, it can enable substantial LICM gains when there are
>     clobbering calls down rare paths.
>
>
> How can this enable more LICM?Given you now have a loop nest with a simpler inner loop, we can 
sometimes perform LICM on that inner loop.

Note that this is equivalent to running a smarter PRE on the original 
loop structure.>
>       * I chose to implement this without relying on the existing
>         block frequency analysis. My reasoning was that a) this is a
>         rarely taken case and adding an expensive analysis dependency
>         probably wasn't worthwhile and b) that block frequency
>         analysis was more expensive/precise than I really needed. Is
>         this reasonable?
>
>
> IMO no. Remember your pass should also work with real profile data 
> (instrumentation or sample based) -- you should  rely on existing 
> profile infrastructure to provide what you need. (Ideally you should 
> treat it as something that is always available for query).There's a distinction here between 'source of information' and
'source
of heuristics'.  By accessing the metadata directly, I'm still using 
whatever profile information the user provided.

I'm currently exploring other approaches to this problem, but if I do 
end up exploring this further, I'll likely rewrite to use one the 
existing analyses.

Philip
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150119/b0202be1/attachment.html>

Reasonably Related Threads

Search for more possibly parallel threads

llvm dev - Jan 2015 - [LLVMdev] Separating loop nests based on profile information?

[LLVMdev] Separating loop nests based on profile information?

[LLVMdev] Separating loop nests based on profile information?

[LLVMdev] Separating loop nests based on profile information?

[LLVMdev] Separating loop nests based on profile information?

[LLVMdev] Separating loop nests based on profile information?

[LLVMdev] Separating loop nests based on profile information?

[LLVMdev] Separating loop nests based on profile information?

[LLVMdev] Separating loop nests based on profile information?

[LLVMdev] Separating loop nests based on profile information?

[LLVMdev] Separating loop nests based on profile information?

Reasonably Related Threads