thr3ads.net - llvm dev - [LLVMdev] Separating loop nests based on profile information? [Jan 2015]

If this information is useful, please help other people find it:
Share via:

Philip Reames

2015-Jan-08 20:33 UTC

[LLVMdev] Separating loop nests based on profile information?

On 01/07/2015 05:33 PM, Chandler Carruth wrote:> How does this compare with classical approaches of loop peeling, 
> partitioning, fission, or whatever you might call it?I'm still developing a good sense for this, but let me lay out some 
observations.  Fair warning, this is going to be long and rambling.

Let's start with a toy example:
while(c) {
   x = this->x;
   y = this->y;
   if (x == y) {
     rare();
   }
}

If we could tell x and y were loop invariant, we could unswitch this 
loop.  However, the rare call clobbers our view of memory, so LICM 
fails, and thus unswitch fails.

We'd like to apply PRE here and push the reload into the loop preheader 
and the rare case.  This currently fails for two reasons: 1) We don't 
allow code duplication during PRE, and 2) the canonical loop-simplify 
form of having a single latch means that PRE doesn't actually see the 
rare block at all, it sees the preheader and the join point after the if 
block.

I think both of the previous are fixable:
1) This calls for profile guided duplication.  If we can trade one hot 
load for two cold ones, that's great.  Also, there might be static 
heuristics we could construct.  (e.g. if it's not available in the 
preheader and one of many latches...)
2) We need PRE to look back through merge blocks.  This should possibly 
even live in MemoryDependenceAnalysis.  Alternatively, we could create a 
"deconstruct loop simplify form" pass to run right before GVN to side 
step this issue.

Neither is entirely simple to implement, but both are doable. Probably.

In theory, we've now solved the same problem my loop nest transform 
does.  There's two snags: one minor(ish), one major.

The minor one is that many of my rare conditions involve volatile 
loads.  MemDepAnalysis currently completely gives up when faced with 
volatile loads.  This is mostly fixable, but given the structure of the 
code may be fairly invasive.  Oddly, LICM actually performs better in 
this case.

For the major one, let's tweak our example slightly:
while(c) {
   x = this->x;
   if (x == null) return;
   y = this->y;
   if (y == null) return;
   if (x == y) {
     rare();
   }
}

Let's assume that PRE worked as designed for 'x'.  We now need to 
perform a transform analogous to loop unswitch, but a bit different.  We 
need to move the condition loop exits into each of the predecessors 
where the answers not available.  I don't believe we have anything like 
this today, and it involves a non-trivial bit of code.  Even if we get 
that done, we're left with a pass ordering problem where we need to run 
PRE, then branch-PRE, then PRE, then branch-PRE, etc..  (In practice, 
these chains can be quite long due to null checks and range checks.)

If we'd split this into a loop nest structure, we'd still have the 
chain, but a) that's easier to control with a custom pass order since 
LICM and LoopUnswitch are far cheaper than GVN/PRE and b) having LICM do 
a bit of trivial loop unswitch for the terminator of the header appears 
quite tractable.

One other point worth mentioning is that LLVM does not currently have a 
loop peeling pass.  LoopRotate seems to get us most of the way there in 
common cases, but not always.  If we did peel the above loop once, we'd 
have the same merge block problem I mentioned previously (this time in 
the preheader *and* the latch), but the profitability becomes slightly 
more clear without the need for profile information or loop specific 
heuristics.

To be clear, I'm not opposed to fixing some of the issues above.  I 
think that's a great idea, I'm just trying to find my way through to 
practical performance improvements with minimal invested effort.  I 
probably will in fact fix some of these since they show up in a number 
of other cases as well.


Ok, so that's that.  Let's move on to why my proposed loop nest 
separation change is a bit scary.

While the loop nest does allow things to be lifted from the inner loop 
to the outer loop - without any of the complex PRE or PGO reasoning 
required above - it suffers from all of the limitations of the existing 
loop optimization passes.  In particular, LICM may not actually be able 
to do much if the inner loop is still complicated enough to defeat 
either it's fairly simple alias analysis or fault analysis.  GVN/PRE is 
strictly more powerful - in theory if not always practice - here than 
what we have in LICM.

We've also potentially created a loop nest which is harder to analyse 
than the original loop with two latches.  Consider this example:
for(int i = 0; i < 20000) {
    i++;
    if(i % 20 != 0) continue;
    a[i] = 0;
}
(This is not a real example of anything useful.  The important point is 
the early return is the common case.)

This loop clearly runs exactly 20000 times.  ScalarEvolution almost 
certainly gets this right. (I haven't checked.)

After separation, we'd have this:

int i = 0;
while(true) {
   while(i < 20000) {
      i++;
      if(i % 20 == 0) { // now a loop exit
        a[i] = 0;
        goto next:
      }
}
break:
   next:
}

It's not clear to me that SE will always be able to tell that both the 
inner and outer loop runs a maximum of 20000 times.  (In this particular 
case it might actually optimize the inner loop substantially and thus 
get a more precise answer, but let's ignore that.)  We could potentially 
loose information about the loop by doing this transformation.  Of 
course, this concern applies to the *existing* loop nest separation 
heuristics too.  I don't have any particular reason to believe one 
heuristic is worse than the other, but I also have no reason not to 
believe that either.

As a final point, the inner loop now has two exiting blocks not one.  I 
know at least some of our loop transforms give up on anything with more 
than one loop exit.  I'm less concerned by that since a) we should fix 
them, and b) the cases I've looked at don't look too hard.


And that's everything that occurs to me at the moment.  I'll probably 
have to append in further responses as I remember more details, I know 
I'm forgetting some pieces.

Philip

Philip Reames

2015-Jan-09 19:39 UTC

head link

[LLVMdev] Separating loop nests based on profile information?

On 01/08/2015 12:33 PM, Philip Reames wrote:>
> On 01/07/2015 05:33 PM, Chandler Carruth wrote:
>> How does this compare with classical approaches of loop peeling, 
>> partitioning, fission, or whatever you might call it?
> I'm still developing a good sense for this, but let me lay out some 
> observations.  Fair warning, this is going to be long and rambling.
>
> Let's start with a toy example:
> while(c) {
>   x = this->x;
>   y = this->y;
>   if (x == y) {
>     rare();
>   }
> }
>
> If we could tell x and y were loop invariant, we could unswitch this 
> loop.  However, the rare call clobbers our view of memory, so LICM 
> fails, and thus unswitch fails.
>
> We'd like to apply PRE here and push the reload into the loop 
> preheader and the rare case.  This currently fails for two reasons: 1) 
> We don't allow code duplication during PRE, and 2) the canonical 
> loop-simplify form of having a single latch means that PRE doesn't 
> actually see the rare block at all, it sees the preheader and the join 
> point after the if block.
>
> I think both of the previous are fixable:
> 1) This calls for profile guided duplication.  If we can trade one hot 
> load for two cold ones, that's great.  Also, there might be static 
> heuristics we could construct.  (e.g. if it's not available in the 
> preheader and one of many latches...)
> 2) We need PRE to look back through merge blocks.  This should 
> possibly even live in MemoryDependenceAnalysis.  Alternatively, we 
> could create a "deconstruct loop simplify form" pass to run right
> before GVN to side step this issue.Point 2 here may be inaccurate.  I've definitely seen problems here in 
the past, but when I sat down to write some test cases, it looks like 
the simple merge block for latch (with the load being considered in the 
loop header) works just fine.  I suspect my original example is actually 
blocked by something else; I'm just not sure what that is yet.

One case where I can get PRE to fail is if there's a load in one of two 
paths through the loop, in the preheader, but not in the header.  Here's 
the IR:
; Function Attrs: readonly
declare void @hold(i32) #0

declare void @clobber()

define i32 @test1(i1 %cnd, i32* %p) {
entry:
   %v2 = load i32* %p
   call void @hold(i32 %v2)
   br label %header

header:                                           ; preds = %merge, %entry
   br i1 %cnd, label %bb1, label %bb2

bb1:                                              ; preds = %header
   %v1 = load i32* %p
   call void @hold(i32 %v1)
   br label %merge

bb2:                                              ; preds = %header
   call void @clobber()
   br label %merge

merge:                                            ; preds = %bb2, %bb1
   br label %header
}

I haven't looked to see what the issue here is yet.

>
> Neither is entirely simple to implement, but both are doable. Probably.
>
> In theory, we've now solved the same problem my loop nest transform 
> does.  There's two snags: one minor(ish), one major.
>
> The minor one is that many of my rare conditions involve volatile 
> loads.  MemDepAnalysis currently completely gives up when faced with 
> volatile loads.  This is mostly fixable, but given the structure of 
> the code may be fairly invasive.  Oddly, LICM actually performs better 
> in this case.
>
> For the major one, let's tweak our example slightly:
> while(c) {
>   x = this->x;
>   if (x == null) return;
>   y = this->y;
>   if (y == null) return;
>   if (x == y) {
>     rare();
>   }
> }
>
> Let's assume that PRE worked as designed for 'x'.  We now need
to
> perform a transform analogous to loop unswitch, but a bit different.  
> We need to move the condition loop exits into each of the predecessors 
> where the answers not available.  I don't believe we have anything 
> like this today, and it involves a non-trivial bit of code.  Even if 
> we get that done, we're left with a pass ordering problem where we 
> need to run PRE, then branch-PRE, then PRE, then branch-PRE, etc..  
> (In practice, these chains can be quite long due to null checks and 
> range checks.)
>
> If we'd split this into a loop nest structure, we'd still have the 
> chain, but a) that's easier to control with a custom pass order since 
> LICM and LoopUnswitch are far cheaper than GVN/PRE and b) having LICM 
> do a bit of trivial loop unswitch for the terminator of the header 
> appears quite tractable.
>
> One other point worth mentioning is that LLVM does not currently have 
> a loop peeling pass.  LoopRotate seems to get us most of the way there 
> in common cases, but not always.  If we did peel the above loop once, 
> we'd have the same merge block problem I mentioned previously (this 
> time in the preheader *and* the latch), but the profitability becomes 
> slightly more clear without the need for profile information or loop 
> specific heuristics.
>
> To be clear, I'm not opposed to fixing some of the issues above. I 
> think that's a great idea, I'm just trying to find my way through
to
> practical performance improvements with minimal invested effort.  I 
> probably will in fact fix some of these since they show up in a number 
> of other cases as well.
>
>
> Ok, so that's that.  Let's move on to why my proposed loop nest 
> separation change is a bit scary.
>
> While the loop nest does allow things to be lifted from the inner loop 
> to the outer loop - without any of the complex PRE or PGO reasoning 
> required above - it suffers from all of the limitations of the 
> existing loop optimization passes.  In particular, LICM may not 
> actually be able to do much if the inner loop is still complicated 
> enough to defeat either it's fairly simple alias analysis or fault 
> analysis.  GVN/PRE is strictly more powerful - in theory if not always 
> practice - here than what we have in LICM.
>
> We've also potentially created a loop nest which is harder to analyse 
> than the original loop with two latches.  Consider this example:
> for(int i = 0; i < 20000) {
>    i++;
>    if(i % 20 != 0) continue;
>    a[i] = 0;
> }
> (This is not a real example of anything useful.  The important point 
> is the early return is the common case.)
>
> This loop clearly runs exactly 20000 times.  ScalarEvolution almost 
> certainly gets this right. (I haven't checked.)
>
> After separation, we'd have this:
>
> int i = 0;
> while(true) {
>   while(i < 20000) {
>      i++;
>      if(i % 20 == 0) { // now a loop exit
>        a[i] = 0;
>        goto next:
>      }
> }
> break:
>   next:
> }
>
> It's not clear to me that SE will always be able to tell that both the 
> inner and outer loop runs a maximum of 20000 times.  (In this 
> particular case it might actually optimize the inner loop 
> substantially and thus get a more precise answer, but let's ignore 
> that.)  We could potentially loose information about the loop by doing 
> this transformation.  Of course, this concern applies to the 
> *existing* loop nest separation heuristics too.  I don't have any 
> particular reason to believe one heuristic is worse than the other, 
> but I also have no reason not to believe that either.
>
> As a final point, the inner loop now has two exiting blocks not one.  
> I know at least some of our loop transforms give up on anything with 
> more than one loop exit.  I'm less concerned by that since a) we 
> should fix them, and b) the cases I've looked at don't look too
hard.
>
>
> And that's everything that occurs to me at the moment.  I'll
probably
> have to append in further responses as I remember more details, I know 
> I'm forgetting some pieces.
>
> Philip
>
>
>

Daniel Berlin

2015-Jan-12 06:17 UTC

head link

[LLVMdev] Separating loop nests based on profile information?

On Thu, Jan 8, 2015 at 3:33 PM, Philip Reames <listmail at
philipreames.com>
wrote:
>
> On 01/07/2015 05:33 PM, Chandler Carruth wrote:
>
>> How does this compare with classical approaches of loop peeling,
>> partitioning, fission, or whatever you might call it?
>>
> I'm still developing a good sense for this, but let me lay out some
> observations.  Fair warning, this is going to be long and rambling.
>
> Let's start with a toy example:
> while(c) {
>   x = this->x;
>   y = this->y;
>   if (x == y) {
>     rare();
>   }
> }
>
>
> If we could tell x and y were loop invariant, we could unswitch this
> loop.  However, the rare call clobbers our view of memory, so LICM fails,
> and thus unswitch fails.
>
>
> We'd like to apply PRE here and push the reload into the loop preheader
> and the rare case.  This currently fails for two reasons: 1) We don't
allow
> code duplication during PRE,

?????
If we don't, we aren't doing real PRE. So i'm not entirely sure what
you
mean here.


> and 2) the canonical loop-simplify form of having a single latch means
> that PRE doesn't actually see the rare block at all, it sees the
preheader
> and the join point after the if block.


>
> I think both of the previous are fixable:
>

GCC's PRE already does the above.
It doesn't do profile guided duplication.
We aren't doing anything special with these blocks.

here is the code I used to test:

struct a{
int x;
int y;
};
extern void rare();
int mai(int c, struct a *this)
{
int d = 0;
while(c) {
int x = this->x;
int y = this->y;
d += x + y;
if (x == y) {
rare();
}
}
return d;
}


It will do exactly what you expect, it is transformed into:

struct a{
int x;
int y;
};
extern void rare();
int mai(int c, struct a *this)
{
int d = 0;
        int pretemp1 = this->x
        int pretemp2 = this->y

while(c) {
                pretemp1phi = phi(rare block pretemp1, preheader pretemp1).
                pretemp2phi = phi(rare block pretemp2, preheader pretemp2)

d += pretemp1phi + pretemp2phi
if (x == y) {
rare();
                        pretemp1 = this->x;
                        pretemp2 = this->y;

}
}
return d;
}
I don't see why profile guided duplication is necessary here. This is a
basic load PRE case.  It is handled by the first version of GVN-based Load
PRE I wrote for GCC.  It is always a win.

Looking at what LLVM does, the failing on the PRE side is that our PRE/GVN
models are not strong enough to handle this. I'm not sure at all why we
think anything else is necessary.  It's certainly not requiring special
code duplication heuristics, etc.

So either you are thinking of a case that is different from the above, or I
am seriously confused :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150112/9e2fa48d/attachment.html>

Daniel Berlin

2015-Jan-12 06:35 UTC

head link

[LLVMdev] Separating loop nests based on profile information?

>
>
> Looking at what LLVM does, the failing on the PRE side is that our PRE/GVN
> models are not strong enough to handle this. I'm not sure at all why we
> think anything else is necessary.
>
By this i mean we don't actually perform real PRE for loads. It's a
known
deficiency and one that does not require any duplicate heuristics to solve.
Right now we perform some availability based removal that happens be
willing to do an insertion here or there to try to move a load in some
cases (it will only do a single insertion).  It's not also based on real
availability, but on finding nearest dependencies of loads and squinting at
them (which is not even really the same as finding where a load stops being
available) funny.

It's also not real PRE, just some code that is willing to insert one load
to move a load.

If you want this case to work, what you need is a real Load PRE
implementation, not one based on simple memdep based load moving.

Trying to hack this one up with profile guided code duplication, etc, to
get it to kinda sorta work seems to me to be a bad idea.

> It's certainly not requiring special code duplication heuristics, etc.
>
> So either you are thinking of a case that is different from the above, or
> I am seriously confused :)
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150112/90307f6f/attachment.html>

Philip Reames

2015-Jan-12 17:48 UTC

head link

[LLVMdev] Separating loop nests based on profile information?

On 01/11/2015 10:17 PM, Daniel Berlin wrote:>
>
> On Thu, Jan 8, 2015 at 3:33 PM, Philip Reames 
> <listmail at philipreames.com <mailto:listmail at
philipreames.com>> wrote:
>
>
>     On 01/07/2015 05:33 PM, Chandler Carruth wrote:
>
>         How does this compare with classical approaches of loop
>         peeling, partitioning, fission, or whatever you might call it?
>
>     I'm still developing a good sense for this, but let me lay out
>     some observations.  Fair warning, this is going to be long and
>     rambling.
>
>     Let's start with a toy example:
>     while(c) {
>       x = this->x;
>       y = this->y;
>       if (x == y) {
>         rare();
>       }
>     }
>
>
>     If we could tell x and y were loop invariant, we could unswitch
>     this loop.  However, the rare call clobbers our view of memory, so
>     LICM fails, and thus unswitch fails.
>
>
>     We'd like to apply PRE here and push the reload into the loop
>     preheader and the rare case.  This currently fails for two
>     reasons: 1) We don't allow code duplication during PRE, 
>
>
> ?????
> If we don't, we aren't doing real PRE. So i'm not entirely sure
what
> you mean here.As I think you comment elsewhere in your response, we currently do not 
duplicate loads; we only move them.  The current structure is based 
around finding a legal point which doesn't introduce a load along any 
path where one didn't exist previously.  If we have a path which has two 
copies of the load, we try to find a place to move one of them such that 
all paths still have the available load and we've removed the extra load 
along the one path.

(Note: The above explanation is rough and does not parallel how the code 
is organized.)

>
>     and 2) the canonical loop-simplify form of having a single latch
>     means that PRE doesn't actually see the rare block at all, it sees
>     the preheader and the join point after the if block.
>
>
>
>
>     I think both of the previous are fixable:
>
>
>
> GCC's PRE already does the above.
> It doesn't do profile guided duplication.
> We aren't doing anything special with these blocks.
>
> here is the code I used to test:
>
> struct a{
> int x;
> int y;
> };
> extern void rare();
> int mai(int c, struct a *this)
> {
> int d = 0;
> while(c) {
> int x = this->x;
> int y = this->y;
> d += x + y;
> if (x == y) {
> rare();
> }
> }
> return d;
> }
>
>
> It will do exactly what you expect, it is transformed into:
>
> struct a{
> int x;
> int y;
> };
> extern void rare();
> int mai(int c, struct a *this)
> {
> int d = 0;
>         int pretemp1 = this->x
>         int pretemp2 = this->y
> while(c) {
>                 pretemp1phi = phi(rare block pretemp1, preheader 
> pretemp1).
>                 pretemp2phi = phi(rare block pretemp2, preheader pretemp2)
>
> d += pretemp1phi + pretemp2phi
> if (x == y) {
> rare();
>                         pretemp1 = this->x;
>                         pretemp2 = this->y;
>
> }
> }
> return d;
> }
> I don't see why profile guided duplication is necessary here. This is 
> a basic load PRE case.  It is handled by the first version of 
> GVN-based Load PRE I wrote for GCC.  It is always a win.Well, to be pedantic, it is not always profitable.  If this loop runs 
exactly one iteration, this is both a code size pessimization and 
possibly a runtime execution pessimization.  While in this particular 
case, you probably don't care, I'd worry that an algorithm which gets 
this might also have other pessimization cases which are more worrying.

However, I think I've actually come to the same underlying conclusion.  
This is a case that our PRE algorithm should handle without needing to 
resort to profiling data.

I have to admit that I am unfamiliar with the GCC implementation. Could 
you outline the algorithm and it's important concepts?  (links are fine too)
>
> Looking at what LLVM does, the failing on the PRE side is that our 
> PRE/GVN models are not strong enough to handle this. I'm not sure at 
> all why we think anything else is necessary.  It's certainly not 
> requiring special code duplication heuristics, etc.
>
> So either you are thinking of a case that is different from the above, 
> or I am seriously confused :)Well, I think we've gotten to the same point here.  An improved PRE 
implementation should handle most of the cases I've actually 
encountered.  Having said that, I'm not yet willing to say that the 
profile guided loop splitting isn't *also* worth exploring.  From a 
practical perspective, a lot of our existing optimizations are 
structured on loops.  Letting those kick in without falling back to 
(expensive) GVN might be worthwhile from a practical perspective. This 
is as much a pass ordering problem as anything else.

p.s. I've started to post patches which improve the results given by our 
current PRE to the llvm-commits list.  So far, it's just simple stuff, 
but that's the direction I'm current working in.  I'm going to defer
looking into the loop nest splitting until I've pushed PRE as far as I 
easily can.

Philip

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150112/32ba8923/attachment.html>

Xinliang David Li

2015-Jan-13 04:28 UTC

head link

[LLVMdev] Separating loop nests based on profile information?

On Thu, Jan 8, 2015 at 12:33 PM, Philip Reames <listmail at
philipreames.com>
wrote:
>
> On 01/07/2015 05:33 PM, Chandler Carruth wrote:
>
>> How does this compare with classical approaches of loop peeling,
>> partitioning, fission, or whatever you might call it?
>>
> I'm still developing a good sense for this, but let me lay out some
> observations.  Fair warning, this is going to be long and rambling.
>
> Let's start with a toy example:
> while(c) {
>   x = this->x;
>   y = this->y;
>   if (x == y) {
>     rare();
>   }
> }
>
>With profile info, speculative PRE can be performed for memory access that
are not downsafe:

temp_c = c;
if (temp_c)
{
    x = this->x;
    y = this->y;
}
while (temp_c) {
   if (x == y)
     {
          rare();
          x = this->x;
          y = this->y;
      }
   temp_c = c;
}

If you can prove this->x etc are safe control speculative (e.g, seen a
dominating memory access with the same base and offset), it can be

x = this->x;
y = this->y;
while (c) {
   if (x == y) {
      rare();
      x = this->x;
      y = this->y;
    }
 }



> If we could tell x and y were loop invariant, we could unswitch this
> loop.  However, the rare call clobbers our view of memory, so LICM fails,
> and thus unswitch fails.
>
> We'd like to apply PRE here and push the reload into the loop preheader
> and the rare case.  This currently fails for two reasons: 1) We don't
allow
> code duplication during PRE, and 2) the canonical loop-simplify form of
> having a single latch means that PRE doesn't actually see the rare
block at
> all, it sees the preheader and the join point after the if block.
>
> I think both of the previous are fixable:
> 1) This calls for profile guided duplication.  If we can trade one hot
> load for two cold ones, that's great.  Also, there might be static
> heuristics we could construct.  (e.g. if it's not available in the
> preheader and one of many latches...)
> 2) We need PRE to look back through merge blocks.  This should possibly
> even live in MemoryDependenceAnalysis.  Alternatively, we could create a
> "deconstruct loop simplify form" pass to run right before GVN to
side step
> this issue.
>
> Neither is entirely simple to implement, but both are doable. Probably.
>
> In theory, we've now solved the same problem my loop nest transform
does.
> There's two snags: one minor(ish), one major.
>
> The minor one is that many of my rare conditions involve volatile loads.
> MemDepAnalysis currently completely gives up when faced with volatile
> loads.  This is mostly fixable, but given the structure of the code may be
> fairly invasive.  Oddly, LICM actually performs better in this case.
>
> For the major one, let's tweak our example slightly:
> while(c) {
>   x = this->x;
>   if (x == null) return;
>   y = this->y;
>   if (y == null) return;
>   if (x == y) {
>     rare();
>   }
> }
>
> Let's assume that PRE worked as designed for 'x'.  We now need
to perform
> a transform analogous to loop unswitch, but a bit different.  We need to
> move the condition loop exits into each of the predecessors where the
> answers not available.  I don't believe we have anything like this
today,
> and it involves a non-trivial bit of code.  Even if we get that done,
we're
> left with a pass ordering problem where we need to run PRE, then
> branch-PRE, then PRE, then branch-PRE, etc..  (In practice, these chains
> can be quite long due to null checks and range checks.)
>
> If we'd split this into a loop nest structure, we'd still have the
chain,
> but a) that's easier to control with a custom pass order since LICM and
> LoopUnswitch are far cheaper than GVN/PRE and b) having LICM do a bit of
> trivial loop unswitch for the terminator of the header appears quite
> tractable.
>

if (c)
{
   x = this->x;
   if (!x) return;
   y = this->y;
   if (!y) return;
}
while (c) {
  if (x == y) {
      rare();
      if (!x) return;
      if (!y) return;
   }
}

The 'branch PRE' (hoisting, sinking, assertion prop and branch
elimination)
part is a little tricky to do though.

David


>
> One other point worth mentioning is that LLVM does not currently have a
> loop peeling pass.  LoopRotate seems to get us most of the way there in
> common cases, but not always.  If we did peel the above loop once, we'd
> have the same merge block problem I mentioned previously (this time in the
> preheader *and* the latch), but the profitability becomes slightly more
> clear without the need for profile information or loop specific heuristics.
>
> To be clear, I'm not opposed to fixing some of the issues above.  I
think
> that's a great idea, I'm just trying to find my way through to
practical
> performance improvements with minimal invested effort.  I probably will in
> fact fix some of these since they show up in a number of other cases as
> well.
>
>
> Ok, so that's that.  Let's move on to why my proposed loop nest
separation
> change is a bit scary.
>
> While the loop nest does allow things to be lifted from the inner loop to
> the outer loop - without any of the complex PRE or PGO reasoning required
> above - it suffers from all of the limitations of the existing loop
> optimization passes.  In particular, LICM may not actually be able to do
> much if the inner loop is still complicated enough to defeat either
it's
> fairly simple alias analysis or fault analysis.  GVN/PRE is strictly more
> powerful - in theory if not always practice - here than what we have in
> LICM.
>
> We've also potentially created a loop nest which is harder to analyse
than
> the original loop with two latches.  Consider this example:
> for(int i = 0; i < 20000) {
>    i++;
>    if(i % 20 != 0) continue;
>    a[i] = 0;
> }
> (This is not a real example of anything useful.  The important point is
> the early return is the common case.)
>
> This loop clearly runs exactly 20000 times.  ScalarEvolution almost
> certainly gets this right. (I haven't checked.)
>
> After separation, we'd have this:
>
> int i = 0;
> while(true) {
>   while(i < 20000) {
>      i++;
>      if(i % 20 == 0) { // now a loop exit
>        a[i] = 0;
>        goto next:
>      }
> }
> break:
>   next:
> }
>
> It's not clear to me that SE will always be able to tell that both the
> inner and outer loop runs a maximum of 20000 times.  (In this particular
> case it might actually optimize the inner loop substantially and thus get a
> more precise answer, but let's ignore that.)  We could potentially
loose
> information about the loop by doing this transformation.  Of course, this
> concern applies to the *existing* loop nest separation heuristics too.  I
> don't have any particular reason to believe one heuristic is worse than
the
> other, but I also have no reason not to believe that either.
>
> As a final point, the inner loop now has two exiting blocks not one.  I
> know at least some of our loop transforms give up on anything with more
> than one loop exit.  I'm less concerned by that since a) we should fix
> them, and b) the cases I've looked at don't look too hard.
>
>
> And that's everything that occurs to me at the moment.  I'll
probably have
> to append in further responses as I remember more details, I know I'm
> forgetting some pieces.
>
> Philip
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150112/227d39a3/attachment.html>

Philip Reames

2015-Jan-20 00:44 UTC

head link

[LLVMdev] Separating loop nests based on profile information?

(Splitting this out from "Re: [LLVMdev] Separating loop nests based on 
profile information?" since it's now a fairly different topic.)

On 01/09/2015 11:39 AM, Philip Reames wrote:> On 01/08/2015 12:33 PM, Philip Reames wrote:
>> 2) We need PRE to look back through merge blocks.  This should 
>> possibly even live in MemoryDependenceAnalysis.  Alternatively, we 
>> could create a "deconstruct loop simplify form" pass to run
right
>> before GVN to side step this issue.
> Point 2 here may be inaccurate.  I've definitely seen problems here in 
> the past, but when I sat down to write some test cases, it looks like 
> the simple merge block for latch (with the load being considered in 
> the loop header) works just fine.  I suspect my original example is 
> actually blocked by something else; I'm just not sure what that is yet.I went back and looked.  One - of possibly several - cases that were 
giving me trouble was when we had a loop with multiple latches before 
loop-simplify, loop simplify creates a unified latch, and we had not 
peeled the first iteration of the loop when we get to PRE. This creates 
a tree of merges that PRE doesn't deal with.

I took a stab at addressing this with a PRE patch.  It can be found 
here: http://reviews.llvm.org/D7061

As I should have expected, it turned out to be quite a bit more involved 
that it first looked.  :)

Philip

Philip Reames

2015-Jan-20 00:45 UTC

head link

[LLVMdev] Load PRE in loops with 'mostly' invariant loads

(Splitting this out from "Re: [LLVMdev] Separating loop nests based on 
profile information?" since it's now a fairly different topic.)

On 01/09/2015 11:39 AM, Philip Reames wrote:> On 01/08/2015 12:33 PM, Philip Reames wrote:
>> 2) We need PRE to look back through merge blocks.  This should 
>> possibly even live in MemoryDependenceAnalysis.  Alternatively, we 
>> could create a "deconstruct loop simplify form" pass to run
right
>> before GVN to side step this issue.
> Point 2 here may be inaccurate.  I've definitely seen problems here in 
> the past, but when I sat down to write some test cases, it looks like 
> the simple merge block for latch (with the load being considered in 
> the loop header) works just fine.  I suspect my original example is 
> actually blocked by something else; I'm just not sure what that is yet.I went back and looked.  One - of possibly several - cases that were 
giving me trouble was when we had a loop with multiple latches before 
loop-simplify, loop simplify creates a unified latch, and we had not 
peeled the first iteration of the loop when we get to PRE. This creates 
a tree of merges that PRE doesn't deal with.

I took a stab at addressing this with a PRE patch.  It can be found 
here: http://reviews.llvm.org/D7061.  Comments and discussion welcome.

As I should have expected, it turned out to be quite a bit more involved 
that it first looked.  :)

Philip

Philip Reames

2015-Jan-20 01:55 UTC

head link

[LLVMdev] Separating loop nests based on profile information?

On 01/12/2015 08:28 PM, Xinliang David Li wrote:>
>
> On Thu, Jan 8, 2015 at 12:33 PM, Philip Reames 
> <listmail at philipreames.com <mailto:listmail at
philipreames.com>> wrote:
>
>
>     Let's start with a toy example:
>     while(c) {
>       x = this->x;
>       y = this->y;
>       if (x == y) {
>         rare();
>       }
>     }
>
>
> With profile info, speculative PRE can be performed for memory access 
> that are not downsafe:Could you define the term "downsafe"?  I think I know what you mean,
but
just to be clear.>
> temp_c = c;
> if (temp_c)
> {
>     x = this->x;
>     y = this->y;
> }
> while (temp_c) {
>    if (x == y)
>      {
>           rare();
>           x = this->x;
>           y = this->y;
>       }
>    temp_c = c;
> }
>
> If you can prove this->x etc are safe control speculative (e.g, seen a 
> dominating memory access with the same base and offset), it can be
>
> x = this->x;
> y = this->y;
> while (c) {
>    if (x == y) {
>       rare();
>       x = this->x;
>       y = this->y;
>     }
>  }Yep.  In LLVM, this basically requires that this->FIELD be known 
deferenceable.  I filled one simple bug for this here: 
http://llvm.org/bugs/show_bug.cgi?id=22266

I've also looked a more general rewrite of our PRE algorithm when a 
pointer is known dereferenceable, but I haven't figured out to judge 
profitability accurately (yet).  The general approach was to just take 
the cut provided by MDA, apply "obvious" improvements (i.e. local
merge
points), the insert loads in all the unavailable blocks if it looked 
profitable.

The alternate approach did have the appeal of being "easy" in cases 
where our current approach (work backwards from load) is "hard".

One challenge is in making sure the two algorithms together generate a 
stable final form.  :)

That last part is where I stalled.  I'm trying to see what else I can do 
with simpler things before returning to that approach.

Nick Lewicky also pointed out that PHITranslation is problematic in the 
current code.  Oddly, I've never seen that in practice.  We clearly have 
different workloads, but I haven't figured out which are the important 
different properties yet.

>
>
>
>     If we'd split this into a loop nest structure, we'd still have
the
>     chain, but a) that's easier to control with a custom pass order
>     since LICM and LoopUnswitch are far cheaper than GVN/PRE and b)
>     having LICM do a bit of trivial loop unswitch for the terminator
>     of the header appears quite tractable.
>
>
>
> if (c)
> {
>    x = this->x;
>    if (!x) return;
>    y = this->y;
>    if (!y) return;
> }
> while (c) {
>   if (x == y) {
>       rare();
>       if (!x) return;
>       if (!y) return;
>    }
> }
>
> The 'branch PRE' (hoisting, sinking, assertion prop and branch 
> elimination) part is a little tricky to do though.I think "a little tricky" might be an understatement here.

At least to start with, I'd probably try to handle simple cases. Even 
just doing trivial loop unswitch in the loop with LoadPRE would unlock a 
lot.  (Right now, I'm just running LICM and Unswitch in a loop.  It's 
not that expensive, gets a lot of cases, but doesn't get everything.)

Philip
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150119/de54e231/attachment.html>

llvm dev - Jan 2015 - [LLVMdev] Separating loop nests based on profile information?

[LLVMdev] Separating loop nests based on profile information?

[LLVMdev] Separating loop nests based on profile information?

[LLVMdev] Separating loop nests based on profile information?

[LLVMdev] Separating loop nests based on profile information?

[LLVMdev] Separating loop nests based on profile information?

[LLVMdev] Separating loop nests based on profile information?

[LLVMdev] Separating loop nests based on profile information?

[LLVMdev] Load PRE in loops with 'mostly' invariant loads

[LLVMdev] Separating loop nests based on profile information?