thr3ads.net - llvm dev - [LLVMdev] [LSR] hoisting loop invariants in reverse order [May 2015]

If this information is useful, please help other people find it:
Share via:

Jingyue Wu

2015-May-09 05:22 UTC

[LLVMdev] [LSR] hoisting loop invariants in reverse order

Hi,

I was tracking down a performance regression and noticed that
LoopStrengthReduce hoists loop invariants (e.g., the initial formulae of
indvars) in the reverse order of how they appear in the loop.

This reverse order creates troubles for the StraightLineStrengthReduce pass
I recently add. While I understand ultimately SLSR should be able to sort
independent candidates in an optimal order, does it make sense to
canonicalizing the order of LSR-hoisted loop invariants back to the
"natural" order? IMO, the optimized code should in general resemble
the
original code unless intentionally changed otherwise.

More specifically, I ran LSR on

void foo(float *input, int a, int b, int c, int n) {
  for (int node_x = 0; node_x < n; ++node_x) {
    int pixel_idx = (a + node_x) * b; // {a*b, +, b}
    bar(pixel_idx * c); // {a*b*c, +, b*c}
    bar((pixel_idx + 1) * c); // {(a*b+1)*c, +, b*c}
    bar((pixel_idx + 2) * c); // {(a*b+2)*c, +, b*c}
    bar((pixel_idx + 3) * c); // {(a*b+3)*c, +, b*c}
  }
}

and LSR produced

void foo(float *input, int a, int b, int c, int n) {
*  int arg3 = (a * b + 3) * c;*
*  int arg2 = (a * b + 2) * c;*
*  int arg1 = (a * b + 1) * c;*
*  int arg0 = a * b * c;*
  for (int node_x = 0; node_x < n; ++node_x) {
    bar(arg0);
    bar(arg1);
    bar(arg2);
    bar(arg3);
    arg0 += b * c;
    arg1 += b * c;
    arg2 += b * c;
    arg3 += b * c;
  }
}
(with obvious redundant operations, i.e. a * b and b * c, combined). Note
that the order of arg0~3 is reversed.

Reversing the order of arg0~3 is not intentional. The user list of
pixel_idx happens to have pixel_idx+3, pixel_idx+2, and pixel_idx+1 in this
order, so LSR simply follows this order when collecting the LSRFixups.

This creates troubles for SLSR. Given the current order of arg0~arg3

  int arg3 = (a * b + 3) * c;
  int arg2 = (a * b + 2) * c;
  int arg1 = (a * b + 1) * c;
  int arg0 = a * b * c;

SLSR optimizes them to

  int arg3 = (a * b + 3) * c;
  int arg2 = arg3 - c;
  int arg1 = arg2 - c;
  int arg0 = arg1 - c;
  // 2 muls and 4 adds/subs

In contrast, if arg0~3 were in the order of

  int arg0 = a * b * c;
  int arg1 = (a * b + 1) * c;
  int arg2 = (a * b + 2) * c;
  int arg3 = (a * b + 3) * c;

SLSR would optimize them to

  int arg0 = a * b * c;
  int arg1 = arg0 + c;
  int arg2 = arg1 + c;
  int arg3 = arg2 + c;
  // 2 muls and 3 adds/subs. 1 add/sub less than with the reversed order

I have a proof-of-concept patch (
http://reviews.llvm.org/differential/diff/25402/) that has
CollectFixupsAndInitialFormulae to sort initial formulae in a dominance
order (i.e. if A.getUser() dominates B.getUser(), then we put A before B).
It breaks some tests that are too sensitive to order changes; besides that,
I don't see significant issues.

Jingyue
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150508/112a0bef/attachment.html>

Jingyue Wu

2015-May-18 17:04 UTC

head link

[LLVMdev] [LSR] hoisting loop invariants in reverse order

Thoughts?

On Fri, May 8, 2015 at 10:22 PM, Jingyue Wu <jingyue at google.com> wrote:
> Hi,
>
> I was tracking down a performance regression and noticed that
> LoopStrengthReduce hoists loop invariants (e.g., the initial formulae of
> indvars) in the reverse order of how they appear in the loop.
>
> This reverse order creates troubles for the StraightLineStrengthReduce
> pass I recently add. While I understand ultimately SLSR should be able to
> sort independent candidates in an optimal order, does it make sense to
> canonicalizing the order of LSR-hoisted loop invariants back to the
> "natural" order? IMO, the optimized code should in general
resemble the
> original code unless intentionally changed otherwise.
>
> More specifically, I ran LSR on
>
> void foo(float *input, int a, int b, int c, int n) {
>   for (int node_x = 0; node_x < n; ++node_x) {
>     int pixel_idx = (a + node_x) * b; // {a*b, +, b}
>     bar(pixel_idx * c); // {a*b*c, +, b*c}
>     bar((pixel_idx + 1) * c); // {(a*b+1)*c, +, b*c}
>     bar((pixel_idx + 2) * c); // {(a*b+2)*c, +, b*c}
>     bar((pixel_idx + 3) * c); // {(a*b+3)*c, +, b*c}
>   }
> }
>
> and LSR produced
>
> void foo(float *input, int a, int b, int c, int n) {
> *  int arg3 = (a * b + 3) * c;*
> *  int arg2 = (a * b + 2) * c;*
> *  int arg1 = (a * b + 1) * c;*
> *  int arg0 = a * b * c;*
>   for (int node_x = 0; node_x < n; ++node_x) {
>     bar(arg0);
>     bar(arg1);
>     bar(arg2);
>     bar(arg3);
>     arg0 += b * c;
>     arg1 += b * c;
>     arg2 += b * c;
>     arg3 += b * c;
>   }
> }
> (with obvious redundant operations, i.e. a * b and b * c, combined). Note
> that the order of arg0~3 is reversed.
>
> Reversing the order of arg0~3 is not intentional. The user list of
> pixel_idx happens to have pixel_idx+3, pixel_idx+2, and pixel_idx+1 in this
> order, so LSR simply follows this order when collecting the LSRFixups.
>
> This creates troubles for SLSR. Given the current order of arg0~arg3
>
>   int arg3 = (a * b + 3) * c;
>   int arg2 = (a * b + 2) * c;
>   int arg1 = (a * b + 1) * c;
>   int arg0 = a * b * c;
>
> SLSR optimizes them to
>
>   int arg3 = (a * b + 3) * c;
>   int arg2 = arg3 - c;
>   int arg1 = arg2 - c;
>   int arg0 = arg1 - c;
>   // 2 muls and 4 adds/subs
>
> In contrast, if arg0~3 were in the order of
>
>   int arg0 = a * b * c;
>   int arg1 = (a * b + 1) * c;
>   int arg2 = (a * b + 2) * c;
>   int arg3 = (a * b + 3) * c;
>
> SLSR would optimize them to
>
>   int arg0 = a * b * c;
>   int arg1 = arg0 + c;
>   int arg2 = arg1 + c;
>   int arg3 = arg2 + c;
>   // 2 muls and 3 adds/subs. 1 add/sub less than with the reversed order
>
> I have a proof-of-concept patch (
> http://reviews.llvm.org/differential/diff/25402/) that has
> CollectFixupsAndInitialFormulae to sort initial formulae in a dominance
> order (i.e. if A.getUser() dominates B.getUser(), then we put A before B).
> It breaks some tests that are too sensitive to order changes; besides that,
> I don't see significant issues.
>
> Jingyue
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150518/9261a76c/attachment.html>

Daniel Berlin

2015-May-18 17:10 UTC

head link

[LLVMdev] [LSR] hoisting loop invariants in reverse order

On Fri, May 8, 2015 at 10:22 PM, Jingyue Wu <jingyue at google.com>
wrote:> Hi,
>
> I was tracking down a performance regression and noticed that
> LoopStrengthReduce hoists loop invariants (e.g., the initial formulae of
> indvars) in the reverse order of how they appear in the loop.It has to to get maximized hoisting.
>
> This reverse order creates troubles for the StraightLineStrengthReduce pass
> I recently add. While I understand ultimately SLSR should be able to sort
> independent candidates in an optimal order, does it make sense to
> canonicalizing the order of LSR-hoisted loop invariants back to the
> "natural" order? IMO, the optimized code should in general
resemble the
> original code unless intentionally changed otherwise.
This is likely a mistake, unless i'm missing something.
I suspect the insertion point is set to the default after, so that it
ends up reversing the order, instead of before, so that it ends up in
the original order.
>
> More specifically, I ran LSR on
>
> void foo(float *input, int a, int b, int c, int n) {
>   for (int node_x = 0; node_x < n; ++node_x) {
>     int pixel_idx = (a + node_x) * b; // {a*b, +, b}
>     bar(pixel_idx * c); // {a*b*c, +, b*c}
>     bar((pixel_idx + 1) * c); // {(a*b+1)*c, +, b*c}
>     bar((pixel_idx + 2) * c); // {(a*b+2)*c, +, b*c}
>     bar((pixel_idx + 3) * c); // {(a*b+3)*c, +, b*c}
>   }
> }
>
> and LSR produced
>
> void foo(float *input, int a, int b, int c, int n) {
>   int arg3 = (a * b + 3) * c;
>   int arg2 = (a * b + 2) * c;
>   int arg1 = (a * b + 1) * c;
>   int arg0 = a * b * c;
>   for (int node_x = 0; node_x < n; ++node_x) {
>     bar(arg0);
>     bar(arg1);
>     bar(arg2);
>     bar(arg3);
>     arg0 += b * c;
>     arg1 += b * c;
>     arg2 += b * c;
>     arg3 += b * c;
>   }
> }
> (with obvious redundant operations, i.e. a * b and b * c, combined). Note
> that the order of arg0~3 is reversed.
>
> Reversing the order of arg0~3 is not intentional. The user list of
pixel_idx
> happens to have pixel_idx+3, pixel_idx+2, and pixel_idx+1 in this order, so
> LSR simply follows this order when collecting the LSRFixups.
>
> This creates troubles for SLSR. Given the current order of arg0~arg3
>
>   int arg3 = (a * b + 3) * c;
>   int arg2 = (a * b + 2) * c;
>   int arg1 = (a * b + 1) * c;
>   int arg0 = a * b * c;
>
> SLSR optimizes them to
>
>   int arg3 = (a * b + 3) * c;
>   int arg2 = arg3 - c;
>   int arg1 = arg2 - c;
>   int arg0 = arg1 - c;
>   // 2 muls and 4 adds/subs
>
> In contrast, if arg0~3 were in the order of
>
>   int arg0 = a * b * c;
>   int arg1 = (a * b + 1) * c;
>   int arg2 = (a * b + 2) * c;
>   int arg3 = (a * b + 3) * c;
>
> SLSR would optimize them to
>
>   int arg0 = a * b * c;
>   int arg1 = arg0 + c;
>   int arg2 = arg1 + c;
>   int arg3 = arg2 + c;
>   // 2 muls and 3 adds/subs. 1 add/sub less than with the reversed order
>
> I have a proof-of-concept patch
> (http://reviews.llvm.org/differential/diff/25402/) that has
> CollectFixupsAndInitialFormulae to sort initial formulae in a dominance
> order (i.e. if A.getUser() dominates B.getUser(), then we put A before B).
Is there a reason to not just put it back in the original order?
IE is dominance order better?

Andrew Trick

2015-May-18 17:13 UTC

head link

[LLVMdev] [LSR] hoisting loop invariants in reverse order

> On May 8, 2015, at 10:22 PM, Jingyue Wu <jingyue at google.com>
wrote:
> 
> Hi, 
> 
> I was tracking down a performance regression and noticed that
LoopStrengthReduce hoists loop invariants (e.g., the initial formulae of
indvars) in the reverse order of how they appear in the loop.
> 
> This reverse order creates troubles for the StraightLineStrengthReduce pass
I recently add. While I understand ultimately SLSR should be able to sort
independent candidates in an optimal order, does it make sense to canonicalizing
the order of LSR-hoisted loop invariants back to the "natural" order?
IMO, the optimized code should in general resemble the original code unless
intentionally changed otherwise.
I dislike passes that shuffle code for no reason, so regardless of SLSR I agree
that LSR should preserve the order of IV users if it’s not too hard. I’m
guessing this is just LSR sloppiness.

Andy
> 
> More specifically, I ran LSR on
> 
> void foo(float *input, int a, int b, int c, int n) {
>   for (int node_x = 0; node_x < n; ++node_x) {
>     int pixel_idx = (a + node_x) * b; // {a*b, +, b}
>     bar(pixel_idx * c); // {a*b*c, +, b*c}
>     bar((pixel_idx + 1) * c); // {(a*b+1)*c, +, b*c}
>     bar((pixel_idx + 2) * c); // {(a*b+2)*c, +, b*c}
>     bar((pixel_idx + 3) * c); // {(a*b+3)*c, +, b*c}
>   }
> }
> 
> and LSR produced
> 
> void foo(float *input, int a, int b, int c, int n) {
>   int arg3 = (a * b + 3) * c;
>   int arg2 = (a * b + 2) * c;
>   int arg1 = (a * b + 1) * c;
>   int arg0 = a * b * c;
>   for (int node_x = 0; node_x < n; ++node_x) {
>     bar(arg0);
>     bar(arg1);
>     bar(arg2);
>     bar(arg3);
>     arg0 += b * c;
>     arg1 += b * c;
>     arg2 += b * c;
>     arg3 += b * c;
>   }
> }
> (with obvious redundant operations, i.e. a * b and b * c, combined). Note
that the order of arg0~3 is reversed.
> 
> Reversing the order of arg0~3 is not intentional. The user list of
pixel_idx happens to have pixel_idx+3, pixel_idx+2, and pixel_idx+1 in this
order, so LSR simply follows this order when collecting the LSRFixups.
> 
> This creates troubles for SLSR. Given the current order of arg0~arg3
> 
>   int arg3 = (a * b + 3) * c;
>   int arg2 = (a * b + 2) * c;
>   int arg1 = (a * b + 1) * c;
>   int arg0 = a * b * c;
> 
> SLSR optimizes them to
> 
>   int arg3 = (a * b + 3) * c;
>   int arg2 = arg3 - c;
>   int arg1 = arg2 - c;
>   int arg0 = arg1 - c;
>   // 2 muls and 4 adds/subs
> 
> In contrast, if arg0~3 were in the order of 
> 
>   int arg0 = a * b * c;
>   int arg1 = (a * b + 1) * c;
>   int arg2 = (a * b + 2) * c;
>   int arg3 = (a * b + 3) * c;
> 
> SLSR would optimize them to
> 
>   int arg0 = a * b * c;
>   int arg1 = arg0 + c;
>   int arg2 = arg1 + c;
>   int arg3 = arg2 + c;
>   // 2 muls and 3 adds/subs. 1 add/sub less than with the reversed order
> 
> I have a proof-of-concept patch
(http://reviews.llvm.org/differential/diff/25402/
<http://reviews.llvm.org/differential/diff/25402/>) that has
CollectFixupsAndInitialFormulae to sort initial formulae in a dominance order
(i.e. if A.getUser() dominates B.getUser(), then we put A before B). It breaks
some tests that are too sensitive to order changes; besides that, I don't
see significant issues.
> 
> Jingyue
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150518/17f2a6d1/attachment.html>

Jingyue Wu

2015-May-18 17:17 UTC

head link

[LLVMdev] [LSR] hoisting loop invariants in reverse order

It's not caused by "the insertion point is set to the default
after".

I should mention the reason somewhere earlier. "Reversing the order of
arg0~3 is not intentional. The user list of pixel_idx happens to have
pixel_idx+3, pixel_idx+2, and pixel_idx+1 in this order, so LSR simply
follows this order when collecting the LSRFixups."

I'm not an expert on uselist orders, but after skimming Duncan Smith's
recent work on preserving uselist orders in assembly, these orders are
deterministic but arbitrary. So blindly following these orders sometimes
cause funny behavior (as in my example).

On Mon, May 18, 2015 at 10:10 AM, Daniel Berlin <dberlin at dberlin.org>
wrote:
> On Fri, May 8, 2015 at 10:22 PM, Jingyue Wu <jingyue at google.com>
wrote:
> > Hi,
> >
> > I was tracking down a performance regression and noticed that
> > LoopStrengthReduce hoists loop invariants (e.g., the initial formulae
of
> > indvars) in the reverse order of how they appear in the loop.
> It has to to get maximized hoisting.
>
> >
> > This reverse order creates troubles for the StraightLineStrengthReduce
> pass
> > I recently add. While I understand ultimately SLSR should be able to
sort
> > independent candidates in an optimal order, does it make sense to
> > canonicalizing the order of LSR-hoisted loop invariants back to the
> > "natural" order? IMO, the optimized code should in general
resemble the
> > original code unless intentionally changed otherwise.
>
> This is likely a mistake, unless i'm missing something.
> I suspect the insertion point is set to the default after, so that it
> ends up reversing the order, instead of before, so that it ends up in
> the original order.
>
> >
> > More specifically, I ran LSR on
> >
> > void foo(float *input, int a, int b, int c, int n) {
> >   for (int node_x = 0; node_x < n; ++node_x) {
> >     int pixel_idx = (a + node_x) * b; // {a*b, +, b}
> >     bar(pixel_idx * c); // {a*b*c, +, b*c}
> >     bar((pixel_idx + 1) * c); // {(a*b+1)*c, +, b*c}
> >     bar((pixel_idx + 2) * c); // {(a*b+2)*c, +, b*c}
> >     bar((pixel_idx + 3) * c); // {(a*b+3)*c, +, b*c}
> >   }
> > }
> >
> > and LSR produced
> >
> > void foo(float *input, int a, int b, int c, int n) {
> >   int arg3 = (a * b + 3) * c;
> >   int arg2 = (a * b + 2) * c;
> >   int arg1 = (a * b + 1) * c;
> >   int arg0 = a * b * c;
> >   for (int node_x = 0; node_x < n; ++node_x) {
> >     bar(arg0);
> >     bar(arg1);
> >     bar(arg2);
> >     bar(arg3);
> >     arg0 += b * c;
> >     arg1 += b * c;
> >     arg2 += b * c;
> >     arg3 += b * c;
> >   }
> > }
> > (with obvious redundant operations, i.e. a * b and b * c, combined).
Note
> > that the order of arg0~3 is reversed.
> >
> > Reversing the order of arg0~3 is not intentional. The user list of
> pixel_idx
> > happens to have pixel_idx+3, pixel_idx+2, and pixel_idx+1 in this
order,
> so
> > LSR simply follows this order when collecting the LSRFixups.
> >
> > This creates troubles for SLSR. Given the current order of arg0~arg3
> >
> >   int arg3 = (a * b + 3) * c;
> >   int arg2 = (a * b + 2) * c;
> >   int arg1 = (a * b + 1) * c;
> >   int arg0 = a * b * c;
> >
> > SLSR optimizes them to
> >
> >   int arg3 = (a * b + 3) * c;
> >   int arg2 = arg3 - c;
> >   int arg1 = arg2 - c;
> >   int arg0 = arg1 - c;
> >   // 2 muls and 4 adds/subs
> >
> > In contrast, if arg0~3 were in the order of
> >
> >   int arg0 = a * b * c;
> >   int arg1 = (a * b + 1) * c;
> >   int arg2 = (a * b + 2) * c;
> >   int arg3 = (a * b + 3) * c;
> >
> > SLSR would optimize them to
> >
> >   int arg0 = a * b * c;
> >   int arg1 = arg0 + c;
> >   int arg2 = arg1 + c;
> >   int arg3 = arg2 + c;
> >   // 2 muls and 3 adds/subs. 1 add/sub less than with the reversed
order
> >
> > I have a proof-of-concept patch
> > (http://reviews.llvm.org/differential/diff/25402/) that has
> > CollectFixupsAndInitialFormulae to sort initial formulae in a
dominance
> > order (i.e. if A.getUser() dominates B.getUser(), then we put A before
> B).
>
> Is there a reason to not just put it back in the original order?
> IE is dominance order better?
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150518/67378805/attachment.html>

Possibly Parallel Threads

Search for more maybe matching threads

llvm dev - May 2015 - [LLVMdev] [LSR] hoisting loop invariants in reverse order

[LLVMdev] [LSR] hoisting loop invariants in reverse order

[LLVMdev] [LSR] hoisting loop invariants in reverse order

[LLVMdev] [LSR] hoisting loop invariants in reverse order

[LLVMdev] [LSR] hoisting loop invariants in reverse order

[LLVMdev] [LSR] hoisting loop invariants in reverse order

Possibly Parallel Threads