thr3ads.net - llvm dev - [llvm-dev] [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines [Jan 2017]

If this information is useful, please help other people find it:
Share via:

Mehdi Amini via llvm-dev

2017-Jan-24 16:24 UTC

[llvm-dev] [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines

> On Jan 24, 2017, at 7:18 AM, Sanjay Patel <spatel at rotateright.com>
wrote:
> 
> 
> 
> On Mon, Jan 23, 2017 at 10:53 PM, Mehdi Amini <mehdi.amini at apple.com
<mailto:mehdi.amini at apple.com>> wrote:
> 
>> On Jan 23, 2017, at 3:48 PM, Sanjay Patel via llvm-dev <llvm-dev at
lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>> 
>> All targets are likely affected in some way by the icmp+shl fold
introduced with r292492. It's a basic pattern that occurs in lots of code.
Did you see any perf wins on your targets with this commit?
>> 
>> Sadly, it is also likely that many (all?) targets are negatively
impacted on the particular test (SingleSource/Benchmarks/Shootout/sieve) that
you have pointed out here because the IR is now decidedly worse.
>> 
>> IMO, we should not revert the commit because it exposed shortcomings in
the optimizer. It's an "obvious" fold/canonicalization, and the
related 'nuw' variant of this fold has existed in trunk since:
>> https://reviews.llvm.org/rL285729
<https://reviews.llvm.org/rL285729>
>> 
>> We need to dissect what analysis/folds are missing to restore the IR to
the better form that existed before, but this is probably going to be a long
process because we treat min/max like an optimization fence.
> 
> If this is gonna be a long process to recover, this looks like something to
be reverted in the 4.0 branch (unless I missed that there is a correctness fix
involved?).
> 
> 
> Nope - this is just about perf, not correctness. Of course, the intent was
that this transform should only improve perf, so I wonder if we can pin any
other perf changes from this commit.
> 
> I'm new to using the LNT site, but this should be the full set of
results for the A53 machine in question with a baseline (r292491) before this
patch and current (r292522) :
> http://llvm.org/perf/db_default/v4/nts/107364
<http://llvm.org/perf/db_default/v4/nts/107364>
> 
> If these are reliable results, we have 2 perf wins (puzzle, gramschmidt) on
the A53 machine. How do we determine the importance of the sieve benchmark vs.
the rest of the suite?
> 
> An x86 machine doesn't show any regressions from this change:
> http://llvm.org/perf/db_default/v4/nts/107353
<http://llvm.org/perf/db_default/v4/nts/107353>
> 
> Are there target-scope-based guidelines for when something is bad enough to
revert?
I don’t think we have any guidelines.

I think my suggestion was more about other regression that we would discover
after the release, it was more of a “maturity” call: if we just notice a problem
with the commit right before the release, it may not have been in tree long
enough to get enough scrutiny.

(Also I thought this thread included a compile time regression, which on re-read
it doesn’t).

— 
Mehdi

> 
> Also, we've absolutely destroyed perf (-48%) on the sieve benchmark on
that A53 target since the baseline (r256803). There are multiple things to fix
before we can truly recover?
> 
> Regardless of whether we revert or not, I am looking at how to clawback the
IR from the r292492 regression. Here's one step towards that:
> https://reviews.llvm.org/D29053 <https://reviews.llvm.org/D29053>
> 
> If we get lucky, we may be able to sidestep the min/max problem by folding
harder before we reach that point in the optimization pipeline.
> 
> 
>  
> — 
> Mehdi
> 
> 
> 
>> 
>> 
>> 
>> On Mon, Jan 23, 2017 at 11:13 AM, Evgeny Astigeevich
<Evgeny.Astigeevich at arm.com <mailto:Evgeny.Astigeevich at
arm.com>> wrote:
>> Confirm there is no change in IR if the hack is disabled in the
sources.
>> 
>> David wrote that these instructions are created by SCEV.
>> 
>> Are other targets affected by the changes, e.g. X86?
>> 
>>  
>> 
>> Kind regards,
>> Evgeny Astigeevich
>> Senior Compiler Engineer
>> Compilation Tools
>> ARM
>> 
>>  
>> 
>> From: Sanjay Patel [mailto:spatel at rotateright.com <mailto:spatel
at rotateright.com>]
>> Sent: Sunday, January 22, 2017 10:45 PM
>> 
>> 
>> To: Evgeny Astigeevich
>> Cc: llvm-dev; nd
>> Subject: Re: [InstCombine] rL292492 affected LoopVectorizer and caused
17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines
>> 
>>  
>> 
>> I tried an experiment to remove the integer min/max bailouts from
InstCombine, and it doesn't appear to change the IR in the attachment, so I
doubt there's going to be any improvement.
>> 
>> If I haven't messed up this example, this is amazing:
>> https://godbolt.org/g/yzoxeY <https://godbolt.org/g/yzoxeY>
>>  
>> 
>> On Sun, Jan 22, 2017 at 1:06 PM, Evgeny Astigeevich
<Evgeny.Astigeevich at arm.com <mailto:Evgeny.Astigeevich at
arm.com>> wrote:
>> 
>> Thank you for information.
>> 
>> I’ll build clang without the hack and re-run the benchmark tomorrow.
>> 
>>  
>> 
>> -Evgeny
>> 
>>  
>> 
>> From: Sanjay Patel [mailto:spatel at rotateright.com <mailto:spatel
at rotateright.com>]
>> Sent: Sunday, January 22, 2017 8:00 PM
>> To: Evgeny Astigeevich
>> Cc: llvm-dev; nd
>> 
>> 
>> Subject: Re: [InstCombine] rL292492 affected LoopVectorizer and caused
17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines
>> 
>>  
>> 
>> > Do you mean to remove the hack in InstCombiner::visitICmpInst()?
>> 
>>  
>> 
>> Yes. Although (this just came up in D28625 too) we might need to remove
multiple versions of that in order to unlock optimization:
>> 
>>
https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCompares.cpp#L4338
<https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCompares.cpp#L4338>
>>
https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCasts.cpp#L470
<https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCasts.cpp#L470>
>>
https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstructionCombining.cpp#L803
<https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstructionCombining.cpp#L803>
>>
https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp#L409
<https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp#L409>
>> 
>> Similar for FP:
>>
https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCompares.cpp#L4780
<https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCompares.cpp#L4780>
>>
https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCasts.cpp#L1376
<https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCasts.cpp#L1376>
>>  
>> 
>> On Sun, Jan 22, 2017 at 12:40 PM, Evgeny Astigeevich
<Evgeny.Astigeevich at arm.com <mailto:Evgeny.Astigeevich at
arm.com>> wrote:
>> 
>> Hi Sanjay,
>> 
>>  
>> 
>> The benchmark source file:
http://www.llvm.org/viewvc/llvm-project/test-suite/trunk/SingleSource/Benchmarks/Shootout/sieve.c?view=markup
<http://www.llvm.org/viewvc/llvm-project/test-suite/trunk/SingleSource/Benchmarks/Shootout/sieve.c?view=markup>
>> Clang options used to produce the initial IR: clang -DNDEBUG  -O3
-DNDEBUG -mcpu=cortex-a53 -fomit-frame-pointer -O3 -DNDEBUG   -w
-Werror=date-time -c sieve.c -S -emit-llvm -mllvm -disable-llvm-optzns
--target=aarch64-arm-linux
>> 
>> Opt options: opt -O3 -o /dev/null -print-before-all -print-after-all
sieve.ll >& sieve.log
>> 
>>  
>> 
>> I used the IR (in attached sieve.zip) created with the r292487 version.
>> 
>> The attached sieve contains the output of ‘-print-before-all
-print-after-all’  for r292487 and rL292492.
>> 
>>  
>> 
>> > If it's possible, can you remove that check locally, rebuild,
>> 
>> > and try the benchmark again on your system? I'd love to know
>> 
>> > if that change alone would solve the problem.
>> 
>>  
>> 
>> Do you mean to remove the hack in InstCombiner::visitICmpInst()?
>> 
>>  
>> 
>> Kind regards,
>> Evgeny Astigeevich
>> Senior Compiler Engineer
>> Compilation Tools
>> ARM
>> 
>>  
>> 
>> From: Sanjay Patel [mailto:spatel at rotateright.com <mailto:spatel
at rotateright.com>]
>> Sent: Friday, January 20, 2017 6:16 PM
>> To: Evgeny Astigeevich
>> Cc: llvm-dev; Renato Golin; t.p.northover at gmail.com
<mailto:t.p.northover at gmail.com>; hfinkel at anl.gov <mailto:hfinkel
at anl.gov>
>> Subject: Re: [InstCombine] rL292492 affected LoopVectorizer and caused
17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines
>> 
>>  
>> 
>> Thanks for letting me know about this problem!
>> 
>> There's no 'shl nsw' visible in the earlier (r292487) code,
so it would be better to see exactly what the IR looks like before that added
transform fires.
>> 
>> 
>> But I see a red flag:
>>   %smax = select i1 %11, i64 %10, i64 8193
>> 
>> The new icmp transform allowed us to create an smax, but we have this
hack in InstCombiner::visitICmpInst():
>> 
>>   // Test if the ICmpInst instruction is used exclusively by a select
as
>>   // part of a minimum or maximum operation. If so, refrain from doing
>>   // any other folding. This helps out other analyses which understand
>>   // non-obfuscated minimum and maximum idioms, such as ScalarEvolution
>>   // and CodeGen. And in this case, at least one of the comparison
>>   // operands has at least one user besides the compare (the select),
>>   // which would often largely negate the benefit of folding anyway.
>> 
>> ...so that prevented folding the icmp into the earlier math.
>> 
>> I am actively working on trying to get rid of that bail-out by
improving min/max value tracking and icmp/select folding. In fact, we might be
able to remove it right now, but I don't know the history of that code or
what cases it was supposed to help.
>> 
>>  
>> 
>> If it's possible, can you remove that check locally, rebuild, and
try the benchmark again on your system? I'd love to know if that change
alone would solve the problem.
>> 
>>  
>> 
>>  
>> 
>> On Fri, Jan 20, 2017 at 10:11 AM, Evgeny Astigeevich
<Evgeny.Astigeevich at arm.com <mailto:Evgeny.Astigeevich at
arm.com>> wrote:
>> 
>> Hi,
>> 
>> We found that today's 17.30%/11.37% performance regressions in LNT
SingleSource/Benchmarks/Shootout/sieve on LNT-AArch64-A53-O3__clang_DEV__aarch64
and LNT-Thumb2v7-A15-O3__clang_DEV__thumbv7
(http://llvm.org/perf/db_default/v4/nts/daily_report/2017/1/20?filter-machine-regex=aarch64%7Carm%7Cthumb%7Cgreen
<http://llvm.org/perf/db_default/v4/nts/daily_report/2017/1/20?filter-machine-regex=aarch64%7Carm%7Cthumb%7Cgreen>)
are caused by changes [rL292492] in InstCombine:
>> 
>> https://reviews.llvm.org/D28406 <https://reviews.llvm.org/D28406>
"[InstCombine] icmp sgt (shl nsw X, C1), C0 --> icmp sgt X, C0 >>
C1"
>> 
>> The Loop Vectorizer generates code with more instructions:
>> 
>> ==== Loop Vectorizer from rL292492  ===>> for.body5:             
; preds = %for.inc16.for.body5_crit_edge, %for.cond.preheader
>>   %indvar = phi i64 [ %indvar.next, %for.inc16.for.body5_crit_edge ], [
0, %for.cond.preheader ]
>>   %1 = phi i8 [ %.pre, %for.inc16.for.body5_crit_edge ], [ 1,
%for.cond.preheader ]
>>   %count.122 = phi i32 [ %count.2, %for.inc16.for.body5_crit_edge ], [
0, %for.cond.preheader ]
>>   %i.119 = phi i64 [ %inc17, %for.inc16.for.body5_crit_edge ], [ 2,
%for.cond.preheader ]
>>   %2 = add i64 %indvar, 2
>>   %3 = shl i64 %indvar, 1
>>   %4 = add i64 %3, 4
>>   %5 = add i64 %indvar, 2
>>   %6 = shl i64 %indvar, 1
>>   %7 = add i64 %6, 4
>>   %8 = add i64 %indvar, 2
>>   %9 = mul i64 %indvar, 3
>>   %10 = add i64 %9, 6
>>   %11 = icmp sgt i64 %10, 8193
>>   %smax = select i1 %11, i64 %10, i64 8193
>>   %12 = mul i64 %indvar, -2
>>   %13 = add i64 %12, -5
>>   %14 = add i64 %smax, %13
>>   %15 = add i64 %indvar, 2
>>   %16 = udiv i64 %14, %15
>>   %17 = add i64 %16, 1
>>   %tobool7 = icmp eq i8 %1, 0
>>   br i1 %tobool7, label %for.inc16, label %if.then
>> ===============================>> 
>> The code generated by the Loop Vectorizer before the changes:
>> 
>> ==== Loop Vectorizer from rL292487 ===>> for.body5:              
; preds = %for.inc16.for.body5_crit_edge, %for.cond.preheader
>>   %indvar = phi i64 [ %indvar.next, %for.inc16.for.body5_crit_edge ], [
0, %for.cond.preheader ]
>>   %1 = phi i8 [ %.pre, %for.inc16.for.body5_crit_edge ], [ 1,
%for.cond.preheader ]
>>   %count.122 = phi i32 [ %count.2, %for.inc16.for.body5_crit_edge ], [
0, %for.cond.preheader ]
>>   %i.119 = phi i64 [ %inc17, %for.inc16.for.body5_crit_edge ], [ 2,
%for.cond.preheader ]
>>   %2 = add i64 %indvar, 2
>>   %3 = shl i64 %indvar, 1
>>   %4 = add i64 %3, 4
>>   %5 = add i64 %indvar, 2
>>   %6 = shl i64 %indvar, 1
>>   %7 = add i64 %6, 4
>>   %8 = add i64 %indvar, 2
>>   %9 = mul i64 %indvar, -2
>>   %10 = add i64 %9, 8188
>>   %11 = add i64 %indvar, 2
>>   %12 = udiv i64 %10, %11
>>   %13 = add i64 %12, 1
>>   %tobool7 = icmp eq i8 %1, 0
>>   br i1 %tobool7, label %for.inc16, label %if.then
>> ===============================>> 
>> I have not investigated yet why the behaviour of the Vectorizer is
changed.
>> 
>> Kind regards,
>> Evgeny Astigeevich
>> Senior Compiler Engineer
>> Compilation Tools
>> ARM
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170124/bb637578/attachment.html>

Sanjay Patel via llvm-dev

2017-Jan-24 16:55 UTC

head link

[llvm-dev] [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines

On Tue, Jan 24, 2017 at 9:24 AM, Mehdi Amini <mehdi.amini at apple.com>
wrote:
>
> On Jan 24, 2017, at 7:18 AM, Sanjay Patel <spatel at rotateright.com>
wrote:
>
>
>
> On Mon, Jan 23, 2017 at 10:53 PM, Mehdi Amini <mehdi.amini at
apple.com>
> wrote:
>
>>
>> On Jan 23, 2017, at 3:48 PM, Sanjay Patel via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>> All targets are likely affected in some way by the icmp+shl fold
>> introduced with r292492. It's a basic pattern that occurs in lots
of code.
>> Did you see any perf wins on your targets with this commit?
>>
>> Sadly, it is also likely that many (all?) targets are negatively
impacted
>> on the particular test (SingleSource/Benchmarks/Shootout/sieve) that
you
>> have pointed out here because the IR is now decidedly worse.
>>
>> IMO, we should not revert the commit because it exposed shortcomings in
>> the optimizer. It's an "obvious" fold/canonicalization,
and the related
>> 'nuw' variant of this fold has existed in trunk since:
>> https://reviews.llvm.org/rL285729
>>
>> We need to dissect what analysis/folds are missing to restore the IR to
>> the better form that existed before, but this is probably going to be a
>> long process because we treat min/max like an optimization fence.
>>
>>
>> If this is gonna be a long process to recover, this looks like
something
>> to be reverted in the 4.0 branch (unless I missed that there is a
>> correctness fix involved?).
>>
>>
> Nope - this is just about perf, not correctness. Of course, the intent was
> that this transform should only improve perf, so I wonder if we can pin any
> other perf changes from this commit.
>
> I'm new to using the LNT site, but this should be the full set of
results
> for the A53 machine in question with a baseline (r292491) before this patch
> and current (r292522) :
> http://llvm.org/perf/db_default/v4/nts/107364
>
> If these are reliable results, we have 2 perf wins (puzzle, gramschmidt)
> on the A53 machine. How do we determine the importance of the sieve
> benchmark vs. the rest of the suite?
>
> An x86 machine doesn't show any regressions from this change:
> http://llvm.org/perf/db_default/v4/nts/107353
>
> Are there target-scope-based guidelines for when something is bad enough
> to revert?
>
>
> I don’t think we have any guidelines.
>
> I think my suggestion was more about other regression that we would
> discover after the release, it was more of a “maturity” call: if we just
> notice a problem with the commit right before the release, it may not have
> been in tree long enough to get enough scrutiny.
>
That makes sense. I have no stake in any particular branch, so I have no
objection to revert from the release branch if that's what people would
like to do. My preference is to keep it in trunk though because it should
be a win in theory and reverting there would make it harder to find and
debug problems like this one.




>
> (Also I thought this thread included a compile time regression, which on
> re-read it doesn’t).
>
> —
> Mehdi
>
>
>
> Also, we've absolutely destroyed perf (-48%) on the sieve benchmark on
> that A53 target since the baseline (r256803). There are multiple things to
> fix before we can truly recover?
>
> Regardless of whether we revert or not, I am looking at how to clawback
> the IR from the r292492 regression. Here's one step towards that:
> https://reviews.llvm.org/D29053
>
> If we get lucky, we may be able to sidestep the min/max problem by folding
> harder before we reach that point in the optimization pipeline.
>
>
>
>
>> —
>> Mehdi
>>
>>
>>
>>
>>
>>
>> On Mon, Jan 23, 2017 at 11:13 AM, Evgeny Astigeevich <Evgeny.
>> Astigeevich at arm.com> wrote:
>>
>>> Confirm there is no change in IR if the hack is disabled in the
sources.
>>>
>>> David wrote that these instructions are created by SCEV.
>>>
>>> Are other targets affected by the changes, e.g. X86?
>>>
>>>
>>>
>>> Kind regards,
>>> Evgeny Astigeevich
>>> Senior Compiler Engineer
>>> Compilation Tools
>>> ARM
>>>
>>>
>>>
>>> *From:* Sanjay Patel [mailto:spatel at rotateright.com]
>>> *Sent:* Sunday, January 22, 2017 10:45 PM
>>>
>>> *To:* Evgeny Astigeevich
>>> *Cc:* llvm-dev; nd
>>> *Subject:* Re: [InstCombine] rL292492 affected LoopVectorizer and
>>> caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT
machines
>>>
>>>
>>>
>>> I tried an experiment to remove the integer min/max bailouts from
>>> InstCombine, and it doesn't appear to change the IR in the
attachment, so I
>>> doubt there's going to be any improvement.
>>>
>>> If I haven't messed up this example, this is amazing:
>>> https://godbolt.org/g/yzoxeY
>>>
>>>
>>>
>>> On Sun, Jan 22, 2017 at 1:06 PM, Evgeny Astigeevich <
>>> Evgeny.Astigeevich at arm.com> wrote:
>>>
>>> Thank you for information.
>>>
>>> I’ll build clang without the hack and re-run the benchmark
tomorrow.
>>>
>>>
>>>
>>> -Evgeny
>>>
>>>
>>>
>>> *From:* Sanjay Patel [mailto:spatel at rotateright.com]
>>> *Sent:* Sunday, January 22, 2017 8:00 PM
>>> *To:* Evgeny Astigeevich
>>> *Cc:* llvm-dev; nd
>>>
>>>
>>> *Subject:* Re: [InstCombine] rL292492 affected LoopVectorizer and
>>> caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT
machines
>>>
>>>
>>>
>>> > Do you mean to remove the hack in
InstCombiner::visitICmpInst()?
>>>
>>>
>>>
>>> Yes. Although (this just came up in D28625 too) we might need to
remove
>>> multiple versions of that in order to unlock optimization:
>>>
>>> https://github.com/llvm-mirror/llvm/blob/master/lib/Transfor
>>> ms/InstCombine/InstCombineCompares.cpp#L4338
>>>
>>> https://github.com/llvm-mirror/llvm/blob/master/lib/Transfor
>>> ms/InstCombine/InstCombineCasts.cpp#L470
>>>
>>> https://github.com/llvm-mirror/llvm/blob/master/lib/Transfor
>>> ms/InstCombine/InstructionCombining.cpp#L803
>>>
>>> https://github.com/llvm-mirror/llvm/blob/master/lib/Transfor
>>> ms/InstCombine/InstCombineSimplifyDemanded.cpp#L409
>>>
>>>
>>> Similar for FP:
>>> https://github.com/llvm-mirror/llvm/blob/master/lib/Transfor
>>> ms/InstCombine/InstCombineCompares.cpp#L4780
>>> https://github.com/llvm-mirror/llvm/blob/master/lib/Transfor
>>> ms/InstCombine/InstCombineCasts.cpp#L1376
>>>
>>>
>>>
>>> On Sun, Jan 22, 2017 at 12:40 PM, Evgeny Astigeevich <
>>> Evgeny.Astigeevich at arm.com> wrote:
>>>
>>> Hi Sanjay,
>>>
>>>
>>>
>>> The benchmark source file:
http://www.llvm.org/viewvc/llvm-project/test-
>>> suite/trunk/SingleSource/Benchmarks/Shootout/sieve.c?view=markup
>>>
>>> Clang options used to produce the initial IR: clang -DNDEBUG  -O3
>>> -DNDEBUG -mcpu=cortex-a53 -fomit-frame-pointer -O3 -DNDEBUG   -w
>>> -Werror=date-time -c sieve.c -S -emit-llvm -mllvm
-disable-llvm-optzns
>>> --target=aarch64-arm-linux
>>>
>>> Opt options: opt -O3 -o /dev/null -print-before-all
-print-after-all
>>> sieve.ll >& sieve.log
>>>
>>>
>>>
>>> I used the IR (in attached sieve.zip) created with the r292487
version.
>>>
>>> The attached sieve contains the output of ‘-print-before-all
>>> -print-after-all’  for r292487 and rL292492.
>>>
>>>
>>>
>>> > If it's possible, can you remove that check locally,
rebuild,
>>>
>>> > and try the benchmark again on your system? I'd love to
know
>>>
>>> > if that change alone would solve the problem.
>>>
>>>
>>>
>>> Do you mean to remove the hack in InstCombiner::visitICmpInst()?
>>>
>>>
>>>
>>> Kind regards,
>>> Evgeny Astigeevich
>>> Senior Compiler Engineer
>>> Compilation Tools
>>> ARM
>>>
>>>
>>>
>>> *From:* Sanjay Patel [mailto:spatel at rotateright.com]
>>> *Sent:* Friday, January 20, 2017 6:16 PM
>>> *To:* Evgeny Astigeevich
>>> *Cc:* llvm-dev; Renato Golin; t.p.northover at gmail.com; hfinkel
at anl.gov
>>> *Subject:* Re: [InstCombine] rL292492 affected LoopVectorizer and
>>> caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT
machines
>>>
>>>
>>>
>>> Thanks for letting me know about this problem!
>>>
>>> There's no 'shl nsw' visible in the earlier (r292487)
code, so it would
>>> be better to see exactly what the IR looks like before that added
transform
>>> fires.
>>>
>>>
>>> But I see a red flag:
>>>   %smax = select i1 %11, i64 %10, i64 8193
>>>
>>> The new icmp transform allowed us to create an smax, but we have
this
>>> hack in InstCombiner::visitICmpInst():
>>>
>>>   // Test if the ICmpInst instruction is used exclusively by a
select as
>>>   // part of a minimum or maximum operation. If so, refrain from
doing
>>>   // any other folding. This helps out other analyses which
understand
>>>   // non-obfuscated minimum and maximum idioms, such as
ScalarEvolution
>>>   // and CodeGen. And in this case, at least one of the comparison
>>>   // operands has at least one user besides the compare (the
select),
>>>   // which would often largely negate the benefit of folding
anyway.
>>>
>>> ...so that prevented folding the icmp into the earlier math.
>>>
>>> I am actively working on trying to get rid of that bail-out by
improving
>>> min/max value tracking and icmp/select folding. In fact, we might
be able
>>> to remove it right now, but I don't know the history of that
code or what
>>> cases it was supposed to help.
>>>
>>>
>>>
>>> If it's possible, can you remove that check locally, rebuild,
and try
>>> the benchmark again on your system? I'd love to know if that
change alone
>>> would solve the problem.
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Jan 20, 2017 at 10:11 AM, Evgeny Astigeevich <
>>> Evgeny.Astigeevich at arm.com> wrote:
>>>
>>> Hi,
>>>
>>> We found that today's 17.30%/11.37% performance regressions in
LNT
>>> SingleSource/Benchmarks/Shootout/sieve on
LNT-AArch64-A53-O3__clang_DEV__aarch64
>>> and LNT-Thumb2v7-A15-O3__clang_DEV__thumbv7 (
>>> http://llvm.org/perf/db_default/v4/nts/daily_report/2017/1/
>>> 20?filter-machine-regex=aarch64%7Carm%7Cthumb%7Cgreen) are caused
by
>>> changes [rL292492] in InstCombine:
>>>
>>> https://reviews.llvm.org/D28406 "[InstCombine] icmp sgt (shl
nsw X,
>>> C1), C0 --> icmp sgt X, C0 >> C1"
>>>
>>> The Loop Vectorizer generates code with more instructions:
>>>
>>> ==== Loop Vectorizer from rL292492  ===>>> for.body5:     
; preds >>> %for.inc16.for.body5_crit_edge, %for.cond.preheader
>>>   %indvar = phi i64 [ %indvar.next, %for.inc16.for.body5_crit_edge
], [
>>> 0, %for.cond.preheader ]
>>>   %1 = phi i8 [ %.pre, %for.inc16.for.body5_crit_edge ], [ 1,
>>> %for.cond.preheader ]
>>>   %count.122 = phi i32 [ %count.2, %for.inc16.for.body5_crit_edge
], [
>>> 0, %for.cond.preheader ]
>>>   %i.119 = phi i64 [ %inc17, %for.inc16.for.body5_crit_edge ], [ 2,
>>> %for.cond.preheader ]
>>>   %2 = add i64 %indvar, 2
>>>   %3 = shl i64 %indvar, 1
>>>   %4 = add i64 %3, 4
>>>   %5 = add i64 %indvar, 2
>>>   %6 = shl i64 %indvar, 1
>>>   %7 = add i64 %6, 4
>>>   %8 = add i64 %indvar, 2
>>>   %9 = mul i64 %indvar, 3
>>>   %10 = add i64 %9, 6
>>>   %11 = icmp sgt i64 %10, 8193
>>>   %smax = select i1 %11, i64 %10, i64 8193
>>>   %12 = mul i64 %indvar, -2
>>>   %13 = add i64 %12, -5
>>>   %14 = add i64 %smax, %13
>>>   %15 = add i64 %indvar, 2
>>>   %16 = udiv i64 %14, %15
>>>   %17 = add i64 %16, 1
>>>   %tobool7 = icmp eq i8 %1, 0
>>>   br i1 %tobool7, label %for.inc16, label %if.then
>>> ===============================>>>
>>> The code generated by the Loop Vectorizer before the changes:
>>>
>>> ==== Loop Vectorizer from rL292487 ===>>> for.body5:      
; preds >>> %for.inc16.for.body5_crit_edge, %for.cond.preheader
>>>   %indvar = phi i64 [ %indvar.next, %for.inc16.for.body5_crit_edge
], [
>>> 0, %for.cond.preheader ]
>>>   %1 = phi i8 [ %.pre, %for.inc16.for.body5_crit_edge ], [ 1,
>>> %for.cond.preheader ]
>>>   %count.122 = phi i32 [ %count.2, %for.inc16.for.body5_crit_edge
], [
>>> 0, %for.cond.preheader ]
>>>   %i.119 = phi i64 [ %inc17, %for.inc16.for.body5_crit_edge ], [ 2,
>>> %for.cond.preheader ]
>>>   %2 = add i64 %indvar, 2
>>>   %3 = shl i64 %indvar, 1
>>>   %4 = add i64 %3, 4
>>>   %5 = add i64 %indvar, 2
>>>   %6 = shl i64 %indvar, 1
>>>   %7 = add i64 %6, 4
>>>   %8 = add i64 %indvar, 2
>>>   %9 = mul i64 %indvar, -2
>>>   %10 = add i64 %9, 8188
>>>   %11 = add i64 %indvar, 2
>>>   %12 = udiv i64 %10, %11
>>>   %13 = add i64 %12, 1
>>>   %tobool7 = icmp eq i8 %1, 0
>>>   br i1 %tobool7, label %for.inc16, label %if.then
>>> ===============================>>>
>>> I have not investigated yet why the behaviour of the Vectorizer is
>>> changed.
>>>
>>> Kind regards,
>>> Evgeny Astigeevich
>>> Senior Compiler Engineer
>>> Compilation Tools
>>> ARM
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170124/479fdd8b/attachment-0001.html>

Mehdi Amini via llvm-dev

2017-Jan-24 16:56 UTC

head link

[llvm-dev] [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines

> On Jan 24, 2017, at 8:55 AM, Sanjay Patel <spatel at rotateright.com>
wrote:
> 
> 
> 
> On Tue, Jan 24, 2017 at 9:24 AM, Mehdi Amini <mehdi.amini at apple.com
<mailto:mehdi.amini at apple.com>> wrote:
> 
>> On Jan 24, 2017, at 7:18 AM, Sanjay Patel <spatel at rotateright.com
<mailto:spatel at rotateright.com>> wrote:
>> 
>> 
>> 
>> On Mon, Jan 23, 2017 at 10:53 PM, Mehdi Amini <mehdi.amini at
apple.com <mailto:mehdi.amini at apple.com>> wrote:
>> 
>>> On Jan 23, 2017, at 3:48 PM, Sanjay Patel via llvm-dev <llvm-dev
at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>>> 
>>> All targets are likely affected in some way by the icmp+shl fold
introduced with r292492. It's a basic pattern that occurs in lots of code.
Did you see any perf wins on your targets with this commit?
>>> 
>>> Sadly, it is also likely that many (all?) targets are negatively
impacted on the particular test (SingleSource/Benchmarks/Shootout/sieve) that
you have pointed out here because the IR is now decidedly worse.
>>> 
>>> IMO, we should not revert the commit because it exposed
shortcomings in the optimizer. It's an "obvious"
fold/canonicalization, and the related 'nuw' variant of this fold has
existed in trunk since:
>>> https://reviews.llvm.org/rL285729
<https://reviews.llvm.org/rL285729>
>>> 
>>> We need to dissect what analysis/folds are missing to restore the
IR to the better form that existed before, but this is probably going to be a
long process because we treat min/max like an optimization fence.
>> 
>> If this is gonna be a long process to recover, this looks like
something to be reverted in the 4.0 branch (unless I missed that there is a
correctness fix involved?).
>> 
>> 
>> Nope - this is just about perf, not correctness. Of course, the intent
was that this transform should only improve perf, so I wonder if we can pin any
other perf changes from this commit.
>> 
>> I'm new to using the LNT site, but this should be the full set of
results for the A53 machine in question with a baseline (r292491) before this
patch and current (r292522) :
>> http://llvm.org/perf/db_default/v4/nts/107364
<http://llvm.org/perf/db_default/v4/nts/107364>
>> 
>> If these are reliable results, we have 2 perf wins (puzzle,
gramschmidt) on the A53 machine. How do we determine the importance of the sieve
benchmark vs. the rest of the suite?
>> 
>> An x86 machine doesn't show any regressions from this change:
>> http://llvm.org/perf/db_default/v4/nts/107353
<http://llvm.org/perf/db_default/v4/nts/107353>
>> 
>> Are there target-scope-based guidelines for when something is bad
enough to revert?
> 
> I don’t think we have any guidelines.
> 
> I think my suggestion was more about other regression that we would
discover after the release, it was more of a “maturity” call: if we just notice
a problem with the commit right before the release, it may not have been in tree
long enough to get enough scrutiny.
> 
> That makes sense. I have no stake in any particular branch, so I have no
objection to revert from the release branch if that's what people would like
to do.
+Hans for advising on what to on clang-4.0.
> My preference is to keep it in trunk though because it should be a win in
theory and reverting there would make it harder to find and debug problems like
this one.
Right, trunk seems fine to me at this point.

— 
Mehdi




> 
> 
> 
>  
> 
> (Also I thought this thread included a compile time regression, which on
re-read it doesn’t).
> 
> — 
> Mehdi
> 
> 
>> 
>> Also, we've absolutely destroyed perf (-48%) on the sieve benchmark
on that A53 target since the baseline (r256803). There are multiple things to
fix before we can truly recover?
>> 
>> Regardless of whether we revert or not, I am looking at how to clawback
the IR from the r292492 regression. Here's one step towards that:
>> https://reviews.llvm.org/D29053 <https://reviews.llvm.org/D29053>
>> 
>> If we get lucky, we may be able to sidestep the min/max problem by
folding harder before we reach that point in the optimization pipeline.
>> 
>> 
>>  
>> — 
>> Mehdi
>> 
>> 
>> 
>>> 
>>> 
>>> 
>>> On Mon, Jan 23, 2017 at 11:13 AM, Evgeny Astigeevich
<Evgeny.Astigeevich at arm.com <mailto:Evgeny.Astigeevich at
arm.com>> wrote:
>>> Confirm there is no change in IR if the hack is disabled in the
sources.
>>> 
>>> David wrote that these instructions are created by SCEV.
>>> 
>>> Are other targets affected by the changes, e.g. X86?
>>> 
>>>  
>>> 
>>> Kind regards,
>>> Evgeny Astigeevich
>>> Senior Compiler Engineer
>>> Compilation Tools
>>> ARM
>>> 
>>>  
>>> 
>>> From: Sanjay Patel [mailto:spatel at rotateright.com
<mailto:spatel at rotateright.com>]
>>> Sent: Sunday, January 22, 2017 10:45 PM
>>> 
>>> 
>>> To: Evgeny Astigeevich
>>> Cc: llvm-dev; nd
>>> Subject: Re: [InstCombine] rL292492 affected LoopVectorizer and
caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines
>>> 
>>>  
>>> 
>>> I tried an experiment to remove the integer min/max bailouts from
InstCombine, and it doesn't appear to change the IR in the attachment, so I
doubt there's going to be any improvement.
>>> 
>>> If I haven't messed up this example, this is amazing:
>>> https://godbolt.org/g/yzoxeY <https://godbolt.org/g/yzoxeY>
>>>  
>>> 
>>> On Sun, Jan 22, 2017 at 1:06 PM, Evgeny Astigeevich
<Evgeny.Astigeevich at arm.com <mailto:Evgeny.Astigeevich at
arm.com>> wrote:
>>> 
>>> Thank you for information.
>>> 
>>> I’ll build clang without the hack and re-run the benchmark
tomorrow.
>>> 
>>>  
>>> 
>>> -Evgeny
>>> 
>>>  
>>> 
>>> From: Sanjay Patel [mailto:spatel at rotateright.com
<mailto:spatel at rotateright.com>]
>>> Sent: Sunday, January 22, 2017 8:00 PM
>>> To: Evgeny Astigeevich
>>> Cc: llvm-dev; nd
>>> 
>>> 
>>> Subject: Re: [InstCombine] rL292492 affected LoopVectorizer and
caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines
>>> 
>>>  
>>> 
>>> > Do you mean to remove the hack in
InstCombiner::visitICmpInst()?
>>> 
>>>  
>>> 
>>> Yes. Although (this just came up in D28625 too) we might need to
remove multiple versions of that in order to unlock optimization:
>>> 
>>>
https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCompares.cpp#L4338
<https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCompares.cpp#L4338>
>>>
https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCasts.cpp#L470
<https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCasts.cpp#L470>
>>>
https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstructionCombining.cpp#L803
<https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstructionCombining.cpp#L803>
>>>
https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp#L409
<https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp#L409>
>>> 
>>> Similar for FP:
>>>
https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCompares.cpp#L4780
<https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCompares.cpp#L4780>
>>>
https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCasts.cpp#L1376
<https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCasts.cpp#L1376>
>>>  
>>> 
>>> On Sun, Jan 22, 2017 at 12:40 PM, Evgeny Astigeevich
<Evgeny.Astigeevich at arm.com <mailto:Evgeny.Astigeevich at
arm.com>> wrote:
>>> 
>>> Hi Sanjay,
>>> 
>>>  
>>> 
>>> The benchmark source file:
http://www.llvm.org/viewvc/llvm-project/test-suite/trunk/SingleSource/Benchmarks/Shootout/sieve.c?view=markup
<http://www.llvm.org/viewvc/llvm-project/test-suite/trunk/SingleSource/Benchmarks/Shootout/sieve.c?view=markup>
>>> Clang options used to produce the initial IR: clang -DNDEBUG  -O3
-DNDEBUG -mcpu=cortex-a53 -fomit-frame-pointer -O3 -DNDEBUG   -w
-Werror=date-time -c sieve.c -S -emit-llvm -mllvm -disable-llvm-optzns
--target=aarch64-arm-linux
>>> 
>>> Opt options: opt -O3 -o /dev/null -print-before-all
-print-after-all sieve.ll >& sieve.log
>>> 
>>>  
>>> 
>>> I used the IR (in attached sieve.zip) created with the r292487
version.
>>> 
>>> The attached sieve contains the output of ‘-print-before-all
-print-after-all’  for r292487 and rL292492.
>>> 
>>>  
>>> 
>>> > If it's possible, can you remove that check locally,
rebuild,
>>> 
>>> > and try the benchmark again on your system? I'd love to
know
>>> 
>>> > if that change alone would solve the problem.
>>> 
>>>  
>>> 
>>> Do you mean to remove the hack in InstCombiner::visitICmpInst()?
>>> 
>>>  
>>> 
>>> Kind regards,
>>> Evgeny Astigeevich
>>> Senior Compiler Engineer
>>> Compilation Tools
>>> ARM
>>> 
>>>  
>>> 
>>> From: Sanjay Patel [mailto:spatel at rotateright.com
<mailto:spatel at rotateright.com>]
>>> Sent: Friday, January 20, 2017 6:16 PM
>>> To: Evgeny Astigeevich
>>> Cc: llvm-dev; Renato Golin; t.p.northover at gmail.com
<mailto:t.p.northover at gmail.com>; hfinkel at anl.gov <mailto:hfinkel
at anl.gov>
>>> Subject: Re: [InstCombine] rL292492 affected LoopVectorizer and
caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines
>>> 
>>>  
>>> 
>>> Thanks for letting me know about this problem!
>>> 
>>> There's no 'shl nsw' visible in the earlier (r292487)
code, so it would be better to see exactly what the IR looks like before that
added transform fires.
>>> 
>>> 
>>> But I see a red flag:
>>>   %smax = select i1 %11, i64 %10, i64 8193
>>> 
>>> The new icmp transform allowed us to create an smax, but we have
this hack in InstCombiner::visitICmpInst():
>>> 
>>>   // Test if the ICmpInst instruction is used exclusively by a
select as
>>>   // part of a minimum or maximum operation. If so, refrain from
doing
>>>   // any other folding. This helps out other analyses which
understand
>>>   // non-obfuscated minimum and maximum idioms, such as
ScalarEvolution
>>>   // and CodeGen. And in this case, at least one of the comparison
>>>   // operands has at least one user besides the compare (the
select),
>>>   // which would often largely negate the benefit of folding
anyway.
>>> 
>>> ...so that prevented folding the icmp into the earlier math.
>>> 
>>> I am actively working on trying to get rid of that bail-out by
improving min/max value tracking and icmp/select folding. In fact, we might be
able to remove it right now, but I don't know the history of that code or
what cases it was supposed to help.
>>> 
>>>  
>>> 
>>> If it's possible, can you remove that check locally, rebuild,
and try the benchmark again on your system? I'd love to know if that change
alone would solve the problem.
>>> 
>>>  
>>> 
>>>  
>>> 
>>> On Fri, Jan 20, 2017 at 10:11 AM, Evgeny Astigeevich
<Evgeny.Astigeevich at arm.com <mailto:Evgeny.Astigeevich at
arm.com>> wrote:
>>> 
>>> Hi,
>>> 
>>> We found that today's 17.30%/11.37% performance regressions in
LNT SingleSource/Benchmarks/Shootout/sieve on
LNT-AArch64-A53-O3__clang_DEV__aarch64 and
LNT-Thumb2v7-A15-O3__clang_DEV__thumbv7
(http://llvm.org/perf/db_default/v4/nts/daily_report/2017/1/20?filter-machine-regex=aarch64%7Carm%7Cthumb%7Cgreen
<http://llvm.org/perf/db_default/v4/nts/daily_report/2017/1/20?filter-machine-regex=aarch64%7Carm%7Cthumb%7Cgreen>)
are caused by changes [rL292492] in InstCombine:
>>> 
>>> https://reviews.llvm.org/D28406
<https://reviews.llvm.org/D28406> "[InstCombine] icmp sgt (shl nsw X,
C1), C0 --> icmp sgt X, C0 >> C1"
>>> 
>>> The Loop Vectorizer generates code with more instructions:
>>> 
>>> ==== Loop Vectorizer from rL292492  ===>>> for.body5:     
; preds = %for.inc16.for.body5_crit_edge, %for.cond.preheader
>>>   %indvar = phi i64 [ %indvar.next, %for.inc16.for.body5_crit_edge
], [ 0, %for.cond.preheader ]
>>>   %1 = phi i8 [ %.pre, %for.inc16.for.body5_crit_edge ], [ 1,
%for.cond.preheader ]
>>>   %count.122 = phi i32 [ %count.2, %for.inc16.for.body5_crit_edge
], [ 0, %for.cond.preheader ]
>>>   %i.119 = phi i64 [ %inc17, %for.inc16.for.body5_crit_edge ], [ 2,
%for.cond.preheader ]
>>>   %2 = add i64 %indvar, 2
>>>   %3 = shl i64 %indvar, 1
>>>   %4 = add i64 %3, 4
>>>   %5 = add i64 %indvar, 2
>>>   %6 = shl i64 %indvar, 1
>>>   %7 = add i64 %6, 4
>>>   %8 = add i64 %indvar, 2
>>>   %9 = mul i64 %indvar, 3
>>>   %10 = add i64 %9, 6
>>>   %11 = icmp sgt i64 %10, 8193
>>>   %smax = select i1 %11, i64 %10, i64 8193
>>>   %12 = mul i64 %indvar, -2
>>>   %13 = add i64 %12, -5
>>>   %14 = add i64 %smax, %13
>>>   %15 = add i64 %indvar, 2
>>>   %16 = udiv i64 %14, %15
>>>   %17 = add i64 %16, 1
>>>   %tobool7 = icmp eq i8 %1, 0
>>>   br i1 %tobool7, label %for.inc16, label %if.then
>>> ===============================>>> 
>>> The code generated by the Loop Vectorizer before the changes:
>>> 
>>> ==== Loop Vectorizer from rL292487 ===>>> for.body5:      
; preds = %for.inc16.for.body5_crit_edge, %for.cond.preheader
>>>   %indvar = phi i64 [ %indvar.next, %for.inc16.for.body5_crit_edge
], [ 0, %for.cond.preheader ]
>>>   %1 = phi i8 [ %.pre, %for.inc16.for.body5_crit_edge ], [ 1,
%for.cond.preheader ]
>>>   %count.122 = phi i32 [ %count.2, %for.inc16.for.body5_crit_edge
], [ 0, %for.cond.preheader ]
>>>   %i.119 = phi i64 [ %inc17, %for.inc16.for.body5_crit_edge ], [ 2,
%for.cond.preheader ]
>>>   %2 = add i64 %indvar, 2
>>>   %3 = shl i64 %indvar, 1
>>>   %4 = add i64 %3, 4
>>>   %5 = add i64 %indvar, 2
>>>   %6 = shl i64 %indvar, 1
>>>   %7 = add i64 %6, 4
>>>   %8 = add i64 %indvar, 2
>>>   %9 = mul i64 %indvar, -2
>>>   %10 = add i64 %9, 8188
>>>   %11 = add i64 %indvar, 2
>>>   %12 = udiv i64 %10, %11
>>>   %13 = add i64 %12, 1
>>>   %tobool7 = icmp eq i8 %1, 0
>>>   br i1 %tobool7, label %for.inc16, label %if.then
>>> ===============================>>> 
>>> I have not investigated yet why the behaviour of the Vectorizer is
changed.
>>> 
>>> Kind regards,
>>> Evgeny Astigeevich
>>> Senior Compiler Engineer
>>> Compilation Tools
>>> ARM
>>> 
>>>  
>>> 
>>>  
>>> 
>>>  
>>> 
>>> 
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170124/8f8d9114/attachment.html>

Evgeny Astigeevich via llvm-dev

2017-Jan-24 20:55 UTC

head link

[llvm-dev] [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines

Hi Sanjay,

Thank you for your analysis. It’s interesting why the x86 machine is not
affected. Maybe the x86 backend is smarter than the AArch64 backend, or it might
be micro-architectural differences.
I don’t mind to keep the changes on trunk.
What I’d like to see is who will/should be involved in solving the issue. What
kind of help/support is needed? Should we (ARM Compilation Tools) start digging
into the issue and fix it because the issue affects our future ARM Compiler 6
releases?
We can provide help with validating that patches fix the issue and don’t
introduce new ones.

Kind regards,
Evgeny Astigeevich
Senior Compiler Engineer
Compilation Tools
ARM

From: Sanjay Patel [mailto:spatel at rotateright.com]
Sent: Tuesday, January 24, 2017 4:55 PM
To: Mehdi Amini
Cc: Evgeny Astigeevich; llvm-dev; nd
Subject: Re: [llvm-dev] [InstCombine] rL292492 affected LoopVectorizer and
caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines

On Tue, Jan 24, 2017 at 9:24 AM, Mehdi Amini <mehdi.amini at
apple.com<mailto:mehdi.amini at apple.com>> wrote:

On Jan 24, 2017, at 7:18 AM, Sanjay Patel <spatel at
rotateright.com<mailto:spatel at rotateright.com>> wrote:

On Mon, Jan 23, 2017 at 10:53 PM, Mehdi Amini <mehdi.amini at
apple.com<mailto:mehdi.amini at apple.com>> wrote:

On Jan 23, 2017, at 3:48 PM, Sanjay Patel via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:

All targets are likely affected in some way by the icmp+shl fold introduced with
r292492. It's a basic pattern that occurs in lots of code. Did you see any
perf wins on your targets with this commit?
Sadly, it is also likely that many (all?) targets are negatively impacted on the
particular test (SingleSource/Benchmarks/Shootout/sieve) that you have pointed
out here because the IR is now decidedly worse.

IMO, we should not revert the commit because it exposed shortcomings in the
optimizer. It's an "obvious" fold/canonicalization, and the
related 'nuw' variant of this fold has existed in trunk since:
https://reviews.llvm.org/rL285729
We need to dissect what analysis/folds are missing to restore the IR to the
better form that existed before, but this is probably going to be a long process
because we treat min/max like an optimization fence.

If this is gonna be a long process to recover, this looks like something to be
reverted in the 4.0 branch (unless I missed that there is a correctness fix
involved?).

Nope - this is just about perf, not correctness. Of course, the intent was that
this transform should only improve perf, so I wonder if we can pin any other
perf changes from this commit.

I'm new to using the LNT site, but this should be the full set of results
for the A53 machine in question with a baseline (r292491) before this patch and
current (r292522) :
http://llvm.org/perf/db_default/v4/nts/107364

If these are reliable results, we have 2 perf wins (puzzle, gramschmidt) on the
A53 machine. How do we determine the importance of the sieve benchmark vs. the
rest of the suite?
An x86 machine doesn't show any regressions from this change:
http://llvm.org/perf/db_default/v4/nts/107353

Are there target-scope-based guidelines for when something is bad enough to
revert?

I don’t think we have any guidelines.

I think my suggestion was more about other regression that we would discover
after the release, it was more of a “maturity” call: if we just notice a problem
with the commit right before the release, it may not have been in tree long
enough to get enough scrutiny.

That makes sense. I have no stake in any particular branch, so I have no
objection to revert from the release branch if that's what people would like
to do. My preference is to keep it in trunk though because it should be a win in
theory and reverting there would make it harder to find and debug problems like
this one.

(Also I thought this thread included a compile time regression, which on re-read
it doesn’t).

—
Mehdi

Also, we've absolutely destroyed perf (-48%) on the sieve benchmark on that
A53 target since the baseline (r256803). There are multiple things to fix before
we can truly recover?

Regardless of whether we revert or not, I am looking at how to clawback the IR
from the r292492 regression. Here's one step towards that:
https://reviews.llvm.org/D29053

If we get lucky, we may be able to sidestep the min/max problem by folding
harder before we reach that point in the optimization pipeline.

—
Mehdi

On Mon, Jan 23, 2017 at 11:13 AM, Evgeny Astigeevich <Evgeny.Astigeevich at
arm.com<mailto:Evgeny.Astigeevich at arm.com>> wrote:
Confirm there is no change in IR if the hack is disabled in the sources.
David wrote that these instructions are created by SCEV.
Are other targets affected by the changes, e.g. X86?

Kind regards,
Evgeny Astigeevich
Senior Compiler Engineer
Compilation Tools
ARM

From: Sanjay Patel [mailto:spatel at rotateright.com<mailto:spatel at
rotateright.com>]
Sent: Sunday, January 22, 2017 10:45 PM

To: Evgeny Astigeevich
Cc: llvm-dev; nd
Subject: Re: [InstCombine] rL292492 affected LoopVectorizer and caused
17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines

I tried an experiment to remove the integer min/max bailouts from InstCombine,
and it doesn't appear to change the IR in the attachment, so I doubt
there's going to be any improvement.

If I haven't messed up this example, this is amazing:
https://godbolt.org/g/yzoxeY

On Sun, Jan 22, 2017 at 1:06 PM, Evgeny Astigeevich <Evgeny.Astigeevich at
arm.com<mailto:Evgeny.Astigeevich at arm.com>> wrote:
Thank you for information.
I’ll build clang without the hack and re-run the benchmark tomorrow.

-Evgeny

From: Sanjay Patel [mailto:spatel at rotateright.com<mailto:spatel at
rotateright.com>]
Sent: Sunday, January 22, 2017 8:00 PM
To: Evgeny Astigeevich
Cc: llvm-dev; nd

Subject: Re: [InstCombine] rL292492 affected LoopVectorizer and caused
17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines
> Do you mean to remove the hack in InstCombiner::visitICmpInst()?
Yes. Although (this just came up in D28625 too) we might need to remove multiple
versions of that in order to unlock optimization:
https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCompares.cpp#L4338
https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCasts.cpp#L470
https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstructionCombining.cpp#L803
https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp#L409

Similar for FP:
https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCompares.cpp#L4780
https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCasts.cpp#L1376

On Sun, Jan 22, 2017 at 12:40 PM, Evgeny Astigeevich <Evgeny.Astigeevich at
arm.com<mailto:Evgeny.Astigeevich at arm.com>> wrote:
Hi Sanjay,

The benchmark source file:
http://www.llvm.org/viewvc/llvm-project/test-suite/trunk/SingleSource/Benchmarks/Shootout/sieve.c?view=markup
Clang options used to produce the initial IR: clang -DNDEBUG  -O3 -DNDEBUG
-mcpu=cortex-a53 -fomit-frame-pointer -O3 -DNDEBUG   -w -Werror=date-time -c
sieve.c -S -emit-llvm -mllvm -disable-llvm-optzns --target=aarch64-arm-linux
Opt options: opt -O3 -o /dev/null -print-before-all -print-after-all sieve.ll
>& sieve.log

I used the IR (in attached sieve.zip) created with the r292487 version.
The attached sieve contains the output of ‘-print-before-all -print-after-all’ 
for r292487 and rL292492.
> If it's possible, can you remove that check locally, rebuild,
> and try the benchmark again on your system? I'd love to know
> if that change alone would solve the problem.
Do you mean to remove the hack in InstCombiner::visitICmpInst()?

Kind regards,
Evgeny Astigeevich
Senior Compiler Engineer
Compilation Tools
ARM

From: Sanjay Patel [mailto:spatel at rotateright.com<mailto:spatel at
rotateright.com>]
Sent: Friday, January 20, 2017 6:16 PM
To: Evgeny Astigeevich
Cc: llvm-dev; Renato Golin; t.p.northover at gmail.com<mailto:t.p.northover
at gmail.com>; hfinkel at anl.gov<mailto:hfinkel at anl.gov>
Subject: Re: [InstCombine] rL292492 affected LoopVectorizer and caused
17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines

Thanks for letting me know about this problem!

There's no 'shl nsw' visible in the earlier (r292487) code, so it
would be better to see exactly what the IR looks like before that added
transform fires.

But I see a red flag:
  %smax = select i1 %11, i64 %10, i64 8193
The new icmp transform allowed us to create an smax, but we have this hack in
InstCombiner::visitICmpInst():

  // Test if the ICmpInst instruction is used exclusively by a select as
  // part of a minimum or maximum operation. If so, refrain from doing
  // any other folding. This helps out other analyses which understand
  // non-obfuscated minimum and maximum idioms, such as ScalarEvolution
  // and CodeGen. And in this case, at least one of the comparison
  // operands has at least one user besides the compare (the select),
  // which would often largely negate the benefit of folding anyway.
...so that prevented folding the icmp into the earlier math.
I am actively working on trying to get rid of that bail-out by improving min/max
value tracking and icmp/select folding. In fact, we might be able to remove it
right now, but I don't know the history of that code or what cases it was
supposed to help.

If it's possible, can you remove that check locally, rebuild, and try the
benchmark again on your system? I'd love to know if that change alone would
solve the problem.

On Fri, Jan 20, 2017 at 10:11 AM, Evgeny Astigeevich <Evgeny.Astigeevich at
arm.com<mailto:Evgeny.Astigeevich at arm.com>> wrote:
Hi,

We found that today's 17.30%/11.37% performance regressions in LNT
SingleSource/Benchmarks/Shootout/sieve on LNT-AArch64-A53-O3__clang_DEV__aarch64
and LNT-Thumb2v7-A15-O3__clang_DEV__thumbv7
(http://llvm.org/perf/db_default/v4/nts/daily_report/2017/1/20?filter-machine-regex=aarch64%7Carm%7Cthumb%7Cgreen)
are caused by changes [rL292492] in InstCombine:

https://reviews.llvm.org/D28406 "[InstCombine] icmp sgt (shl nsw X, C1), C0
--> icmp sgt X, C0 >> C1"

The Loop Vectorizer generates code with more instructions:

==== Loop Vectorizer from rL292492  ===for.body5:                               
; preds = %for.inc16.for.body5_crit_edge, %for.cond.preheader
  %indvar = phi i64 [ %indvar.next, %for.inc16.for.body5_crit_edge ], [ 0,
%for.cond.preheader ]
  %1 = phi i8 [ %.pre, %for.inc16.for.body5_crit_edge ], [ 1,
%for.cond.preheader ]
  %count.122 = phi i32 [ %count.2, %for.inc16.for.body5_crit_edge ], [ 0,
%for.cond.preheader ]
  %i.119 = phi i64 [ %inc17, %for.inc16.for.body5_crit_edge ], [ 2,
%for.cond.preheader ]
  %2 = add i64 %indvar, 2
  %3 = shl i64 %indvar, 1
  %4 = add i64 %3, 4
  %5 = add i64 %indvar, 2
  %6 = shl i64 %indvar, 1
  %7 = add i64 %6, 4
  %8 = add i64 %indvar, 2
  %9 = mul i64 %indvar, 3
  %10 = add i64 %9, 6
  %11 = icmp sgt i64 %10, 8193
  %smax = select i1 %11, i64 %10, i64 8193
  %12 = mul i64 %indvar, -2
  %13 = add i64 %12, -5
  %14 = add i64 %smax, %13
  %15 = add i64 %indvar, 2
  %16 = udiv i64 %14, %15
  %17 = add i64 %16, 1
  %tobool7 = icmp eq i8 %1, 0
  br i1 %tobool7, label %for.inc16, label %if.then
===============================
The code generated by the Loop Vectorizer before the changes:

==== Loop Vectorizer from rL292487 ===for.body5:                                
; preds = %for.inc16.for.body5_crit_edge, %for.cond.preheader
  %indvar = phi i64 [ %indvar.next, %for.inc16.for.body5_crit_edge ], [ 0,
%for.cond.preheader ]
  %1 = phi i8 [ %.pre, %for.inc16.for.body5_crit_edge ], [ 1,
%for.cond.preheader ]
  %count.122 = phi i32 [ %count.2, %for.inc16.for.body5_crit_edge ], [ 0,
%for.cond.preheader ]
  %i.119 = phi i64 [ %inc17, %for.inc16.for.body5_crit_edge ], [ 2,
%for.cond.preheader ]
  %2 = add i64 %indvar, 2
  %3 = shl i64 %indvar, 1
  %4 = add i64 %3, 4
  %5 = add i64 %indvar, 2
  %6 = shl i64 %indvar, 1
  %7 = add i64 %6, 4
  %8 = add i64 %indvar, 2
  %9 = mul i64 %indvar, -2
  %10 = add i64 %9, 8188
  %11 = add i64 %indvar, 2
  %12 = udiv i64 %10, %11
  %13 = add i64 %12, 1
  %tobool7 = icmp eq i8 %1, 0
  br i1 %tobool7, label %for.inc16, label %if.then
===============================
I have not investigated yet why the behaviour of the Vectorizer is changed.

Kind regards,
Evgeny Astigeevich
Senior Compiler Engineer
Compilation Tools
ARM

_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170124/4087b73d/attachment.html>

llvm dev - Jan 2017 - [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines

[llvm-dev] [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines

[llvm-dev] [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines

[llvm-dev] [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines

[llvm-dev] [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines