Evgeny Astigeevich via llvm-dev
2017-Jan-23 18:13 UTC
[llvm-dev] [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines
Confirm there is no change in IR if the hack is disabled in the sources. David wrote that these instructions are created by SCEV. Are other targets affected by the changes, e.g. X86? Kind regards, Evgeny Astigeevich Senior Compiler Engineer Compilation Tools ARM From: Sanjay Patel [mailto:spatel at rotateright.com] Sent: Sunday, January 22, 2017 10:45 PM To: Evgeny Astigeevich Cc: llvm-dev; nd Subject: Re: [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines I tried an experiment to remove the integer min/max bailouts from InstCombine, and it doesn't appear to change the IR in the attachment, so I doubt there's going to be any improvement. If I haven't messed up this example, this is amazing: https://godbolt.org/g/yzoxeY On Sun, Jan 22, 2017 at 1:06 PM, Evgeny Astigeevich <Evgeny.Astigeevich at arm.com<mailto:Evgeny.Astigeevich at arm.com>> wrote: Thank you for information. I’ll build clang without the hack and re-run the benchmark tomorrow. -Evgeny From: Sanjay Patel [mailto:spatel at rotateright.com<mailto:spatel at rotateright.com>] Sent: Sunday, January 22, 2017 8:00 PM To: Evgeny Astigeevich Cc: llvm-dev; nd Subject: Re: [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines> Do you mean to remove the hack in InstCombiner::visitICmpInst()?Yes. Although (this just came up in D28625 too) we might need to remove multiple versions of that in order to unlock optimization: https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCompares.cpp#L4338 https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCasts.cpp#L470 https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstructionCombining.cpp#L803 https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp#L409 Similar for FP: https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCompares.cpp#L4780 https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCasts.cpp#L1376 On Sun, Jan 22, 2017 at 12:40 PM, Evgeny Astigeevich <Evgeny.Astigeevich at arm.com<mailto:Evgeny.Astigeevich at arm.com>> wrote: Hi Sanjay, The benchmark source file: http://www.llvm.org/viewvc/llvm-project/test-suite/trunk/SingleSource/Benchmarks/Shootout/sieve.c?view=markup Clang options used to produce the initial IR: clang -DNDEBUG -O3 -DNDEBUG -mcpu=cortex-a53 -fomit-frame-pointer -O3 -DNDEBUG -w -Werror=date-time -c sieve.c -S -emit-llvm -mllvm -disable-llvm-optzns --target=aarch64-arm-linux Opt options: opt -O3 -o /dev/null -print-before-all -print-after-all sieve.ll >& sieve.log I used the IR (in attached sieve.zip) created with the r292487 version. The attached sieve contains the output of ‘-print-before-all -print-after-all’ for r292487 and rL292492.> If it's possible, can you remove that check locally, rebuild, > and try the benchmark again on your system? I'd love to know > if that change alone would solve the problem.Do you mean to remove the hack in InstCombiner::visitICmpInst()? Kind regards, Evgeny Astigeevich Senior Compiler Engineer Compilation Tools ARM From: Sanjay Patel [mailto:spatel at rotateright.com<mailto:spatel at rotateright.com>] Sent: Friday, January 20, 2017 6:16 PM To: Evgeny Astigeevich Cc: llvm-dev; Renato Golin; t.p.northover at gmail.com<mailto:t.p.northover at gmail.com>; hfinkel at anl.gov<mailto:hfinkel at anl.gov> Subject: Re: [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines Thanks for letting me know about this problem! There's no 'shl nsw' visible in the earlier (r292487) code, so it would be better to see exactly what the IR looks like before that added transform fires. But I see a red flag: %smax = select i1 %11, i64 %10, i64 8193 The new icmp transform allowed us to create an smax, but we have this hack in InstCombiner::visitICmpInst(): // Test if the ICmpInst instruction is used exclusively by a select as // part of a minimum or maximum operation. If so, refrain from doing // any other folding. This helps out other analyses which understand // non-obfuscated minimum and maximum idioms, such as ScalarEvolution // and CodeGen. And in this case, at least one of the comparison // operands has at least one user besides the compare (the select), // which would often largely negate the benefit of folding anyway. ...so that prevented folding the icmp into the earlier math. I am actively working on trying to get rid of that bail-out by improving min/max value tracking and icmp/select folding. In fact, we might be able to remove it right now, but I don't know the history of that code or what cases it was supposed to help. If it's possible, can you remove that check locally, rebuild, and try the benchmark again on your system? I'd love to know if that change alone would solve the problem. On Fri, Jan 20, 2017 at 10:11 AM, Evgeny Astigeevich <Evgeny.Astigeevich at arm.com<mailto:Evgeny.Astigeevich at arm.com>> wrote: Hi, We found that today's 17.30%/11.37% performance regressions in LNT SingleSource/Benchmarks/Shootout/sieve on LNT-AArch64-A53-O3__clang_DEV__aarch64 and LNT-Thumb2v7-A15-O3__clang_DEV__thumbv7 (http://llvm.org/perf/db_default/v4/nts/daily_report/2017/1/20?filter-machine-regex=aarch64%7Carm%7Cthumb%7Cgreen) are caused by changes [rL292492] in InstCombine: https://reviews.llvm.org/D28406 "[InstCombine] icmp sgt (shl nsw X, C1), C0 --> icmp sgt X, C0 >> C1" The Loop Vectorizer generates code with more instructions: ==== Loop Vectorizer from rL292492 ===for.body5: ; preds = %for.inc16.for.body5_crit_edge, %for.cond.preheader %indvar = phi i64 [ %indvar.next, %for.inc16.for.body5_crit_edge ], [ 0, %for.cond.preheader ] %1 = phi i8 [ %.pre, %for.inc16.for.body5_crit_edge ], [ 1, %for.cond.preheader ] %count.122 = phi i32 [ %count.2, %for.inc16.for.body5_crit_edge ], [ 0, %for.cond.preheader ] %i.119 = phi i64 [ %inc17, %for.inc16.for.body5_crit_edge ], [ 2, %for.cond.preheader ] %2 = add i64 %indvar, 2 %3 = shl i64 %indvar, 1 %4 = add i64 %3, 4 %5 = add i64 %indvar, 2 %6 = shl i64 %indvar, 1 %7 = add i64 %6, 4 %8 = add i64 %indvar, 2 %9 = mul i64 %indvar, 3 %10 = add i64 %9, 6 %11 = icmp sgt i64 %10, 8193 %smax = select i1 %11, i64 %10, i64 8193 %12 = mul i64 %indvar, -2 %13 = add i64 %12, -5 %14 = add i64 %smax, %13 %15 = add i64 %indvar, 2 %16 = udiv i64 %14, %15 %17 = add i64 %16, 1 %tobool7 = icmp eq i8 %1, 0 br i1 %tobool7, label %for.inc16, label %if.then =============================== The code generated by the Loop Vectorizer before the changes: ==== Loop Vectorizer from rL292487 ===for.body5: ; preds = %for.inc16.for.body5_crit_edge, %for.cond.preheader %indvar = phi i64 [ %indvar.next, %for.inc16.for.body5_crit_edge ], [ 0, %for.cond.preheader ] %1 = phi i8 [ %.pre, %for.inc16.for.body5_crit_edge ], [ 1, %for.cond.preheader ] %count.122 = phi i32 [ %count.2, %for.inc16.for.body5_crit_edge ], [ 0, %for.cond.preheader ] %i.119 = phi i64 [ %inc17, %for.inc16.for.body5_crit_edge ], [ 2, %for.cond.preheader ] %2 = add i64 %indvar, 2 %3 = shl i64 %indvar, 1 %4 = add i64 %3, 4 %5 = add i64 %indvar, 2 %6 = shl i64 %indvar, 1 %7 = add i64 %6, 4 %8 = add i64 %indvar, 2 %9 = mul i64 %indvar, -2 %10 = add i64 %9, 8188 %11 = add i64 %indvar, 2 %12 = udiv i64 %10, %11 %13 = add i64 %12, 1 %tobool7 = icmp eq i8 %1, 0 br i1 %tobool7, label %for.inc16, label %if.then =============================== I have not investigated yet why the behaviour of the Vectorizer is changed. Kind regards, Evgeny Astigeevich Senior Compiler Engineer Compilation Tools ARM -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170123/db564b43/attachment-0001.html>
Sanjay Patel via llvm-dev
2017-Jan-23 23:48 UTC
[llvm-dev] [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines
All targets are likely affected in some way by the icmp+shl fold introduced with r292492. It's a basic pattern that occurs in lots of code. Did you see any perf wins on your targets with this commit? Sadly, it is also likely that many (all?) targets are negatively impacted on the particular test (SingleSource/Benchmarks/Shootout/sieve) that you have pointed out here because the IR is now decidedly worse. IMO, we should not revert the commit because it exposed shortcomings in the optimizer. It's an "obvious" fold/canonicalization, and the related 'nuw' variant of this fold has existed in trunk since: https://reviews.llvm.org/rL285729 We need to dissect what analysis/folds are missing to restore the IR to the better form that existed before, but this is probably going to be a long process because we treat min/max like an optimization fence. On Mon, Jan 23, 2017 at 11:13 AM, Evgeny Astigeevich < Evgeny.Astigeevich at arm.com> wrote:> Confirm there is no change in IR if the hack is disabled in the sources. > > David wrote that these instructions are created by SCEV. > > Are other targets affected by the changes, e.g. X86? > > > > Kind regards, > Evgeny Astigeevich > Senior Compiler Engineer > Compilation Tools > ARM > > > > *From:* Sanjay Patel [mailto:spatel at rotateright.com] > *Sent:* Sunday, January 22, 2017 10:45 PM > > *To:* Evgeny Astigeevich > *Cc:* llvm-dev; nd > *Subject:* Re: [InstCombine] rL292492 affected LoopVectorizer and caused > 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines > > > > I tried an experiment to remove the integer min/max bailouts from > InstCombine, and it doesn't appear to change the IR in the attachment, so I > doubt there's going to be any improvement. > > If I haven't messed up this example, this is amazing: > https://godbolt.org/g/yzoxeY > > > > On Sun, Jan 22, 2017 at 1:06 PM, Evgeny Astigeevich < > Evgeny.Astigeevich at arm.com> wrote: > > Thank you for information. > > I’ll build clang without the hack and re-run the benchmark tomorrow. > > > > -Evgeny > > > > *From:* Sanjay Patel [mailto:spatel at rotateright.com] > *Sent:* Sunday, January 22, 2017 8:00 PM > *To:* Evgeny Astigeevich > *Cc:* llvm-dev; nd > > > *Subject:* Re: [InstCombine] rL292492 affected LoopVectorizer and caused > 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines > > > > > Do you mean to remove the hack in InstCombiner::visitICmpInst()? > > > > Yes. Although (this just came up in D28625 too) we might need to remove > multiple versions of that in order to unlock optimization: > > https://github.com/llvm-mirror/llvm/blob/master/lib/ > Transforms/InstCombine/InstCombineCompares.cpp#L4338 > > https://github.com/llvm-mirror/llvm/blob/master/lib/ > Transforms/InstCombine/InstCombineCasts.cpp#L470 > > https://github.com/llvm-mirror/llvm/blob/master/lib/ > Transforms/InstCombine/InstructionCombining.cpp#L803 > > https://github.com/llvm-mirror/llvm/blob/master/lib/ > Transforms/InstCombine/InstCombineSimplifyDemanded.cpp#L409 > > > Similar for FP: > https://github.com/llvm-mirror/llvm/blob/master/lib/ > Transforms/InstCombine/InstCombineCompares.cpp#L4780 > https://github.com/llvm-mirror/llvm/blob/master/lib/ > Transforms/InstCombine/InstCombineCasts.cpp#L1376 > > > > On Sun, Jan 22, 2017 at 12:40 PM, Evgeny Astigeevich < > Evgeny.Astigeevich at arm.com> wrote: > > Hi Sanjay, > > > > The benchmark source file: http://www.llvm.org/viewvc/ > llvm-project/test-suite/trunk/SingleSource/Benchmarks/ > Shootout/sieve.c?view=markup > > Clang options used to produce the initial IR: clang -DNDEBUG -O3 -DNDEBUG > -mcpu=cortex-a53 -fomit-frame-pointer -O3 -DNDEBUG -w -Werror=date-time > -c sieve.c -S -emit-llvm -mllvm -disable-llvm-optzns > --target=aarch64-arm-linux > > Opt options: opt -O3 -o /dev/null -print-before-all -print-after-all > sieve.ll >& sieve.log > > > > I used the IR (in attached sieve.zip) created with the r292487 version. > > The attached sieve contains the output of ‘-print-before-all > -print-after-all’ for r292487 and rL292492. > > > > > If it's possible, can you remove that check locally, rebuild, > > > and try the benchmark again on your system? I'd love to know > > > if that change alone would solve the problem. > > > > Do you mean to remove the hack in InstCombiner::visitICmpInst()? > > > > Kind regards, > Evgeny Astigeevich > Senior Compiler Engineer > Compilation Tools > ARM > > > > *From:* Sanjay Patel [mailto:spatel at rotateright.com] > *Sent:* Friday, January 20, 2017 6:16 PM > *To:* Evgeny Astigeevich > *Cc:* llvm-dev; Renato Golin; t.p.northover at gmail.com; hfinkel at anl.gov > *Subject:* Re: [InstCombine] rL292492 affected LoopVectorizer and caused > 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines > > > > Thanks for letting me know about this problem! > > There's no 'shl nsw' visible in the earlier (r292487) code, so it would be > better to see exactly what the IR looks like before that added transform > fires. > > > But I see a red flag: > %smax = select i1 %11, i64 %10, i64 8193 > > The new icmp transform allowed us to create an smax, but we have this hack > in InstCombiner::visitICmpInst(): > > // Test if the ICmpInst instruction is used exclusively by a select as > // part of a minimum or maximum operation. If so, refrain from doing > // any other folding. This helps out other analyses which understand > // non-obfuscated minimum and maximum idioms, such as ScalarEvolution > // and CodeGen. And in this case, at least one of the comparison > // operands has at least one user besides the compare (the select), > // which would often largely negate the benefit of folding anyway. > > ...so that prevented folding the icmp into the earlier math. > > I am actively working on trying to get rid of that bail-out by improving > min/max value tracking and icmp/select folding. In fact, we might be able > to remove it right now, but I don't know the history of that code or what > cases it was supposed to help. > > > > If it's possible, can you remove that check locally, rebuild, and try the > benchmark again on your system? I'd love to know if that change alone would > solve the problem. > > > > > > On Fri, Jan 20, 2017 at 10:11 AM, Evgeny Astigeevich < > Evgeny.Astigeevich at arm.com> wrote: > > Hi, > > We found that today's 17.30%/11.37% performance regressions in LNT > SingleSource/Benchmarks/Shootout/sieve on LNT-AArch64-A53-O3__clang_DEV__aarch64 > and LNT-Thumb2v7-A15-O3__clang_DEV__thumbv7 (http://llvm.org/perf/db_ > default/v4/nts/daily_report/2017/1/20?filter-machine- > regex=aarch64%7Carm%7Cthumb%7Cgreen) are caused by changes [rL292492] in > InstCombine: > > https://reviews.llvm.org/D28406 "[InstCombine] icmp sgt (shl nsw X, C1), > C0 --> icmp sgt X, C0 >> C1" > > The Loop Vectorizer generates code with more instructions: > > ==== Loop Vectorizer from rL292492 ===> for.body5: ; preds > %for.inc16.for.body5_crit_edge, %for.cond.preheader > %indvar = phi i64 [ %indvar.next, %for.inc16.for.body5_crit_edge ], [ 0, > %for.cond.preheader ] > %1 = phi i8 [ %.pre, %for.inc16.for.body5_crit_edge ], [ 1, > %for.cond.preheader ] > %count.122 = phi i32 [ %count.2, %for.inc16.for.body5_crit_edge ], [ 0, > %for.cond.preheader ] > %i.119 = phi i64 [ %inc17, %for.inc16.for.body5_crit_edge ], [ 2, > %for.cond.preheader ] > %2 = add i64 %indvar, 2 > %3 = shl i64 %indvar, 1 > %4 = add i64 %3, 4 > %5 = add i64 %indvar, 2 > %6 = shl i64 %indvar, 1 > %7 = add i64 %6, 4 > %8 = add i64 %indvar, 2 > %9 = mul i64 %indvar, 3 > %10 = add i64 %9, 6 > %11 = icmp sgt i64 %10, 8193 > %smax = select i1 %11, i64 %10, i64 8193 > %12 = mul i64 %indvar, -2 > %13 = add i64 %12, -5 > %14 = add i64 %smax, %13 > %15 = add i64 %indvar, 2 > %16 = udiv i64 %14, %15 > %17 = add i64 %16, 1 > %tobool7 = icmp eq i8 %1, 0 > br i1 %tobool7, label %for.inc16, label %if.then > ===============================> > The code generated by the Loop Vectorizer before the changes: > > ==== Loop Vectorizer from rL292487 ===> for.body5: ; preds > %for.inc16.for.body5_crit_edge, %for.cond.preheader > %indvar = phi i64 [ %indvar.next, %for.inc16.for.body5_crit_edge ], [ 0, > %for.cond.preheader ] > %1 = phi i8 [ %.pre, %for.inc16.for.body5_crit_edge ], [ 1, > %for.cond.preheader ] > %count.122 = phi i32 [ %count.2, %for.inc16.for.body5_crit_edge ], [ 0, > %for.cond.preheader ] > %i.119 = phi i64 [ %inc17, %for.inc16.for.body5_crit_edge ], [ 2, > %for.cond.preheader ] > %2 = add i64 %indvar, 2 > %3 = shl i64 %indvar, 1 > %4 = add i64 %3, 4 > %5 = add i64 %indvar, 2 > %6 = shl i64 %indvar, 1 > %7 = add i64 %6, 4 > %8 = add i64 %indvar, 2 > %9 = mul i64 %indvar, -2 > %10 = add i64 %9, 8188 > %11 = add i64 %indvar, 2 > %12 = udiv i64 %10, %11 > %13 = add i64 %12, 1 > %tobool7 = icmp eq i8 %1, 0 > br i1 %tobool7, label %for.inc16, label %if.then > ===============================> > I have not investigated yet why the behaviour of the Vectorizer is changed. > > Kind regards, > Evgeny Astigeevich > Senior Compiler Engineer > Compilation Tools > ARM > > > > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170123/e5a06486/attachment.html>
Mehdi Amini via llvm-dev
2017-Jan-24 05:53 UTC
[llvm-dev] [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines
> On Jan 23, 2017, at 3:48 PM, Sanjay Patel via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > All targets are likely affected in some way by the icmp+shl fold introduced with r292492. It's a basic pattern that occurs in lots of code. Did you see any perf wins on your targets with this commit? > > Sadly, it is also likely that many (all?) targets are negatively impacted on the particular test (SingleSource/Benchmarks/Shootout/sieve) that you have pointed out here because the IR is now decidedly worse. > > IMO, we should not revert the commit because it exposed shortcomings in the optimizer. It's an "obvious" fold/canonicalization, and the related 'nuw' variant of this fold has existed in trunk since: > https://reviews.llvm.org/rL285729 <https://reviews.llvm.org/rL285729> > > We need to dissect what analysis/folds are missing to restore the IR to the better form that existed before, but this is probably going to be a long process because we treat min/max like an optimization fence.If this is gonna be a long process to recover, this looks like something to be reverted in the 4.0 branch (unless I missed that there is a correctness fix involved?). — Mehdi> > > > On Mon, Jan 23, 2017 at 11:13 AM, Evgeny Astigeevich <Evgeny.Astigeevich at arm.com <mailto:Evgeny.Astigeevich at arm.com>> wrote: > Confirm there is no change in IR if the hack is disabled in the sources. > > David wrote that these instructions are created by SCEV. > > Are other targets affected by the changes, e.g. X86? > > > > Kind regards, > Evgeny Astigeevich > Senior Compiler Engineer > Compilation Tools > ARM > > > > From: Sanjay Patel [mailto:spatel at rotateright.com <mailto:spatel at rotateright.com>] > Sent: Sunday, January 22, 2017 10:45 PM > > > To: Evgeny Astigeevich > Cc: llvm-dev; nd > Subject: Re: [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines > > > > I tried an experiment to remove the integer min/max bailouts from InstCombine, and it doesn't appear to change the IR in the attachment, so I doubt there's going to be any improvement. > > If I haven't messed up this example, this is amazing: > https://godbolt.org/g/yzoxeY <https://godbolt.org/g/yzoxeY> > > > On Sun, Jan 22, 2017 at 1:06 PM, Evgeny Astigeevich <Evgeny.Astigeevich at arm.com <mailto:Evgeny.Astigeevich at arm.com>> wrote: > > Thank you for information. > > I’ll build clang without the hack and re-run the benchmark tomorrow. > > > > -Evgeny > > > > From: Sanjay Patel [mailto:spatel at rotateright.com <mailto:spatel at rotateright.com>] > Sent: Sunday, January 22, 2017 8:00 PM > To: Evgeny Astigeevich > Cc: llvm-dev; nd > > > Subject: Re: [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines > > > > > Do you mean to remove the hack in InstCombiner::visitICmpInst()? > > > > Yes. Although (this just came up in D28625 too) we might need to remove multiple versions of that in order to unlock optimization: > > https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCompares.cpp#L4338 <https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCompares.cpp#L4338> > https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCasts.cpp#L470 <https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCasts.cpp#L470> > https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstructionCombining.cpp#L803 <https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstructionCombining.cpp#L803> > https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp#L409 <https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp#L409> > > Similar for FP: > https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCompares.cpp#L4780 <https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCompares.cpp#L4780> > https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCasts.cpp#L1376 <https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCasts.cpp#L1376> > > > On Sun, Jan 22, 2017 at 12:40 PM, Evgeny Astigeevich <Evgeny.Astigeevich at arm.com <mailto:Evgeny.Astigeevich at arm.com>> wrote: > > Hi Sanjay, > > > > The benchmark source file: http://www.llvm.org/viewvc/llvm-project/test-suite/trunk/SingleSource/Benchmarks/Shootout/sieve.c?view=markup <http://www.llvm.org/viewvc/llvm-project/test-suite/trunk/SingleSource/Benchmarks/Shootout/sieve.c?view=markup> > Clang options used to produce the initial IR: clang -DNDEBUG -O3 -DNDEBUG -mcpu=cortex-a53 -fomit-frame-pointer -O3 -DNDEBUG -w -Werror=date-time -c sieve.c -S -emit-llvm -mllvm -disable-llvm-optzns --target=aarch64-arm-linux > > Opt options: opt -O3 -o /dev/null -print-before-all -print-after-all sieve.ll >& sieve.log > > > > I used the IR (in attached sieve.zip) created with the r292487 version. > > The attached sieve contains the output of ‘-print-before-all -print-after-all’ for r292487 and rL292492. > > > > > If it's possible, can you remove that check locally, rebuild, > > > and try the benchmark again on your system? I'd love to know > > > if that change alone would solve the problem. > > > > Do you mean to remove the hack in InstCombiner::visitICmpInst()? > > > > Kind regards, > Evgeny Astigeevich > Senior Compiler Engineer > Compilation Tools > ARM > > > > From: Sanjay Patel [mailto:spatel at rotateright.com <mailto:spatel at rotateright.com>] > Sent: Friday, January 20, 2017 6:16 PM > To: Evgeny Astigeevich > Cc: llvm-dev; Renato Golin; t.p.northover at gmail.com <mailto:t.p.northover at gmail.com>; hfinkel at anl.gov <mailto:hfinkel at anl.gov> > Subject: Re: [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines > > > > Thanks for letting me know about this problem! > > There's no 'shl nsw' visible in the earlier (r292487) code, so it would be better to see exactly what the IR looks like before that added transform fires. > > > But I see a red flag: > %smax = select i1 %11, i64 %10, i64 8193 > > The new icmp transform allowed us to create an smax, but we have this hack in InstCombiner::visitICmpInst(): > > // Test if the ICmpInst instruction is used exclusively by a select as > // part of a minimum or maximum operation. If so, refrain from doing > // any other folding. This helps out other analyses which understand > // non-obfuscated minimum and maximum idioms, such as ScalarEvolution > // and CodeGen. And in this case, at least one of the comparison > // operands has at least one user besides the compare (the select), > // which would often largely negate the benefit of folding anyway. > > ...so that prevented folding the icmp into the earlier math. > > I am actively working on trying to get rid of that bail-out by improving min/max value tracking and icmp/select folding. In fact, we might be able to remove it right now, but I don't know the history of that code or what cases it was supposed to help. > > > > If it's possible, can you remove that check locally, rebuild, and try the benchmark again on your system? I'd love to know if that change alone would solve the problem. > > > > > > On Fri, Jan 20, 2017 at 10:11 AM, Evgeny Astigeevich <Evgeny.Astigeevich at arm.com <mailto:Evgeny.Astigeevich at arm.com>> wrote: > > Hi, > > We found that today's 17.30%/11.37% performance regressions in LNT SingleSource/Benchmarks/Shootout/sieve on LNT-AArch64-A53-O3__clang_DEV__aarch64 and LNT-Thumb2v7-A15-O3__clang_DEV__thumbv7 (http://llvm.org/perf/db_default/v4/nts/daily_report/2017/1/20?filter-machine-regex=aarch64%7Carm%7Cthumb%7Cgreen <http://llvm.org/perf/db_default/v4/nts/daily_report/2017/1/20?filter-machine-regex=aarch64%7Carm%7Cthumb%7Cgreen>) are caused by changes [rL292492] in InstCombine: > > https://reviews.llvm.org/D28406 <https://reviews.llvm.org/D28406> "[InstCombine] icmp sgt (shl nsw X, C1), C0 --> icmp sgt X, C0 >> C1" > > The Loop Vectorizer generates code with more instructions: > > ==== Loop Vectorizer from rL292492 ===> for.body5: ; preds = %for.inc16.for.body5_crit_edge, %for.cond.preheader > %indvar = phi i64 [ %indvar.next, %for.inc16.for.body5_crit_edge ], [ 0, %for.cond.preheader ] > %1 = phi i8 [ %.pre, %for.inc16.for.body5_crit_edge ], [ 1, %for.cond.preheader ] > %count.122 = phi i32 [ %count.2, %for.inc16.for.body5_crit_edge ], [ 0, %for.cond.preheader ] > %i.119 = phi i64 [ %inc17, %for.inc16.for.body5_crit_edge ], [ 2, %for.cond.preheader ] > %2 = add i64 %indvar, 2 > %3 = shl i64 %indvar, 1 > %4 = add i64 %3, 4 > %5 = add i64 %indvar, 2 > %6 = shl i64 %indvar, 1 > %7 = add i64 %6, 4 > %8 = add i64 %indvar, 2 > %9 = mul i64 %indvar, 3 > %10 = add i64 %9, 6 > %11 = icmp sgt i64 %10, 8193 > %smax = select i1 %11, i64 %10, i64 8193 > %12 = mul i64 %indvar, -2 > %13 = add i64 %12, -5 > %14 = add i64 %smax, %13 > %15 = add i64 %indvar, 2 > %16 = udiv i64 %14, %15 > %17 = add i64 %16, 1 > %tobool7 = icmp eq i8 %1, 0 > br i1 %tobool7, label %for.inc16, label %if.then > ===============================> > The code generated by the Loop Vectorizer before the changes: > > ==== Loop Vectorizer from rL292487 ===> for.body5: ; preds = %for.inc16.for.body5_crit_edge, %for.cond.preheader > %indvar = phi i64 [ %indvar.next, %for.inc16.for.body5_crit_edge ], [ 0, %for.cond.preheader ] > %1 = phi i8 [ %.pre, %for.inc16.for.body5_crit_edge ], [ 1, %for.cond.preheader ] > %count.122 = phi i32 [ %count.2, %for.inc16.for.body5_crit_edge ], [ 0, %for.cond.preheader ] > %i.119 = phi i64 [ %inc17, %for.inc16.for.body5_crit_edge ], [ 2, %for.cond.preheader ] > %2 = add i64 %indvar, 2 > %3 = shl i64 %indvar, 1 > %4 = add i64 %3, 4 > %5 = add i64 %indvar, 2 > %6 = shl i64 %indvar, 1 > %7 = add i64 %6, 4 > %8 = add i64 %indvar, 2 > %9 = mul i64 %indvar, -2 > %10 = add i64 %9, 8188 > %11 = add i64 %indvar, 2 > %12 = udiv i64 %10, %11 > %13 = add i64 %12, 1 > %tobool7 = icmp eq i8 %1, 0 > br i1 %tobool7, label %for.inc16, label %if.then > ===============================> > I have not investigated yet why the behaviour of the Vectorizer is changed. > > Kind regards, > Evgeny Astigeevich > Senior Compiler Engineer > Compilation Tools > ARM > > > > > > > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170123/556d49f2/attachment.html>
Maybe Matching Threads
- [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines
- [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines
- [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines
- [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines
- [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines