Michael Kuperstein via llvm-dev
2017-Jan-24 21:46 UTC
[llvm-dev] [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines
On Tue, Jan 24, 2017 at 1:20 PM, Sanjay Patel via llvm-dev < llvm-dev at lists.llvm.org> wrote:> > I started looking at the log files that you attached, and I'm confused. > The code that is supposedly causing the perf regression is created by the > loop vectorizer, right? Except the bad code is not in the "vector.body", so > is there something peculiar about this benchmark that the hot loop is not > the vector loop? But there's another mystery: there are no vector ops in > the "vector.body"! > >I haven't looked at this particular example, but this isn't very rare. 1) The vectorizer actually doubles as an unroller, under some circumstances. So it's conceivable to have a vector.body without vector instructions. 2) From what was pasted above, the "bad" code is in the runtime checks that LV generates. The general problem is that if we don't know what the loop count of a loop is going to be, we assume it's high, so the cost of the runtime checks is negligible. This fails if we have a loop with a low (but statically unknown) trip-count nested inside a hot loop with a high trip-count. In this case, the overhead from the runtime checks, that end up being inside the outer loop, may cost more than what we gain from vectorization. To try to avoid this, we have a threshold that prevents us from generating too costly runtime checks, but I guess this particular check is not considered complex enough. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170124/959d5a70/attachment.html>
Michael Kuperstein via llvm-dev
2017-Jan-24 21:50 UTC
[llvm-dev] [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines
Also, the difference between X86 and AArch64 is that on X86, the vectorizer decides not to unroll. On Tue, Jan 24, 2017 at 1:46 PM, Michael Kuperstein <mkuper at google.com> wrote:> > On Tue, Jan 24, 2017 at 1:20 PM, Sanjay Patel via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> >> I started looking at the log files that you attached, and I'm confused. >> The code that is supposedly causing the perf regression is created by the >> loop vectorizer, right? Except the bad code is not in the "vector.body", so >> is there something peculiar about this benchmark that the hot loop is not >> the vector loop? But there's another mystery: there are no vector ops in >> the "vector.body"! >> >> > I haven't looked at this particular example, but this isn't very rare. > > 1) The vectorizer actually doubles as an unroller, under some > circumstances. So it's conceivable to have a vector.body without vector > instructions. > > 2) From what was pasted above, the "bad" code is in the runtime checks > that LV generates. > The general problem is that if we don't know what the loop count of a loop > is going to be, we assume it's high, so the cost of the runtime checks is > negligible. This fails if we have a loop with a low (but statically > unknown) trip-count nested inside a hot loop with a high trip-count. In > this case, the overhead from the runtime checks, that end up being inside > the outer loop, may cost more than what we gain from vectorization. > To try to avoid this, we have a threshold that prevents us from generating > too costly runtime checks, but I guess this particular check is not > considered complex enough. > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170124/0c6e7d9c/attachment.html>
Evgeny Astigeevich via llvm-dev
2017-Jan-25 17:53 UTC
[llvm-dev] [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines
Hi Michael, Thank you for information. You’ve gave me ideas what to look at. It looks like SCEV might reconstruct things InstCombine have optimized out to create runtime checks. Thanks, Evgeny From: Michael Kuperstein [mailto:mkuper at google.com] Sent: Tuesday, January 24, 2017 9:46 PM To: Sanjay Patel Cc: Evgeny Astigeevich; llvm-dev; nd Subject: Re: [llvm-dev] [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines On Tue, Jan 24, 2017 at 1:20 PM, Sanjay Patel via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote: I started looking at the log files that you attached, and I'm confused. The code that is supposedly causing the perf regression is created by the loop vectorizer, right? Except the bad code is not in the "vector.body", so is there something peculiar about this benchmark that the hot loop is not the vector loop? But there's another mystery: there are no vector ops in the "vector.body"! I haven't looked at this particular example, but this isn't very rare. 1) The vectorizer actually doubles as an unroller, under some circumstances. So it's conceivable to have a vector.body without vector instructions. 2) From what was pasted above, the "bad" code is in the runtime checks that LV generates. The general problem is that if we don't know what the loop count of a loop is going to be, we assume it's high, so the cost of the runtime checks is negligible. This fails if we have a loop with a low (but statically unknown) trip-count nested inside a hot loop with a high trip-count. In this case, the overhead from the runtime checks, that end up being inside the outer loop, may cost more than what we gain from vectorization. To try to avoid this, we have a threshold that prevents us from generating too costly runtime checks, but I guess this particular check is not considered complex enough. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170125/27c83277/attachment.html>
Evgeny Astigeevich via llvm-dev
2017-Feb-01 18:23 UTC
[llvm-dev] [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines
Hi, It looks I have found what causes the insertion of the instructions. *** IR in rL292487 before Loop Vectorize ***: if.then: ; preds = %for.body5 %add = shl nsw i64 %i.119, 1 %cmp917 = icmp slt i64 %add, 8193 br i1 %cmp917, label %for.body10.preheader, label %for.end14 for.body10.preheader: ; preds = %if.then br label %for.body10 for.body10: ; preds = %for.body10.preheader, %for.body10 %k.018 = phi i64 [ %add13, %for.body10 ], [ %add, %for.body10.preheader ] %arrayidx11 = getelementptr inbounds [8193 x i8], [8193 x i8]* @main.flags, i64 0, i64 %k.018 store i8 0, i8* %arrayidx11, align 1, !tbaa !5 %add13 = add nuw nsw i64 %k.018, %i.119 %cmp9 = icmp slt i64 %add13, 8193 br i1 %cmp9, label %for.body10, label %for.end14.loopexit for.end14.loopexit: ; preds = %for.body10 br label %for.end14 for.end14: ; preds = %for.end14.loopexit, %if.then %inc15 = add nsw i32 %count.122, 1 br label %for.inc16 *** IR in rL292492 before Loop Vectorize ***: if.then: ; preds = %for.body5 %cmp917 = icmp slt i64 %i.119, 4097 br i1 %cmp917, label %for.body10.preheader, label %for.end14 for.body10.preheader: ; preds = %if.then %add = shl nsw i64 %i.119, 1 br label %for.body10 for.body10: ; preds = %for.body10.preheader, %for.body10 %k.018 = phi i64 [ %add13, %for.body10 ], [ %add, %for.body10.preheader ] %arrayidx11 = getelementptr inbounds [8193 x i8], [8193 x i8]* @main.flags, i64 0, i64 %k.018 store i8 0, i8* %arrayidx11, align 1, !tbaa !5 %add13 = add nuw nsw i64 %k.018, %i.119 %cmp9 = icmp slt i64 %add13, 8193 br i1 %cmp9, label %for.body10, label %for.end14.loopexit for.end14.loopexit: ; preds = %for.body10 br label %for.end14 for.end14: ; preds = %for.end14.loopexit, %if.then %inc15 = add nsw i32 %count.122, 1 br label %for.inc16 **************************************** ‘%cmp917 = icmp slt i64 %add, 8193’ is changed to ‘%cmp917 = icmp slt i64 %i.119, 4097’ in ‘if.then’ basic block. In ScalarEvolution.cpp there is the following code which checks if a loop is guarded by a basic block (in our case the ‘if.then’ basic block): {code} ScalarEvolution::ExitLimit ScalarEvolution::howManyLessThans(const SCEV *LHS, const SCEV *RHS, const Loop *L, bool IsSigned, bool ControlsExit, bool AllowPredicates) { … // If the loop entry is guarded by the result of the backedge test of the // first loop iteration, then we know the backedge will be taken at least // once and so the backedge taken count is as above. If not then we use the // expression (max(End,Start)-Start)/Stride to describe the backedge count, // as if the backedge is taken at least once max(End,Start) is End and so the // result is as above, and if not max(End,Start) is Start so we get a backedge // count of zero. const SCEV *BECount; if (isLoopEntryGuardedByCond(L, Cond, getMinusSCEV(Start, Stride), RHS)) BECount = BECountIfBackedgeTaken; else { End = IsSigned ? getSMaxExpr(RHS, Start) : getUMaxExpr(RHS, Start); BECount = computeBECount(getMinusSCEV(End, Start), Stride, false); } {code} As ICMP instructions in the ‘if.then’ basic block and in the ‘for.body10’ basic block use different constants (the constants are the same 8193 in rL292487) ScalarEvolution thinks the ‘if.then’ basic block is not a guardian. This causes the insertion of SMaxExpr. In fact the execution semantic is not changed. So we need to update ScalarEvolution to support such cases. Kind regards, Evgeny Astigeevich Senior Compiler Engineer Compilation Tools ARM From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Evgeny Astigeevich via llvm-dev Sent: Wednesday, January 25, 2017 5:53 PM To: Michael Kuperstein Cc: llvm-dev; nd Subject: Re: [llvm-dev] [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines Hi Michael, Thank you for information. You’ve gave me ideas what to look at. It looks like SCEV might reconstruct things InstCombine have optimized out to create runtime checks. Thanks, Evgeny From: Michael Kuperstein [mailto:mkuper at google.com] Sent: Tuesday, January 24, 2017 9:46 PM To: Sanjay Patel Cc: Evgeny Astigeevich; llvm-dev; nd Subject: Re: [llvm-dev] [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines On Tue, Jan 24, 2017 at 1:20 PM, Sanjay Patel via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote: I started looking at the log files that you attached, and I'm confused. The code that is supposedly causing the perf regression is created by the loop vectorizer, right? Except the bad code is not in the "vector.body", so is there something peculiar about this benchmark that the hot loop is not the vector loop? But there's another mystery: there are no vector ops in the "vector.body"! I haven't looked at this particular example, but this isn't very rare. 1) The vectorizer actually doubles as an unroller, under some circumstances. So it's conceivable to have a vector.body without vector instructions. 2) From what was pasted above, the "bad" code is in the runtime checks that LV generates. The general problem is that if we don't know what the loop count of a loop is going to be, we assume it's high, so the cost of the runtime checks is negligible. This fails if we have a loop with a low (but statically unknown) trip-count nested inside a hot loop with a high trip-count. In this case, the overhead from the runtime checks, that end up being inside the outer loop, may cost more than what we gain from vectorization. To try to avoid this, we have a threshold that prevents us from generating too costly runtime checks, but I guess this particular check is not considered complex enough. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170201/d87878c5/attachment.html>