Mikael Holmén via llvm-dev
2020-May-05 13:25 UTC
[llvm-dev] Missing vectorization of loop due to load late in the loop
Hi, TL;DR: A loop doesn't get vectorized due to the interaction of loop- rotate, licm and instcombine. What to do about it? Full story: In the benchmarks for our out-of-tree target we have a case that we would like to get vectorized, but currently it isn't. I've done some digging to see why and have some kind of idea what prevents it, but I don't know what the best way to fix it would be so I thought I'd share a reduced version of it to see if anyone here have ideas. So, what happens can be reproduced on trunk with opt -O3 -S -o - bbi-39227-reduced.ll The input program consists of two nested loops where the inner loop loads a value and does some calculations and then the outer loop writes the calculated value somewhere. With -debug-only=loop-vectorize I see this printout from the vectorizer for the inner loop: LV: Not vectorizing: Found an unidentified PHI %h.15 = phi i32 [ %h.11, %inner.cond.preheader ], [ %h.1, %inner.body ] When we run the vectorizer the inner loop looks like inner.body: ; preds %inner.cond.preheader, %inner.body %h.15 = phi i32 [ %h.11, %inner.cond.preheader ], [ %h.1, %inner.body ] %h.pn4 = phi i32* [ %h, %inner.cond.preheader ], [ %hp.1, %inner.body ] %j.03 = phi i16 [ 0, %inner.cond.preheader ], [ %j.1, %inner.body ] %real.02 = phi i32 [ 0, %inner.cond.preheader ], [ %sub, %inner.body ] %hp.1 = getelementptr inbounds i32, i32* %h.pn4, i64 1 %0 = shl i32 %h.15, 16 %conv7 = ashr exact i32 %0, 16 %add = sub i32 %real.02, %h.15 %sub = add i32 %add, %conv7 %j.1 = add nuw nsw i16 %j.03, 1 %h.1 = load i32, i32* %hp.1, align 1, !tbaa !4 %cmp3 = icmp ult i16 %j.03, 99 br i1 %cmp3, label %inner.body, label %inner.end, !llvm.loop !8 And the vectorizer bails out since the load is placed "late" in the loop, so RecurrenceDescriptor::isFirstOrderRecurrence returns false. If we just move the load before the definition of %0 in the vectorizer input, then we instead get LV: We can vectorize this loop! Originally the loop had two loads, and then one of them was actually placed early in the loop block. That was done by instcombine, by InstCombiner::FoldPHIArgLoadIntoPHI. The reason this doesn't happen for the load we see in the reduced example is because after loop rotation licm hoists the start value of the PHI, so when instcombine tries to do FoldPHIArgLoadIntoPHI, the start value isn't placed in the direct predecessor block of the PHI, and the folding is aborted. Therefore I tried to squeeze in an additional run of instcombine before the licm run that does that hoisting, and then instcombine does the folding and the load in the inner loop is done early instead of late in the loop. The vectorizer then is happy and accepts to vectorize the loop. I have no idea if inserting another run of instcombine is a good idea though and I see some mixed results in other benchmarks. So, the question is what really to do about this... Should the vectorizer vectorize the loop anyway in this case? Should licm not hoist the initial value of the PHI (but hoisting is in general nice, so...). Should instcombine realize it can do something about this case even if the initial value of the PHI is "far" from the PHI. Something completely different? Any ideas? Thanks, Mikael -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: bbi-39227-reduced.ll URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200505/78af719b/attachment.ksh>
Finkel, Hal J. via llvm-dev
2020-May-06 22:35 UTC
[llvm-dev] Missing vectorization of loop due to load late in the loop
________________________________ From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Mikael Holmén via llvm-dev <llvm-dev at lists.llvm.org> Sent: Tuesday, May 5, 2020 8:25 AM To: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org> Subject: [llvm-dev] Missing vectorization of loop due to load late in the loop Hi, TL;DR: A loop doesn't get vectorized due to the interaction of loop- rotate, licm and instcombine. What to do about it? Full story: In the benchmarks for our out-of-tree target we have a case that we would like to get vectorized, but currently it isn't. I've done some digging to see why and have some kind of idea what prevents it, but I don't know what the best way to fix it would be so I thought I'd share a reduced version of it to see if anyone here have ideas. So, what happens can be reproduced on trunk with opt -O3 -S -o - bbi-39227-reduced.ll The input program consists of two nested loops where the inner loop loads a value and does some calculations and then the outer loop writes the calculated value somewhere. With -debug-only=loop-vectorize I see this printout from the vectorizer for the inner loop: LV: Not vectorizing: Found an unidentified PHI %h.15 = phi i32 [ %h.11, %inner.cond.preheader ], [ %h.1, %inner.body ] When we run the vectorizer the inner loop looks like inner.body: ; preds %inner.cond.preheader, %inner.body %h.15 = phi i32 [ %h.11, %inner.cond.preheader ], [ %h.1, %inner.body ] %h.pn4 = phi i32* [ %h, %inner.cond.preheader ], [ %hp.1, %inner.body ] %j.03 = phi i16 [ 0, %inner.cond.preheader ], [ %j.1, %inner.body ] %real.02 = phi i32 [ 0, %inner.cond.preheader ], [ %sub, %inner.body ] %hp.1 = getelementptr inbounds i32, i32* %h.pn4, i64 1 %0 = shl i32 %h.15, 16 %conv7 = ashr exact i32 %0, 16 %add = sub i32 %real.02, %h.15 %sub = add i32 %add, %conv7 %j.1 = add nuw nsw i16 %j.03, 1 %h.1 = load i32, i32* %hp.1, align 1, !tbaa !4 %cmp3 = icmp ult i16 %j.03, 99 br i1 %cmp3, label %inner.body, label %inner.end, !llvm.loop !8 And the vectorizer bails out since the load is placed "late" in the loop, so RecurrenceDescriptor::isFirstOrderRecurrence returns false. If we just move the load before the definition of %0 in the vectorizer input, then we instead get LV: We can vectorize this loop! That just seems like a bug. Unless I'm missing something, the load depends only on %hp.1, there are no use-def or memory dependencies in between the load and the phi. The semantics are exactly the same between the two placements of the load instruction. The order of independent SSA calculations should not affect whether or not the vectorizer is able to vectorize the loop. -Hal Originally the loop had two loads, and then one of them was actually placed early in the loop block. That was done by instcombine, by InstCombiner::FoldPHIArgLoadIntoPHI. The reason this doesn't happen for the load we see in the reduced example is because after loop rotation licm hoists the start value of the PHI, so when instcombine tries to do FoldPHIArgLoadIntoPHI, the start value isn't placed in the direct predecessor block of the PHI, and the folding is aborted. Therefore I tried to squeeze in an additional run of instcombine before the licm run that does that hoisting, and then instcombine does the folding and the load in the inner loop is done early instead of late in the loop. The vectorizer then is happy and accepts to vectorize the loop. I have no idea if inserting another run of instcombine is a good idea though and I see some mixed results in other benchmarks. So, the question is what really to do about this... Should the vectorizer vectorize the loop anyway in this case? Should licm not hoist the initial value of the PHI (but hoisting is in general nice, so...). Should instcombine realize it can do something about this case even if the initial value of the PHI is "far" from the PHI. Something completely different? Any ideas? Thanks, Mikael -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200506/edc9dfa8/attachment.html>
Mikael Holmén via llvm-dev
2020-May-07 06:03 UTC
[llvm-dev] Missing vectorization of loop due to load late in the loop
Hi, On Wed, 2020-05-06 at 22:35 +0000, Finkel, Hal J. wrote:> > From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of > > Mikael Holmén via llvm-dev <llvm-dev at lists.llvm.org> > > Sent: Tuesday, May 5, 2020 8:25 AM > > To: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org> > > Subject: [llvm-dev] Missing vectorization of loop due to load late > > in the loop > > > > Hi, > > > > TL;DR: A loop doesn't get vectorized due to the interaction of > > loop- > > rotate, licm and instcombine. What to do about it? > > > > > > Full story: > > > > In the benchmarks for our out-of-tree target we have a case that we > > would like to get vectorized, but currently it isn't. I've done > > some > > digging to see why and have some kind of idea what prevents it, but > > I > > don't know what the best way to fix it would be so I thought I'd > > share > > a reduced version of it to see if anyone here have ideas. > > > > So, what happens can be reproduced on trunk with > > opt -O3 -S -o - bbi-39227-reduced.ll > > > > The input program consists of two nested loops where the inner loop > > loads a value and does some calculations and then the outer loop > > writes > > the calculated value somewhere. > > > > With > > -debug-only=loop-vectorize > > > > I see this printout from the vectorizer for the inner loop: > > LV: Not vectorizing: Found an unidentified PHI %h.15 = phi i32 [ > > %h.11, %inner.cond.preheader ], [ %h.1, %inner.body ] > > > > When we run the vectorizer the inner loop looks like > > > > inner.body: ; preds > > %inner.cond.preheader, %inner.body > > %h.15 = phi i32 [ %h.11, %inner.cond.preheader ], [ %h.1, > > %inner.body > > ] > > %h.pn4 = phi i32* [ %h, %inner.cond.preheader ], [ %hp.1, > > %inner.body > > ] > > %j.03 = phi i16 [ 0, %inner.cond.preheader ], [ %j.1, %inner.body > > ] > > %real.02 = phi i32 [ 0, %inner.cond.preheader ], [ %sub, > > %inner.body > > ] > > %hp.1 = getelementptr inbounds i32, i32* %h.pn4, i64 1 > > %0 = shl i32 %h.15, 16 > > %conv7 = ashr exact i32 %0, 16 > > %add = sub i32 %real.02, %h.15 > > %sub = add i32 %add, %conv7 > > %j.1 = add nuw nsw i16 %j.03, 1 > > %h.1 = load i32, i32* %hp.1, align 1, !tbaa !4 > > %cmp3 = icmp ult i16 %j.03, 99 > > br i1 %cmp3, label %inner.body, label %inner.end, !llvm.loop !8 > > > > And the vectorizer bails out since the load is placed "late" in the > > loop, so > > RecurrenceDescriptor::isFirstOrderRecurrence > > returns false. > > > > If we just move the load before the definition of %0 in the > > vectorizer > > input, then we instead get > > LV: We can vectorize this loop! > > That just seems like a bug. Unless I'm missing something, the load > depends only on %hp.1, there are no use-def or memory dependencies in > between the load and the phi. The semantics are exactly the same > between the two placements of the load instruction. The order of > independent SSA calculations should not affect whether or not the > vectorizer is able to vectorize the loop.Great! That's what I thought too, but as I don't know the vectorizer I'm not sure what requirements are reasonable on its input. When the load is placed early, it's this piece of code in LoopVectorizationLegality::canVectorizeInstrs() that saves us: if (RecurrenceDescriptor::isFirstOrderRecurrence(Phi, TheLoop, SinkAfter, DT)) { AllowedExit.insert(Phi); FirstOrderRecurrences.insert(Phi); continue; } but when it's placed late it doesn't and we end up with "Found an unidentified PHI". The vectorizer behavior can be reproduced with opt -loop-vectorize -S -o - bbi-39227-lv.ll I'll write a PR about this then. And if anyone has ideas about how the LoopVectorizationLegality checks can be relaxed to allow this case too, please chime in. Thanks! Mikael> > -Hal > > > > Originally the loop had two loads, and then one of them was > > actually > > placed early in the loop block. That was done by instcombine, by > > InstCombiner::FoldPHIArgLoadIntoPHI. > > > > The reason this doesn't happen for the load we see in the reduced > > example is because after loop rotation licm hoists the start value > > of > > the PHI, so when instcombine tries to do FoldPHIArgLoadIntoPHI, the > > start value isn't placed in the direct predecessor block of the > > PHI, > > and the folding is aborted. > > > > Therefore I tried to squeeze in an additional run of instcombine > > before > > the licm run that does that hoisting, and then instcombine does the > > folding and the load in the inner loop is done early instead of > > late in > > the loop. The vectorizer then is happy and accepts to vectorize the > > loop. I have no idea if inserting another run of instcombine is a > > good > > idea though and I see some mixed results in other benchmarks. > > > > So, the question is what really to do about this... > > > > Should the vectorizer vectorize the loop anyway in this case? > > > > Should licm not hoist the initial value of the PHI (but hoisting is > > in > > general nice, so...). > > > > Should instcombine realize it can do something about this case even > > if > > the initial value of the PHI is "far" from the PHI. > > > > Something completely different? > > > > Any ideas? > > > > Thanks, > > Mikael-------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: bbi-39227-lv.ll URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200507/1f0a0ef6/attachment.ksh>
Reasonably Related Threads
- SCEV related question
- [LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops
- [LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops
- [LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops
- SCEV related question