thr3ads.net - llvm dev - [llvm-dev] Missing vectorization of loop due to load late in the loop [May 2020]

If this information is useful, please help other people find it:
Share via:

Mikael Holmén via llvm-dev

2020-May-05 13:25 UTC

[llvm-dev] Missing vectorization of loop due to load late in the loop

Hi,

TL;DR: A loop doesn't get vectorized due to the interaction of loop-
rotate, licm and instcombine. What to do about it?


Full story:

In the benchmarks for our out-of-tree target we have a case that we
would like to get vectorized, but currently it isn't. I've done some
digging to see why and have some kind of idea what prevents it, but I
don't know what the best way to fix it would be so I thought I'd share
a reduced version of it to see if anyone here have ideas.

So, what happens can be reproduced on trunk with
 opt -O3 -S -o - bbi-39227-reduced.ll

The input program consists of two nested loops where the inner loop
loads a value and does some calculations and then the outer loop writes
the calculated value somewhere.

With
 -debug-only=loop-vectorize

I see this printout from the vectorizer for the inner loop:
 LV: Not vectorizing: Found an unidentified PHI   %h.15 = phi i32 [
%h.11, %inner.cond.preheader ], [ %h.1, %inner.body ]

When we run the vectorizer the inner loop looks like

inner.body:                                       ; preds %inner.cond.preheader,
%inner.body
  %h.15 = phi i32 [ %h.11, %inner.cond.preheader ], [ %h.1, %inner.body
]
  %h.pn4 = phi i32* [ %h, %inner.cond.preheader ], [ %hp.1, %inner.body
]
  %j.03 = phi i16 [ 0, %inner.cond.preheader ], [ %j.1, %inner.body ]
  %real.02 = phi i32 [ 0, %inner.cond.preheader ], [ %sub, %inner.body
]
  %hp.1 = getelementptr inbounds i32, i32* %h.pn4, i64 1
  %0 = shl i32 %h.15, 16
  %conv7 = ashr exact i32 %0, 16
  %add = sub i32 %real.02, %h.15
  %sub = add i32 %add, %conv7
  %j.1 = add nuw nsw i16 %j.03, 1
  %h.1 = load i32, i32* %hp.1, align 1, !tbaa !4
  %cmp3 = icmp ult i16 %j.03, 99
  br i1 %cmp3, label %inner.body, label %inner.end, !llvm.loop !8

And the vectorizer bails out since the load is placed "late" in the
loop, so
 RecurrenceDescriptor::isFirstOrderRecurrence
returns false.

If we just move the load before the definition of %0 in the vectorizer
input, then we instead get
 LV: We can vectorize this loop!

Originally the loop had two loads, and then one of them was actually
placed early in the loop block. That was done by instcombine, by 
 InstCombiner::FoldPHIArgLoadIntoPHI.

The reason this doesn't happen for the load we see in the reduced
example is because after loop rotation licm hoists the start value of
the PHI, so when instcombine tries to do FoldPHIArgLoadIntoPHI, the
start value isn't placed in the direct predecessor block of the PHI,
and the folding is aborted.

Therefore I tried to squeeze in an additional run of instcombine before
the licm run that does that hoisting, and then instcombine does the
folding and the load in the inner loop is done early instead of late in
the loop. The vectorizer then is happy and accepts to vectorize the
loop. I have no idea if inserting another run of instcombine is a good
idea though and I see some mixed results in other benchmarks.

So, the question is what really to do about this...

Should the vectorizer vectorize the loop anyway in this case?

Should licm not hoist the initial value of the PHI (but hoisting is in
general nice, so...).

Should instcombine realize it can do something about this case even if
the initial value of the PHI is "far" from the PHI.

Something completely different?

Any ideas?

Thanks,
Mikael
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: bbi-39227-reduced.ll
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200505/78af719b/attachment.ksh>

Finkel, Hal J. via llvm-dev

2020-May-06 22:35 UTC

head link

[llvm-dev] Missing vectorization of loop due to load late in the loop

________________________________
From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Mikael
Holmén via llvm-dev <llvm-dev at lists.llvm.org>
Sent: Tuesday, May 5, 2020 8:25 AM
To: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
Subject: [llvm-dev] Missing vectorization of loop due to load late in the loop

Hi,

TL;DR: A loop doesn't get vectorized due to the interaction of loop-
rotate, licm and instcombine. What to do about it?

Full story:

In the benchmarks for our out-of-tree target we have a case that we
would like to get vectorized, but currently it isn't. I've done some
digging to see why and have some kind of idea what prevents it, but I
don't know what the best way to fix it would be so I thought I'd share
a reduced version of it to see if anyone here have ideas.

So, what happens can be reproduced on trunk with
 opt -O3 -S -o - bbi-39227-reduced.ll

The input program consists of two nested loops where the inner loop
loads a value and does some calculations and then the outer loop writes
the calculated value somewhere.

With
 -debug-only=loop-vectorize

I see this printout from the vectorizer for the inner loop:
 LV: Not vectorizing: Found an unidentified PHI   %h.15 = phi i32 [
%h.11, %inner.cond.preheader ], [ %h.1, %inner.body ]

When we run the vectorizer the inner loop looks like

inner.body:                                       ; preds %inner.cond.preheader,
%inner.body
  %h.15 = phi i32 [ %h.11, %inner.cond.preheader ], [ %h.1, %inner.body
]
  %h.pn4 = phi i32* [ %h, %inner.cond.preheader ], [ %hp.1, %inner.body
]
  %j.03 = phi i16 [ 0, %inner.cond.preheader ], [ %j.1, %inner.body ]
  %real.02 = phi i32 [ 0, %inner.cond.preheader ], [ %sub, %inner.body
]
  %hp.1 = getelementptr inbounds i32, i32* %h.pn4, i64 1
  %0 = shl i32 %h.15, 16
  %conv7 = ashr exact i32 %0, 16
  %add = sub i32 %real.02, %h.15
  %sub = add i32 %add, %conv7
  %j.1 = add nuw nsw i16 %j.03, 1
  %h.1 = load i32, i32* %hp.1, align 1, !tbaa !4
  %cmp3 = icmp ult i16 %j.03, 99
  br i1 %cmp3, label %inner.body, label %inner.end, !llvm.loop !8

And the vectorizer bails out since the load is placed "late" in the
loop, so
 RecurrenceDescriptor::isFirstOrderRecurrence
returns false.

If we just move the load before the definition of %0 in the vectorizer
input, then we instead get
 LV: We can vectorize this loop!

That just seems like a bug. Unless I'm missing something, the load depends
only on %hp.1, there are no use-def or memory dependencies in between the load
and the phi. The semantics are exactly the same between the two placements of
the load instruction. The order of independent SSA calculations should not
affect whether or not the vectorizer is able to vectorize the loop.

 -Hal

Originally the loop had two loads, and then one of them was actually
placed early in the loop block. That was done by instcombine, by
 InstCombiner::FoldPHIArgLoadIntoPHI.

The reason this doesn't happen for the load we see in the reduced
example is because after loop rotation licm hoists the start value of
the PHI, so when instcombine tries to do FoldPHIArgLoadIntoPHI, the
start value isn't placed in the direct predecessor block of the PHI,
and the folding is aborted.

Therefore I tried to squeeze in an additional run of instcombine before
the licm run that does that hoisting, and then instcombine does the
folding and the load in the inner loop is done early instead of late in
the loop. The vectorizer then is happy and accepts to vectorize the
loop. I have no idea if inserting another run of instcombine is a good
idea though and I see some mixed results in other benchmarks.

So, the question is what really to do about this...

Should the vectorizer vectorize the loop anyway in this case?

Should licm not hoist the initial value of the PHI (but hoisting is in
general nice, so...).

Should instcombine realize it can do something about this case even if
the initial value of the PHI is "far" from the PHI.

Something completely different?

Any ideas?

Thanks,
Mikael
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200506/edc9dfa8/attachment.html>

Mikael Holmén via llvm-dev

2020-May-07 06:03 UTC

head link

[llvm-dev] Missing vectorization of loop due to load late in the loop

Hi,

On Wed, 2020-05-06 at 22:35 +0000, Finkel, Hal J. wrote:> > From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of
> > Mikael Holmén via llvm-dev <llvm-dev at lists.llvm.org>
> > Sent: Tuesday, May 5, 2020 8:25 AM
> > To: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
> > Subject: [llvm-dev] Missing vectorization of loop due to load late
> > in the loop
> >  
> > Hi,
> > 
> > TL;DR: A loop doesn't get vectorized due to the interaction of
> > loop-
> > rotate, licm and instcombine. What to do about it?
> > 
> > 
> > Full story:
> > 
> > In the benchmarks for our out-of-tree target we have a case that we
> > would like to get vectorized, but currently it isn't. I've
done
> > some
> > digging to see why and have some kind of idea what prevents it, but
> > I
> > don't know what the best way to fix it would be so I thought
I'd
> > share
> > a reduced version of it to see if anyone here have ideas.
> > 
> > So, what happens can be reproduced on trunk with
> >  opt -O3 -S -o - bbi-39227-reduced.ll
> > 
> > The input program consists of two nested loops where the inner loop
> > loads a value and does some calculations and then the outer loop
> > writes
> > the calculated value somewhere.
> > 
> > With
> >  -debug-only=loop-vectorize
> > 
> > I see this printout from the vectorizer for the inner loop:
> >  LV: Not vectorizing: Found an unidentified PHI   %h.15 = phi i32 [
> > %h.11, %inner.cond.preheader ], [ %h.1, %inner.body ]
> > 
> > When we run the vectorizer the inner loop looks like
> > 
> > inner.body:                                       ; preds > >
%inner.cond.preheader, %inner.body
> >   %h.15 = phi i32 [ %h.11, %inner.cond.preheader ], [ %h.1,
> > %inner.body
> > ]
> >   %h.pn4 = phi i32* [ %h, %inner.cond.preheader ], [ %hp.1,
> > %inner.body
> > ]
> >   %j.03 = phi i16 [ 0, %inner.cond.preheader ], [ %j.1, %inner.body
> > ]
> >   %real.02 = phi i32 [ 0, %inner.cond.preheader ], [ %sub,
> > %inner.body
> > ]
> >   %hp.1 = getelementptr inbounds i32, i32* %h.pn4, i64 1
> >   %0 = shl i32 %h.15, 16
> >   %conv7 = ashr exact i32 %0, 16
> >   %add = sub i32 %real.02, %h.15
> >   %sub = add i32 %add, %conv7
> >   %j.1 = add nuw nsw i16 %j.03, 1
> >   %h.1 = load i32, i32* %hp.1, align 1, !tbaa !4
> >   %cmp3 = icmp ult i16 %j.03, 99
> >   br i1 %cmp3, label %inner.body, label %inner.end, !llvm.loop !8
> > 
> > And the vectorizer bails out since the load is placed "late"
in the
> > loop, so
> >  RecurrenceDescriptor::isFirstOrderRecurrence
> > returns false.
> > 
> > If we just move the load before the definition of %0 in the
> > vectorizer
> > input, then we instead get
> >  LV: We can vectorize this loop!
> 
> That just seems like a bug. Unless I'm missing something, the load
> depends only on %hp.1, there are no use-def or memory dependencies in
> between the load and the phi. The semantics are exactly the same
> between the two placements of the load instruction. The order of
> independent SSA calculations should not affect whether or not the
> vectorizer is able to vectorize the loop.
Great! That's what I thought too, but as I don't know the vectorizer
I'm not sure what requirements are reasonable on its input.

When the load is placed early, it's this piece of code in
LoopVectorizationLegality::canVectorizeInstrs() that saves us:

    if (RecurrenceDescriptor::isFirstOrderRecurrence(Phi, TheLoop,
                                                     SinkAfter, DT)) {
      AllowedExit.insert(Phi);
      FirstOrderRecurrences.insert(Phi);
      continue;
    }

but when it's placed late it doesn't and we end up with "Found an
unidentified PHI".

The vectorizer behavior can be reproduced with
 opt -loop-vectorize -S -o - bbi-39227-lv.ll

I'll write a PR about this then. And if anyone has ideas about how the
LoopVectorizationLegality checks can be relaxed to allow this case too,
please chime in.

Thanks!
Mikael
> 
>  -Hal
> 
> 
> > Originally the loop had two loads, and then one of them was
> > actually
> > placed early in the loop block. That was done by instcombine, by 
> >  InstCombiner::FoldPHIArgLoadIntoPHI.
> > 
> > The reason this doesn't happen for the load we see in the reduced
> > example is because after loop rotation licm hoists the start value
> > of
> > the PHI, so when instcombine tries to do FoldPHIArgLoadIntoPHI, the
> > start value isn't placed in the direct predecessor block of the
> > PHI,
> > and the folding is aborted.
> > 
> > Therefore I tried to squeeze in an additional run of instcombine
> > before
> > the licm run that does that hoisting, and then instcombine does the
> > folding and the load in the inner loop is done early instead of
> > late in
> > the loop. The vectorizer then is happy and accepts to vectorize the
> > loop. I have no idea if inserting another run of instcombine is a
> > good
> > idea though and I see some mixed results in other benchmarks.
> > 
> > So, the question is what really to do about this...
> > 
> > Should the vectorizer vectorize the loop anyway in this case?
> > 
> > Should licm not hoist the initial value of the PHI (but hoisting is
> > in
> > general nice, so...).
> > 
> > Should instcombine realize it can do something about this case even
> > if
> > the initial value of the PHI is "far" from the PHI.
> > 
> > Something completely different?
> > 
> > Any ideas?
> > 
> > Thanks,
> > Mikael-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: bbi-39227-lv.ll
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200507/1f0a0ef6/attachment.ksh>

Reasonably Related Threads

Search for more possibly parallel threads

llvm dev - May 2020 - Missing vectorization of loop due to load late in the loop

[llvm-dev] Missing vectorization of loop due to load late in the loop

[llvm-dev] Missing vectorization of loop due to load late in the loop

[llvm-dev] Missing vectorization of loop due to load late in the loop

Reasonably Related Threads