thr3ads.net - llvm dev - [llvm-dev] LoopVectorizer: shufflevectors [Sep 2018]

If this information is useful, please help other people find it:
Share via:

Jonas Paulsson via llvm-dev

2018-Sep-04 12:14 UTC

[llvm-dev] LoopVectorizer: shufflevectors

Hi,

I have been discussing a bit with Sanjay on how to handle the poor 
sequences of shufflevector instructions produced by the loop vectorizer 
and he suggested we bring this up on llvm-dev.

I have run into this in the past also and it surprised me to again see 
(on SystemZ) that the vectorized loop did many seemingly unnecessary 
shuffles. In this case (see 
https://bugs.llvm.org/show_bug.cgi?id=38792), one group of interleaved 
loads got shuffled into vector operands, to then be used by an 
interleaved group of stores, which in turn did its shuffling. The loop 
vectorizer did not attempt to combine these shuffles, and unfortunately 
no later pass did so either.

This seems to be an issue which is due to keeping instcombine simple and 
fast, as well as a conservativeness to not produce any new shuffles not 
already in the input program (see comment in 
InstCombiner::visitShuffleVectorInst). For some reason a bit unclear to 
me the backend will get into trouble then.

Should improved optimization of shufflevector instructions handle all of 
them globally, or just the new ones produced by the vectorizers? At 
least in the code produced by the vectorizers, it seems fair to combine 
the shuffles to minimize them. If we want to limit this to just 
auto-vectorized code, then maybe this could be done with some common 
utility called by a vectorizer on its result? If we on the other hand 
want to optimize everything, a new separate IR pass for vector ops could 
be made to run after the vectorizers just once. Such a new pass could 
handle vector instructions in general more extensively than instcombine. 
Would it then be possible to avoid the current problems the backend is 
having?

Or does this really have to be done on the DAG by each backend? Or 
perhaps this is really just a local issue with the loop vectorizer?

Please fill in on how to best proceed with improving the loop 
vectorizers code.

Jonas

Renato Golin via llvm-dev

2018-Sep-04 12:45 UTC

head link

[llvm-dev] LoopVectorizer: shufflevectors

On Tue, 4 Sep 2018 at 13:14, Jonas Paulsson via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> I have run into this in the past also and it surprised me to again see
> (on SystemZ) that the vectorized loop did many seemingly unnecessary
> shuffles.
Hi Jonas,

This is a known side-effect of vectorisers (not just LLVM) that have
to handle multiple hardware architectures. GCC has it's own set of bad
patterns, too.

> This seems to be an issue which is due to keeping instcombine simple and
> fast, as well as a conservativeness to not produce any new shuffles not
> already in the input program (see comment in
> InstCombiner::visitShuffleVectorInst). For some reason a bit unclear to
> me the backend will get into trouble then.
Specifically interleaved generation will invariably lead to
*additional* shuffles, because it's trying to create the pattern that
will, later, be selected into one or few instructions.

The middle-end relies on the back-end knowing how to select the large
patters, as well as other middle-end passes not destroying it.
Cleaning that up may very well lead to poorer code.

> Should improved optimization of shufflevector instructions handle all of
> them globally, or just the new ones produced by the vectorizers?
It's probably a lot simpler to improve the SystemZ model to "know"
have the same arch flags / cost model completeness as the other
targets.

The transformations done by the vectoriser are target-agnostic, but
they still ask the targets if certain patterns are possible and
profitable before doing so.

> Or does this really have to be done on the DAG by each backend? Or
> perhaps this is really just a local issue with the loop vectorizer?
It's a dance between what the middle-end enquires of the target info
and what the back-end can actually generate.

You may have to expose some flags (to turn certain behaviour on), then
to tune the cost model (to make them profitable on most cases), then
implement the pattern recognition in the DAGISel (also GlobalISel), so
that the generated code can be optimally selected.

If LLVM was compiling to a single target, emitting IR that conforms to
one specific pattern would be the most logical choice (don't pollute,
simplify further passes, reduce pattern complexity), so it may sound a
lot simpler on arch-specific compilers.

But in a target-agnostic compiler you need to "emulate" that using the
three-step above: target info, cost model, ISel patterns.

Hope that helps.

-- 
cheers,
--renato

Jonas Paulsson via llvm-dev

2018-Sep-04 16:34 UTC

head link

[llvm-dev] LoopVectorizer: shufflevectors

Hi Renato,

> It's probably a lot simpler to improve the SystemZ model to
"know"
> have the same arch flags / cost model completeness as the other
> targets.
I thought they were - anything particular in mind?>
> The transformations done by the vectoriser are target-agnostic, but
> they still ask the targets if certain patterns are possible and
> profitable before doing so.I have a patch for the LoopVectorizer that makes it recognize this 
particular case of a load interleave group being only used by a store 
interleave group. I then pass this flag to 
TTI.getInterleavedMemoryOpCost(), so that the target can return an 
accurate cost. During my experiments on SystemZ I added the cost of 
shuffling the vector(s) only on the load, while then the store group did 
not get that cost at all.

This then made many more cases of interleaving happen (~450 cases on 
spec IIRC). Only problem was... the SystemZ backend could not handle 
those shuffles as well in all the cases. To me that looked like 
something to be fixed on the I/R level, and after discussions with 
Sanjay I got the impression that this was the case...

To me, this looks like something the LoopVectorizer is neglecting and 
should be combining. I suppose with my patch for the Load -> Store 
groups, I could add also the handling of recomputed indices so that the 
load group produces a vector that fits the store group directly. But if 
I understand you correctly, even this is not so wise? And if so, then 
indeed improving the SystemZ DAGCombiner is the only alternative left, I 
guess...
> But in a target-agnostic compiler you need to "emulate" that
using the three-step above: target info, cost model, ISel patterns.
But having the cost functions available is not enough to drive a later 
I/R pass to optimize the generated vector code? I mean if the target 
indicated which shuffles were expensive, that could then easily be avoided.

Thanks,

Jonas

Seemingly Similar Threads

Search for more maybe matching threads

llvm dev - Sep 2018 - LoopVectorizer: shufflevectors

[llvm-dev] LoopVectorizer: shufflevectors

[llvm-dev] LoopVectorizer: shufflevectors

[llvm-dev] LoopVectorizer: shufflevectors

Seemingly Similar Threads