thr3ads.net - llvm dev - [llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate illegal types [Jun 2016]

If this information is useful, please help other people find it:
Share via:

Nadav Rotem via llvm-dev

2016-Jun-16 06:24 UTC

[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate illegal types

Hi Michael, 

Thank you for working on this. The loop vectorizer tries a bunch of different
vectorization factors and stops at the widest word size mostly because of
compile time concerns. On every vectorization factors that we check we have to
scan all of the instructions in the loop and make multiple calls into TTI. If
you decide to increase the VF enumeration space then you will linearly increase
the compile time of the loop vectorizer. I think that it would be a good idea to
explore this compile-time vs performance tradeoff with numbers.
  
The cost model is designed to be a fast approximation of SelectionDAG. We
don't want to duplicate every optimization in SelectionDAG into the cost
model because this would make the code model (and the optimizer) difficult to
maintain. If the cost model does not represent an operation that you care about
then you should add it to the cost tables. 

I don't understand how selecting wide vectors would eliminate the need to
have loop widening.  Loop widening happens to break data dependencies and allow
more parallelism. If you have two independent arithmetic operations then they
can go into different execution units, or to pipelined execution units. Your
mixed-typed loops would cause a shuffle across registers (which we can't
model well in the cost model, for obvious reasons) that will pack multiple lanes
into a smaller vector and this would introduce a data dependency. 

Maybe you should start by increasing the enumeration space (by 2X, for example)
under a flag and see if you get any performance gains. 

-Nadav

On Jun 15, 2016, at 03:48 PM, Michael Kuperstein <mkuper at google.com>
wrote:

Hello,

Currently the loop vectorizer will, by default, not consider vectorization
factors that would make it generate types that do not fit into the target
platform's vector registers. That is, if the widest scalar type in the
scalar loop is i64, and the platform's largest vector register is 256-bit
wide, we will not consider a VF above 4.

We have a command line option (-mllvm -vectorizer-maximize-bandwidth), that will
choose VFs for consideration based on the narrowest scalar type instead of the
widest one, but I don't believe it has been widely tested. If anyone has had
an opportunity to play around with it, I'd love to hear about the results.

What I'd like to do is:
Step 1: Make -vectorizer-maximize-bandwidth the default. This should improve the
performance of loops that contain mixed-width types.
Step 2: Remove the artificial width limitation altogether, and base the
vectorization factor decision purely on the cost model. This should allow us to
get rid of the interleaving code in the loop vectorizer, and get interleaving
for "free" from the legalizer instead.

There are two potential road-blocks I see - the cost-model, and the legalizer.
To make this work, we need to:
a) Model the cost of operations on illegal types better. Right now, what we get
is sometimes completely ridiculous (e.g. see http://reviews.llvm.org/D21251).
b) Make sure the cost model actually stops us when the VF becomes too large.
This is mostly a question of correctly estimating the register pressure. In
theory, that should not be a issue - we already rely on this estimate to choose
the interleaving factor, so using the same logic to upper-bound the VF directly
shouldn't make things worse.
c) Ensure the legalizer is up to the task of emitting good code for overly wide
vectors. I've talked about this with Chandler, and his opinion (Chandler,
please correct me if I'm wrong) is that on x86, the legalizer is likely to
be able to handle this. This may not be true for other platforms. So, I'd
like to try to make this the default on a platform-by-platform basis, starting
with x86.

What do you think? Does this seem like a step in the right direction? Anything
important I'm missing?

Thanks,
  Michael
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160616/2fe93d3b/attachment.html>

Michael Kuperstein via llvm-dev

2016-Jun-16 07:41 UTC

head link

[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate illegal types

Hi Nadav,
Thanks a lot for the feedback!

Of course we need to explore this with numbers. Not just in terms of the
performance vs. compile-time, but in general in terms of the performance
benefit. For now, I'm just trying to get a feel for whether people think
this sounds like a reasonable idea. As I wrote in the original email, we
already have this under a flag (it was added by Cong last year). But it
will be hard to get reliable performance numbers without first having the
cost model provide better-quality answers at the higher vectorization
factors.

I didn't mean that we should be duplicating every optimization the
SelectionDAG makes. Of course the cost model is only a rough approximation.
What I do want the (generic) cost model to do, however, is provide a
more-or-less precise approximation of legalization costs. To be concrete,
http://reviews.llvm.org/D21251 is a first step in that direction. Do you
think this is something the cost model should not be doing?

Regarding loop widening - see my email to Dibyendu for what I meant. For
mixed-type loops, it really depends. Let's say you have a mixed-type loop,
with i32 and i64, and 256-bit registers. Would the extra parallelism you
get from vectorizing by 4 and interleaving be worth the throughput loss you
suffer from not vectorizing the i32 operations by 8? It seems like this
would depend heavily on the specific loop, and the proportion of i32 and
i64 instructions. This is exactly the question I'd like to get the cost
model to answer. Do you think this is not feasible? It shouldn't (I hope
:-) ) require modeling every possible shuffle.

Thanks,
  Michael

On Wed, Jun 15, 2016 at 11:24 PM, Nadav Rotem <nadav.rotem at me.com>
wrote:
> Hi Michael,
>
> Thank you for working on this. The loop vectorizer tries a bunch of
> different vectorization factors and stops at the widest word size mostly
> because of compile time concerns. On every vectorization factors that we
> check we have to scan all of the instructions in the loop and make multiple
> calls into TTI. If you decide to increase the VF enumeration space then you
> will linearly increase the compile time of the loop vectorizer. I think
> that it would be a good idea to explore this compile-time vs performance
> tradeoff with numbers.
>
> The cost model is designed to be a fast approximation of SelectionDAG. We
> don't want to duplicate every optimization in SelectionDAG into the
cost
> model because this would make the code model (and the optimizer) difficult
> to maintain. If the cost model does not represent an operation that you
> care about then you should add it to the cost tables.
>
> I don't understand how selecting wide vectors would eliminate the need
to
> have loop widening.  Loop widening happens to break data dependencies and
> allow more parallelism. If you have two independent arithmetic operations
> then they can go into different execution units, or to pipelined execution
> units. Your mixed-typed loops would cause a shuffle across registers (which
> we can't model well in the cost model, for obvious reasons) that will
pack
> multiple lanes into a smaller vector and this would introduce a data
> dependency.
>
> Maybe you should start by increasing the enumeration space (by 2X, for
> example) under a flag and see if you get any performance gains.
>
> -Nadav
>
> On Jun 15, 2016, at 03:48 PM, Michael Kuperstein <mkuper at
google.com>
> wrote:
>
> Hello,
>
> Currently the loop vectorizer will, by default, not consider vectorization
> factors that would make it generate types that do not fit into the target
> platform's vector registers. That is, if the widest scalar type in the
> scalar loop is i64, and the platform's largest vector register is
256-bit
> wide, we will not consider a VF above 4.
>
> We have a command line option (-mllvm -vectorizer-maximize-bandwidth),
> that will choose VFs for consideration based on the narrowest scalar type
> instead of the widest one, but I don't believe it has been widely
tested.
> If anyone has had an opportunity to play around with it, I'd love to
hear
> about the results.
>
> What I'd like to do is:
> Step 1: Make -vectorizer-maximize-bandwidth the default. This should
> improve the performance of loops that contain mixed-width types.
> Step 2: Remove the artificial width limitation altogether, and base the
> vectorization factor decision purely on the cost model. This should allow
> us to get rid of the interleaving code in the loop vectorizer, and get
> interleaving for "free" from the legalizer instead.
>
> There are two potential road-blocks I see - the cost-model, and the
> legalizer. To make this work, we need to:
> a) Model the cost of operations on illegal types better. Right now, what
> we get is sometimes completely ridiculous (e.g. see
> http://reviews.llvm.org/D21251).
> b) Make sure the cost model actually stops us when the VF becomes too
> large. This is mostly a question of correctly estimating the register
> pressure. In theory, that should not be a issue - we already rely on this
> estimate to choose the interleaving factor, so using the same logic to
> upper-bound the VF directly shouldn't make things worse.
> c) Ensure the legalizer is up to the task of emitting good code for overly
> wide vectors. I've talked about this with Chandler, and his opinion
> (Chandler, please correct me if I'm wrong) is that on x86, the
legalizer is
> likely to be able to handle this. This may not be true for other platforms.
> So, I'd like to try to make this the default on a platform-by-platform
> basis, starting with x86.
>
> What do you think? Does this seem like a step in the right direction?
> Anything important I'm missing?
>
> Thanks,
>   Michael
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160616/b3c9ca3a/attachment.html>

Zaks, Ayal via llvm-dev

2016-Jun-16 14:15 UTC

head link

[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate illegal types

Some thoughts:

o To determine the VF for a loop with mixed data sizes, choosing the smallest
ensures each vector register used is full, choosing the largest will minimize
the number of vector registers used. Which one’s better, or some size in
between, depends on the target’s costs for the vector operations, availability
of registers and possibly control/memory divergence and trip count. “This is a
question of cost modeling” and its associated compile-time, but in general good
vectorization of loops with mixed data sizes is expected to be important,
especially when larger scopes are vectorized. BTW, SLP followed this a year ago:
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20150706/286110.html

o As for increasing VF beyond maximize-bandwidth, one could argue that a
vectorizer should focus on tapping the SIMD capabilities of the target, up to
maximize-bandwidth, and that its vectorized loop should later be subject to a
separate independent unroller/interleaver pass. One suggestion, regardless, is
to use the term “unroll-and-jam”, which traditionally applies to loops
containing control-flow and nested loops but is quite clear for innermost loops
too, instead of the overloaded term “interleaving”. Admittedly loop
vectorization conceptually applies unroll-and-jam followed by packetization into
vectors, so why unroll-and-jam twice. As noted, the considerations for best
unroll factor are different from those of best VF for optimal usage of SIMD
capabilities. Indeed representing in LLVM-IR a loop with vectors longer than
maximize-bandwidth looks more appealing than replicating its ‘legal’ vectors,
easier produced by the vectorizer than by an unroll-and-jam pass. BTW, taken to
the extreme, one could vectorize to the full trip count of the loop, as in
http://impact.crhc.illinois.edu/shared/Papers/tr2014.mxpa.pdf, where memory
spatial locality is deemed more important to optimize than register usage.

Ayal.

From: Michael Kuperstein [mailto:mkuper at google.com]
Sent: Thursday, June 16, 2016 10:42
To: Nadav Rotem <nadav.rotem at me.com>
Cc: Hal Finkel <hfinkel at anl.gov>; Zaks, Ayal <ayal.zaks at
intel.com>; Demikhovsky, Elena <elena.demikhovsky at intel.com>; Adam
Nemet <anemet at apple.com>; Sanjoy Das <sanjoy at
playingwithpointers.com>; James Molloy <james.molloy at arm.com>;
Matthew Simpson <mssimpso at codeaurora.org>; Sanjay Patel <spatel at
rotateright.com>; Chandler Carruth <chandlerc at google.com>; David Li
<davidxl at google.com>; Wei Mi <wmi at google.com>; Dehao Chen
<dehao at google.com>; Cong Hou <congh at google.com>; Llvm Dev
<llvm-dev at lists.llvm.org>
Subject: Re: [RFC] Allow loop vectorizer to choose vector widths that generate
illegal types

Hi Nadav,
Thanks a lot for the feedback!

Of course we need to explore this with numbers. Not just in terms of the
performance vs. compile-time, but in general in terms of the performance
benefit. For now, I'm just trying to get a feel for whether people think
this sounds like a reasonable idea. As I wrote in the original email, we already
have this under a flag (it was added by Cong last year). But it will be hard to
get reliable performance numbers without first having the cost model provide
better-quality answers at the higher vectorization factors.

I didn't mean that we should be duplicating every optimization the
SelectionDAG makes. Of course the cost model is only a rough approximation. What
I do want the (generic) cost model to do, however, is provide a more-or-less
precise approximation of legalization costs. To be concrete,
http://reviews.llvm.org/D21251 is a first step in that direction. Do you think
this is something the cost model should not be doing?

Regarding loop widening - see my email to Dibyendu for what I meant. For
mixed-type loops, it really depends. Let's say you have a mixed-type loop,
with i32 and i64, and 256-bit registers. Would the extra parallelism you get
from vectorizing by 4 and interleaving be worth the throughput loss you suffer
from not vectorizing the i32 operations by 8? It seems like this would depend
heavily on the specific loop, and the proportion of i32 and i64 instructions.
This is exactly the question I'd like to get the cost model to answer. Do
you think this is not feasible? It shouldn't (I hope :-) ) require modeling
every possible shuffle.

Thanks,
  Michael

On Wed, Jun 15, 2016 at 11:24 PM, Nadav Rotem <nadav.rotem at
me.com<mailto:nadav.rotem at me.com>> wrote:
Hi Michael,

Thank you for working on this. The loop vectorizer tries a bunch of different
vectorization factors and stops at the widest word size mostly because of
compile time concerns. On every vectorization factors that we check we have to
scan all of the instructions in the loop and make multiple calls into TTI. If
you decide to increase the VF enumeration space then you will linearly increase
the compile time of the loop vectorizer. I think that it would be a good idea to
explore this compile-time vs performance tradeoff with numbers.

The cost model is designed to be a fast approximation of SelectionDAG. We
don't want to duplicate every optimization in SelectionDAG into the cost
model because this would make the code model (and the optimizer) difficult to
maintain. If the cost model does not represent an operation that you care about
then you should add it to the cost tables.

I don't understand how selecting wide vectors would eliminate the need to
have loop widening.  Loop widening happens to break data dependencies and allow
more parallelism. If you have two independent arithmetic operations then they
can go into different execution units, or to pipelined execution units. Your
mixed-typed loops would cause a shuffle across registers (which we can't
model well in the cost model, for obvious reasons) that will pack multiple lanes
into a smaller vector and this would introduce a data dependency.

Maybe you should start by increasing the enumeration space (by 2X, for example)
under a flag and see if you get any performance gains.

-Nadav

On Jun 15, 2016, at 03:48 PM, Michael Kuperstein <mkuper at
google.com<mailto:mkuper at google.com>> wrote:
Hello,

Currently the loop vectorizer will, by default, not consider vectorization
factors that would make it generate types that do not fit into the target
platform's vector registers. That is, if the widest scalar type in the
scalar loop is i64, and the platform's largest vector register is 256-bit
wide, we will not consider a VF above 4.

We have a command line option (-mllvm -vectorizer-maximize-bandwidth), that will
choose VFs for consideration based on the narrowest scalar type instead of the
widest one, but I don't believe it has been widely tested. If anyone has had
an opportunity to play around with it, I'd love to hear about the results.

What I'd like to do is:
Step 1: Make -vectorizer-maximize-bandwidth the default. This should improve the
performance of loops that contain mixed-width types.
Step 2: Remove the artificial width limitation altogether, and base the
vectorization factor decision purely on the cost model. This should allow us to
get rid of the interleaving code in the loop vectorizer, and get interleaving
for "free" from the legalizer instead.

There are two potential road-blocks I see - the cost-model, and the legalizer.
To make this work, we need to:
a) Model the cost of operations on illegal types better. Right now, what we get
is sometimes completely ridiculous (e.g. see http://reviews.llvm.org/D21251).
b) Make sure the cost model actually stops us when the VF becomes too large.
This is mostly a question of correctly estimating the register pressure. In
theory, that should not be a issue - we already rely on this estimate to choose
the interleaving factor, so using the same logic to upper-bound the VF directly
shouldn't make things worse.
c) Ensure the legalizer is up to the task of emitting good code for overly wide
vectors. I've talked about this with Chandler, and his opinion (Chandler,
please correct me if I'm wrong) is that on x86, the legalizer is likely to
be able to handle this. This may not be true for other platforms. So, I'd
like to try to make this the default on a platform-by-platform basis, starting
with x86.

What do you think? Does this seem like a step in the right direction? Anything
important I'm missing?

Thanks,
  Michael

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160616/a61ce34a/attachment.html>

Possibly Parallel Threads

Search for more seemingly similar threads

llvm dev - Jun 2016 - [RFC] Allow loop vectorizer to choose vector widths that generate illegal types

[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate illegal types

[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate illegal types

[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate illegal types

Possibly Parallel Threads