thr3ads.net - llvm dev - [llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate illegal types [Jun 2016]

If this information is useful, please help other people find it:
Share via:

Michael Kuperstein via llvm-dev

2016-Jun-15 22:47 UTC

[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate illegal types

Hello,

Currently the loop vectorizer will, by default, not consider vectorization
factors that would make it generate types that do not fit into the target
platform's vector registers. That is, if the widest scalar type in the
scalar loop is i64, and the platform's largest vector register is 256-bit
wide, we will not consider a VF above 4.

We have a command line option (-mllvm -vectorizer-maximize-bandwidth), that
will choose VFs for consideration based on the narrowest scalar type
instead of the widest one, but I don't believe it has been widely tested.
If anyone has had an opportunity to play around with it, I'd love to hear
about the results.

What I'd like to do is:
Step 1: Make -vectorizer-maximize-bandwidth the default. This should
improve the performance of loops that contain mixed-width types.
Step 2: Remove the artificial width limitation altogether, and base the
vectorization factor decision purely on the cost model. This should allow
us to get rid of the interleaving code in the loop vectorizer, and get
interleaving for "free" from the legalizer instead.

There are two potential road-blocks I see - the cost-model, and the
legalizer. To make this work, we need to:
a) Model the cost of operations on illegal types better. Right now, what we
get is sometimes completely ridiculous (e.g. see
http://reviews.llvm.org/D21251).
b) Make sure the cost model actually stops us when the VF becomes too
large. This is mostly a question of correctly estimating the register
pressure. In theory, that should not be a issue - we already rely on this
estimate to choose the interleaving factor, so using the same logic to
upper-bound the VF directly shouldn't make things worse.
c) Ensure the legalizer is up to the task of emitting good code for overly
wide vectors. I've talked about this with Chandler, and his opinion
(Chandler, please correct me if I'm wrong) is that on x86, the legalizer is
likely to be able to handle this. This may not be true for other platforms.
So, I'd like to try to make this the default on a platform-by-platform
basis, starting with x86.

What do you think? Does this seem like a step in the right direction?
Anything important I'm missing?

Thanks,
  Michael
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160615/a79db53b/attachment.html>

Chandler Carruth via llvm-dev

2016-Jun-15 23:00 UTC

head link

[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate illegal types

I know we already talked about this and so I'm more interested in
others'
thoughts, but just to explicitly say it, this LGTM. I particularly think
that using extra-wide vectors to model widening-for-interleaving is a much
cleaner model in the IR.

Also, at least one other user of the IR's vector capabilities is doing
precisely this: Halide. I'm pretty happy about seeing convergence here and
both Halide and the loop vectorizer generating more similar patterns.

On Wed, Jun 15, 2016 at 3:48 PM Michael Kuperstein <mkuper at google.com>
wrote:
> Hello,
>
> Currently the loop vectorizer will, by default, not consider vectorization
> factors that would make it generate types that do not fit into the target
> platform's vector registers. That is, if the widest scalar type in the
> scalar loop is i64, and the platform's largest vector register is
256-bit
> wide, we will not consider a VF above 4.
>
> We have a command line option (-mllvm -vectorizer-maximize-bandwidth),
> that will choose VFs for consideration based on the narrowest scalar type
> instead of the widest one, but I don't believe it has been widely
tested.
> If anyone has had an opportunity to play around with it, I'd love to
hear
> about the results.
>
> What I'd like to do is:
> Step 1: Make -vectorizer-maximize-bandwidth the default. This should
> improve the performance of loops that contain mixed-width types.
> Step 2: Remove the artificial width limitation altogether, and base the
> vectorization factor decision purely on the cost model. This should allow
> us to get rid of the interleaving code in the loop vectorizer, and get
> interleaving for "free" from the legalizer instead.
>
> There are two potential road-blocks I see - the cost-model, and the
> legalizer. To make this work, we need to:
> a) Model the cost of operations on illegal types better. Right now, what
> we get is sometimes completely ridiculous (e.g. see
> http://reviews.llvm.org/D21251).
> b) Make sure the cost model actually stops us when the VF becomes too
> large. This is mostly a question of correctly estimating the register
> pressure. In theory, that should not be a issue - we already rely on this
> estimate to choose the interleaving factor, so using the same logic to
> upper-bound the VF directly shouldn't make things worse.
> c) Ensure the legalizer is up to the task of emitting good code for overly
> wide vectors. I've talked about this with Chandler, and his opinion
> (Chandler, please correct me if I'm wrong) is that on x86, the
legalizer is
> likely to be able to handle this. This may not be true for other platforms.
> So, I'd like to try to make this the default on a platform-by-platform
> basis, starting with x86.
>
> What do you think? Does this seem like a step in the right direction?
> Anything important I'm missing?
>
> Thanks,
>   Michael
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160615/9043aea1/attachment.html>

Xinliang David Li via llvm-dev

2016-Jun-15 23:12 UTC

head link

[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate illegal types

Michael,  thanks for driving this! My only comment is that  before the
final flip, we need to engage the community for more extensive performance
testing on various architectures.

David

On Wed, Jun 15, 2016 at 3:47 PM, Michael Kuperstein <mkuper at google.com>
wrote:
> Hello,
>
> Currently the loop vectorizer will, by default, not consider vectorization
> factors that would make it generate types that do not fit into the target
> platform's vector registers. That is, if the widest scalar type in the
> scalar loop is i64, and the platform's largest vector register is
256-bit
> wide, we will not consider a VF above 4.
>
> We have a command line option (-mllvm -vectorizer-maximize-bandwidth),
> that will choose VFs for consideration based on the narrowest scalar type
> instead of the widest one, but I don't believe it has been widely
tested.
> If anyone has had an opportunity to play around with it, I'd love to
hear
> about the results.
>
> What I'd like to do is:
> Step 1: Make -vectorizer-maximize-bandwidth the default. This should
> improve the performance of loops that contain mixed-width types.
> Step 2: Remove the artificial width limitation altogether, and base the
> vectorization factor decision purely on the cost model. This should allow
> us to get rid of the interleaving code in the loop vectorizer, and get
> interleaving for "free" from the legalizer instead.
>
> There are two potential road-blocks I see - the cost-model, and the
> legalizer. To make this work, we need to:
> a) Model the cost of operations on illegal types better. Right now, what
> we get is sometimes completely ridiculous (e.g. see
> http://reviews.llvm.org/D21251).
> b) Make sure the cost model actually stops us when the VF becomes too
> large. This is mostly a question of correctly estimating the register
> pressure. In theory, that should not be a issue - we already rely on this
> estimate to choose the interleaving factor, so using the same logic to
> upper-bound the VF directly shouldn't make things worse.
> c) Ensure the legalizer is up to the task of emitting good code for overly
> wide vectors. I've talked about this with Chandler, and his opinion
> (Chandler, please correct me if I'm wrong) is that on x86, the
legalizer is
> likely to be able to handle this. This may not be true for other platforms.
> So, I'd like to try to make this the default on a platform-by-platform
> basis, starting with x86.
>
> What do you think? Does this seem like a step in the right direction?
> Anything important I'm missing?
>
> Thanks,
>   Michael
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160615/6b58ee7f/attachment.html>

Michael Kuperstein via llvm-dev

2016-Jun-15 23:25 UTC

head link

[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate illegal types

Of course.

If anyone wants to volunteer to test this on their workloads once the cost
model is less broken (so we actually try to use higher VFs instead of
rejecting them on cost grounds), that would be great.

Thanks,
  Michael

On Wed, Jun 15, 2016 at 4:12 PM, Xinliang David Li <davidxl at google.com>
wrote:
> Michael,  thanks for driving this! My only comment is that  before the
> final flip, we need to engage the community for more extensive performance
> testing on various architectures.
>
> David
>
> On Wed, Jun 15, 2016 at 3:47 PM, Michael Kuperstein <mkuper at
google.com>
> wrote:
>
>> Hello,
>>
>> Currently the loop vectorizer will, by default, not consider
>> vectorization factors that would make it generate types that do not fit
>> into the target platform's vector registers. That is, if the widest
scalar
>> type in the scalar loop is i64, and the platform's largest vector
register
>> is 256-bit wide, we will not consider a VF above 4.
>>
>> We have a command line option (-mllvm -vectorizer-maximize-bandwidth),
>> that will choose VFs for consideration based on the narrowest scalar
type
>> instead of the widest one, but I don't believe it has been widely
tested.
>> If anyone has had an opportunity to play around with it, I'd love
to hear
>> about the results.
>>
>> What I'd like to do is:
>> Step 1: Make -vectorizer-maximize-bandwidth the default. This should
>> improve the performance of loops that contain mixed-width types.
>> Step 2: Remove the artificial width limitation altogether, and base the
>> vectorization factor decision purely on the cost model. This should
allow
>> us to get rid of the interleaving code in the loop vectorizer, and get
>> interleaving for "free" from the legalizer instead.
>>
>> There are two potential road-blocks I see - the cost-model, and the
>> legalizer. To make this work, we need to:
>> a) Model the cost of operations on illegal types better. Right now,
what
>> we get is sometimes completely ridiculous (e.g. see
>> http://reviews.llvm.org/D21251).
>> b) Make sure the cost model actually stops us when the VF becomes too
>> large. This is mostly a question of correctly estimating the register
>> pressure. In theory, that should not be a issue - we already rely on
this
>> estimate to choose the interleaving factor, so using the same logic to
>> upper-bound the VF directly shouldn't make things worse.
>> c) Ensure the legalizer is up to the task of emitting good code for
>> overly wide vectors. I've talked about this with Chandler, and his
opinion
>> (Chandler, please correct me if I'm wrong) is that on x86, the
legalizer is
>> likely to be able to handle this. This may not be true for other
platforms.
>> So, I'd like to try to make this the default on a
platform-by-platform
>> basis, starting with x86.
>>
>> What do you think? Does this seem like a step in the right direction?
>> Anything important I'm missing?
>>
>> Thanks,
>>   Michael
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160615/a109d7b3/attachment.html>

Das, Dibyendu via llvm-dev

2016-Jun-16 06:46 UTC

head link

[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate illegal types

Its not clear how you would get ‘interleaving for free’.

From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Michael
Kuperstein via llvm-dev
Sent: Thursday, June 16, 2016 4:18 AM
To: Hal Finkel <hfinkel at anl.gov>; Nadav Rotem <nadav.rotem at
me.com>; Ayal Zaks <ayal.zaks at intel.com>; Demikhovsky, Elena
<elena.demikhovsky at intel.com>; Adam Nemet <anemet at apple.com>;
Sanjoy Das <sanjoy at playingwithpointers.com>; James Molloy
<james.molloy at arm.com>; Matthew Simpson <mssimpso at
codeaurora.org>; Sanjay Patel <spatel at rotateright.com>; Chandler
Carruth <chandlerc at google.com>; David Li <davidxl at google.com>;
Wei Mi <wmi at google.com>; Dehao Chen <dehao at google.com>; Cong
Hou <congh at google.com>
Cc: Llvm Dev <llvm-dev at lists.llvm.org>
Subject: [llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that
generate illegal types

Hello,

Currently the loop vectorizer will, by default, not consider vectorization
factors that would make it generate types that do not fit into the target
platform's vector registers. That is, if the widest scalar type in the
scalar loop is i64, and the platform's largest vector register is 256-bit
wide, we will not consider a VF above 4.

We have a command line option (-mllvm -vectorizer-maximize-bandwidth), that will
choose VFs for consideration based on the narrowest scalar type instead of the
widest one, but I don't believe it has been widely tested. If anyone has had
an opportunity to play around with it, I'd love to hear about the results.

What I'd like to do is:
Step 1: Make -vectorizer-maximize-bandwidth the default. This should improve the
performance of loops that contain mixed-width types.
Step 2: Remove the artificial width limitation altogether, and base the
vectorization factor decision purely on the cost model. This should allow us to
get rid of the interleaving code in the loop vectorizer, and get interleaving
for "free" from the legalizer instead.

There are two potential road-blocks I see - the cost-model, and the legalizer.
To make this work, we need to:
a) Model the cost of operations on illegal types better. Right now, what we get
is sometimes completely ridiculous (e.g. see http://reviews.llvm.org/D21251).
b) Make sure the cost model actually stops us when the VF becomes too large.
This is mostly a question of correctly estimating the register pressure. In
theory, that should not be a issue - we already rely on this estimate to choose
the interleaving factor, so using the same logic to upper-bound the VF directly
shouldn't make things worse.
c) Ensure the legalizer is up to the task of emitting good code for overly wide
vectors. I've talked about this with Chandler, and his opinion (Chandler,
please correct me if I'm wrong) is that on x86, the legalizer is likely to
be able to handle this. This may not be true for other platforms. So, I'd
like to try to make this the default on a platform-by-platform basis, starting
with x86.

What do you think? Does this seem like a step in the right direction? Anything
important I'm missing?

Thanks,
  Michael
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160616/988ac963/attachment.html>

Michael Kuperstein via llvm-dev

2016-Jun-16 07:20 UTC

head link

[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate illegal types

Sorry, you're right, that really wasn't clear.
When I wrote "for free", I meant "without having code in the
vectorizer
dealing specifically with interleaving".

Consider a simple loop, like:

void hot(int *a, int *b) {
#pragma clang loop vectorize_width(4) interleave_count(2)
#pragma nounroll
  for (int i = 0; i < 1000; i++) {
    a[i] += b[i];
  }
  return ;
}

We'll get a vector loop with 4-element vectors, that, when compiling for
SSE, gets lowered to:
.LBB0_3:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
movdqu -16(%rsi,%rax,4), %xmm0
movdqu (%rsi,%rax,4), %xmm1
movdqu -16(%rdi,%rax,4), %xmm2
movdqu (%rdi,%rax,4), %xmm3
paddd %xmm0, %xmm2
paddd %xmm1, %xmm3
movdqu %xmm2, -16(%rdi,%rax,4)
movdqu %xmm3, (%rdi,%rax,4)
addq $8, %rax
cmpq $1004, %rax             # imm = 0x3EC
jne .LBB0_3

If we instead have
#pragma clang loop vectorize_width(8) interleave_count(1)

We'll get an 8-wide IR vector loop, but end up with almost the same
lowering:
.LBB0_3:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
movdqu 16(%rsi,%rax,4), %xmm0
movdqu (%rsi,%rax,4), %xmm1
movdqu 16(%rdi,%rax,4), %xmm2
movdqu (%rdi,%rax,4), %xmm3
paddd %xmm1, %xmm3
paddd %xmm0, %xmm2
movdqu %xmm2, 16(%rdi,%rax,4)
movdqu %xmm3, (%rdi,%rax,4)
addq $8, %rax
cmpq $1000, %rax             # imm = 0x3E8
jne .LBB0_3

Legalization splits each 8-wide operation into two 4-wide operations,
achieving almost the same result as vectorizing by a factor of 4 and
unrolling by 2.
The question is whether the legalizer is actually up to doing this well in
general.

On Wed, Jun 15, 2016 at 11:46 PM, Das, Dibyendu via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Its not clear how you would get ‘interleaving for free’.
>
>
>
> *From:* llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] *On Behalf Of
*Michael
> Kuperstein via llvm-dev
> *Sent:* Thursday, June 16, 2016 4:18 AM
> *To:* Hal Finkel <hfinkel at anl.gov>; Nadav Rotem <nadav.rotem at
me.com>;
> Ayal Zaks <ayal.zaks at intel.com>; Demikhovsky, Elena <
> elena.demikhovsky at intel.com>; Adam Nemet <anemet at apple.com>;
Sanjoy Das <
> sanjoy at playingwithpointers.com>; James Molloy <james.molloy at
arm.com>;
> Matthew Simpson <mssimpso at codeaurora.org>; Sanjay Patel <
> spatel at rotateright.com>; Chandler Carruth <chandlerc at
google.com>; David
> Li <davidxl at google.com>; Wei Mi <wmi at google.com>; Dehao
Chen <
> dehao at google.com>; Cong Hou <congh at google.com>
> *Cc:* Llvm Dev <llvm-dev at lists.llvm.org>
> *Subject:* [llvm-dev] [RFC] Allow loop vectorizer to choose vector widths
> that generate illegal types
>
>
>
> Hello,
>
>
> Currently the loop vectorizer will, by default, not consider vectorization
> factors that would make it generate types that do not fit into the target
> platform's vector registers. That is, if the widest scalar type in the
> scalar loop is i64, and the platform's largest vector register is
256-bit
> wide, we will not consider a VF above 4.
>
> We have a command line option (-mllvm -vectorizer-maximize-bandwidth),
> that will choose VFs for consideration based on the narrowest scalar type
> instead of the widest one, but I don't believe it has been widely
tested.
> If anyone has had an opportunity to play around with it, I'd love to
hear
> about the results.
>
> What I'd like to do is:
>
> Step 1: Make -vectorizer-maximize-bandwidth the default. This should
> improve the performance of loops that contain mixed-width types.
> Step 2: Remove the artificial width limitation altogether, and base the
> vectorization factor decision purely on the cost model. This should allow
> us to get rid of the interleaving code in the loop vectorizer, and get
> interleaving for "free" from the legalizer instead.
>
>
>
> There are two potential road-blocks I see - the cost-model, and the
> legalizer. To make this work, we need to:
>
> a) Model the cost of operations on illegal types better. Right now, what
> we get is sometimes completely ridiculous (e.g. see
> http://reviews.llvm.org/D21251).
>
> b) Make sure the cost model actually stops us when the VF becomes too
> large. This is mostly a question of correctly estimating the register
> pressure. In theory, that should not be a issue - we already rely on this
> estimate to choose the interleaving factor, so using the same logic to
> upper-bound the VF directly shouldn't make things worse.
>
> c) Ensure the legalizer is up to the task of emitting good code for overly
> wide vectors. I've talked about this with Chandler, and his opinion
> (Chandler, please correct me if I'm wrong) is that on x86, the
legalizer is
> likely to be able to handle this. This may not be true for other platforms.
> So, I'd like to try to make this the default on a platform-by-platform
> basis, starting with x86.
>
>
>
> What do you think? Does this seem like a step in the right direction?
> Anything important I'm missing?
>
>
>
> Thanks,
>
>   Michael
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160616/5bdf74fb/attachment.html>

Martin J. O'Riordan via llvm-dev

2016-Jun-16 08:02 UTC

head link

[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate illegal types

Our architecture has 2 different sizes for vector registers with separate
register files and functional units for each, and the existing cost model
already makes optimisation for this quite difficult.  Ideally the
loop-vectoriser would be able to vectorise for vectorisable code in the loop
using both in parallel.  At the moment the architectures that in the TRUNK for
LLVM all use a single size for vector registers and a single register file for
them, but I expect there are other out-of-tree targets that are using multiple
vector register widths.

 

Removing the width limitation altogether I think would make optimisations for
hybrid vector models such as ours less difficult, but it also means the cost
model should be able to query for the vector width and expect to get a list
instead of a single value as it does now.  Querying for the number of vector
registers should be a function of the vector type being examined.

 

            MartinO

 

From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Michael
Kuperstein via llvm-dev
Sent: 15 June 2016 23:48
To: Hal Finkel <hfinkel at anl.gov>; Nadav Rotem <nadav.rotem at
me.com>; Ayal Zaks <ayal.zaks at intel.com>; Demikhovsky, Elena
<elena.demikhovsky at intel.com>; Adam Nemet <anemet at apple.com>;
Sanjoy Das <sanjoy at playingwithpointers.com>; James Molloy
<james.molloy at arm.com>; Matthew Simpson <mssimpso at
codeaurora.org>; Sanjay Patel <spatel at rotateright.com>; Chandler
Carruth <chandlerc at google.com>; David Li <davidxl at google.com>;
Wei Mi <wmi at google.com>; Dehao Chen <dehao at google.com>; Cong
Hou <congh at google.com>
Cc: Llvm Dev <llvm-dev at lists.llvm.org>
Subject: [llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that
generate illegal types

 

Hello,

Currently the loop vectorizer will, by default, not consider vectorization
factors that would make it generate types that do not fit into the target
platform's vector registers. That is, if the widest scalar type in the
scalar loop is i64, and the platform's largest vector register is 256-bit
wide, we will not consider a VF above 4.

We have a command line option (-mllvm -vectorizer-maximize-bandwidth), that will
choose VFs for consideration based on the narrowest scalar type instead of the
widest one, but I don't believe it has been widely tested. If anyone has had
an opportunity to play around with it, I'd love to hear about the results.

What I'd like to do is:

Step 1: Make -vectorizer-maximize-bandwidth the default. This should improve the
performance of loops that contain mixed-width types.
Step 2: Remove the artificial width limitation altogether, and base the
vectorization factor decision purely on the cost model. This should allow us to
get rid of the interleaving code in the loop vectorizer, and get interleaving
for "free" from the legalizer instead.

 

There are two potential road-blocks I see - the cost-model, and the legalizer.
To make this work, we need to:

a) Model the cost of operations on illegal types better. Right now, what we get
is sometimes completely ridiculous (e.g. see http://reviews.llvm.org/D21251).

b) Make sure the cost model actually stops us when the VF becomes too large.
This is mostly a question of correctly estimating the register pressure. In
theory, that should not be a issue - we already rely on this estimate to choose
the interleaving factor, so using the same logic to upper-bound the VF directly
shouldn't make things worse.

c) Ensure the legalizer is up to the task of emitting good code for overly wide
vectors. I've talked about this with Chandler, and his opinion (Chandler,
please correct me if I'm wrong) is that on x86, the legalizer is likely to
be able to handle this. This may not be true for other platforms. So, I'd
like to try to make this the default on a platform-by-platform basis, starting
with x86.

 

What do you think? Does this seem like a step in the right direction? Anything
important I'm missing?

 

Thanks,

  Michael

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160616/a530375c/attachment.html>

Renato Golin via llvm-dev

2016-Jun-22 15:45 UTC

head link

[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate illegal types

On 15 June 2016 at 23:47, Michael Kuperstein via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> Step 1: Make -vectorizer-maximize-bandwidth the default. This should
improve
> the performance of loops that contain mixed-width types.
Hi Michael,

Per target, after investigation, I think this is perfectly fine.

> Step 2: Remove the artificial width limitation altogether, and base the
> vectorization factor decision purely on the cost model. This should allow
us
> to get rid of the interleaving code in the loop vectorizer, and get
> interleaving for "free" from the legalizer instead.
I'm slightly worried about this one, though.

The legalizer is a very large mess, with many unknown (or long
forgotten) inter-dependencies and intra-dependencies (with isel,
regalloc, back-end opt passes, etc), which were all mostly annealed
into working by heuristics and hack-fixing stuff. The multiple
attempts at re-writing the instruction selection is one demonstration
of that problem...

So, while I agree with Hal that this will put a good pressure into
improving the cost model (as well as the intra-dependencies), and
that's something very positive, I fear if the jump becomes to far,
we'll either break the world or not jump at all. For example,
FastISel.

I'm not saying we shouldn't do it, but if/when we do it, it would be
*very* beneficial to provide a multi-step migration path for future
targets to move in, not just a multi-step initial migration for the
primary target.

Another thing to consider is that the SLP vectorizer can use non-SIMD
FP co-processors (VFP on ARM), which have different costs than SIMD,
but may share the same decision path, especially if we move the
decision lower down into the legalizer.

Also, there are hidden costs between the different units in sharing
the registers or moving between, and that is not mapped into the
current cost model entirely (only via heuristics). This may not be a
problem for Intel, but it certainly will be for ARM/AArch64.

I had a plan 3 years ago to look into that, but never got around doing
it. Maybe it's about time I did... :)

Finally, if you need pre-testing and benchmarking, let me know and I
can spare some time to help you. I'll be glad to be copied on the
reviews and will do my best to help.

All in all, I don't think we'll get anything for free on this change.
There will be a cost, and it will be different on different targets,
but it may very well be a cost worth taking. I don't know enough yet
to have an opinion.

cheers,
--renato

Michael Kuperstein via llvm-dev

2016-Jun-22 18:00 UTC

head link

[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate illegal types

Thanks, Renato!

On Wed, Jun 22, 2016 at 8:45 AM, Renato Golin <renato.golin at linaro.org>
wrote:
> On 15 June 2016 at 23:47, Michael Kuperstein via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
> > Step 1: Make -vectorizer-maximize-bandwidth the default. This should
> improve
> > the performance of loops that contain mixed-width types.
>
> Hi Michael,
>
> Per target, after investigation, I think this is perfectly fine.
>
>Of course.

>
> > Step 2: Remove the artificial width limitation altogether, and base
the
> > vectorization factor decision purely on the cost model. This should
> allow us
> > to get rid of the interleaving code in the loop vectorizer, and get
> > interleaving for "free" from the legalizer instead.
>
> I'm slightly worried about this one, though.
>
>Me too. :-)

> The legalizer is a very large mess, with many unknown (or long
> forgotten) inter-dependencies and intra-dependencies (with isel,
> regalloc, back-end opt passes, etc), which were all mostly annealed
> into working by heuristics and hack-fixing stuff. The multiple
> attempts at re-writing the instruction selection is one demonstration
> of that problem...
>
>Yes.

> So, while I agree with Hal that this will put a good pressure into
> improving the cost model (as well as the intra-dependencies), and
> that's something very positive, I fear if the jump becomes to far,
> we'll either break the world or not jump at all. For example,
> FastISel.
>
How common is the "LoopVectorizer + FastISel + Performance is
important"
use-case?
In any case, I agree, this is precisely why I'm not jumping directly to
this, and going through the current vectorizer-maximize-bandwidth first.

>
> I'm not saying we shouldn't do it, but if/when we do it, it would
be
> *very* beneficial to provide a multi-step migration path for future
> targets to move in, not just a multi-step initial migration for the
> primary target.
>
By "multi-step", do you mean the same two steps above, or something
more?
If I understand you correctly, you're suggesting
" -vectorizer-maximize-bandwidth" and
"-vectorizer-maximize-bandwidth-harder". Then we can move the
per-platform
defaults to either the "regular" or the "harder" versions
independently. Is
this what you meant? If so, it makes perfect sense to me.

>
> Another thing to consider is that the SLP vectorizer can use non-SIMD
> FP co-processors (VFP on ARM), which have different costs than SIMD,
> but may share the same decision path, especially if we move the
> decision lower down into the legalizer.
>
> Also, there are hidden costs between the different units in sharing
> the registers or moving between, and that is not mapped into the
> current cost model entirely (only via heuristics). This may not be a
> problem for Intel, but it certainly will be for ARM/AArch64.
>
I agree this is a problem, but it seems like it should be orthogonal to
what I'm suggesting. I probably don't understand the background well
enough, though.

>
> I had a plan 3 years ago to look into that, but never got around doing
> it. Maybe it's about time I did... :)
>
> Finally, if you need pre-testing and benchmarking, let me know and I
> can spare some time to help you. I'll be glad to be copied on the
> reviews and will do my best to help.
>
That wold be great! I'm going to start with X86, mostly because that's
the
platform I'm most familiar with. But once it works on X86 (hopefully),
I'll
definitely need help with other platforms, both in terms of the cost model
and benchmarking.

>
> All in all, I don't think we'll get anything for free on this
change.
> There will be a cost, and it will be different on different targets,
> but it may very well be a cost worth taking. I don't know enough yet
> to have an opinion.
>
>Maybe "free" wasn't the right word to use here. :-)

> cheers,
> --renato
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160622/9ef8cd87/attachment.html>

llvm dev - Jun 2016 - [RFC] Allow loop vectorizer to choose vector widths that generate illegal types

[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate illegal types

[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate illegal types

[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate illegal types

[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate illegal types

[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate illegal types

[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate illegal types

[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate illegal types

[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate illegal types

[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate illegal types