thr3ads.net - llvm dev - [llvm-dev] [RFC] Make LoopVectorize Aware of SLP Operations [Feb 2018]

If this information is useful, please help other people find it:
Share via:

Florian Hahn via llvm-dev

2018-Feb-06 18:25 UTC

[llvm-dev] [RFC] Make LoopVectorize Aware of SLP Operations

Hello,

We would like to propose making LoopVectorize aware of SLP operations, 
to improve the generated code for loops operating on struct fields or 
doing complex math.

At the moment, LoopVectorize uses interleaving to vectorize loops that 
operate on values loaded/stored from consecutive addresses: vector 
loads/stores are generated to combine consecutive loads/stores and then 
shufflevector is used to de-interleave/interleave the loaded/stored 
values. At the moment however, we fail to detect cases where  the same 
operations are applied to all consecutive values and there is no need to 
interleave. To illustrate this, consider the following example loop:

struct Test {
     int x;
     int y;
};

void add(struct Test *A, struct Test *B, struct Test *C) {
    for (unsigned i = 0; i < 1024; i++) {
        C[i].x = A[i].x + B[i].y;
        C[i].y = A[i].y + B[i].y;
    }
}

On X86, we do not vectorize this loop and on AArch64, we generate the 
following code for the vector body:

vector.body:
   %index = phi i64 [ %index.next, %vector.body ], [ 0,
                      %vector.body.preheader ]
   %8 = or i64 %index, 4
   %9 = getelementptr inbounds %struct.Test, %struct.Test* %A,
                      i64 %index, i32 0
   %10 = getelementptr inbounds %struct.Test, %struct.Test* %A, i64 %8,
                       i32 0
   %11 = bitcast i32* %9 to <8 x i32>*
   %12 = bitcast i32* %10 to <8 x i32>*
   %wide.vec = load <8 x i32>, <8 x i32>* %11, align 4, !tbaa !2
   %wide.vec60 = load <8 x i32>, <8 x i32>* %12, align 4, !tbaa !2
   %13 = getelementptr inbounds %struct.Test, %struct.Test* %B,
                       i64 %index, i32 0
   %14 = getelementptr inbounds %struct.Test, %struct.Test* %B, i64 %8,
                       i32 0
   %15 = bitcast i32* %13 to <8 x i32>*
   %16 = bitcast i32* %14 to <8 x i32>*
   %wide.vec64 = load <8 x i32>, <8 x i32>* %15, align 4, !tbaa !2
   %wide.vec65 = load <8 x i32>, <8 x i32>* %16, align 4, !tbaa !2
   %17 = add nsw <8 x i32> %wide.vec64, %wide.vec
   %18 = shufflevector <8 x i32> %17, <8 x i32> undef, <4 x
i32>
                        <i32 0, i32 2, i32 4, i32 6>
   %19 = add nsw <8 x i32> %wide.vec65, %wide.vec60
   %20 = shufflevector <8 x i32> %19, <8 x i32> undef, <4 x
i32>
                       <i32 0, i32 2, i32 4, i32 6>
   %21 = add nsw <8 x i32> %wide.vec64, %wide.vec
   %22 = shufflevector <8 x i32> %21, <8 x i32> undef, <4 x
i32>
                       <i32 1, i32 3, i32 5, i32 7>
   %23 = add nsw <8 x i32> %wide.vec65, %wide.vec60
   %24 = shufflevector <8 x i32> %23, <8 x i32> undef, <4 x
i32>
                       <i32 1, i32 3, i32 5, i32 7>
   %25 = getelementptr inbounds %struct.Test, %struct.Test* %C,
                       i64 %index, i32 1
   %26 = getelementptr inbounds %struct.Test, %struct.Test* %C, i64 %8,
                       i32 1
   %27 = getelementptr i32, i32* %25, i64 -1
   %28 = bitcast i32* %27 to <8 x i32>*
   %29 = getelementptr i32, i32* %26, i64 -1
   %30 = bitcast i32* %29 to <8 x i32>*
   %interleaved.vec = shufflevector <4 x i32> %18, <4 x i32> %22,
       <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32
7>
   store <8 x i32> %interleaved.vec, <8 x i32>* %28, align 4, !tbaa
!2
   %interleaved.vec70 = shufflevector <4 x i32> %20, <4 x i32> %24,
       <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32
7>
   store <8 x i32> %interleaved.vec70, <8 x i32>* %30, align 4,
!tbaa !2
   %index.next = add i64 %index, 8
   %31 = icmp eq i64 %index.next, 1024
   br i1 %31, label %for.cond.cleanup, label %vector.body, !llvm.loop !6

Note the use of shufflevector to interleave and deinterleave the vector 
elements. On AArch64, we emit additional uzp1, uzp2 and st2 instructions 
for those.

In the case above however, there is no need to de-interleave the 
loaded/stored data and instead we could mix the consecutive operands 
together in a single vector register and apply the operations to vectors 
with compounded operands. As for real-world use cases, complex number 
arithmetic for example could also benefit from not interleaving.

In what follows, I propose an extension to LoopVectorize to make it 
aware of SLP opportunities and detect operations on "compound values" 
(for the lack of a better term) as an extension to interleaved access 
handling. This idea is described in [1].



Structure
========
To improve vectorization for loops operating on compound values, we 
propose to extend LoopVectorize to

   1. Detect loops containing SLP opportunities (operations on compound
      values)
   2. Extend the cost model to choose between interleaving or using
      compound values
   3. Add support for vectorizing compound operations to VPlan

Please note that the example in the previous section is intentionally 
kept simple and one could argue it could be handled by doing loop 
rerolling as a canonicalization step before LoopVectorize (currently 
LoopReroll does not handle this case, but it could be naturally extended 
to do so). However, the main advantage of making LoopVectorize aware of 
SLP opportunities is that it will allow us to deal with more complex 
cases like

  * loops where some vectors need to be transformed, for example where
    different operations are performed on different (groups of) lanes,
    like A[i].x + B[i].x, A[i].y - B[i].y which could be transformed to
    A[i].x + B[i].x, A[i].y + (-B[i].y), or where one compound group
    needs to be reordered, like A[i].x + B[i].y, A[i].y + B[i].x

  * loops where only parts are applicable to SLP-style vectorization

  * loops with complex operations that can be mapped to specialized HW
    instructions, like complex multiply and accumulate using FCMLA (Arm
    v8.3-a)

We do not necessarily have to do 3. in LoopVectorize. If we detect SLP 
opportunities, instead of choosing to interleave, we could decide to 
just unroll the loop so we can make optimal use of the vector registers 
with compound operations and let the SLP vectorizer generate code. But 
this would be problematic for cases where the right unroll factor is not 
known at compile time (e.g. with Arm's scalable vectors). Another 
advantage of a SLP aware LoopVectorizer is that we would benefit from 
loop versioning like with regular vectorization. Also Vplan should make 
it relatively straight-forward to add codegen support.


Detect Loops Containing SLP Opportunities
------------------------------------------
This is the key part. LoopVectorize needs to be able to detect SLP 
opportunities, in order to decide if interleaving or SLP style 
vectorization is better. The basic idea to detect SLP opportunities is 
to combine instructions that produce values that can be put in different 
lanes of a single vector register to "compound values" and try to
build
operation trees using them. Such trees can, for example, be constructed 
using a bottom-up process rooted at interleaved stores [1].

As a first step, we propose to add analysis capable to handle only cases 
in which the same operation is applied to all lanes of a compound value 
and incrementally improve the analysis to handle a mixture of operations 
on compound values (e.g. addition on some lanes, subtraction on others).

The SLP vectorizer in LLVM already implements a similar analysis. In the 
long term, it would probably make sense to share the analysis between 
LoopVectorize and the SLP vectorizer. But initially we think it would be 
safer to start with a separate and initially more limited analysis in 
LoopVectorize.


Extend to cost model to choose between interleaving or using "compound 
operations"
-----------------------------------------------------------------------
For load/store instructions which result in compound values, we choose 
SLP-vectorization over interleave/scalarization, if it is beneficial. 
The cost of loading/storing an interleave-group for SLP vectorization is 
the same as interleaving, but without the costs to rearrange the values.

For regular instructions operating on compound values, we count the cost 
of the operation in lane 0 of the resulting compound value considering 
the vectorization factor and ignore the instructions in other lanes, as 
we can do them for "free". In case we mix regular vectorization and 
SLP-style vectorization, we have to also account for additional 
instructions to switch between data layouts.

One limitation here is that we commit to either interleaving/compound 
vectorization when calculating the cost for the loads. Depending on the 
other instructions in the loop, interleaving could be beneficial 
overall, even though we could use compound vectorization for some 
operations.  Initially we could only consider SLP-style vectorization if 
it can be used for all instructions.


Add support for SLP style vectorization to Vplan
------------------------------------------------------------------------
Introduce 2 new recipes VPSLPMemoryRecipe and VPSLPInstructionRecipe. 
Both generate code to produce a single compound value and we add those 
recipes for the instruction in lane 0 of the compound value. We do not 
add any recipes for instructions in other lanes.

The SLP vectorizer in LLVM also implements code generation for similar 
cases. In the long term it  might make sense to share the Vplan based 
infrastructure between LoopVectorize and the SLP vectorizer.



Alternative approaches
=====================We do not necessarily have to deal with that case in the
vectorizer to
improve the generate code in simple cases. For example, we could add 
patterns to instcombine to undo/remove the interleaving done by the 
vectorizer for certain common patterns. But this seems quite fragile.

Another possible solution to make LoopReroll aware of operations on 
compound values and let it deal with de-interleaving the loop 
iterations. This would require no changes to LoopVectorize, but only 
allows us to handle simple cases.

Also, as a follow-up, we want to build on the infrastructure described 
in this RFC to enable vectorization using new HW instructions that 
operate on compund values, like FCADD and FCMLA in Armv8.3-a.


[1] "Loop-Aware SLP in GCC" by Ira Rosen, Dorit Nuzman, Ayal Zaks.


Thanks for bearing with me! I am looking forward to any thoughts & 
feedback!

Cheers,
Florian

Caballero, Diego via llvm-dev

2018-Feb-08 04:22 UTC

head link

[llvm-dev] [RFC] Make LoopVectorize Aware of SLP Operations

Hi Florian!

This proposal sounds pretty exciting! Integrating SLP-aware loop vectorization
(or the other way around) and SLP into the VPlan framework is definitely aligned
with the long term vision and we would prefer this approach to the LoopReroll
and InstCombine alternatives that you mentioned. We prefer a generic
implementation that can handle complicated cases to something ad-hoc for some
simple ones. Because of this, we would have some comments regarding the design
that you propose:
>   1. Detect loops containing SLP opportunities (operations on compound
>      values)
>   2. Extend the cost model to choose between interleaving or using
>      compound values
>   3. Add support for vectorizing compound operations to VPlan
Currently, VPlan is not fully integrated in all the stages of the inner loop
vectorizer pipeline. For that reason, part of your implementation (#1 and #2)
would happen outside of VPlan and another part (#3) would be VPlan-based. As you
know, we are currently leveraging a new vectorization path where 1) VPlan is
built upfront in the pipeline
(http://lists.llvm.org/pipermail/llvm-dev/2017-December/119523.html) and 2) all
the vectorization stages will be implemented on top of the VPlan representation.
We think that your proposal is a really good candidate to be implemented in this
new “VPlan-native” vectorization path. In this way, we would avoid the porting
effort of #1 and #2 to the final VPlan-base infrastructure and would give you
the opportunity of getting involved in the design of VPlan. It would also avoid
introducing the complexity of SLP into the existing cost model and code
generation, which is also a concern to consider. We should definitely talk in
depth about the requirements to implement this in the new vectorization path but
we truly believe it's the best approach.

>  * loops where some vectors need to be transformed, for example where
>    different operations are performed on different (groups of) lanes,
>    like A[i].x + B[i].x, A[i].y - B[i].y which could be transformed to
>    A[i].x + B[i].x, A[i].y + (-B[i].y), or where one compound group
>    needs to be reordered, like A[i].x + B[i].y, A[i].y + B[i].x
This kind of transformation/reordering is an example of what could be a
preparatory VPlan-to-VPlan transformation that could precede and simplify the
core analysis of the SLP-aware analysis.

> The SLP vectorizer in LLVM already implements a similar analysis. In the
long term, it would probably make sense to share the analysis between
LoopVectorize and the SLP vectorizer. But initially we think it would be safer
to start with a separate and initially more limited analysis in LoopVectorize.
As a first step, this sounds reasonable to me. If it's implemented on top of
VPlan it could be reused in a future VPlan-base SLP. We should have this in mind
for the design from the very beginning so that we maximize the reuse of the
common ground of "standalone" SLP and SLP-aware loop vectorization.

> One limitation here is that we commit to either interleaving/compound
vectorization when calculating the cost for the loads. Depending on the other
instructions in the loop, interleaving could be beneficial overall, even though
we could use compound vectorization for some operations.  Initially we could
only consider SLP-style vectorization if it can be used for all instructions.
VPlan will allow the independent evaluation of multiple vectorization scenarios
in the future. This seems to fit into that category.

> Add support for SLP style vectorization to Vplan
> ------------------------------------------------------------------------
> Introduce 2 new recipes VPSLPMemoryRecipe and VPSLPInstructionRecipe. 
We introduced VPInstructions to model masking in patch D38676. They are
necessary to properly model the def-use/use-def chains in the VPlan
representation, and we believe you will need to represent def-use/use-def chains
for newly inserted operations, shuffles, and inserts/extracts. As such,
VPInstructions would be a more proper representation than the Recipes per the
long term vision. In fact, we plan to replace some of the current existing
recipes with VPInstructions in the near future. For this reason, we think that
the output of the SLP vectorization should be modelled using VPInstruction and
not with new Recipes.

Thanks,
Diego Caballero & Vectorizer Team.
Intel Compiler and Languages.


-----Original Message-----
From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Florian
Hahn via llvm-dev
Sent: Tuesday, February 6, 2018 10:25 AM
To: Renato Golin <renato.golin at linaro.org>; Saito, Hideki
<hideki.saito at intel.com>; Zaks, Ayal <ayal.zaks at intel.com>;
mkuper at google.com; llvm-dev <llvm-dev at lists.llvm.org>
Cc: nd <nd at arm.com>
Subject: [llvm-dev] [RFC] Make LoopVectorize Aware of SLP Operations

Hello,

We would like to propose making LoopVectorize aware of SLP operations, to
improve the generated code for loops operating on struct fields or doing complex
math.

At the moment, LoopVectorize uses interleaving to vectorize loops that operate
on values loaded/stored from consecutive addresses: vector loads/stores are
generated to combine consecutive loads/stores and then shufflevector is used to
de-interleave/interleave the loaded/stored values. At the moment however, we
fail to detect cases where  the same operations are applied to all consecutive
values and there is no need to interleave. To illustrate this, consider the
following example loop:

struct Test {
     int x;
     int y;
};

void add(struct Test *A, struct Test *B, struct Test *C) {
    for (unsigned i = 0; i < 1024; i++) {
        C[i].x = A[i].x + B[i].y;
        C[i].y = A[i].y + B[i].y;
    }
}

On X86, we do not vectorize this loop and on AArch64, we generate the following
code for the vector body:

vector.body:
   %index = phi i64 [ %index.next, %vector.body ], [ 0,
                      %vector.body.preheader ]
   %8 = or i64 %index, 4
   %9 = getelementptr inbounds %struct.Test, %struct.Test* %A,
                      i64 %index, i32 0
   %10 = getelementptr inbounds %struct.Test, %struct.Test* %A, i64 %8,
                       i32 0
   %11 = bitcast i32* %9 to <8 x i32>*
   %12 = bitcast i32* %10 to <8 x i32>*
   %wide.vec = load <8 x i32>, <8 x i32>* %11, align 4, !tbaa !2
   %wide.vec60 = load <8 x i32>, <8 x i32>* %12, align 4, !tbaa !2
   %13 = getelementptr inbounds %struct.Test, %struct.Test* %B,
                       i64 %index, i32 0
   %14 = getelementptr inbounds %struct.Test, %struct.Test* %B, i64 %8,
                       i32 0
   %15 = bitcast i32* %13 to <8 x i32>*
   %16 = bitcast i32* %14 to <8 x i32>*
   %wide.vec64 = load <8 x i32>, <8 x i32>* %15, align 4, !tbaa !2
   %wide.vec65 = load <8 x i32>, <8 x i32>* %16, align 4, !tbaa !2
   %17 = add nsw <8 x i32> %wide.vec64, %wide.vec
   %18 = shufflevector <8 x i32> %17, <8 x i32> undef, <4 x
i32>
                        <i32 0, i32 2, i32 4, i32 6>
   %19 = add nsw <8 x i32> %wide.vec65, %wide.vec60
   %20 = shufflevector <8 x i32> %19, <8 x i32> undef, <4 x
i32>
                       <i32 0, i32 2, i32 4, i32 6>
   %21 = add nsw <8 x i32> %wide.vec64, %wide.vec
   %22 = shufflevector <8 x i32> %21, <8 x i32> undef, <4 x
i32>
                       <i32 1, i32 3, i32 5, i32 7>
   %23 = add nsw <8 x i32> %wide.vec65, %wide.vec60
   %24 = shufflevector <8 x i32> %23, <8 x i32> undef, <4 x
i32>
                       <i32 1, i32 3, i32 5, i32 7>
   %25 = getelementptr inbounds %struct.Test, %struct.Test* %C,
                       i64 %index, i32 1
   %26 = getelementptr inbounds %struct.Test, %struct.Test* %C, i64 %8,
                       i32 1
   %27 = getelementptr i32, i32* %25, i64 -1
   %28 = bitcast i32* %27 to <8 x i32>*
   %29 = getelementptr i32, i32* %26, i64 -1
   %30 = bitcast i32* %29 to <8 x i32>*
   %interleaved.vec = shufflevector <4 x i32> %18, <4 x i32> %22,
       <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32
7>
   store <8 x i32> %interleaved.vec, <8 x i32>* %28, align 4, !tbaa
!2
   %interleaved.vec70 = shufflevector <4 x i32> %20, <4 x i32> %24,
       <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32
7>
   store <8 x i32> %interleaved.vec70, <8 x i32>* %30, align 4,
!tbaa !2
   %index.next = add i64 %index, 8
   %31 = icmp eq i64 %index.next, 1024
   br i1 %31, label %for.cond.cleanup, label %vector.body, !llvm.loop !6

Note the use of shufflevector to interleave and deinterleave the vector
elements. On AArch64, we emit additional uzp1, uzp2 and st2 instructions for
those.

In the case above however, there is no need to de-interleave the loaded/stored
data and instead we could mix the consecutive operands together in a single
vector register and apply the operations to vectors with compounded operands. As
for real-world use cases, complex number arithmetic for example could also
benefit from not interleaving.

In what follows, I propose an extension to LoopVectorize to make it aware of SLP
opportunities and detect operations on "compound values"
(for the lack of a better term) as an extension to interleaved access handling.
This idea is described in [1].



Structure
========
To improve vectorization for loops operating on compound values, we propose to
extend LoopVectorize to

   1. Detect loops containing SLP opportunities (operations on compound
      values)
   2. Extend the cost model to choose between interleaving or using
      compound values
   3. Add support for vectorizing compound operations to VPlan

Please note that the example in the previous section is intentionally kept
simple and one could argue it could be handled by doing loop rerolling as a
canonicalization step before LoopVectorize (currently LoopReroll does not handle
this case, but it could be naturally extended to do so). However, the main
advantage of making LoopVectorize aware of SLP opportunities is that it will
allow us to deal with more complex cases like

  * loops where some vectors need to be transformed, for example where
    different operations are performed on different (groups of) lanes,
    like A[i].x + B[i].x, A[i].y - B[i].y which could be transformed to
    A[i].x + B[i].x, A[i].y + (-B[i].y), or where one compound group
    needs to be reordered, like A[i].x + B[i].y, A[i].y + B[i].x

  * loops where only parts are applicable to SLP-style vectorization

  * loops with complex operations that can be mapped to specialized HW
    instructions, like complex multiply and accumulate using FCMLA (Arm
    v8.3-a)

We do not necessarily have to do 3. in LoopVectorize. If we detect SLP
opportunities, instead of choosing to interleave, we could decide to just unroll
the loop so we can make optimal use of the vector registers with compound
operations and let the SLP vectorizer generate code. But this would be
problematic for cases where the right unroll factor is not known at compile time
(e.g. with Arm's scalable vectors). Another advantage of a SLP aware
LoopVectorizer is that we would benefit from loop versioning like with regular
vectorization. Also Vplan should make it relatively straight-forward to add
codegen support.


Detect Loops Containing SLP Opportunities
------------------------------------------
This is the key part. LoopVectorize needs to be able to detect SLP
opportunities, in order to decide if interleaving or SLP style vectorization is
better. The basic idea to detect SLP opportunities is to combine instructions
that produce values that can be put in different lanes of a single vector
register to "compound values" and try to build operation trees using
them. Such trees can, for example, be constructed using a bottom-up process
rooted at interleaved stores [1].

As a first step, we propose to add analysis capable to handle only cases in
which the same operation is applied to all lanes of a compound value and
incrementally improve the analysis to handle a mixture of operations on compound
values (e.g. addition on some lanes, subtraction on others).

The SLP vectorizer in LLVM already implements a similar analysis. In the long
term, it would probably make sense to share the analysis between LoopVectorize
and the SLP vectorizer. But initially we think it would be safer to start with a
separate and initially more limited analysis in LoopVectorize.


Extend to cost model to choose between interleaving or using "compound
operations"
-----------------------------------------------------------------------
For load/store instructions which result in compound values, we choose
SLP-vectorization over interleave/scalarization, if it is beneficial.
The cost of loading/storing an interleave-group for SLP vectorization is the
same as interleaving, but without the costs to rearrange the values.

For regular instructions operating on compound values, we count the cost of the
operation in lane 0 of the resulting compound value considering the
vectorization factor and ignore the instructions in other lanes, as we can do
them for "free". In case we mix regular vectorization and SLP-style
vectorization, we have to also account for additional instructions to switch
between data layouts.

One limitation here is that we commit to either interleaving/compound
vectorization when calculating the cost for the loads. Depending on the other
instructions in the loop, interleaving could be beneficial overall, even though
we could use compound vectorization for some operations.  Initially we could
only consider SLP-style vectorization if it can be used for all instructions.


Add support for SLP style vectorization to Vplan
------------------------------------------------------------------------
Introduce 2 new recipes VPSLPMemoryRecipe and VPSLPInstructionRecipe. 
Both generate code to produce a single compound value and we add those recipes
for the instruction in lane 0 of the compound value. We do not add any recipes
for instructions in other lanes.

The SLP vectorizer in LLVM also implements code generation for similar cases. In
the long term it  might make sense to share the Vplan based infrastructure
between LoopVectorize and the SLP vectorizer.



Alternative approaches
=====================We do not necessarily have to deal with that case in the
vectorizer to improve the generate code in simple cases. For example, we could
add patterns to instcombine to undo/remove the interleaving done by the
vectorizer for certain common patterns. But this seems quite fragile.

Another possible solution to make LoopReroll aware of operations on compound
values and let it deal with de-interleaving the loop iterations. This would
require no changes to LoopVectorize, but only allows us to handle simple cases.

Also, as a follow-up, we want to build on the infrastructure described in this
RFC to enable vectorization using new HW instructions that operate on compund
values, like FCADD and FCMLA in Armv8.3-a.


[1] "Loop-Aware SLP in GCC" by Ira Rosen, Dorit Nuzman, Ayal Zaks.


Thanks for bearing with me! I am looking forward to any thoughts & feedback!

Cheers,
Florian
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Florian Hahn via llvm-dev

2018-Feb-08 15:30 UTC

head link

[llvm-dev] [RFC] Make LoopVectorize Aware of SLP Operations

Hi,

On 08/02/2018 04:22, Caballero, Diego wrote:> Hi Florian!
> 
> This proposal sounds pretty exciting! Integrating SLP-aware loop
vectorization (or the other way around) and SLP into the VPlan framework is
definitely aligned with the long term vision and we would prefer this approach
to the LoopReroll and InstCombine alternatives that you mentioned. We prefer a
generic implementation that can handle complicated cases to something ad-hoc for
some simple ones. Because of this, we would have some comments regarding the
design that you propose:
> 
>>    1. Detect loops containing SLP opportunities (operations on compound
>>       values)
>>    2. Extend the cost model to choose between interleaving or using
>>       compound values
>>    3. Add support for vectorizing compound operations to VPlan
> 
> Currently, VPlan is not fully integrated in all the stages of the inner
loop vectorizer pipeline. For that reason, part of your implementation (#1 and
#2) would happen outside of VPlan and another part (#3) would be VPlan-based. As
you know, we are currently leveraging a new vectorization path where 1) VPlan is
built upfront in the pipeline
(http://lists.llvm.org/pipermail/llvm-dev/2017-December/119523.html) and 2) all
the vectorization stages will be implemented on top of the VPlan representation.
We think that your proposal is a really good candidate to be implemented in this
new “VPlan-native” vectorization path. In this way, we would avoid the porting
effort of #1 and #2 to the final VPlan-base infrastructure and would give you
the opportunity of getting involved in the design of VPlan. It would also avoid
introducing the complexity of SLP into the existing cost model and code
generation, which is also a concern to consider. We should definitely talk in
depth about the requirements to implement this in the new vectorization path but
we truly believe it's the best approach.
> 
Thank you very much for your detailed response!

I also think that this proposal would fit very well in the 
"VPlan-native" vectorization path, especially building and evaluating 
multiple plans for different strategies (interleaved and SLP style for 
example) should make the cost-modelling easier in more complex 
scenarios. I suppose this proposal could be a good candidate for an 
initial user of the new VPlan model and I would also be happy with 
helping out with related VPlan infrastructure work!
> 
>>   * loops where some vectors need to be transformed, for example where
>>     different operations are performed on different (groups of) lanes,
>>     like A[i].x + B[i].x, A[i].y - B[i].y which could be transformed to
>>     A[i].x + B[i].x, A[i].y + (-B[i].y), or where one compound group
>>     needs to be reordered, like A[i].x + B[i].y, A[i].y + B[i].x
> 
> This kind of transformation/reordering is an example of what could be a
preparatory VPlan-to-VPlan transformation that could precede and simplify the
core analysis of the SLP-aware analysis.
> 
Yes in those cases I think it would be very beneficial to not commit to 
a single strategy too early.
> 
>> The SLP vectorizer in LLVM already implements a similar analysis. In
the long term, it would probably make sense to share the analysis between
LoopVectorize and the SLP vectorizer. But initially we think it would be safer
to start with a separate and initially more limited analysis in LoopVectorize.
> 
> As a first step, this sounds reasonable to me. If it's implemented on
top of VPlan it could be reused in a future VPlan-base SLP. We should have this
in mind for the design from the very beginning so that we maximize the reuse of
the common ground of "standalone" SLP and SLP-aware loop
vectorization.
Agreed. The infrastructure we add here should make it at least easier to 
re-use parts for standalone SLP.
> 
> 
>> One limitation here is that we commit to either interleaving/compound
vectorization when calculating the cost for the loads. Depending on the other
instructions in the loop, interleaving could be beneficial overall, even though
we could use compound vectorization for some operations.  Initially we could
only consider SLP-style vectorization if it can be used for all instructions.
> 
> VPlan will allow the independent evaluation of multiple vectorization
scenarios in the future. This seems to fit into that category.
Yes, I think this would fit very well in the Vplan-native model and also 
allow for a modular implementation of cost-modelling.
> 
> 
>> Add support for SLP style vectorization to Vplan
>>
------------------------------------------------------------------------
>> Introduce 2 new recipes VPSLPMemoryRecipe and VPSLPInstructionRecipe.
> 
> We introduced VPInstructions to model masking in patch D38676. They are
necessary to properly model the def-use/use-def chains in the VPlan
representation, and we believe you will need to represent def-use/use-def chains
for newly inserted operations, shuffles, and inserts/extracts. As such,
VPInstructions would be a more proper representation than the Recipes per the
long term vision. In fact, we plan to replace some of the current existing
recipes with VPInstructions in the near future. For this reason, we think that
the output of the SLP vectorization should be modelled using VPInstruction and
not with new Recipes.
Interesting, thanks for the pointer. I will have a closer look, but I 
think VPInstructions should be good fit here.



Thanks,
Florian

Maybe Matching Threads

Search for more maybe matching threads

llvm dev - Feb 2018 - [RFC] Make LoopVectorize Aware of SLP Operations

[llvm-dev] [RFC] Make LoopVectorize Aware of SLP Operations

[llvm-dev] [RFC] Make LoopVectorize Aware of SLP Operations

[llvm-dev] [RFC] Make LoopVectorize Aware of SLP Operations

Maybe Matching Threads