thr3ads.net - llvm dev - [llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization [Jun 2016]

If this information is useful, please help other people find it:
Share via:

Saito, Hideki via llvm-dev

2016-Jun-18 00:30 UTC

[llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization

>Vectorizer's output should be as clean as vector code can be so that
analyses and optimizers downstream can
>do a great job optimizing.
Guess I should clarify this philosophical position of mine. In terms of vector
code optimization that complicates
the output of vectorizer:

If vectorizer is the best place to perform the optimization, it should do so.
     This includes the cases like
           If vectorizer doesn't do it, we'll lose the opportunity.
           If vectorizer doesn't do it, complexity of performing the
optimization goes up significantly.
           Optimization changes the way vectorization happens (may consider this
as part of "lose the oppotunity")
They should be different from
           If vectorizer doesn't do it, it'll add complexity of
vectorizer's cost model.
Compiler's cost modeling in general is already complex enough since there
are many things
happening downstream. Adding one more optimization downstream doesn't change
the big picture.

There is certainly a tradeoff against "if vectorizer does this, we'll
lose some other optimization downstream"
even if vectorizer is the best place to perform a certain optimization ---- but
this isn't a new problem.

As Ashutosh wrote, stride memref optimization in this RFC can be done later in
compilation, and doing so
should not be much more complex than doing it in vectorizer. That's why I
recommend doing it outside of
vectorizer (like CG prepare), and I'm glad to hear that Ashutosh is
considering that.

Thanks for reading.
Hideki

-----Original Message-----
From: Saito, Hideki 
Sent: Friday, June 17, 2016 4:40 PM
To: 'Nema, Ashutosh' <Ashutosh.Nema at amd.com>
Cc: llvm-dev <llvm-dev at lists.llvm.org>
Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization

>I agree this can be done with Gather/Scatter intrinsic as well, once we
enable these we need to place right costing. During costing we have to estimate
the cost of load[s], >store[s] and shuffle[s] and in CG prepare we have to
lower them. In the proposed approach we have introduced SkipFactor which helps
to reduce number of load[s] & >store[s]. i.e. For Stride 3 & VF 4 we
only generate 2 loads(vs 3) to model load operation.
I'm all for properly accounting the cost. I'm just against leaking
unnecessary complexity to the IL.
>I did not understood this completely, vectorizer already identifies maximum
target HW physical vector width and based on it choose VF then widens the
instructions.
Well, the selection of VF should be based on the cost modeling. If you
haven't, please follow the thread started by Michael Kuperstein.
	[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate
illegal types
In general, a "logical vector" operation can be one full vector on the
target, less than one full vector (such as half), or more than one full vector
(e.g., two or four vector). LLVM IR instructions can support logical vectors
(and have vector type legalization support). I don't see why vectorizer
should be restricted to use target legalized vector only.

Now, looking at Stride-2 Loads and VF=4 case, where target HW can accommodate 4
elements in the vector.
	A * B * C * D
We could try breaking up into two loads in
	A * B *   + C * D *
or
	A * B *   + * C * D
The second one is fine if we can assume that any memory between A and D can be
accessed.
However, the first one is bad since the "*" after D may cause SEGV.

What if we want to use VF=8? Under the proposal in consideration, some
operations (such as ADD)
are consuming the result of load in 8-way vector (say, <8 x float>), but
the load is somehow split up
into 2x2x4. That's not good for optimizers. Load of 8-way vector should look
like load of 8 elements ---
not a convoluted sequence of smaller loads and bunch of shuffles that
analysis/optimization downstream
(or even human) will have a hard time understanding that as a load of 8
elements.
>But I accept this can be done with hypothetical "stride
load/store" intrinsic or gather/scatter intrinsic. At vectorizer generate
these intrinsic and later during CodeGen/CG->Prepare generate mapping
instructions.
Thanks. That's what I'd recommend ---- unless someone can clearly show
that adding this much complexity to the vectorizer's output
enables some other optimizations (and without hurting optimizations) downstream.

Vectorizer's output should be as clean as vector code can be so that
analyses and optimizers downstream can
do a great job optimizing.

Thanks,
Hideki

-----Original Message-----
From: Nema, Ashutosh [mailto:Ashutosh.Nema at amd.com] 
Sent: Thursday, June 16, 2016 11:05 PM
To: Saito, Hideki <hideki.saito at intel.com>
Cc: llvm-dev <llvm-dev at lists.llvm.org>
Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization

Thanks Hideki for going through this proposal.

I agree this can be done with Gather/Scatter intrinsic as well, once we enable
these we need to place right costing. During costing we have to estimate the
cost of load[s], store[s] and shuffle[s] and in CG prepare we have to lower
them. In the proposed approach we have introduced SkipFactor which helps to
reduce number of load[s] & store[s]. i.e. For Stride 3 & VF 4 we only
generate 2 loads(vs 3) to model load operation.
> 4) Unfortunately, 3) most likely imply that the wide load used here 
> needs to be based on target HW physical
>      vector width. There are reasons why such type legalization at 
> vectorizer should be avoided.
I did not understood this completely, vectorizer already identifies maximum
target HW physical vector width and based on it choose VF then widens the
instructions.
> If I remember correctly, Elena Demikhovsky (my colleague) attempted to 
> introduce stride load/store intrinsic sometime ago and got a push 
> back. I think it's best to address the following questions.
> a) What can this enable over gather/scatter intrinsic w/ constant 
> index vector
> b) What can this enable over hypothetical "stride load/store" 
> intrinsic
In this proposal we are not generating any intrinsic rather directly generates
required instructions based on vector factor. Here instructions width are purely
based on vector factor. Also we have introduced SkipFactor concept to reduce
number of load[s] & store[s].

But I accept this can be done with hypothetical "stride load/store"
intrinsic or gather/scatter intrinsic. At vectorizer generate these intrinsic
and later during CodeGen/CG-Prepare generate mapping instructions.
> c) Any chance of introducing stride load/store as a form of wide
load/store?
We already widens the instructions but purely based on vector factor.
i.e. if for a i32 load, VF is 4 then the generated wide load will be <i32 x
4>.
We are not intentionally generating wide loads like interleave vectorizer
because we transform everything at vectorizer and the wide load will not help us
to minimize the required load[s] & store[s].
i.e. for VF 4 and stride 3, interleave vectorizer will generate <i32 x 12>
wide load (which is actually 3 <i32 x 4>) but this approach only generate
2 <i32 x 4> loads.

Regards,
Ashutosh
> -----Original Message-----
> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of 
> Saito, Hideki via llvm-dev
> Sent: Thursday, June 16, 2016 5:10 AM
> To: llvm-dev at lists.llvm.org
> Subject: Re: [llvm-dev] [Proposal][RFC] Strided Memory Access 
> Vectorization
> 
> Ashutosh,
> 
> First, I'm all for enabling general stride load/store support for all 
> targets --- should be just a matter of proper cost modeling. For that 
> matter, we should enable general gather/scatter support for all targets.
> 
> About the specific approach taken by this RFC:
> 1) It's best to first state that this is for small constant stride 
> where a single wide load/store can cover two or more
>      valid stride load/store elements. If a wide load/store contain 
> just one valid element, there aren't any need
>      to do a wide load/store.
> 2) Optimal code generation for strided memory accesses are target
dependent.
>      Breaking one gather/scatter w/ constant index vector into several 
> IR instructions like below might hurt such optimization,
>      especially when the vectorization factor is greater than the 
> target legal vector width ---- this can happen
>      in mixed data type code (int mixed with double, etc.)
> 3) Make sure that the last element of unmasked load is a valid stride 
> load element ---- otherwise,
>      out-of-bounds SEGV can happen. Use of masked load intrinsic most 
> likely degrades any benefits over
>      gather intrinsic w/ constant index.
> 4) Unfortunately, 3) most likely imply that the wide load used here 
> needs to be based on target HW physical
>      vector width. There are reasons why such type legalization at 
> vectorizer should be avoided.
> 5) In general, I think leaking this much complexity into IR hurts 
> (rather than
> enable) optimization.
>      It's best to keep vectorizer's output as simple as vector code
> can be such that other optimizers can do their best
>      in optimizing vector code.
> 
> If I remember correctly, Elena Demikhovsky (my colleague) attempted to 
> introduce stride load/store intrinsic sometime ago and got a push 
> back. I think it's best to address the following questions.
> a) What can this enable over gather/scatter intrinsic w/ constant 
> index vector
> b) What can this enable over hypothetical "stride load/store" 
> intrinsic
> c) Any chance of introducing stride load/store as a form of wide
load/store?
>     I understand that there is a big hurdle here, but the reward 
> should also be pretty good here.
> 
> One might claim that the cost modeling can be easier, but that is also 
> the downside since the generated cost may not match the cost of optimal
code.
> 
> All these things considered, my suggestion is to start from enabling 
> optimal code generation (or optimal lowering in CG prepare) from 
> gather/scatter intrinsic and reflect that in vectorizer's cost model.
> 
> As a vectorizer centric person, I'd like to eventually see 
> gather/scatter/stride- load/store to be supported by IR instructions 
> (i.e., not intrinsics), but I'll leave that discussion to another time
so that I won't pollute Ashutosh's RFC.
> 
> Thanks,
> Hideki Saito
> Technical Lead of Vectorizer Development Intel Compiler and Languages
> 
> -----------------------------------
> Message: 7
> Date: Wed, 15 Jun 2016 06:16:58 +0000
> From: "Nema, Ashutosh via llvm-dev" <llvm-dev at
lists.llvm.org>
> To: llvm-dev <llvm-dev at lists.llvm.org>
> Subject: [llvm-dev] [Proposal][RFC] Strided Memory Access
> 	Vectorization
> Message-ID:
> 	<BLUPR12MB0644468A76C335E9B7431853FB550 at BLUPR12MB0644.na
> mprd12.prod.outlook.com>
> 
> Content-Type: text/plain; charset="utf-8"
> 
> Hi,
> 
> This is a proposal about implementing strided memory access vectorization.
> 
> Currently LLVM does not support strided access vectorization 
> completely, some support is available via Interleave vectorization.
> 
> There are two main overheads with strided vectorizations:
> *       An overhead of consolidating data into an operable vector.
> *       An overhead of distributing the data elements after the operations.
> 
> Because of these overheads LLVM finds strided memory vectorization is 
> not profitable and generating serial code sequence over vectorized code.
> GATHER & SCATTER instruction support is available on few(not all) 
> targets for consolidation & distribution operations.
> 
> In this approach we are trying to handle cases like this:
> for (int i = 0; i < len; i++)
>      a[i*3] = b[i*2] + c[i*3];
> 
> We model strided memory load & store using shuffle &
load/mask-store
> operations.
> *       Load is modeled as loads followed by shuffle.
> *       Store is modeled as shuffle followed by mask store.
> *       To minimize load and store operation introduced
'SkipFactor'.
> 
> 'SkipFactor':
> *       Multiple load operation required for consolidating data into an
operable
> vector.
> *       Between various loads we skip by few offsets to effective
consolidate.
> *       SkipFactor is the number of additional offsets required to move
from the
> previous vector load memory-end address for the next vector load.
> *       SkipFactor allows all memory loads to be considered as identical
(From
> valid element perspective).
> *       SkipFactor = (Stride - (VF % Stride)) % Stride)
> 
> For example:
> 
> 
> How LOAD is modeled:
> Load stride 3 (i.e. load for b [ 3 * i ])
>   %5 = getelementptr inbounds i32, i32* %b, i64 %.lhs
>   %6 = bitcast i32* %5 to <4 x i32>*
>   %stride.load27 = load <4 x i32>, <4 x i32>* %6, align 1,
!tbaa !1
>   %7 = getelementptr i32, i32* %5, i64 6
>   %8 = bitcast i32* %7 to <4 x i32>*
>   %stride.load28 = load <4 x i32>, <4 x i32>* %8, align 1,
!tbaa !1
>   %strided.vec29 = shufflevector <4 x i32> %stride.load27, <4 x
i32>
> %stride.load28, <4 x i32> <i32 0, i32 3, i32 4, i32 7>
> 
> How STORE is modeled:
> Store with stride 3 (i.e. store to c [ 3 * i ])
>  %10 = getelementptr inbounds i32, i32* %c, i64 %.lhs
>   %11 = bitcast i32* %10 to <4 x i32>*
>   %interleaved.vec = shufflevector <4 x i32> %StoreResult, <4 x
i32>
> undef, <4 x
> i32> <i32 0, i32 undef, i32 undef, i32 1>
>   call void @llvm.masked.store.v4i32(<4 x i32> %interleaved.vec,
<4 x
> i32>* %11, i32 4, <4 x i1> <i1 true, i1 false, i1 false, i1
true>)
>   %12 = getelementptr i32, i32* %10, i64 6
>   %13 = bitcast i32* %12 to <4 x i32>*
>   %interleaved.vec30 = shufflevector <4 x i32> %StoreResult, <4 x
i32>
> undef,
> <4 x i32> <i32 2, i32 undef, i32 undef, i32 3>
>   call void @llvm.masked.store.v4i32(<4 x i32> %interleaved.vec30,
<4
> x i32>* %13, i32 4, <4 x i1> <i1 true, i1 false, i1 false, i1
true>)
> 
> NOTE: For Stride 3 SkipFactor is 2, that's why the second GEP offset by
6(4+2).
> 
> To enable this feature below LLVM changes required.
> 1)      Identify strided memory access (Its already there i.e. interleave
vectorizer
> does this)
> 2)      Costing changes:
> a.      Identify number of Load[s]/Store[s] & Shuffle[s] required to
model
> Load/Store operation by considering SkipFactor.
> b.      Return total cost by adding Load[s]/Store[s] and shuffle[s] costs.
> 3)      Transform:
> a.      Generate Shuffle[s] followed by Mask-Store[s] instructions to model
a
> Store operation.
> b.      Generate Load[s] followed by Shuffle[s] instructions to model a
Load
> operation.
> 
> Implemented a working prototype to demonstrate this feature, its 
> available at below review-thread.
> http://reviews.llvm.org/D21363
> 
> Use below option to enable this feature:
> "-mllvm -enable-interleaved-mem-accesses -mllvm -enable-strided- 
> vectorization"
> 
> Gains observed with prototype:
> TSVC kernel S111                1.15x
> TSVC kernel S1111               1.42x
> 
> If community is interested in the above work, I have below plan:
> 
> I'm not sure keeping this feature separate or the part of current 
> interleave vectorizer ?
> 
> 1)      Will enable interleave vectorizer on X86 (If this feature is going
as an
> enhancement on interleave vectorizer)
> 2)      If it's going with interleave vectorizer, will modify its infra
to support this.
> 3)      Will enable strided vectorization by submitting changes on costing
&
> transform.
> 
> Please share your valuable thoughts.
> 
> Thanks,
> Ashutosh
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Nema, Ashutosh via llvm-dev

2016-Jun-30 04:50 UTC

head link

[llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization

One common concern raised for cases where Loop Vectorizer generate
bigger types than target supported:

Based on VF currently we check the cost and generate the expected set of
instruction[s] for bigger type. It has two challenges for bigger types cost
is not always correct and code generation may not generate efficient
instruction[s].

Probably can depend on the support provided by below RFC by Michael:
 "Allow loop vectorizer to choose vector widths that generate illegal
types"
In that case Loop Vectorizer will consider the cost of bigger types and generate
gather / scatter of same type. Later code generation will generate the desired
instruction[s] by breaking the bigger types into target supported.

Regards,
Ashutosh
> -----Original Message-----
> From: Saito, Hideki [mailto:hideki.saito at intel.com]
> Sent: Saturday, June 18, 2016 6:00 AM
> To: Nema, Ashutosh <Ashutosh.Nema at amd.com>
> Cc: llvm-dev <llvm-dev at lists.llvm.org>
> Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization
> 
> 
> >Vectorizer's output should be as clean as vector code can be so
that analyses
> and optimizers downstream can
> >do a great job optimizing.
> 
> Guess I should clarify this philosophical position of mine. In terms of
vector
> code optimization that complicates
> the output of vectorizer:
> 
> If vectorizer is the best place to perform the optimization, it should do
so.
>      This includes the cases like
>            If vectorizer doesn't do it, we'll lose the opportunity.
>            If vectorizer doesn't do it, complexity of performing the
optimization goes
> up significantly.
>            Optimization changes the way vectorization happens (may consider
this
> as part of "lose the oppotunity")
> They should be different from
>            If vectorizer doesn't do it, it'll add complexity of
vectorizer's cost model.
> Compiler's cost modeling in general is already complex enough since
there are
> many things
> happening downstream. Adding one more optimization downstream doesn't
> change
> the big picture.
> 
> There is certainly a tradeoff against "if vectorizer does this,
we'll lose some
> other optimization downstream"
> even if vectorizer is the best place to perform a certain optimization ----
but this
> isn't a new problem.
> 
> As Ashutosh wrote, stride memref optimization in this RFC can be done later
in
> compilation, and doing so
> should not be much more complex than doing it in vectorizer. That's why
I
> recommend doing it outside of
> vectorizer (like CG prepare), and I'm glad to hear that Ashutosh is
considering
> that.
> 
> Thanks for reading.
> Hideki
> 
> -----Original Message-----
> From: Saito, Hideki
> Sent: Friday, June 17, 2016 4:40 PM
> To: 'Nema, Ashutosh' <Ashutosh.Nema at amd.com>
> Cc: llvm-dev <llvm-dev at lists.llvm.org>
> Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization
> 
> 
> >I agree this can be done with Gather/Scatter intrinsic as well, once we
enable
> these we need to place right costing. During costing we have to estimate
the
> cost of load[s], >store[s] and shuffle[s] and in CG prepare we have to
lower
> them. In the proposed approach we have introduced SkipFactor which helps to
> reduce number of load[s] & >store[s]. i.e. For Stride 3 & VF 4
we only generate
> 2 loads(vs 3) to model load operation.
> 
> I'm all for properly accounting the cost. I'm just against leaking
unnecessary
> complexity to the IL.
> 
> >I did not understood this completely, vectorizer already identifies
maximum
> target HW physical vector width and based on it choose VF then widens the
> instructions.
> 
> Well, the selection of VF should be based on the cost modeling. If you
haven't,
> please follow the thread started by Michael Kuperstein.
> 	[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that
> generate illegal types
> In general, a "logical vector" operation can be one full vector
on the target, less
> than one full vector (such as half), or more than one full vector
> (e.g., two or four vector). LLVM IR instructions can support logical
vectors (and
> have vector type legalization support). I don't see why vectorizer
> should be restricted to use target legalized vector only.
> 
> Now, looking at Stride-2 Loads and VF=4 case, where target HW can
> accommodate 4 elements in the vector.
> 	A * B * C * D
> We could try breaking up into two loads in
> 	A * B *   + C * D *
> or
> 	A * B *   + * C * D
> The second one is fine if we can assume that any memory between A and D
> can be accessed.
> However, the first one is bad since the "*" after D may cause
SEGV.
> 
> What if we want to use VF=8? Under the proposal in consideration, some
> operations (such as ADD)
> are consuming the result of load in 8-way vector (say, <8 x float>),
but the load
> is somehow split up
> into 2x2x4. That's not good for optimizers. Load of 8-way vector should
look like
> load of 8 elements ---
> not a convoluted sequence of smaller loads and bunch of shuffles that
> analysis/optimization downstream
> (or even human) will have a hard time understanding that as a load of 8
> elements.
> 
> >But I accept this can be done with hypothetical "stride
load/store" intrinsic or
> gather/scatter intrinsic. At vectorizer generate these intrinsic and later
during
> CodeGen/CG->Prepare generate mapping instructions.
> 
> Thanks. That's what I'd recommend ---- unless someone can clearly
show that
> adding this much complexity to the vectorizer's output
> enables some other optimizations (and without hurting optimizations)
> downstream.
> 
> Vectorizer's output should be as clean as vector code can be so that
analyses
> and optimizers downstream can
> do a great job optimizing.
> 
> Thanks,
> Hideki
> 
> -----Original Message-----
> From: Nema, Ashutosh [mailto:Ashutosh.Nema at amd.com]
> Sent: Thursday, June 16, 2016 11:05 PM
> To: Saito, Hideki <hideki.saito at intel.com>
> Cc: llvm-dev <llvm-dev at lists.llvm.org>
> Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization
> 
> Thanks Hideki for going through this proposal.
> 
> I agree this can be done with Gather/Scatter intrinsic as well, once we
enable
> these we need to place right costing. During costing we have to estimate
the
> cost of load[s], store[s] and shuffle[s] and in CG prepare we have to lower
> them. In the proposed approach we have introduced SkipFactor which helps to
> reduce number of load[s] & store[s]. i.e. For Stride 3 & VF 4 we
only generate 2
> loads(vs 3) to model load operation.
> 
> > 4) Unfortunately, 3) most likely imply that the wide load used here
> > needs to be based on target HW physical
> >      vector width. There are reasons why such type legalization at
> > vectorizer should be avoided.
> 
> I did not understood this completely, vectorizer already identifies maximum
> target HW physical vector width and based on it choose VF then widens the
> instructions.
> 
> > If I remember correctly, Elena Demikhovsky (my colleague) attempted to
> > introduce stride load/store intrinsic sometime ago and got a push
> > back. I think it's best to address the following questions.
> > a) What can this enable over gather/scatter intrinsic w/ constant
> > index vector
> > b) What can this enable over hypothetical "stride
load/store"
> > intrinsic
> 
> In this proposal we are not generating any intrinsic rather directly
generates
> required instructions based on vector factor. Here instructions width are
purely
> based on vector factor. Also we have introduced SkipFactor concept to
reduce
> number of load[s] & store[s].
> 
> But I accept this can be done with hypothetical "stride
load/store" intrinsic or
> gather/scatter intrinsic. At vectorizer generate these intrinsic and later
during
> CodeGen/CG-Prepare generate mapping instructions.
> 
> > c) Any chance of introducing stride load/store as a form of wide
load/store?
> 
> We already widens the instructions but purely based on vector factor.
> i.e. if for a i32 load, VF is 4 then the generated wide load will be
<i32 x 4>.
> We are not intentionally generating wide loads like interleave vectorizer
> because we transform everything at vectorizer and the wide load will not
help
> us to minimize the required load[s] & store[s].
> i.e. for VF 4 and stride 3, interleave vectorizer will generate <i32 x
12> wide load
> (which is actually 3 <i32 x 4>) but this approach only generate 2
<i32 x 4> loads.
> 
> Regards,
> Ashutosh
> 
> > -----Original Message-----
> > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf
Of
> > Saito, Hideki via llvm-dev
> > Sent: Thursday, June 16, 2016 5:10 AM
> > To: llvm-dev at lists.llvm.org
> > Subject: Re: [llvm-dev] [Proposal][RFC] Strided Memory Access
> > Vectorization
> >
> > Ashutosh,
> >
> > First, I'm all for enabling general stride load/store support for
all
> > targets --- should be just a matter of proper cost modeling. For that
> > matter, we should enable general gather/scatter support for all
targets.
> >
> > About the specific approach taken by this RFC:
> > 1) It's best to first state that this is for small constant stride
> > where a single wide load/store can cover two or more
> >      valid stride load/store elements. If a wide load/store contain
> > just one valid element, there aren't any need
> >      to do a wide load/store.
> > 2) Optimal code generation for strided memory accesses are target
> dependent.
> >      Breaking one gather/scatter w/ constant index vector into several
> > IR instructions like below might hurt such optimization,
> >      especially when the vectorization factor is greater than the
> > target legal vector width ---- this can happen
> >      in mixed data type code (int mixed with double, etc.)
> > 3) Make sure that the last element of unmasked load is a valid stride
> > load element ---- otherwise,
> >      out-of-bounds SEGV can happen. Use of masked load intrinsic most
> > likely degrades any benefits over
> >      gather intrinsic w/ constant index.
> > 4) Unfortunately, 3) most likely imply that the wide load used here
> > needs to be based on target HW physical
> >      vector width. There are reasons why such type legalization at
> > vectorizer should be avoided.
> > 5) In general, I think leaking this much complexity into IR hurts
> > (rather than
> > enable) optimization.
> >      It's best to keep vectorizer's output as simple as vector
code
> > can be such that other optimizers can do their best
> >      in optimizing vector code.
> >
> > If I remember correctly, Elena Demikhovsky (my colleague) attempted to
> > introduce stride load/store intrinsic sometime ago and got a push
> > back. I think it's best to address the following questions.
> > a) What can this enable over gather/scatter intrinsic w/ constant
> > index vector
> > b) What can this enable over hypothetical "stride
load/store"
> > intrinsic
> > c) Any chance of introducing stride load/store as a form of wide
load/store?
> >     I understand that there is a big hurdle here, but the reward
> > should also be pretty good here.
> >
> > One might claim that the cost modeling can be easier, but that is also
> > the downside since the generated cost may not match the cost of
optimal
> code.
> >
> > All these things considered, my suggestion is to start from enabling
> > optimal code generation (or optimal lowering in CG prepare) from
> > gather/scatter intrinsic and reflect that in vectorizer's cost
model.
> >
> > As a vectorizer centric person, I'd like to eventually see
> > gather/scatter/stride- load/store to be supported by IR instructions
> > (i.e., not intrinsics), but I'll leave that discussion to another
time so that I
> won't pollute Ashutosh's RFC.
> >
> > Thanks,
> > Hideki Saito
> > Technical Lead of Vectorizer Development Intel Compiler and Languages
> >
> > -----------------------------------
> > Message: 7
> > Date: Wed, 15 Jun 2016 06:16:58 +0000
> > From: "Nema, Ashutosh via llvm-dev" <llvm-dev at
lists.llvm.org>
> > To: llvm-dev <llvm-dev at lists.llvm.org>
> > Subject: [llvm-dev] [Proposal][RFC] Strided Memory Access
> > 	Vectorization
> > Message-ID:
> > 	<BLUPR12MB0644468A76C335E9B7431853FB550 at BLUPR12MB0644.na
> > mprd12.prod.outlook.com>
> >
> > Content-Type: text/plain; charset="utf-8"
> >
> > Hi,
> >
> > This is a proposal about implementing strided memory access
vectorization.
> >
> > Currently LLVM does not support strided access vectorization
> > completely, some support is available via Interleave vectorization.
> >
> > There are two main overheads with strided vectorizations:
> > *       An overhead of consolidating data into an operable vector.
> > *       An overhead of distributing the data elements after the
operations.
> >
> > Because of these overheads LLVM finds strided memory vectorization is
> > not profitable and generating serial code sequence over vectorized
code.
> > GATHER & SCATTER instruction support is available on few(not all)
> > targets for consolidation & distribution operations.
> >
> > In this approach we are trying to handle cases like this:
> > for (int i = 0; i < len; i++)
> >      a[i*3] = b[i*2] + c[i*3];
> >
> > We model strided memory load & store using shuffle &
load/mask-store
> > operations.
> > *       Load is modeled as loads followed by shuffle.
> > *       Store is modeled as shuffle followed by mask store.
> > *       To minimize load and store operation introduced
'SkipFactor'.
> >
> > 'SkipFactor':
> > *       Multiple load operation required for consolidating data into
an operable
> > vector.
> > *       Between various loads we skip by few offsets to effective
consolidate.
> > *       SkipFactor is the number of additional offsets required to
move from the
> > previous vector load memory-end address for the next vector load.
> > *       SkipFactor allows all memory loads to be considered as
identical (From
> > valid element perspective).
> > *       SkipFactor = (Stride - (VF % Stride)) % Stride)
> >
> > For example:
> >
> >
> > How LOAD is modeled:
> > Load stride 3 (i.e. load for b [ 3 * i ])
> >   %5 = getelementptr inbounds i32, i32* %b, i64 %.lhs
> >   %6 = bitcast i32* %5 to <4 x i32>*
> >   %stride.load27 = load <4 x i32>, <4 x i32>* %6, align 1,
!tbaa !1
> >   %7 = getelementptr i32, i32* %5, i64 6
> >   %8 = bitcast i32* %7 to <4 x i32>*
> >   %stride.load28 = load <4 x i32>, <4 x i32>* %8, align 1,
!tbaa !1
> >   %strided.vec29 = shufflevector <4 x i32> %stride.load27, <4
x i32>
> > %stride.load28, <4 x i32> <i32 0, i32 3, i32 4, i32 7>
> >
> > How STORE is modeled:
> > Store with stride 3 (i.e. store to c [ 3 * i ])
> >  %10 = getelementptr inbounds i32, i32* %c, i64 %.lhs
> >   %11 = bitcast i32* %10 to <4 x i32>*
> >   %interleaved.vec = shufflevector <4 x i32> %StoreResult, <4
x i32>
> > undef, <4 x
> > i32> <i32 0, i32 undef, i32 undef, i32 1>
> >   call void @llvm.masked.store.v4i32(<4 x i32> %interleaved.vec,
<4 x
> > i32>* %11, i32 4, <4 x i1> <i1 true, i1 false, i1 false,
i1 true>)
> >   %12 = getelementptr i32, i32* %10, i64 6
> >   %13 = bitcast i32* %12 to <4 x i32>*
> >   %interleaved.vec30 = shufflevector <4 x i32> %StoreResult,
<4 x i32>
> > undef,
> > <4 x i32> <i32 2, i32 undef, i32 undef, i32 3>
> >   call void @llvm.masked.store.v4i32(<4 x i32>
%interleaved.vec30, <4
> > x i32>* %13, i32 4, <4 x i1> <i1 true, i1 false, i1 false,
i1 true>)
> >
> > NOTE: For Stride 3 SkipFactor is 2, that's why the second GEP
offset by 6(4+2).
> >
> > To enable this feature below LLVM changes required.
> > 1)      Identify strided memory access (Its already there i.e.
interleave
> vectorizer
> > does this)
> > 2)      Costing changes:
> > a.      Identify number of Load[s]/Store[s] & Shuffle[s] required
to model
> > Load/Store operation by considering SkipFactor.
> > b.      Return total cost by adding Load[s]/Store[s] and shuffle[s]
costs.
> > 3)      Transform:
> > a.      Generate Shuffle[s] followed by Mask-Store[s] instructions to
model a
> > Store operation.
> > b.      Generate Load[s] followed by Shuffle[s] instructions to model
a Load
> > operation.
> >
> > Implemented a working prototype to demonstrate this feature, its
> > available at below review-thread.
> > http://reviews.llvm.org/D21363
> >
> > Use below option to enable this feature:
> > "-mllvm -enable-interleaved-mem-accesses -mllvm -enable-strided-
> > vectorization"
> >
> > Gains observed with prototype:
> > TSVC kernel S111                1.15x
> > TSVC kernel S1111               1.42x
> >
> > If community is interested in the above work, I have below plan:
> >
> > I'm not sure keeping this feature separate or the part of current
> > interleave vectorizer ?
> >
> > 1)      Will enable interleave vectorizer on X86 (If this feature is
going as an
> > enhancement on interleave vectorizer)
> > 2)      If it's going with interleave vectorizer, will modify its
infra to support this.
> > 3)      Will enable strided vectorization by submitting changes on
costing &
> > transform.
> >
> > Please share your valuable thoughts.
> >
> > Thanks,
> > Ashutosh
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Saito, Hideki via llvm-dev

2016-Jun-30 17:41 UTC

head link

[llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization

As a strong advocate of logical vector representation, I'm counting on
community liking Michael's RFC and that'll proceed sooner than later.
I plan to pitch in (e.g., perf experiments).
>Probably can depend on the support provided by below RFC by Michael:
> "Allow loop vectorizer to choose vector widths that generate illegal
types"
>In that case Loop Vectorizer will consider the cost of bigger types and
generate gather / scatter of same type. Later code generation will generate the
desired instruction[s] by breaking the bigger types into target >supported.
Hopefully, you can come up with a short term workaround, if the progress of
Michael's work is behind yours.

Thanks,
Hideki

-----Original Message-----
From: Nema, Ashutosh [mailto:Ashutosh.Nema at amd.com] 
Sent: Wednesday, June 29, 2016 9:50 PM
To: Saito, Hideki <hideki.saito at intel.com>; Demikhovsky, Elena
<elena.demikhovsky at intel.com>; silviu.baranga at gmail.com; Zaks, Ayal
<ayal.zaks at intel.com>
Cc: llvm-dev <llvm-dev at lists.llvm.org>; asbirlea at google.com;
renato.golin at linaro.org; mssimpso at codeaurora.org; kv.bhat at samsung.com;
Shahid, Asghar-ahmad <Asghar-ahmad.Shahid at amd.com>; sanjoy at
playingwithpointers.com; mzolotukhin at apple.com; Michael Kuperstein <mkuper
at google.com>
Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization

One common concern raised for cases where Loop Vectorizer generate bigger types
than target supported:

Based on VF currently we check the cost and generate the expected set of
instruction[s] for bigger type. It has two challenges for bigger types cost is
not always correct and code generation may not generate efficient
instruction[s].

Probably can depend on the support provided by below RFC by Michael:
 "Allow loop vectorizer to choose vector widths that generate illegal
types"
In that case Loop Vectorizer will consider the cost of bigger types and generate
gather / scatter of same type. Later code generation will generate the desired
instruction[s] by breaking the bigger types into target supported.

Regards,
Ashutosh
> -----Original Message-----
> From: Saito, Hideki [mailto:hideki.saito at intel.com]
> Sent: Saturday, June 18, 2016 6:00 AM
> To: Nema, Ashutosh <Ashutosh.Nema at amd.com>
> Cc: llvm-dev <llvm-dev at lists.llvm.org>
> Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access 
> Vectorization
> 
> 
> >Vectorizer's output should be as clean as vector code can be so
that
> >analyses
> and optimizers downstream can
> >do a great job optimizing.
> 
> Guess I should clarify this philosophical position of mine. In terms 
> of vector code optimization that complicates the output of vectorizer:
> 
> If vectorizer is the best place to perform the optimization, it should do
so.
>      This includes the cases like
>            If vectorizer doesn't do it, we'll lose the opportunity.
>            If vectorizer doesn't do it, complexity of performing the 
> optimization goes up significantly.
>            Optimization changes the way vectorization happens (may 
> consider this as part of "lose the oppotunity") They should be 
> different from
>            If vectorizer doesn't do it, it'll add complexity of
vectorizer's cost model.
> Compiler's cost modeling in general is already complex enough since 
> there are many things happening downstream. Adding one more 
> optimization downstream doesn't change the big picture.
> 
> There is certainly a tradeoff against "if vectorizer does this,
we'll
> lose some other optimization downstream"
> even if vectorizer is the best place to perform a certain optimization 
> ---- but this isn't a new problem.
> 
> As Ashutosh wrote, stride memref optimization in this RFC can be done 
> later in compilation, and doing so should not be much more complex 
> than doing it in vectorizer. That's why I recommend doing it outside 
> of vectorizer (like CG prepare), and I'm glad to hear that Ashutosh is 
> considering that.
> 
> Thanks for reading.
> Hideki
> 
> -----Original Message-----
> From: Saito, Hideki
> Sent: Friday, June 17, 2016 4:40 PM
> To: 'Nema, Ashutosh' <Ashutosh.Nema at amd.com>
> Cc: llvm-dev <llvm-dev at lists.llvm.org>
> Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access 
> Vectorization
> 
> 
> >I agree this can be done with Gather/Scatter intrinsic as well, once 
> >we enable
> these we need to place right costing. During costing we have to 
> estimate the cost of load[s], >store[s] and shuffle[s] and in CG 
> prepare we have to lower them. In the proposed approach we have 
> introduced SkipFactor which helps to reduce number of load[s] & 
> >store[s]. i.e. For Stride 3 & VF 4 we only generate
> 2 loads(vs 3) to model load operation.
> 
> I'm all for properly accounting the cost. I'm just against leaking 
> unnecessary complexity to the IL.
> 
> >I did not understood this completely, vectorizer already identifies 
> >maximum
> target HW physical vector width and based on it choose VF then widens 
> the instructions.
> 
> Well, the selection of VF should be based on the cost modeling. If you 
> haven't, please follow the thread started by Michael Kuperstein.
> 	[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that 
> generate illegal types In general, a "logical vector" operation
can be
> one full vector on the target, less than one full vector (such as 
> half), or more than one full vector (e.g., two or four vector). LLVM 
> IR instructions can support logical vectors (and have vector type 
> legalization support). I don't see why vectorizer should be restricted 
> to use target legalized vector only.
> 
> Now, looking at Stride-2 Loads and VF=4 case, where target HW can 
> accommodate 4 elements in the vector.
> 	A * B * C * D
> We could try breaking up into two loads in
> 	A * B *   + C * D *
> or
> 	A * B *   + * C * D
> The second one is fine if we can assume that any memory between A and 
> D can be accessed.
> However, the first one is bad since the "*" after D may cause
SEGV.
> 
> What if we want to use VF=8? Under the proposal in consideration, some 
> operations (such as ADD) are consuming the result of load in 8-way 
> vector (say, <8 x float>), but the load is somehow split up into 
> 2x2x4. That's not good for optimizers. Load of 8-way vector should 
> look like load of 8 elements --- not a convoluted sequence of smaller 
> loads and bunch of shuffles that analysis/optimization downstream (or 
> even human) will have a hard time understanding that as a load of 8 
> elements.
> 
> >But I accept this can be done with hypothetical "stride
load/store"
> >intrinsic or
> gather/scatter intrinsic. At vectorizer generate these intrinsic and 
> later during CodeGen/CG->Prepare generate mapping instructions.
> 
> Thanks. That's what I'd recommend ---- unless someone can clearly
show
> that adding this much complexity to the vectorizer's output enables 
> some other optimizations (and without hurting optimizations) 
> downstream.
> 
> Vectorizer's output should be as clean as vector code can be so that 
> analyses and optimizers downstream can do a great job optimizing.
> 
> Thanks,
> Hideki
> 
> -----Original Message-----
> From: Nema, Ashutosh [mailto:Ashutosh.Nema at amd.com]
> Sent: Thursday, June 16, 2016 11:05 PM
> To: Saito, Hideki <hideki.saito at intel.com>
> Cc: llvm-dev <llvm-dev at lists.llvm.org>
> Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access 
> Vectorization
> 
> Thanks Hideki for going through this proposal.
> 
> I agree this can be done with Gather/Scatter intrinsic as well, once 
> we enable these we need to place right costing. During costing we have 
> to estimate the cost of load[s], store[s] and shuffle[s] and in CG 
> prepare we have to lower them. In the proposed approach we have 
> introduced SkipFactor which helps to reduce number of load[s] & 
> store[s]. i.e. For Stride 3 & VF 4 we only generate 2 loads(vs 3) to
model load operation.
> 
> > 4) Unfortunately, 3) most likely imply that the wide load used here 
> > needs to be based on target HW physical
> >      vector width. There are reasons why such type legalization at 
> > vectorizer should be avoided.
> 
> I did not understood this completely, vectorizer already identifies 
> maximum target HW physical vector width and based on it choose VF then 
> widens the instructions.
> 
> > If I remember correctly, Elena Demikhovsky (my colleague) attempted 
> > to introduce stride load/store intrinsic sometime ago and got a push 
> > back. I think it's best to address the following questions.
> > a) What can this enable over gather/scatter intrinsic w/ constant 
> > index vector
> > b) What can this enable over hypothetical "stride
load/store"
> > intrinsic
> 
> In this proposal we are not generating any intrinsic rather directly 
> generates required instructions based on vector factor. Here 
> instructions width are purely based on vector factor. Also we have 
> introduced SkipFactor concept to reduce number of load[s] & store[s].
> 
> But I accept this can be done with hypothetical "stride
load/store"
> intrinsic or gather/scatter intrinsic. At vectorizer generate these 
> intrinsic and later during CodeGen/CG-Prepare generate mapping
instructions.
> 
> > c) Any chance of introducing stride load/store as a form of wide
load/store?
> 
> We already widens the instructions but purely based on vector factor.
> i.e. if for a i32 load, VF is 4 then the generated wide load will be
<i32 x 4>.
> We are not intentionally generating wide loads like interleave 
> vectorizer because we transform everything at vectorizer and the wide 
> load will not help us to minimize the required load[s] & store[s].
> i.e. for VF 4 and stride 3, interleave vectorizer will generate <i32 x 
> 12> wide load (which is actually 3 <i32 x 4>) but this approach
only generate 2 <i32 x 4> loads.
> 
> Regards,
> Ashutosh
> 
> > -----Original Message-----
> > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf
Of
> > Saito, Hideki via llvm-dev
> > Sent: Thursday, June 16, 2016 5:10 AM
> > To: llvm-dev at lists.llvm.org
> > Subject: Re: [llvm-dev] [Proposal][RFC] Strided Memory Access 
> > Vectorization
> >
> > Ashutosh,
> >
> > First, I'm all for enabling general stride load/store support for 
> > all targets --- should be just a matter of proper cost modeling. For 
> > that matter, we should enable general gather/scatter support for all
targets.
> >
> > About the specific approach taken by this RFC:
> > 1) It's best to first state that this is for small constant stride
> > where a single wide load/store can cover two or more
> >      valid stride load/store elements. If a wide load/store contain 
> > just one valid element, there aren't any need
> >      to do a wide load/store.
> > 2) Optimal code generation for strided memory accesses are target
> dependent.
> >      Breaking one gather/scatter w/ constant index vector into 
> > several IR instructions like below might hurt such optimization,
> >      especially when the vectorization factor is greater than the 
> > target legal vector width ---- this can happen
> >      in mixed data type code (int mixed with double, etc.)
> > 3) Make sure that the last element of unmasked load is a valid 
> > stride load element ---- otherwise,
> >      out-of-bounds SEGV can happen. Use of masked load intrinsic 
> > most likely degrades any benefits over
> >      gather intrinsic w/ constant index.
> > 4) Unfortunately, 3) most likely imply that the wide load used here 
> > needs to be based on target HW physical
> >      vector width. There are reasons why such type legalization at 
> > vectorizer should be avoided.
> > 5) In general, I think leaking this much complexity into IR hurts 
> > (rather than
> > enable) optimization.
> >      It's best to keep vectorizer's output as simple as vector
code
> > can be such that other optimizers can do their best
> >      in optimizing vector code.
> >
> > If I remember correctly, Elena Demikhovsky (my colleague) attempted 
> > to introduce stride load/store intrinsic sometime ago and got a push 
> > back. I think it's best to address the following questions.
> > a) What can this enable over gather/scatter intrinsic w/ constant 
> > index vector
> > b) What can this enable over hypothetical "stride
load/store"
> > intrinsic
> > c) Any chance of introducing stride load/store as a form of wide
load/store?
> >     I understand that there is a big hurdle here, but the reward 
> > should also be pretty good here.
> >
> > One might claim that the cost modeling can be easier, but that is 
> > also the downside since the generated cost may not match the cost of 
> > optimal
> code.
> >
> > All these things considered, my suggestion is to start from enabling 
> > optimal code generation (or optimal lowering in CG prepare) from 
> > gather/scatter intrinsic and reflect that in vectorizer's cost
model.
> >
> > As a vectorizer centric person, I'd like to eventually see
> > gather/scatter/stride- load/store to be supported by IR instructions 
> > (i.e., not intrinsics), but I'll leave that discussion to another 
> > time so that I
> won't pollute Ashutosh's RFC.
> >
> > Thanks,
> > Hideki Saito
> > Technical Lead of Vectorizer Development Intel Compiler and 
> > Languages
> >
> > -----------------------------------
> > Message: 7
> > Date: Wed, 15 Jun 2016 06:16:58 +0000
> > From: "Nema, Ashutosh via llvm-dev" <llvm-dev at
lists.llvm.org>
> > To: llvm-dev <llvm-dev at lists.llvm.org>
> > Subject: [llvm-dev] [Proposal][RFC] Strided Memory Access
> > 	Vectorization
> > Message-ID:
> > 	<BLUPR12MB0644468A76C335E9B7431853FB550 at BLUPR12MB0644.na
> > mprd12.prod.outlook.com>
> >
> > Content-Type: text/plain; charset="utf-8"
> >
> > Hi,
> >
> > This is a proposal about implementing strided memory access
vectorization.
> >
> > Currently LLVM does not support strided access vectorization 
> > completely, some support is available via Interleave vectorization.
> >
> > There are two main overheads with strided vectorizations:
> > *       An overhead of consolidating data into an operable vector.
> > *       An overhead of distributing the data elements after the
operations.
> >
> > Because of these overheads LLVM finds strided memory vectorization 
> > is not profitable and generating serial code sequence over vectorized
code.
> > GATHER & SCATTER instruction support is available on few(not all) 
> > targets for consolidation & distribution operations.
> >
> > In this approach we are trying to handle cases like this:
> > for (int i = 0; i < len; i++)
> >      a[i*3] = b[i*2] + c[i*3];
> >
> > We model strided memory load & store using shuffle &
load/mask-store
> > operations.
> > *       Load is modeled as loads followed by shuffle.
> > *       Store is modeled as shuffle followed by mask store.
> > *       To minimize load and store operation introduced
'SkipFactor'.
> >
> > 'SkipFactor':
> > *       Multiple load operation required for consolidating data into
an operable
> > vector.
> > *       Between various loads we skip by few offsets to effective
consolidate.
> > *       SkipFactor is the number of additional offsets required to
move from the
> > previous vector load memory-end address for the next vector load.
> > *       SkipFactor allows all memory loads to be considered as
identical (From
> > valid element perspective).
> > *       SkipFactor = (Stride - (VF % Stride)) % Stride)
> >
> > For example:
> >
> >
> > How LOAD is modeled:
> > Load stride 3 (i.e. load for b [ 3 * i ])
> >   %5 = getelementptr inbounds i32, i32* %b, i64 %.lhs
> >   %6 = bitcast i32* %5 to <4 x i32>*
> >   %stride.load27 = load <4 x i32>, <4 x i32>* %6, align 1,
!tbaa !1
> >   %7 = getelementptr i32, i32* %5, i64 6
> >   %8 = bitcast i32* %7 to <4 x i32>*
> >   %stride.load28 = load <4 x i32>, <4 x i32>* %8, align 1,
!tbaa !1
> >   %strided.vec29 = shufflevector <4 x i32> %stride.load27, <4
x i32>
> > %stride.load28, <4 x i32> <i32 0, i32 3, i32 4, i32 7>
> >
> > How STORE is modeled:
> > Store with stride 3 (i.e. store to c [ 3 * i ])
> >  %10 = getelementptr inbounds i32, i32* %c, i64 %.lhs
> >   %11 = bitcast i32* %10 to <4 x i32>*
> >   %interleaved.vec = shufflevector <4 x i32> %StoreResult, <4
x i32>
> > undef, <4 x
> > i32> <i32 0, i32 undef, i32 undef, i32 1>
> >   call void @llvm.masked.store.v4i32(<4 x i32> %interleaved.vec,
<4
> > x
> > i32>* %11, i32 4, <4 x i1> <i1 true, i1 false, i1 false,
i1 true>)
> >   %12 = getelementptr i32, i32* %10, i64 6
> >   %13 = bitcast i32* %12 to <4 x i32>*
> >   %interleaved.vec30 = shufflevector <4 x i32> %StoreResult,
<4 x
> > i32> undef,
> > <4 x i32> <i32 2, i32 undef, i32 undef, i32 3>
> >   call void @llvm.masked.store.v4i32(<4 x i32>
%interleaved.vec30,
> > <4 x i32>* %13, i32 4, <4 x i1> <i1 true, i1 false, i1
false, i1
> > true>)
> >
> > NOTE: For Stride 3 SkipFactor is 2, that's why the second GEP
offset by 6(4+2).
> >
> > To enable this feature below LLVM changes required.
> > 1)      Identify strided memory access (Its already there i.e.
interleave
> vectorizer
> > does this)
> > 2)      Costing changes:
> > a.      Identify number of Load[s]/Store[s] & Shuffle[s] required
to model
> > Load/Store operation by considering SkipFactor.
> > b.      Return total cost by adding Load[s]/Store[s] and shuffle[s]
costs.
> > 3)      Transform:
> > a.      Generate Shuffle[s] followed by Mask-Store[s] instructions to
model a
> > Store operation.
> > b.      Generate Load[s] followed by Shuffle[s] instructions to model
a Load
> > operation.
> >
> > Implemented a working prototype to demonstrate this feature, its 
> > available at below review-thread.
> > http://reviews.llvm.org/D21363
> >
> > Use below option to enable this feature:
> > "-mllvm -enable-interleaved-mem-accesses -mllvm -enable-strided- 
> > vectorization"
> >
> > Gains observed with prototype:
> > TSVC kernel S111                1.15x
> > TSVC kernel S1111               1.42x
> >
> > If community is interested in the above work, I have below plan:
> >
> > I'm not sure keeping this feature separate or the part of current 
> > interleave vectorizer ?
> >
> > 1)      Will enable interleave vectorizer on X86 (If this feature is
going as an
> > enhancement on interleave vectorizer)
> > 2)      If it's going with interleave vectorizer, will modify its
infra to support this.
> > 3)      Will enable strided vectorization by submitting changes on
costing &
> > transform.
> >
> > Please share your valuable thoughts.
> >
> > Thanks,
> > Ashutosh
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Possibly Parallel Threads

Search for more reasonably related threads

llvm dev - Jun 2016 - [Proposal][RFC] Strided Memory Access Vectorization

[llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization

[llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization

[llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization

Possibly Parallel Threads