thr3ads.net - llvm dev - [llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization [Jul 2016]

If this information is useful, please help other people find it:
Share via:

Saito, Hideki via llvm-dev

2016-Jun-30 17:41 UTC

[llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization

As a strong advocate of logical vector representation, I'm counting on
community liking Michael's RFC and that'll proceed sooner than later.
I plan to pitch in (e.g., perf experiments).
>Probably can depend on the support provided by below RFC by Michael:
> "Allow loop vectorizer to choose vector widths that generate illegal
types"
>In that case Loop Vectorizer will consider the cost of bigger types and
generate gather / scatter of same type. Later code generation will generate the
desired instruction[s] by breaking the bigger types into target >supported.
Hopefully, you can come up with a short term workaround, if the progress of
Michael's work is behind yours.

Thanks,
Hideki

-----Original Message-----
From: Nema, Ashutosh [mailto:Ashutosh.Nema at amd.com] 
Sent: Wednesday, June 29, 2016 9:50 PM
To: Saito, Hideki <hideki.saito at intel.com>; Demikhovsky, Elena
<elena.demikhovsky at intel.com>; silviu.baranga at gmail.com; Zaks, Ayal
<ayal.zaks at intel.com>
Cc: llvm-dev <llvm-dev at lists.llvm.org>; asbirlea at google.com;
renato.golin at linaro.org; mssimpso at codeaurora.org; kv.bhat at samsung.com;
Shahid, Asghar-ahmad <Asghar-ahmad.Shahid at amd.com>; sanjoy at
playingwithpointers.com; mzolotukhin at apple.com; Michael Kuperstein <mkuper
at google.com>
Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization

One common concern raised for cases where Loop Vectorizer generate bigger types
than target supported:

Based on VF currently we check the cost and generate the expected set of
instruction[s] for bigger type. It has two challenges for bigger types cost is
not always correct and code generation may not generate efficient
instruction[s].

Probably can depend on the support provided by below RFC by Michael:
 "Allow loop vectorizer to choose vector widths that generate illegal
types"
In that case Loop Vectorizer will consider the cost of bigger types and generate
gather / scatter of same type. Later code generation will generate the desired
instruction[s] by breaking the bigger types into target supported.

Regards,
Ashutosh
> -----Original Message-----
> From: Saito, Hideki [mailto:hideki.saito at intel.com]
> Sent: Saturday, June 18, 2016 6:00 AM
> To: Nema, Ashutosh <Ashutosh.Nema at amd.com>
> Cc: llvm-dev <llvm-dev at lists.llvm.org>
> Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access 
> Vectorization
> 
> 
> >Vectorizer's output should be as clean as vector code can be so
that
> >analyses
> and optimizers downstream can
> >do a great job optimizing.
> 
> Guess I should clarify this philosophical position of mine. In terms 
> of vector code optimization that complicates the output of vectorizer:
> 
> If vectorizer is the best place to perform the optimization, it should do
so.
>      This includes the cases like
>            If vectorizer doesn't do it, we'll lose the opportunity.
>            If vectorizer doesn't do it, complexity of performing the 
> optimization goes up significantly.
>            Optimization changes the way vectorization happens (may 
> consider this as part of "lose the oppotunity") They should be 
> different from
>            If vectorizer doesn't do it, it'll add complexity of
vectorizer's cost model.
> Compiler's cost modeling in general is already complex enough since 
> there are many things happening downstream. Adding one more 
> optimization downstream doesn't change the big picture.
> 
> There is certainly a tradeoff against "if vectorizer does this,
we'll
> lose some other optimization downstream"
> even if vectorizer is the best place to perform a certain optimization 
> ---- but this isn't a new problem.
> 
> As Ashutosh wrote, stride memref optimization in this RFC can be done 
> later in compilation, and doing so should not be much more complex 
> than doing it in vectorizer. That's why I recommend doing it outside 
> of vectorizer (like CG prepare), and I'm glad to hear that Ashutosh is 
> considering that.
> 
> Thanks for reading.
> Hideki
> 
> -----Original Message-----
> From: Saito, Hideki
> Sent: Friday, June 17, 2016 4:40 PM
> To: 'Nema, Ashutosh' <Ashutosh.Nema at amd.com>
> Cc: llvm-dev <llvm-dev at lists.llvm.org>
> Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access 
> Vectorization
> 
> 
> >I agree this can be done with Gather/Scatter intrinsic as well, once 
> >we enable
> these we need to place right costing. During costing we have to 
> estimate the cost of load[s], >store[s] and shuffle[s] and in CG 
> prepare we have to lower them. In the proposed approach we have 
> introduced SkipFactor which helps to reduce number of load[s] & 
> >store[s]. i.e. For Stride 3 & VF 4 we only generate
> 2 loads(vs 3) to model load operation.
> 
> I'm all for properly accounting the cost. I'm just against leaking 
> unnecessary complexity to the IL.
> 
> >I did not understood this completely, vectorizer already identifies 
> >maximum
> target HW physical vector width and based on it choose VF then widens 
> the instructions.
> 
> Well, the selection of VF should be based on the cost modeling. If you 
> haven't, please follow the thread started by Michael Kuperstein.
> 	[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that 
> generate illegal types In general, a "logical vector" operation
can be
> one full vector on the target, less than one full vector (such as 
> half), or more than one full vector (e.g., two or four vector). LLVM 
> IR instructions can support logical vectors (and have vector type 
> legalization support). I don't see why vectorizer should be restricted 
> to use target legalized vector only.
> 
> Now, looking at Stride-2 Loads and VF=4 case, where target HW can 
> accommodate 4 elements in the vector.
> 	A * B * C * D
> We could try breaking up into two loads in
> 	A * B *   + C * D *
> or
> 	A * B *   + * C * D
> The second one is fine if we can assume that any memory between A and 
> D can be accessed.
> However, the first one is bad since the "*" after D may cause
SEGV.
> 
> What if we want to use VF=8? Under the proposal in consideration, some 
> operations (such as ADD) are consuming the result of load in 8-way 
> vector (say, <8 x float>), but the load is somehow split up into 
> 2x2x4. That's not good for optimizers. Load of 8-way vector should 
> look like load of 8 elements --- not a convoluted sequence of smaller 
> loads and bunch of shuffles that analysis/optimization downstream (or 
> even human) will have a hard time understanding that as a load of 8 
> elements.
> 
> >But I accept this can be done with hypothetical "stride
load/store"
> >intrinsic or
> gather/scatter intrinsic. At vectorizer generate these intrinsic and 
> later during CodeGen/CG->Prepare generate mapping instructions.
> 
> Thanks. That's what I'd recommend ---- unless someone can clearly
show
> that adding this much complexity to the vectorizer's output enables 
> some other optimizations (and without hurting optimizations) 
> downstream.
> 
> Vectorizer's output should be as clean as vector code can be so that 
> analyses and optimizers downstream can do a great job optimizing.
> 
> Thanks,
> Hideki
> 
> -----Original Message-----
> From: Nema, Ashutosh [mailto:Ashutosh.Nema at amd.com]
> Sent: Thursday, June 16, 2016 11:05 PM
> To: Saito, Hideki <hideki.saito at intel.com>
> Cc: llvm-dev <llvm-dev at lists.llvm.org>
> Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access 
> Vectorization
> 
> Thanks Hideki for going through this proposal.
> 
> I agree this can be done with Gather/Scatter intrinsic as well, once 
> we enable these we need to place right costing. During costing we have 
> to estimate the cost of load[s], store[s] and shuffle[s] and in CG 
> prepare we have to lower them. In the proposed approach we have 
> introduced SkipFactor which helps to reduce number of load[s] & 
> store[s]. i.e. For Stride 3 & VF 4 we only generate 2 loads(vs 3) to
model load operation.
> 
> > 4) Unfortunately, 3) most likely imply that the wide load used here 
> > needs to be based on target HW physical
> >      vector width. There are reasons why such type legalization at 
> > vectorizer should be avoided.
> 
> I did not understood this completely, vectorizer already identifies 
> maximum target HW physical vector width and based on it choose VF then 
> widens the instructions.
> 
> > If I remember correctly, Elena Demikhovsky (my colleague) attempted 
> > to introduce stride load/store intrinsic sometime ago and got a push 
> > back. I think it's best to address the following questions.
> > a) What can this enable over gather/scatter intrinsic w/ constant 
> > index vector
> > b) What can this enable over hypothetical "stride
load/store"
> > intrinsic
> 
> In this proposal we are not generating any intrinsic rather directly 
> generates required instructions based on vector factor. Here 
> instructions width are purely based on vector factor. Also we have 
> introduced SkipFactor concept to reduce number of load[s] & store[s].
> 
> But I accept this can be done with hypothetical "stride
load/store"
> intrinsic or gather/scatter intrinsic. At vectorizer generate these 
> intrinsic and later during CodeGen/CG-Prepare generate mapping
instructions.
> 
> > c) Any chance of introducing stride load/store as a form of wide
load/store?
> 
> We already widens the instructions but purely based on vector factor.
> i.e. if for a i32 load, VF is 4 then the generated wide load will be
<i32 x 4>.
> We are not intentionally generating wide loads like interleave 
> vectorizer because we transform everything at vectorizer and the wide 
> load will not help us to minimize the required load[s] & store[s].
> i.e. for VF 4 and stride 3, interleave vectorizer will generate <i32 x 
> 12> wide load (which is actually 3 <i32 x 4>) but this approach
only generate 2 <i32 x 4> loads.
> 
> Regards,
> Ashutosh
> 
> > -----Original Message-----
> > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf
Of
> > Saito, Hideki via llvm-dev
> > Sent: Thursday, June 16, 2016 5:10 AM
> > To: llvm-dev at lists.llvm.org
> > Subject: Re: [llvm-dev] [Proposal][RFC] Strided Memory Access 
> > Vectorization
> >
> > Ashutosh,
> >
> > First, I'm all for enabling general stride load/store support for 
> > all targets --- should be just a matter of proper cost modeling. For 
> > that matter, we should enable general gather/scatter support for all
targets.
> >
> > About the specific approach taken by this RFC:
> > 1) It's best to first state that this is for small constant stride
> > where a single wide load/store can cover two or more
> >      valid stride load/store elements. If a wide load/store contain 
> > just one valid element, there aren't any need
> >      to do a wide load/store.
> > 2) Optimal code generation for strided memory accesses are target
> dependent.
> >      Breaking one gather/scatter w/ constant index vector into 
> > several IR instructions like below might hurt such optimization,
> >      especially when the vectorization factor is greater than the 
> > target legal vector width ---- this can happen
> >      in mixed data type code (int mixed with double, etc.)
> > 3) Make sure that the last element of unmasked load is a valid 
> > stride load element ---- otherwise,
> >      out-of-bounds SEGV can happen. Use of masked load intrinsic 
> > most likely degrades any benefits over
> >      gather intrinsic w/ constant index.
> > 4) Unfortunately, 3) most likely imply that the wide load used here 
> > needs to be based on target HW physical
> >      vector width. There are reasons why such type legalization at 
> > vectorizer should be avoided.
> > 5) In general, I think leaking this much complexity into IR hurts 
> > (rather than
> > enable) optimization.
> >      It's best to keep vectorizer's output as simple as vector
code
> > can be such that other optimizers can do their best
> >      in optimizing vector code.
> >
> > If I remember correctly, Elena Demikhovsky (my colleague) attempted 
> > to introduce stride load/store intrinsic sometime ago and got a push 
> > back. I think it's best to address the following questions.
> > a) What can this enable over gather/scatter intrinsic w/ constant 
> > index vector
> > b) What can this enable over hypothetical "stride
load/store"
> > intrinsic
> > c) Any chance of introducing stride load/store as a form of wide
load/store?
> >     I understand that there is a big hurdle here, but the reward 
> > should also be pretty good here.
> >
> > One might claim that the cost modeling can be easier, but that is 
> > also the downside since the generated cost may not match the cost of 
> > optimal
> code.
> >
> > All these things considered, my suggestion is to start from enabling 
> > optimal code generation (or optimal lowering in CG prepare) from 
> > gather/scatter intrinsic and reflect that in vectorizer's cost
model.
> >
> > As a vectorizer centric person, I'd like to eventually see
> > gather/scatter/stride- load/store to be supported by IR instructions 
> > (i.e., not intrinsics), but I'll leave that discussion to another 
> > time so that I
> won't pollute Ashutosh's RFC.
> >
> > Thanks,
> > Hideki Saito
> > Technical Lead of Vectorizer Development Intel Compiler and 
> > Languages
> >
> > -----------------------------------
> > Message: 7
> > Date: Wed, 15 Jun 2016 06:16:58 +0000
> > From: "Nema, Ashutosh via llvm-dev" <llvm-dev at
lists.llvm.org>
> > To: llvm-dev <llvm-dev at lists.llvm.org>
> > Subject: [llvm-dev] [Proposal][RFC] Strided Memory Access
> > 	Vectorization
> > Message-ID:
> > 	<BLUPR12MB0644468A76C335E9B7431853FB550 at BLUPR12MB0644.na
> > mprd12.prod.outlook.com>
> >
> > Content-Type: text/plain; charset="utf-8"
> >
> > Hi,
> >
> > This is a proposal about implementing strided memory access
vectorization.
> >
> > Currently LLVM does not support strided access vectorization 
> > completely, some support is available via Interleave vectorization.
> >
> > There are two main overheads with strided vectorizations:
> > *       An overhead of consolidating data into an operable vector.
> > *       An overhead of distributing the data elements after the
operations.
> >
> > Because of these overheads LLVM finds strided memory vectorization 
> > is not profitable and generating serial code sequence over vectorized
code.
> > GATHER & SCATTER instruction support is available on few(not all) 
> > targets for consolidation & distribution operations.
> >
> > In this approach we are trying to handle cases like this:
> > for (int i = 0; i < len; i++)
> >      a[i*3] = b[i*2] + c[i*3];
> >
> > We model strided memory load & store using shuffle &
load/mask-store
> > operations.
> > *       Load is modeled as loads followed by shuffle.
> > *       Store is modeled as shuffle followed by mask store.
> > *       To minimize load and store operation introduced
'SkipFactor'.
> >
> > 'SkipFactor':
> > *       Multiple load operation required for consolidating data into
an operable
> > vector.
> > *       Between various loads we skip by few offsets to effective
consolidate.
> > *       SkipFactor is the number of additional offsets required to
move from the
> > previous vector load memory-end address for the next vector load.
> > *       SkipFactor allows all memory loads to be considered as
identical (From
> > valid element perspective).
> > *       SkipFactor = (Stride - (VF % Stride)) % Stride)
> >
> > For example:
> >
> >
> > How LOAD is modeled:
> > Load stride 3 (i.e. load for b [ 3 * i ])
> >   %5 = getelementptr inbounds i32, i32* %b, i64 %.lhs
> >   %6 = bitcast i32* %5 to <4 x i32>*
> >   %stride.load27 = load <4 x i32>, <4 x i32>* %6, align 1,
!tbaa !1
> >   %7 = getelementptr i32, i32* %5, i64 6
> >   %8 = bitcast i32* %7 to <4 x i32>*
> >   %stride.load28 = load <4 x i32>, <4 x i32>* %8, align 1,
!tbaa !1
> >   %strided.vec29 = shufflevector <4 x i32> %stride.load27, <4
x i32>
> > %stride.load28, <4 x i32> <i32 0, i32 3, i32 4, i32 7>
> >
> > How STORE is modeled:
> > Store with stride 3 (i.e. store to c [ 3 * i ])
> >  %10 = getelementptr inbounds i32, i32* %c, i64 %.lhs
> >   %11 = bitcast i32* %10 to <4 x i32>*
> >   %interleaved.vec = shufflevector <4 x i32> %StoreResult, <4
x i32>
> > undef, <4 x
> > i32> <i32 0, i32 undef, i32 undef, i32 1>
> >   call void @llvm.masked.store.v4i32(<4 x i32> %interleaved.vec,
<4
> > x
> > i32>* %11, i32 4, <4 x i1> <i1 true, i1 false, i1 false,
i1 true>)
> >   %12 = getelementptr i32, i32* %10, i64 6
> >   %13 = bitcast i32* %12 to <4 x i32>*
> >   %interleaved.vec30 = shufflevector <4 x i32> %StoreResult,
<4 x
> > i32> undef,
> > <4 x i32> <i32 2, i32 undef, i32 undef, i32 3>
> >   call void @llvm.masked.store.v4i32(<4 x i32>
%interleaved.vec30,
> > <4 x i32>* %13, i32 4, <4 x i1> <i1 true, i1 false, i1
false, i1
> > true>)
> >
> > NOTE: For Stride 3 SkipFactor is 2, that's why the second GEP
offset by 6(4+2).
> >
> > To enable this feature below LLVM changes required.
> > 1)      Identify strided memory access (Its already there i.e.
interleave
> vectorizer
> > does this)
> > 2)      Costing changes:
> > a.      Identify number of Load[s]/Store[s] & Shuffle[s] required
to model
> > Load/Store operation by considering SkipFactor.
> > b.      Return total cost by adding Load[s]/Store[s] and shuffle[s]
costs.
> > 3)      Transform:
> > a.      Generate Shuffle[s] followed by Mask-Store[s] instructions to
model a
> > Store operation.
> > b.      Generate Load[s] followed by Shuffle[s] instructions to model
a Load
> > operation.
> >
> > Implemented a working prototype to demonstrate this feature, its 
> > available at below review-thread.
> > http://reviews.llvm.org/D21363
> >
> > Use below option to enable this feature:
> > "-mllvm -enable-interleaved-mem-accesses -mllvm -enable-strided- 
> > vectorization"
> >
> > Gains observed with prototype:
> > TSVC kernel S111                1.15x
> > TSVC kernel S1111               1.42x
> >
> > If community is interested in the above work, I have below plan:
> >
> > I'm not sure keeping this feature separate or the part of current 
> > interleave vectorizer ?
> >
> > 1)      Will enable interleave vectorizer on X86 (If this feature is
going as an
> > enhancement on interleave vectorizer)
> > 2)      If it's going with interleave vectorizer, will modify its
infra to support this.
> > 3)      Will enable strided vectorization by submitting changes on
costing &
> > transform.
> >
> > Please share your valuable thoughts.
> >
> > Thanks,
> > Ashutosh
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Nema, Ashutosh via llvm-dev

2016-Jul-01 10:06 UTC

head link

[llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization

> >Probably can depend on the support provided by below RFC by Michael:
> > "Allow loop vectorizer to choose vector widths that generate
illegal types"
> >In that case Loop Vectorizer will consider the cost of bigger types and
> generate gather / scatter of same type. Later code generation will generate
the
> desired instruction[s] by breaking the bigger types into target
>supported.
> 
> Hopefully, you can come up with a short term workaround, if the progress of
> Michael's work is behind yours.
We can skip those cases for now, whenever the vector factor for a type goes 
beyond the target support will disable stride memory vectorization.
Once Michael work accepted will enable it to consider these cases.

Regards,
Ashutosh
> -----Original Message-----
> From: Saito, Hideki [mailto:hideki.saito at intel.com]
> Sent: Thursday, June 30, 2016 11:12 PM
> To: Nema, Ashutosh <Ashutosh.Nema at amd.com>; Demikhovsky, Elena
> <elena.demikhovsky at intel.com>; silviu.baranga at gmail.com; Zaks,
Ayal
> <ayal.zaks at intel.com>
> Cc: llvm-dev <llvm-dev at lists.llvm.org>; asbirlea at google.com;
> renato.golin at linaro.org; mssimpso at codeaurora.org; kv.bhat at
samsung.com;
> Shahid, Asghar-ahmad <Asghar-ahmad.Shahid at amd.com>;
> sanjoy at playingwithpointers.com; mzolotukhin at apple.com; Michael
> Kuperstein <mkuper at google.com>
> Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization
> 
> 
> As a strong advocate of logical vector representation, I'm counting on
> community liking Michael's RFC and that'll proceed sooner than
later.
> I plan to pitch in (e.g., perf experiments).
> 
> >Probably can depend on the support provided by below RFC by Michael:
> > "Allow loop vectorizer to choose vector widths that generate
illegal types"
> >In that case Loop Vectorizer will consider the cost of bigger types and
> generate gather / scatter of same type. Later code generation will generate
the
> desired instruction[s] by breaking the bigger types into target
>supported.
> 
> Hopefully, you can come up with a short term workaround, if the progress of
> Michael's work is behind yours.
> 
> Thanks,
> Hideki
> 
> -----Original Message-----
> From: Nema, Ashutosh [mailto:Ashutosh.Nema at amd.com]
> Sent: Wednesday, June 29, 2016 9:50 PM
> To: Saito, Hideki <hideki.saito at intel.com>; Demikhovsky, Elena
> <elena.demikhovsky at intel.com>; silviu.baranga at gmail.com; Zaks,
Ayal
> <ayal.zaks at intel.com>
> Cc: llvm-dev <llvm-dev at lists.llvm.org>; asbirlea at google.com;
> renato.golin at linaro.org; mssimpso at codeaurora.org; kv.bhat at
samsung.com;
> Shahid, Asghar-ahmad <Asghar-ahmad.Shahid at amd.com>;
> sanjoy at playingwithpointers.com; mzolotukhin at apple.com; Michael
> Kuperstein <mkuper at google.com>
> Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization
> 
> One common concern raised for cases where Loop Vectorizer generate bigger
> types than target supported:
> 
> Based on VF currently we check the cost and generate the expected set of
> instruction[s] for bigger type. It has two challenges for bigger types cost
is not
> always correct and code generation may not generate efficient
instruction[s].
> 
> Probably can depend on the support provided by below RFC by Michael:
>  "Allow loop vectorizer to choose vector widths that generate illegal
types"
> In that case Loop Vectorizer will consider the cost of bigger types and
generate
> gather / scatter of same type. Later code generation will generate the
desired
> instruction[s] by breaking the bigger types into target supported.
> 
> Regards,
> Ashutosh
> 
> > -----Original Message-----
> > From: Saito, Hideki [mailto:hideki.saito at intel.com]
> > Sent: Saturday, June 18, 2016 6:00 AM
> > To: Nema, Ashutosh <Ashutosh.Nema at amd.com>
> > Cc: llvm-dev <llvm-dev at lists.llvm.org>
> > Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access
> > Vectorization
> >
> >
> > >Vectorizer's output should be as clean as vector code can be
so that
> > >analyses
> > and optimizers downstream can
> > >do a great job optimizing.
> >
> > Guess I should clarify this philosophical position of mine. In terms
> > of vector code optimization that complicates the output of vectorizer:
> >
> > If vectorizer is the best place to perform the optimization, it should
do so.
> >      This includes the cases like
> >            If vectorizer doesn't do it, we'll lose the
opportunity.
> >            If vectorizer doesn't do it, complexity of performing
the
> > optimization goes up significantly.
> >            Optimization changes the way vectorization happens (may
> > consider this as part of "lose the oppotunity") They should
be
> > different from
> >            If vectorizer doesn't do it, it'll add complexity
of vectorizer's cost model.
> > Compiler's cost modeling in general is already complex enough
since
> > there are many things happening downstream. Adding one more
> > optimization downstream doesn't change the big picture.
> >
> > There is certainly a tradeoff against "if vectorizer does this,
we'll
> > lose some other optimization downstream"
> > even if vectorizer is the best place to perform a certain optimization
> > ---- but this isn't a new problem.
> >
> > As Ashutosh wrote, stride memref optimization in this RFC can be done
> > later in compilation, and doing so should not be much more complex
> > than doing it in vectorizer. That's why I recommend doing it
outside
> > of vectorizer (like CG prepare), and I'm glad to hear that
Ashutosh is
> > considering that.
> >
> > Thanks for reading.
> > Hideki
> >
> > -----Original Message-----
> > From: Saito, Hideki
> > Sent: Friday, June 17, 2016 4:40 PM
> > To: 'Nema, Ashutosh' <Ashutosh.Nema at amd.com>
> > Cc: llvm-dev <llvm-dev at lists.llvm.org>
> > Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access
> > Vectorization
> >
> >
> > >I agree this can be done with Gather/Scatter intrinsic as well,
once
> > >we enable
> > these we need to place right costing. During costing we have to
> > estimate the cost of load[s], >store[s] and shuffle[s] and in CG
> > prepare we have to lower them. In the proposed approach we have
> > introduced SkipFactor which helps to reduce number of load[s] &
> > >store[s]. i.e. For Stride 3 & VF 4 we only generate
> > 2 loads(vs 3) to model load operation.
> >
> > I'm all for properly accounting the cost. I'm just against
leaking
> > unnecessary complexity to the IL.
> >
> > >I did not understood this completely, vectorizer already
identifies
> > >maximum
> > target HW physical vector width and based on it choose VF then widens
> > the instructions.
> >
> > Well, the selection of VF should be based on the cost modeling. If you
> > haven't, please follow the thread started by Michael Kuperstein.
> > 	[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that
> > generate illegal types In general, a "logical vector"
operation can be
> > one full vector on the target, less than one full vector (such as
> > half), or more than one full vector (e.g., two or four vector). LLVM
> > IR instructions can support logical vectors (and have vector type
> > legalization support). I don't see why vectorizer should be
restricted
> > to use target legalized vector only.
> >
> > Now, looking at Stride-2 Loads and VF=4 case, where target HW can
> > accommodate 4 elements in the vector.
> > 	A * B * C * D
> > We could try breaking up into two loads in
> > 	A * B *   + C * D *
> > or
> > 	A * B *   + * C * D
> > The second one is fine if we can assume that any memory between A and
> > D can be accessed.
> > However, the first one is bad since the "*" after D may
cause SEGV.
> >
> > What if we want to use VF=8? Under the proposal in consideration, some
> > operations (such as ADD) are consuming the result of load in 8-way
> > vector (say, <8 x float>), but the load is somehow split up into
> > 2x2x4. That's not good for optimizers. Load of 8-way vector should
> > look like load of 8 elements --- not a convoluted sequence of smaller
> > loads and bunch of shuffles that analysis/optimization downstream (or
> > even human) will have a hard time understanding that as a load of 8
> > elements.
> >
> > >But I accept this can be done with hypothetical "stride
load/store"
> > >intrinsic or
> > gather/scatter intrinsic. At vectorizer generate these intrinsic and
> > later during CodeGen/CG->Prepare generate mapping instructions.
> >
> > Thanks. That's what I'd recommend ---- unless someone can
clearly show
> > that adding this much complexity to the vectorizer's output
enables
> > some other optimizations (and without hurting optimizations)
> > downstream.
> >
> > Vectorizer's output should be as clean as vector code can be so
that
> > analyses and optimizers downstream can do a great job optimizing.
> >
> > Thanks,
> > Hideki
> >
> > -----Original Message-----
> > From: Nema, Ashutosh [mailto:Ashutosh.Nema at amd.com]
> > Sent: Thursday, June 16, 2016 11:05 PM
> > To: Saito, Hideki <hideki.saito at intel.com>
> > Cc: llvm-dev <llvm-dev at lists.llvm.org>
> > Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access
> > Vectorization
> >
> > Thanks Hideki for going through this proposal.
> >
> > I agree this can be done with Gather/Scatter intrinsic as well, once
> > we enable these we need to place right costing. During costing we have
> > to estimate the cost of load[s], store[s] and shuffle[s] and in CG
> > prepare we have to lower them. In the proposed approach we have
> > introduced SkipFactor which helps to reduce number of load[s] &
> > store[s]. i.e. For Stride 3 & VF 4 we only generate 2 loads(vs 3)
to model load
> operation.
> >
> > > 4) Unfortunately, 3) most likely imply that the wide load used
here
> > > needs to be based on target HW physical
> > >      vector width. There are reasons why such type legalization
at
> > > vectorizer should be avoided.
> >
> > I did not understood this completely, vectorizer already identifies
> > maximum target HW physical vector width and based on it choose VF then
> > widens the instructions.
> >
> > > If I remember correctly, Elena Demikhovsky (my colleague)
attempted
> > > to introduce stride load/store intrinsic sometime ago and got a
push
> > > back. I think it's best to address the following questions.
> > > a) What can this enable over gather/scatter intrinsic w/ constant
> > > index vector
> > > b) What can this enable over hypothetical "stride
load/store"
> > > intrinsic
> >
> > In this proposal we are not generating any intrinsic rather directly
> > generates required instructions based on vector factor. Here
> > instructions width are purely based on vector factor. Also we have
> > introduced SkipFactor concept to reduce number of load[s] &
store[s].
> >
> > But I accept this can be done with hypothetical "stride
load/store"
> > intrinsic or gather/scatter intrinsic. At vectorizer generate these
> > intrinsic and later during CodeGen/CG-Prepare generate mapping
> instructions.
> >
> > > c) Any chance of introducing stride load/store as a form of wide
load/store?
> >
> > We already widens the instructions but purely based on vector factor.
> > i.e. if for a i32 load, VF is 4 then the generated wide load will be
<i32 x 4>.
> > We are not intentionally generating wide loads like interleave
> > vectorizer because we transform everything at vectorizer and the wide
> > load will not help us to minimize the required load[s] & store[s].
> > i.e. for VF 4 and stride 3, interleave vectorizer will generate
<i32 x
> > 12> wide load (which is actually 3 <i32 x 4>) but this
approach only generate 2
> <i32 x 4> loads.
> >
> > Regards,
> > Ashutosh
> >
> > > -----Original Message-----
> > > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On
Behalf Of
> > > Saito, Hideki via llvm-dev
> > > Sent: Thursday, June 16, 2016 5:10 AM
> > > To: llvm-dev at lists.llvm.org
> > > Subject: Re: [llvm-dev] [Proposal][RFC] Strided Memory Access
> > > Vectorization
> > >
> > > Ashutosh,
> > >
> > > First, I'm all for enabling general stride load/store support
for
> > > all targets --- should be just a matter of proper cost modeling.
For
> > > that matter, we should enable general gather/scatter support for
all
> targets.
> > >
> > > About the specific approach taken by this RFC:
> > > 1) It's best to first state that this is for small constant
stride
> > > where a single wide load/store can cover two or more
> > >      valid stride load/store elements. If a wide load/store
contain
> > > just one valid element, there aren't any need
> > >      to do a wide load/store.
> > > 2) Optimal code generation for strided memory accesses are target
> > dependent.
> > >      Breaking one gather/scatter w/ constant index vector into
> > > several IR instructions like below might hurt such optimization,
> > >      especially when the vectorization factor is greater than the
> > > target legal vector width ---- this can happen
> > >      in mixed data type code (int mixed with double, etc.)
> > > 3) Make sure that the last element of unmasked load is a valid
> > > stride load element ---- otherwise,
> > >      out-of-bounds SEGV can happen. Use of masked load intrinsic
> > > most likely degrades any benefits over
> > >      gather intrinsic w/ constant index.
> > > 4) Unfortunately, 3) most likely imply that the wide load used
here
> > > needs to be based on target HW physical
> > >      vector width. There are reasons why such type legalization
at
> > > vectorizer should be avoided.
> > > 5) In general, I think leaking this much complexity into IR hurts
> > > (rather than
> > > enable) optimization.
> > >      It's best to keep vectorizer's output as simple as
vector code
> > > can be such that other optimizers can do their best
> > >      in optimizing vector code.
> > >
> > > If I remember correctly, Elena Demikhovsky (my colleague)
attempted
> > > to introduce stride load/store intrinsic sometime ago and got a
push
> > > back. I think it's best to address the following questions.
> > > a) What can this enable over gather/scatter intrinsic w/ constant
> > > index vector
> > > b) What can this enable over hypothetical "stride
load/store"
> > > intrinsic
> > > c) Any chance of introducing stride load/store as a form of wide
load/store?
> > >     I understand that there is a big hurdle here, but the reward
> > > should also be pretty good here.
> > >
> > > One might claim that the cost modeling can be easier, but that is
> > > also the downside since the generated cost may not match the cost
of
> > > optimal
> > code.
> > >
> > > All these things considered, my suggestion is to start from
enabling
> > > optimal code generation (or optimal lowering in CG prepare) from
> > > gather/scatter intrinsic and reflect that in vectorizer's
cost model.
> > >
> > > As a vectorizer centric person, I'd like to eventually see
> > > gather/scatter/stride- load/store to be supported by IR
instructions
> > > (i.e., not intrinsics), but I'll leave that discussion to
another
> > > time so that I
> > won't pollute Ashutosh's RFC.
> > >
> > > Thanks,
> > > Hideki Saito
> > > Technical Lead of Vectorizer Development Intel Compiler and
> > > Languages
> > >
> > > -----------------------------------
> > > Message: 7
> > > Date: Wed, 15 Jun 2016 06:16:58 +0000
> > > From: "Nema, Ashutosh via llvm-dev" <llvm-dev at
lists.llvm.org>
> > > To: llvm-dev <llvm-dev at lists.llvm.org>
> > > Subject: [llvm-dev] [Proposal][RFC] Strided Memory Access
> > > 	Vectorization
> > > Message-ID:
> > > 	<BLUPR12MB0644468A76C335E9B7431853FB550 at BLUPR12MB0644.na
> > > mprd12.prod.outlook.com>
> > >
> > > Content-Type: text/plain; charset="utf-8"
> > >
> > > Hi,
> > >
> > > This is a proposal about implementing strided memory access
vectorization.
> > >
> > > Currently LLVM does not support strided access vectorization
> > > completely, some support is available via Interleave
vectorization.
> > >
> > > There are two main overheads with strided vectorizations:
> > > *       An overhead of consolidating data into an operable
vector.
> > > *       An overhead of distributing the data elements after the
operations.
> > >
> > > Because of these overheads LLVM finds strided memory
vectorization
> > > is not profitable and generating serial code sequence over
vectorized code.
> > > GATHER & SCATTER instruction support is available on few(not
all)
> > > targets for consolidation & distribution operations.
> > >
> > > In this approach we are trying to handle cases like this:
> > > for (int i = 0; i < len; i++)
> > >      a[i*3] = b[i*2] + c[i*3];
> > >
> > > We model strided memory load & store using shuffle &
load/mask-store
> > > operations.
> > > *       Load is modeled as loads followed by shuffle.
> > > *       Store is modeled as shuffle followed by mask store.
> > > *       To minimize load and store operation introduced
'SkipFactor'.
> > >
> > > 'SkipFactor':
> > > *       Multiple load operation required for consolidating data
into an
> operable
> > > vector.
> > > *       Between various loads we skip by few offsets to effective
consolidate.
> > > *       SkipFactor is the number of additional offsets required
to move from
> the
> > > previous vector load memory-end address for the next vector load.
> > > *       SkipFactor allows all memory loads to be considered as
identical (From
> > > valid element perspective).
> > > *       SkipFactor = (Stride - (VF % Stride)) % Stride)
> > >
> > > For example:
> > >
> > >
> > > How LOAD is modeled:
> > > Load stride 3 (i.e. load for b [ 3 * i ])
> > >   %5 = getelementptr inbounds i32, i32* %b, i64 %.lhs
> > >   %6 = bitcast i32* %5 to <4 x i32>*
> > >   %stride.load27 = load <4 x i32>, <4 x i32>* %6,
align 1, !tbaa !1
> > >   %7 = getelementptr i32, i32* %5, i64 6
> > >   %8 = bitcast i32* %7 to <4 x i32>*
> > >   %stride.load28 = load <4 x i32>, <4 x i32>* %8,
align 1, !tbaa !1
> > >   %strided.vec29 = shufflevector <4 x i32> %stride.load27,
<4 x i32>
> > > %stride.load28, <4 x i32> <i32 0, i32 3, i32 4, i32
7>
> > >
> > > How STORE is modeled:
> > > Store with stride 3 (i.e. store to c [ 3 * i ])
> > >  %10 = getelementptr inbounds i32, i32* %c, i64 %.lhs
> > >   %11 = bitcast i32* %10 to <4 x i32>*
> > >   %interleaved.vec = shufflevector <4 x i32> %StoreResult,
<4 x i32>
> > > undef, <4 x
> > > i32> <i32 0, i32 undef, i32 undef, i32 1>
> > >   call void @llvm.masked.store.v4i32(<4 x i32>
%interleaved.vec, <4
> > > x
> > > i32>* %11, i32 4, <4 x i1> <i1 true, i1 false, i1
false, i1 true>)
> > >   %12 = getelementptr i32, i32* %10, i64 6
> > >   %13 = bitcast i32* %12 to <4 x i32>*
> > >   %interleaved.vec30 = shufflevector <4 x i32>
%StoreResult, <4 x
> > > i32> undef,
> > > <4 x i32> <i32 2, i32 undef, i32 undef, i32 3>
> > >   call void @llvm.masked.store.v4i32(<4 x i32>
%interleaved.vec30,
> > > <4 x i32>* %13, i32 4, <4 x i1> <i1 true, i1
false, i1 false, i1
> > > true>)
> > >
> > > NOTE: For Stride 3 SkipFactor is 2, that's why the second GEP
offset by
> 6(4+2).
> > >
> > > To enable this feature below LLVM changes required.
> > > 1)      Identify strided memory access (Its already there i.e.
interleave
> > vectorizer
> > > does this)
> > > 2)      Costing changes:
> > > a.      Identify number of Load[s]/Store[s] & Shuffle[s]
required to model
> > > Load/Store operation by considering SkipFactor.
> > > b.      Return total cost by adding Load[s]/Store[s] and
shuffle[s] costs.
> > > 3)      Transform:
> > > a.      Generate Shuffle[s] followed by Mask-Store[s]
instructions to model a
> > > Store operation.
> > > b.      Generate Load[s] followed by Shuffle[s] instructions to
model a Load
> > > operation.
> > >
> > > Implemented a working prototype to demonstrate this feature, its
> > > available at below review-thread.
> > > http://reviews.llvm.org/D21363
> > >
> > > Use below option to enable this feature:
> > > "-mllvm -enable-interleaved-mem-accesses -mllvm
-enable-strided-
> > > vectorization"
> > >
> > > Gains observed with prototype:
> > > TSVC kernel S111                1.15x
> > > TSVC kernel S1111               1.42x
> > >
> > > If community is interested in the above work, I have below plan:
> > >
> > > I'm not sure keeping this feature separate or the part of
current
> > > interleave vectorizer ?
> > >
> > > 1)      Will enable interleave vectorizer on X86 (If this feature
is going as an
> > > enhancement on interleave vectorizer)
> > > 2)      If it's going with interleave vectorizer, will modify
its infra to support
> this.
> > > 3)      Will enable strided vectorization by submitting changes
on costing &
> > > transform.
> > >
> > > Please share your valuable thoughts.
> > >
> > > Thanks,
> > > Ashutosh
> > >
> > > _______________________________________________
> > > LLVM Developers mailing list
> > > llvm-dev at lists.llvm.org
> > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

llvm dev - Jul 2016 - [Proposal][RFC] Strided Memory Access Vectorization

[llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization

[llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization