Saito, Hideki via llvm-dev
2016-Jun-30 17:41 UTC
[llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization
As a strong advocate of logical vector representation, I'm counting on community liking Michael's RFC and that'll proceed sooner than later. I plan to pitch in (e.g., perf experiments).>Probably can depend on the support provided by below RFC by Michael: > "Allow loop vectorizer to choose vector widths that generate illegal types" >In that case Loop Vectorizer will consider the cost of bigger types and generate gather / scatter of same type. Later code generation will generate the desired instruction[s] by breaking the bigger types into target >supported.Hopefully, you can come up with a short term workaround, if the progress of Michael's work is behind yours. Thanks, Hideki -----Original Message----- From: Nema, Ashutosh [mailto:Ashutosh.Nema at amd.com] Sent: Wednesday, June 29, 2016 9:50 PM To: Saito, Hideki <hideki.saito at intel.com>; Demikhovsky, Elena <elena.demikhovsky at intel.com>; silviu.baranga at gmail.com; Zaks, Ayal <ayal.zaks at intel.com> Cc: llvm-dev <llvm-dev at lists.llvm.org>; asbirlea at google.com; renato.golin at linaro.org; mssimpso at codeaurora.org; kv.bhat at samsung.com; Shahid, Asghar-ahmad <Asghar-ahmad.Shahid at amd.com>; sanjoy at playingwithpointers.com; mzolotukhin at apple.com; Michael Kuperstein <mkuper at google.com> Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization One common concern raised for cases where Loop Vectorizer generate bigger types than target supported: Based on VF currently we check the cost and generate the expected set of instruction[s] for bigger type. It has two challenges for bigger types cost is not always correct and code generation may not generate efficient instruction[s]. Probably can depend on the support provided by below RFC by Michael: "Allow loop vectorizer to choose vector widths that generate illegal types" In that case Loop Vectorizer will consider the cost of bigger types and generate gather / scatter of same type. Later code generation will generate the desired instruction[s] by breaking the bigger types into target supported. Regards, Ashutosh> -----Original Message----- > From: Saito, Hideki [mailto:hideki.saito at intel.com] > Sent: Saturday, June 18, 2016 6:00 AM > To: Nema, Ashutosh <Ashutosh.Nema at amd.com> > Cc: llvm-dev <llvm-dev at lists.llvm.org> > Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access > Vectorization > > > >Vectorizer's output should be as clean as vector code can be so that > >analyses > and optimizers downstream can > >do a great job optimizing. > > Guess I should clarify this philosophical position of mine. In terms > of vector code optimization that complicates the output of vectorizer: > > If vectorizer is the best place to perform the optimization, it should do so. > This includes the cases like > If vectorizer doesn't do it, we'll lose the opportunity. > If vectorizer doesn't do it, complexity of performing the > optimization goes up significantly. > Optimization changes the way vectorization happens (may > consider this as part of "lose the oppotunity") They should be > different from > If vectorizer doesn't do it, it'll add complexity of vectorizer's cost model. > Compiler's cost modeling in general is already complex enough since > there are many things happening downstream. Adding one more > optimization downstream doesn't change the big picture. > > There is certainly a tradeoff against "if vectorizer does this, we'll > lose some other optimization downstream" > even if vectorizer is the best place to perform a certain optimization > ---- but this isn't a new problem. > > As Ashutosh wrote, stride memref optimization in this RFC can be done > later in compilation, and doing so should not be much more complex > than doing it in vectorizer. That's why I recommend doing it outside > of vectorizer (like CG prepare), and I'm glad to hear that Ashutosh is > considering that. > > Thanks for reading. > Hideki > > -----Original Message----- > From: Saito, Hideki > Sent: Friday, June 17, 2016 4:40 PM > To: 'Nema, Ashutosh' <Ashutosh.Nema at amd.com> > Cc: llvm-dev <llvm-dev at lists.llvm.org> > Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access > Vectorization > > > >I agree this can be done with Gather/Scatter intrinsic as well, once > >we enable > these we need to place right costing. During costing we have to > estimate the cost of load[s], >store[s] and shuffle[s] and in CG > prepare we have to lower them. In the proposed approach we have > introduced SkipFactor which helps to reduce number of load[s] & > >store[s]. i.e. For Stride 3 & VF 4 we only generate > 2 loads(vs 3) to model load operation. > > I'm all for properly accounting the cost. I'm just against leaking > unnecessary complexity to the IL. > > >I did not understood this completely, vectorizer already identifies > >maximum > target HW physical vector width and based on it choose VF then widens > the instructions. > > Well, the selection of VF should be based on the cost modeling. If you > haven't, please follow the thread started by Michael Kuperstein. > [llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that > generate illegal types In general, a "logical vector" operation can be > one full vector on the target, less than one full vector (such as > half), or more than one full vector (e.g., two or four vector). LLVM > IR instructions can support logical vectors (and have vector type > legalization support). I don't see why vectorizer should be restricted > to use target legalized vector only. > > Now, looking at Stride-2 Loads and VF=4 case, where target HW can > accommodate 4 elements in the vector. > A * B * C * D > We could try breaking up into two loads in > A * B * + C * D * > or > A * B * + * C * D > The second one is fine if we can assume that any memory between A and > D can be accessed. > However, the first one is bad since the "*" after D may cause SEGV. > > What if we want to use VF=8? Under the proposal in consideration, some > operations (such as ADD) are consuming the result of load in 8-way > vector (say, <8 x float>), but the load is somehow split up into > 2x2x4. That's not good for optimizers. Load of 8-way vector should > look like load of 8 elements --- not a convoluted sequence of smaller > loads and bunch of shuffles that analysis/optimization downstream (or > even human) will have a hard time understanding that as a load of 8 > elements. > > >But I accept this can be done with hypothetical "stride load/store" > >intrinsic or > gather/scatter intrinsic. At vectorizer generate these intrinsic and > later during CodeGen/CG->Prepare generate mapping instructions. > > Thanks. That's what I'd recommend ---- unless someone can clearly show > that adding this much complexity to the vectorizer's output enables > some other optimizations (and without hurting optimizations) > downstream. > > Vectorizer's output should be as clean as vector code can be so that > analyses and optimizers downstream can do a great job optimizing. > > Thanks, > Hideki > > -----Original Message----- > From: Nema, Ashutosh [mailto:Ashutosh.Nema at amd.com] > Sent: Thursday, June 16, 2016 11:05 PM > To: Saito, Hideki <hideki.saito at intel.com> > Cc: llvm-dev <llvm-dev at lists.llvm.org> > Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access > Vectorization > > Thanks Hideki for going through this proposal. > > I agree this can be done with Gather/Scatter intrinsic as well, once > we enable these we need to place right costing. During costing we have > to estimate the cost of load[s], store[s] and shuffle[s] and in CG > prepare we have to lower them. In the proposed approach we have > introduced SkipFactor which helps to reduce number of load[s] & > store[s]. i.e. For Stride 3 & VF 4 we only generate 2 loads(vs 3) to model load operation. > > > 4) Unfortunately, 3) most likely imply that the wide load used here > > needs to be based on target HW physical > > vector width. There are reasons why such type legalization at > > vectorizer should be avoided. > > I did not understood this completely, vectorizer already identifies > maximum target HW physical vector width and based on it choose VF then > widens the instructions. > > > If I remember correctly, Elena Demikhovsky (my colleague) attempted > > to introduce stride load/store intrinsic sometime ago and got a push > > back. I think it's best to address the following questions. > > a) What can this enable over gather/scatter intrinsic w/ constant > > index vector > > b) What can this enable over hypothetical "stride load/store" > > intrinsic > > In this proposal we are not generating any intrinsic rather directly > generates required instructions based on vector factor. Here > instructions width are purely based on vector factor. Also we have > introduced SkipFactor concept to reduce number of load[s] & store[s]. > > But I accept this can be done with hypothetical "stride load/store" > intrinsic or gather/scatter intrinsic. At vectorizer generate these > intrinsic and later during CodeGen/CG-Prepare generate mapping instructions. > > > c) Any chance of introducing stride load/store as a form of wide load/store? > > We already widens the instructions but purely based on vector factor. > i.e. if for a i32 load, VF is 4 then the generated wide load will be <i32 x 4>. > We are not intentionally generating wide loads like interleave > vectorizer because we transform everything at vectorizer and the wide > load will not help us to minimize the required load[s] & store[s]. > i.e. for VF 4 and stride 3, interleave vectorizer will generate <i32 x > 12> wide load (which is actually 3 <i32 x 4>) but this approach only generate 2 <i32 x 4> loads. > > Regards, > Ashutosh > > > -----Original Message----- > > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of > > Saito, Hideki via llvm-dev > > Sent: Thursday, June 16, 2016 5:10 AM > > To: llvm-dev at lists.llvm.org > > Subject: Re: [llvm-dev] [Proposal][RFC] Strided Memory Access > > Vectorization > > > > Ashutosh, > > > > First, I'm all for enabling general stride load/store support for > > all targets --- should be just a matter of proper cost modeling. For > > that matter, we should enable general gather/scatter support for all targets. > > > > About the specific approach taken by this RFC: > > 1) It's best to first state that this is for small constant stride > > where a single wide load/store can cover two or more > > valid stride load/store elements. If a wide load/store contain > > just one valid element, there aren't any need > > to do a wide load/store. > > 2) Optimal code generation for strided memory accesses are target > dependent. > > Breaking one gather/scatter w/ constant index vector into > > several IR instructions like below might hurt such optimization, > > especially when the vectorization factor is greater than the > > target legal vector width ---- this can happen > > in mixed data type code (int mixed with double, etc.) > > 3) Make sure that the last element of unmasked load is a valid > > stride load element ---- otherwise, > > out-of-bounds SEGV can happen. Use of masked load intrinsic > > most likely degrades any benefits over > > gather intrinsic w/ constant index. > > 4) Unfortunately, 3) most likely imply that the wide load used here > > needs to be based on target HW physical > > vector width. There are reasons why such type legalization at > > vectorizer should be avoided. > > 5) In general, I think leaking this much complexity into IR hurts > > (rather than > > enable) optimization. > > It's best to keep vectorizer's output as simple as vector code > > can be such that other optimizers can do their best > > in optimizing vector code. > > > > If I remember correctly, Elena Demikhovsky (my colleague) attempted > > to introduce stride load/store intrinsic sometime ago and got a push > > back. I think it's best to address the following questions. > > a) What can this enable over gather/scatter intrinsic w/ constant > > index vector > > b) What can this enable over hypothetical "stride load/store" > > intrinsic > > c) Any chance of introducing stride load/store as a form of wide load/store? > > I understand that there is a big hurdle here, but the reward > > should also be pretty good here. > > > > One might claim that the cost modeling can be easier, but that is > > also the downside since the generated cost may not match the cost of > > optimal > code. > > > > All these things considered, my suggestion is to start from enabling > > optimal code generation (or optimal lowering in CG prepare) from > > gather/scatter intrinsic and reflect that in vectorizer's cost model. > > > > As a vectorizer centric person, I'd like to eventually see > > gather/scatter/stride- load/store to be supported by IR instructions > > (i.e., not intrinsics), but I'll leave that discussion to another > > time so that I > won't pollute Ashutosh's RFC. > > > > Thanks, > > Hideki Saito > > Technical Lead of Vectorizer Development Intel Compiler and > > Languages > > > > ----------------------------------- > > Message: 7 > > Date: Wed, 15 Jun 2016 06:16:58 +0000 > > From: "Nema, Ashutosh via llvm-dev" <llvm-dev at lists.llvm.org> > > To: llvm-dev <llvm-dev at lists.llvm.org> > > Subject: [llvm-dev] [Proposal][RFC] Strided Memory Access > > Vectorization > > Message-ID: > > <BLUPR12MB0644468A76C335E9B7431853FB550 at BLUPR12MB0644.na > > mprd12.prod.outlook.com> > > > > Content-Type: text/plain; charset="utf-8" > > > > Hi, > > > > This is a proposal about implementing strided memory access vectorization. > > > > Currently LLVM does not support strided access vectorization > > completely, some support is available via Interleave vectorization. > > > > There are two main overheads with strided vectorizations: > > * An overhead of consolidating data into an operable vector. > > * An overhead of distributing the data elements after the operations. > > > > Because of these overheads LLVM finds strided memory vectorization > > is not profitable and generating serial code sequence over vectorized code. > > GATHER & SCATTER instruction support is available on few(not all) > > targets for consolidation & distribution operations. > > > > In this approach we are trying to handle cases like this: > > for (int i = 0; i < len; i++) > > a[i*3] = b[i*2] + c[i*3]; > > > > We model strided memory load & store using shuffle & load/mask-store > > operations. > > * Load is modeled as loads followed by shuffle. > > * Store is modeled as shuffle followed by mask store. > > * To minimize load and store operation introduced 'SkipFactor'. > > > > 'SkipFactor': > > * Multiple load operation required for consolidating data into an operable > > vector. > > * Between various loads we skip by few offsets to effective consolidate. > > * SkipFactor is the number of additional offsets required to move from the > > previous vector load memory-end address for the next vector load. > > * SkipFactor allows all memory loads to be considered as identical (From > > valid element perspective). > > * SkipFactor = (Stride - (VF % Stride)) % Stride) > > > > For example: > > > > > > How LOAD is modeled: > > Load stride 3 (i.e. load for b [ 3 * i ]) > > %5 = getelementptr inbounds i32, i32* %b, i64 %.lhs > > %6 = bitcast i32* %5 to <4 x i32>* > > %stride.load27 = load <4 x i32>, <4 x i32>* %6, align 1, !tbaa !1 > > %7 = getelementptr i32, i32* %5, i64 6 > > %8 = bitcast i32* %7 to <4 x i32>* > > %stride.load28 = load <4 x i32>, <4 x i32>* %8, align 1, !tbaa !1 > > %strided.vec29 = shufflevector <4 x i32> %stride.load27, <4 x i32> > > %stride.load28, <4 x i32> <i32 0, i32 3, i32 4, i32 7> > > > > How STORE is modeled: > > Store with stride 3 (i.e. store to c [ 3 * i ]) > > %10 = getelementptr inbounds i32, i32* %c, i64 %.lhs > > %11 = bitcast i32* %10 to <4 x i32>* > > %interleaved.vec = shufflevector <4 x i32> %StoreResult, <4 x i32> > > undef, <4 x > > i32> <i32 0, i32 undef, i32 undef, i32 1> > > call void @llvm.masked.store.v4i32(<4 x i32> %interleaved.vec, <4 > > x > > i32>* %11, i32 4, <4 x i1> <i1 true, i1 false, i1 false, i1 true>) > > %12 = getelementptr i32, i32* %10, i64 6 > > %13 = bitcast i32* %12 to <4 x i32>* > > %interleaved.vec30 = shufflevector <4 x i32> %StoreResult, <4 x > > i32> undef, > > <4 x i32> <i32 2, i32 undef, i32 undef, i32 3> > > call void @llvm.masked.store.v4i32(<4 x i32> %interleaved.vec30, > > <4 x i32>* %13, i32 4, <4 x i1> <i1 true, i1 false, i1 false, i1 > > true>) > > > > NOTE: For Stride 3 SkipFactor is 2, that's why the second GEP offset by 6(4+2). > > > > To enable this feature below LLVM changes required. > > 1) Identify strided memory access (Its already there i.e. interleave > vectorizer > > does this) > > 2) Costing changes: > > a. Identify number of Load[s]/Store[s] & Shuffle[s] required to model > > Load/Store operation by considering SkipFactor. > > b. Return total cost by adding Load[s]/Store[s] and shuffle[s] costs. > > 3) Transform: > > a. Generate Shuffle[s] followed by Mask-Store[s] instructions to model a > > Store operation. > > b. Generate Load[s] followed by Shuffle[s] instructions to model a Load > > operation. > > > > Implemented a working prototype to demonstrate this feature, its > > available at below review-thread. > > http://reviews.llvm.org/D21363 > > > > Use below option to enable this feature: > > "-mllvm -enable-interleaved-mem-accesses -mllvm -enable-strided- > > vectorization" > > > > Gains observed with prototype: > > TSVC kernel S111 1.15x > > TSVC kernel S1111 1.42x > > > > If community is interested in the above work, I have below plan: > > > > I'm not sure keeping this feature separate or the part of current > > interleave vectorizer ? > > > > 1) Will enable interleave vectorizer on X86 (If this feature is going as an > > enhancement on interleave vectorizer) > > 2) If it's going with interleave vectorizer, will modify its infra to support this. > > 3) Will enable strided vectorization by submitting changes on costing & > > transform. > > > > Please share your valuable thoughts. > > > > Thanks, > > Ashutosh > > > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Nema, Ashutosh via llvm-dev
2016-Jul-01 10:06 UTC
[llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization
> >Probably can depend on the support provided by below RFC by Michael: > > "Allow loop vectorizer to choose vector widths that generate illegal types" > >In that case Loop Vectorizer will consider the cost of bigger types and > generate gather / scatter of same type. Later code generation will generate the > desired instruction[s] by breaking the bigger types into target >supported. > > Hopefully, you can come up with a short term workaround, if the progress of > Michael's work is behind yours.We can skip those cases for now, whenever the vector factor for a type goes beyond the target support will disable stride memory vectorization. Once Michael work accepted will enable it to consider these cases. Regards, Ashutosh> -----Original Message----- > From: Saito, Hideki [mailto:hideki.saito at intel.com] > Sent: Thursday, June 30, 2016 11:12 PM > To: Nema, Ashutosh <Ashutosh.Nema at amd.com>; Demikhovsky, Elena > <elena.demikhovsky at intel.com>; silviu.baranga at gmail.com; Zaks, Ayal > <ayal.zaks at intel.com> > Cc: llvm-dev <llvm-dev at lists.llvm.org>; asbirlea at google.com; > renato.golin at linaro.org; mssimpso at codeaurora.org; kv.bhat at samsung.com; > Shahid, Asghar-ahmad <Asghar-ahmad.Shahid at amd.com>; > sanjoy at playingwithpointers.com; mzolotukhin at apple.com; Michael > Kuperstein <mkuper at google.com> > Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization > > > As a strong advocate of logical vector representation, I'm counting on > community liking Michael's RFC and that'll proceed sooner than later. > I plan to pitch in (e.g., perf experiments). > > >Probably can depend on the support provided by below RFC by Michael: > > "Allow loop vectorizer to choose vector widths that generate illegal types" > >In that case Loop Vectorizer will consider the cost of bigger types and > generate gather / scatter of same type. Later code generation will generate the > desired instruction[s] by breaking the bigger types into target >supported. > > Hopefully, you can come up with a short term workaround, if the progress of > Michael's work is behind yours. > > Thanks, > Hideki > > -----Original Message----- > From: Nema, Ashutosh [mailto:Ashutosh.Nema at amd.com] > Sent: Wednesday, June 29, 2016 9:50 PM > To: Saito, Hideki <hideki.saito at intel.com>; Demikhovsky, Elena > <elena.demikhovsky at intel.com>; silviu.baranga at gmail.com; Zaks, Ayal > <ayal.zaks at intel.com> > Cc: llvm-dev <llvm-dev at lists.llvm.org>; asbirlea at google.com; > renato.golin at linaro.org; mssimpso at codeaurora.org; kv.bhat at samsung.com; > Shahid, Asghar-ahmad <Asghar-ahmad.Shahid at amd.com>; > sanjoy at playingwithpointers.com; mzolotukhin at apple.com; Michael > Kuperstein <mkuper at google.com> > Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization > > One common concern raised for cases where Loop Vectorizer generate bigger > types than target supported: > > Based on VF currently we check the cost and generate the expected set of > instruction[s] for bigger type. It has two challenges for bigger types cost is not > always correct and code generation may not generate efficient instruction[s]. > > Probably can depend on the support provided by below RFC by Michael: > "Allow loop vectorizer to choose vector widths that generate illegal types" > In that case Loop Vectorizer will consider the cost of bigger types and generate > gather / scatter of same type. Later code generation will generate the desired > instruction[s] by breaking the bigger types into target supported. > > Regards, > Ashutosh > > > -----Original Message----- > > From: Saito, Hideki [mailto:hideki.saito at intel.com] > > Sent: Saturday, June 18, 2016 6:00 AM > > To: Nema, Ashutosh <Ashutosh.Nema at amd.com> > > Cc: llvm-dev <llvm-dev at lists.llvm.org> > > Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access > > Vectorization > > > > > > >Vectorizer's output should be as clean as vector code can be so that > > >analyses > > and optimizers downstream can > > >do a great job optimizing. > > > > Guess I should clarify this philosophical position of mine. In terms > > of vector code optimization that complicates the output of vectorizer: > > > > If vectorizer is the best place to perform the optimization, it should do so. > > This includes the cases like > > If vectorizer doesn't do it, we'll lose the opportunity. > > If vectorizer doesn't do it, complexity of performing the > > optimization goes up significantly. > > Optimization changes the way vectorization happens (may > > consider this as part of "lose the oppotunity") They should be > > different from > > If vectorizer doesn't do it, it'll add complexity of vectorizer's cost model. > > Compiler's cost modeling in general is already complex enough since > > there are many things happening downstream. Adding one more > > optimization downstream doesn't change the big picture. > > > > There is certainly a tradeoff against "if vectorizer does this, we'll > > lose some other optimization downstream" > > even if vectorizer is the best place to perform a certain optimization > > ---- but this isn't a new problem. > > > > As Ashutosh wrote, stride memref optimization in this RFC can be done > > later in compilation, and doing so should not be much more complex > > than doing it in vectorizer. That's why I recommend doing it outside > > of vectorizer (like CG prepare), and I'm glad to hear that Ashutosh is > > considering that. > > > > Thanks for reading. > > Hideki > > > > -----Original Message----- > > From: Saito, Hideki > > Sent: Friday, June 17, 2016 4:40 PM > > To: 'Nema, Ashutosh' <Ashutosh.Nema at amd.com> > > Cc: llvm-dev <llvm-dev at lists.llvm.org> > > Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access > > Vectorization > > > > > > >I agree this can be done with Gather/Scatter intrinsic as well, once > > >we enable > > these we need to place right costing. During costing we have to > > estimate the cost of load[s], >store[s] and shuffle[s] and in CG > > prepare we have to lower them. In the proposed approach we have > > introduced SkipFactor which helps to reduce number of load[s] & > > >store[s]. i.e. For Stride 3 & VF 4 we only generate > > 2 loads(vs 3) to model load operation. > > > > I'm all for properly accounting the cost. I'm just against leaking > > unnecessary complexity to the IL. > > > > >I did not understood this completely, vectorizer already identifies > > >maximum > > target HW physical vector width and based on it choose VF then widens > > the instructions. > > > > Well, the selection of VF should be based on the cost modeling. If you > > haven't, please follow the thread started by Michael Kuperstein. > > [llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that > > generate illegal types In general, a "logical vector" operation can be > > one full vector on the target, less than one full vector (such as > > half), or more than one full vector (e.g., two or four vector). LLVM > > IR instructions can support logical vectors (and have vector type > > legalization support). I don't see why vectorizer should be restricted > > to use target legalized vector only. > > > > Now, looking at Stride-2 Loads and VF=4 case, where target HW can > > accommodate 4 elements in the vector. > > A * B * C * D > > We could try breaking up into two loads in > > A * B * + C * D * > > or > > A * B * + * C * D > > The second one is fine if we can assume that any memory between A and > > D can be accessed. > > However, the first one is bad since the "*" after D may cause SEGV. > > > > What if we want to use VF=8? Under the proposal in consideration, some > > operations (such as ADD) are consuming the result of load in 8-way > > vector (say, <8 x float>), but the load is somehow split up into > > 2x2x4. That's not good for optimizers. Load of 8-way vector should > > look like load of 8 elements --- not a convoluted sequence of smaller > > loads and bunch of shuffles that analysis/optimization downstream (or > > even human) will have a hard time understanding that as a load of 8 > > elements. > > > > >But I accept this can be done with hypothetical "stride load/store" > > >intrinsic or > > gather/scatter intrinsic. At vectorizer generate these intrinsic and > > later during CodeGen/CG->Prepare generate mapping instructions. > > > > Thanks. That's what I'd recommend ---- unless someone can clearly show > > that adding this much complexity to the vectorizer's output enables > > some other optimizations (and without hurting optimizations) > > downstream. > > > > Vectorizer's output should be as clean as vector code can be so that > > analyses and optimizers downstream can do a great job optimizing. > > > > Thanks, > > Hideki > > > > -----Original Message----- > > From: Nema, Ashutosh [mailto:Ashutosh.Nema at amd.com] > > Sent: Thursday, June 16, 2016 11:05 PM > > To: Saito, Hideki <hideki.saito at intel.com> > > Cc: llvm-dev <llvm-dev at lists.llvm.org> > > Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access > > Vectorization > > > > Thanks Hideki for going through this proposal. > > > > I agree this can be done with Gather/Scatter intrinsic as well, once > > we enable these we need to place right costing. During costing we have > > to estimate the cost of load[s], store[s] and shuffle[s] and in CG > > prepare we have to lower them. In the proposed approach we have > > introduced SkipFactor which helps to reduce number of load[s] & > > store[s]. i.e. For Stride 3 & VF 4 we only generate 2 loads(vs 3) to model load > operation. > > > > > 4) Unfortunately, 3) most likely imply that the wide load used here > > > needs to be based on target HW physical > > > vector width. There are reasons why such type legalization at > > > vectorizer should be avoided. > > > > I did not understood this completely, vectorizer already identifies > > maximum target HW physical vector width and based on it choose VF then > > widens the instructions. > > > > > If I remember correctly, Elena Demikhovsky (my colleague) attempted > > > to introduce stride load/store intrinsic sometime ago and got a push > > > back. I think it's best to address the following questions. > > > a) What can this enable over gather/scatter intrinsic w/ constant > > > index vector > > > b) What can this enable over hypothetical "stride load/store" > > > intrinsic > > > > In this proposal we are not generating any intrinsic rather directly > > generates required instructions based on vector factor. Here > > instructions width are purely based on vector factor. Also we have > > introduced SkipFactor concept to reduce number of load[s] & store[s]. > > > > But I accept this can be done with hypothetical "stride load/store" > > intrinsic or gather/scatter intrinsic. At vectorizer generate these > > intrinsic and later during CodeGen/CG-Prepare generate mapping > instructions. > > > > > c) Any chance of introducing stride load/store as a form of wide load/store? > > > > We already widens the instructions but purely based on vector factor. > > i.e. if for a i32 load, VF is 4 then the generated wide load will be <i32 x 4>. > > We are not intentionally generating wide loads like interleave > > vectorizer because we transform everything at vectorizer and the wide > > load will not help us to minimize the required load[s] & store[s]. > > i.e. for VF 4 and stride 3, interleave vectorizer will generate <i32 x > > 12> wide load (which is actually 3 <i32 x 4>) but this approach only generate 2 > <i32 x 4> loads. > > > > Regards, > > Ashutosh > > > > > -----Original Message----- > > > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of > > > Saito, Hideki via llvm-dev > > > Sent: Thursday, June 16, 2016 5:10 AM > > > To: llvm-dev at lists.llvm.org > > > Subject: Re: [llvm-dev] [Proposal][RFC] Strided Memory Access > > > Vectorization > > > > > > Ashutosh, > > > > > > First, I'm all for enabling general stride load/store support for > > > all targets --- should be just a matter of proper cost modeling. For > > > that matter, we should enable general gather/scatter support for all > targets. > > > > > > About the specific approach taken by this RFC: > > > 1) It's best to first state that this is for small constant stride > > > where a single wide load/store can cover two or more > > > valid stride load/store elements. If a wide load/store contain > > > just one valid element, there aren't any need > > > to do a wide load/store. > > > 2) Optimal code generation for strided memory accesses are target > > dependent. > > > Breaking one gather/scatter w/ constant index vector into > > > several IR instructions like below might hurt such optimization, > > > especially when the vectorization factor is greater than the > > > target legal vector width ---- this can happen > > > in mixed data type code (int mixed with double, etc.) > > > 3) Make sure that the last element of unmasked load is a valid > > > stride load element ---- otherwise, > > > out-of-bounds SEGV can happen. Use of masked load intrinsic > > > most likely degrades any benefits over > > > gather intrinsic w/ constant index. > > > 4) Unfortunately, 3) most likely imply that the wide load used here > > > needs to be based on target HW physical > > > vector width. There are reasons why such type legalization at > > > vectorizer should be avoided. > > > 5) In general, I think leaking this much complexity into IR hurts > > > (rather than > > > enable) optimization. > > > It's best to keep vectorizer's output as simple as vector code > > > can be such that other optimizers can do their best > > > in optimizing vector code. > > > > > > If I remember correctly, Elena Demikhovsky (my colleague) attempted > > > to introduce stride load/store intrinsic sometime ago and got a push > > > back. I think it's best to address the following questions. > > > a) What can this enable over gather/scatter intrinsic w/ constant > > > index vector > > > b) What can this enable over hypothetical "stride load/store" > > > intrinsic > > > c) Any chance of introducing stride load/store as a form of wide load/store? > > > I understand that there is a big hurdle here, but the reward > > > should also be pretty good here. > > > > > > One might claim that the cost modeling can be easier, but that is > > > also the downside since the generated cost may not match the cost of > > > optimal > > code. > > > > > > All these things considered, my suggestion is to start from enabling > > > optimal code generation (or optimal lowering in CG prepare) from > > > gather/scatter intrinsic and reflect that in vectorizer's cost model. > > > > > > As a vectorizer centric person, I'd like to eventually see > > > gather/scatter/stride- load/store to be supported by IR instructions > > > (i.e., not intrinsics), but I'll leave that discussion to another > > > time so that I > > won't pollute Ashutosh's RFC. > > > > > > Thanks, > > > Hideki Saito > > > Technical Lead of Vectorizer Development Intel Compiler and > > > Languages > > > > > > ----------------------------------- > > > Message: 7 > > > Date: Wed, 15 Jun 2016 06:16:58 +0000 > > > From: "Nema, Ashutosh via llvm-dev" <llvm-dev at lists.llvm.org> > > > To: llvm-dev <llvm-dev at lists.llvm.org> > > > Subject: [llvm-dev] [Proposal][RFC] Strided Memory Access > > > Vectorization > > > Message-ID: > > > <BLUPR12MB0644468A76C335E9B7431853FB550 at BLUPR12MB0644.na > > > mprd12.prod.outlook.com> > > > > > > Content-Type: text/plain; charset="utf-8" > > > > > > Hi, > > > > > > This is a proposal about implementing strided memory access vectorization. > > > > > > Currently LLVM does not support strided access vectorization > > > completely, some support is available via Interleave vectorization. > > > > > > There are two main overheads with strided vectorizations: > > > * An overhead of consolidating data into an operable vector. > > > * An overhead of distributing the data elements after the operations. > > > > > > Because of these overheads LLVM finds strided memory vectorization > > > is not profitable and generating serial code sequence over vectorized code. > > > GATHER & SCATTER instruction support is available on few(not all) > > > targets for consolidation & distribution operations. > > > > > > In this approach we are trying to handle cases like this: > > > for (int i = 0; i < len; i++) > > > a[i*3] = b[i*2] + c[i*3]; > > > > > > We model strided memory load & store using shuffle & load/mask-store > > > operations. > > > * Load is modeled as loads followed by shuffle. > > > * Store is modeled as shuffle followed by mask store. > > > * To minimize load and store operation introduced 'SkipFactor'. > > > > > > 'SkipFactor': > > > * Multiple load operation required for consolidating data into an > operable > > > vector. > > > * Between various loads we skip by few offsets to effective consolidate. > > > * SkipFactor is the number of additional offsets required to move from > the > > > previous vector load memory-end address for the next vector load. > > > * SkipFactor allows all memory loads to be considered as identical (From > > > valid element perspective). > > > * SkipFactor = (Stride - (VF % Stride)) % Stride) > > > > > > For example: > > > > > > > > > How LOAD is modeled: > > > Load stride 3 (i.e. load for b [ 3 * i ]) > > > %5 = getelementptr inbounds i32, i32* %b, i64 %.lhs > > > %6 = bitcast i32* %5 to <4 x i32>* > > > %stride.load27 = load <4 x i32>, <4 x i32>* %6, align 1, !tbaa !1 > > > %7 = getelementptr i32, i32* %5, i64 6 > > > %8 = bitcast i32* %7 to <4 x i32>* > > > %stride.load28 = load <4 x i32>, <4 x i32>* %8, align 1, !tbaa !1 > > > %strided.vec29 = shufflevector <4 x i32> %stride.load27, <4 x i32> > > > %stride.load28, <4 x i32> <i32 0, i32 3, i32 4, i32 7> > > > > > > How STORE is modeled: > > > Store with stride 3 (i.e. store to c [ 3 * i ]) > > > %10 = getelementptr inbounds i32, i32* %c, i64 %.lhs > > > %11 = bitcast i32* %10 to <4 x i32>* > > > %interleaved.vec = shufflevector <4 x i32> %StoreResult, <4 x i32> > > > undef, <4 x > > > i32> <i32 0, i32 undef, i32 undef, i32 1> > > > call void @llvm.masked.store.v4i32(<4 x i32> %interleaved.vec, <4 > > > x > > > i32>* %11, i32 4, <4 x i1> <i1 true, i1 false, i1 false, i1 true>) > > > %12 = getelementptr i32, i32* %10, i64 6 > > > %13 = bitcast i32* %12 to <4 x i32>* > > > %interleaved.vec30 = shufflevector <4 x i32> %StoreResult, <4 x > > > i32> undef, > > > <4 x i32> <i32 2, i32 undef, i32 undef, i32 3> > > > call void @llvm.masked.store.v4i32(<4 x i32> %interleaved.vec30, > > > <4 x i32>* %13, i32 4, <4 x i1> <i1 true, i1 false, i1 false, i1 > > > true>) > > > > > > NOTE: For Stride 3 SkipFactor is 2, that's why the second GEP offset by > 6(4+2). > > > > > > To enable this feature below LLVM changes required. > > > 1) Identify strided memory access (Its already there i.e. interleave > > vectorizer > > > does this) > > > 2) Costing changes: > > > a. Identify number of Load[s]/Store[s] & Shuffle[s] required to model > > > Load/Store operation by considering SkipFactor. > > > b. Return total cost by adding Load[s]/Store[s] and shuffle[s] costs. > > > 3) Transform: > > > a. Generate Shuffle[s] followed by Mask-Store[s] instructions to model a > > > Store operation. > > > b. Generate Load[s] followed by Shuffle[s] instructions to model a Load > > > operation. > > > > > > Implemented a working prototype to demonstrate this feature, its > > > available at below review-thread. > > > http://reviews.llvm.org/D21363 > > > > > > Use below option to enable this feature: > > > "-mllvm -enable-interleaved-mem-accesses -mllvm -enable-strided- > > > vectorization" > > > > > > Gains observed with prototype: > > > TSVC kernel S111 1.15x > > > TSVC kernel S1111 1.42x > > > > > > If community is interested in the above work, I have below plan: > > > > > > I'm not sure keeping this feature separate or the part of current > > > interleave vectorizer ? > > > > > > 1) Will enable interleave vectorizer on X86 (If this feature is going as an > > > enhancement on interleave vectorizer) > > > 2) If it's going with interleave vectorizer, will modify its infra to support > this. > > > 3) Will enable strided vectorization by submitting changes on costing & > > > transform. > > > > > > Please share your valuable thoughts. > > > > > > Thanks, > > > Ashutosh > > > > > > _______________________________________________ > > > LLVM Developers mailing list > > > llvm-dev at lists.llvm.org > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev