Hi Roger, That's a good example, that shows most of the moving parts involved here. In a nutshell, the difference is, and what we would like to make explicit, is the vector trip versus the scalar loop trip count. In your IR example, the loads/stores are predicated on a mask that is calculated from a splat induction variable, which is compared with the vector trip count. Illustrated with your example simplified, and with some pseudo-code, if we tail-fold and vectorize this scalar loop: for i= 0 to 10 a[i] = b[i] + c[i]; the vector loop trip count is rounded up to 14, the next multiple of 4, and lanes are predicated on i < 10: for i= 0 to 12 a[i:4] = b[i:4] + c[i:4], if i < 10; what we would like to generate is a vector loop with implicit predication, which works by setting up the the number of elements processed by the loop: hwloop 10 [i:4] = b[i:4] + c[i:4] This is implicit since instructions don't produce/consume a mask, but it is generated ans used under the hood by the "hwloop" construct. Your observation that the information in the IR is mostly there is correct, but rather than pattern matching and reconstructing this in the backend, we would like to makes this explicit. In this example, the scalar iteration count 10 iis the number of elements processed by this loop, which is what we want to pass on from the vectoriser to backend passes. Hope this helps. Cheers, Sjoerd. ________________________________ From: Roger Ferrer Ibáñez <rofirrim at gmail.com> Sent: 04 May 2020 21:22 To: Sjoerd Meijer <Sjoerd.Meijer at arm.com> Cc: Eli Friedman <efriedma at quicinc.com>; llvm-dev <llvm-dev at lists.llvm.org>; Sam Parker <Sam.Parker at arm.com> Subject: Re: [llvm-dev] LV: predication Hi Sjoerd, That would be an excellent way of doing it and it would also map very well to MVE too, where we have a VCTP intrinsic/instruction that creates the mask/predicate (Vector Create Tail-Predicate). So I will go for this approach. Such an intrinsic was actually also proposed in Sam's original RFC (see https://lists.llvm.org/pipermail/llvm-dev/2019-May/132512.html), but we hadn't implemented it yet. This intrinsic will probably look something like this: <N x i1> @llvm.loop.get.active.mask(AnyInt, AnyInt) It produces a <N x i1> predicate based on its two arguments, the number of elements and the vector trip count, and it will be used by the predicated masked loads/stores instructions in the vector body. I will start drafting an implementation for this and continue with this in D79100. I'm curious about this, because this looks to me very similar to the code that -prefer-predicate-over-epilog is already emitting for the "outer mask" of a tail-folded loop. The following code void foo(int N, int *restrict c, int *restrict a, int *restrict b) { #pragma clang loop vectorize(enable) interleave(disable) for (int i = 0; i < N; i++) { a[i] = b[i] + c[i]; } } compiled with clang --target=x86_64 -mavx512f -mllvm -prefer-predicate-over-epilog -emit-llvm -O2 emits the following IR vector.body: ; preds = %vector.body, %for.body.preheader.new %index = phi i64 [ 0, %for.body.preheader.new ], [ %index.next.1, %vector.body ] %niter = phi i64 [ %unroll_iter, %for.body.preheader.new ], [ %niter.nsub.1, %vector.body ] %broadcast.splatinsert12 = insertelement <16 x i64> undef, i64 %index, i32 0 %broadcast.splat13 = shufflevector <16 x i64> %broadcast.splatinsert12, <16 x i64> undef, <16 x i32> zeroinitializer %induction = or <16 x i64> %broadcast.splat13, <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7, i64 8, i64 9, i64 10, i64 11, i64 12, i64 13, i64 14, i64 15> %4 = getelementptr inbounds i32, i32* %b, i64 %index %5 = icmp ule <16 x i64> %induction, %broadcast.splat ... %wide.masked.load = call <16 x i32> @llvm.masked.load.v16i32.p0v16i32(<16 x i32>* %6, i32 4, <16 x i1> %5, <16 x i32> undef), !tbaa !2 I understand %5 is not the same your proposed llvm.loop.get.active.mask would compute, is that correct? Can you elaborate on the difference here? Thanks a lot, Roger -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200504/49cfe347/attachment.html>
Hi Sjoerd, thanks a lot for the clarification. Makes sense. Kind regards, Missatge de Sjoerd Meijer <Sjoerd.Meijer at arm.com> del dia dt., 5 de maig 2020 a les 0:06:> Hi Roger, > > That's a good example, that shows most of the moving parts involved here. > In a nutshell, the difference is, and what we would like to make explicit, > is the vector trip versus the scalar loop trip count. In your IR example, > the loads/stores are predicated on a mask that is calculated from a splat > induction variable, which is compared with the vector trip count. > Illustrated with your example simplified, and with some pseudo-code, if we > tail-fold and vectorize this scalar loop: > > for i= 0 to 10 > a[i] = b[i] + c[i]; > > the vector loop trip count is rounded up to 14, the next multiple of 4, > and lanes are predicated on i < 10: > > for i= 0 to 12 > a[i:4] = b[i:4] + c[i:4], if i < 10; > > what we would like to generate is a vector loop with implicit predication, > which works by setting up the the number of elements processed by the loop: > > hwloop 10 > [i:4] = b[i:4] + c[i:4] > > This is implicit since instructions don't produce/consume a mask, but it > is generated ans used under the hood by the "hwloop" construct. Your > observation that the information in the IR is mostly there is correct, but > rather than pattern matching and reconstructing this in the backend, we > would like to makes this explicit. In this example, the scalar iteration > count 10 iis the number of elements processed by this loop, which is what > we want to pass on from the vectoriser to backend passes. > > Hope this helps. > Cheers, > Sjoerd. > > > > ------------------------------ > *From:* Roger Ferrer Ibáñez <rofirrim at gmail.com> > *Sent:* 04 May 2020 21:22 > *To:* Sjoerd Meijer <Sjoerd.Meijer at arm.com> > *Cc:* Eli Friedman <efriedma at quicinc.com>; llvm-dev < > llvm-dev at lists.llvm.org>; Sam Parker <Sam.Parker at arm.com> > *Subject:* Re: [llvm-dev] LV: predication > > Hi Sjoerd, > > > That would be an excellent way of doing it and it would also map very well > to MVE too, where we have a VCTP intrinsic/instruction that creates the > mask/predicate (Vector Create Tail-Predicate). So I will go for this > approach. Such an intrinsic was actually also proposed in Sam's original > RFC (see https://lists.llvm.org/pipermail/llvm-dev/2019-May/132512.html), > but we hadn't implemented it yet. This intrinsic will probably look > something like this: > > <N x i1> @llvm.loop.get.active.mask(AnyInt, AnyInt) > > It produces a <N x i1> predicate based on its two arguments, the number of > elements and the vector trip count, and it will be used by the predicated > masked loads/stores instructions in the vector body. I will start drafting > an implementation for this and continue with this in D79100. > > > I'm curious about this, because this looks to me very similar to the code > that -prefer-predicate-over-epilog is already emitting for the "outer mask" > of a tail-folded loop. > > The following code > > void foo(int N, int *restrict c, int *restrict a, int *restrict b) { > #pragma clang loop vectorize(enable) interleave(disable) > for (int i = 0; i < N; i++) { > a[i] = b[i] + c[i]; > } > } > > compiled with clang --target=x86_64 -mavx512f -mllvm > -prefer-predicate-over-epilog -emit-llvm -O2 emits the following IR > > vector.body: ; preds = %vector.body, > %for.body.preheader.new > %index = phi i64 [ 0, %for.body.preheader.new ], [ %index.next.1, > %vector.body ] > %niter = phi i64 [ %unroll_iter, %for.body.preheader.new ], [ > %niter.nsub.1, %vector.body ] > %broadcast.splatinsert12 = insertelement <16 x i64> undef, i64 %index, > i32 0 > %broadcast.splat13 = shufflevector <16 x i64> %broadcast.splatinsert12, > <16 x i64> undef, <16 x i32> zeroinitializer > %induction = or <16 x i64> %broadcast.splat13, <i64 0, i64 1, i64 2, i64 > 3, i64 4, i64 5, i64 6, i64 7, i64 8, i64 9, i64 10, i64 11, i64 12, i64 > 13, i64 14, i64 15> > %4 = getelementptr inbounds i32, i32* %b, i64 %index > *%5 = icmp ule <16 x i64> %induction, %broadcast.splat* > ... > %wide.masked.load = call <16 x i32> > @llvm.masked.load.v16i32.p0v16i32(<16 x i32>* %6, i32 4, *<16 x i1> %5*, > <16 x i32> undef), !tbaa !2 > > I understand %5 is not the same your proposed llvm.loop.get.active.mask > would compute, is that correct? Can you elaborate on the difference here? > > Thanks a lot, > Roger >-- Roger Ferrer Ibáñez -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200505/66432145/attachment.html>
On 5/5/20 12:07 AM, Sjoerd Meijer via llvm-dev wrote: what we would like to generate is a vector loop with implicit predication, which works by setting up the the number of elements processed by the loop: hwloop 10 [i:4] = b[i:4] + c[i:4] Why couldn't you use VP intrinsics and scalable types for this? %bval = <4 x vscale x double> call @vp.load(..., /* %evl */ 10) %cval = <4 x vscale x double> call @vp.load(..., /* %evl */ 10) %sum = <4 x vscale x double> fadd %bval, %cval store [..] I see three issues with the llvm.set.loop.elements approach: 1) It is conceptually broken: as others have pointed out, optimization can move the intrinsic around since the intrinsic doesn't have any dependencies that would naturally keep it in place. 2) The whole proposed set of intrinsics is vendor specific: this causes fragmentation and i don't see why we would want to emit vendor-specific intrinsics in a generic optimization pass. Soon, we would see reports a la "your optimization caused regressions for MVE - add a check that the transformation must not touch llvm.set.loop.* or llvm.active.mask intrinsics when compiling for MVE..". I doubt that you would tolerate when that intrinsic were some removed in performance-critical code that would then remain scalar as a result.. so, i do not see the "beauty of the approach". 3) We need a reliable solution to properly support vector ISA such as RISC-V V extension and SX-Aurora and also MVE.. i don't see that reliability in this proposal. If for whatever reason, the above does not work and seems to far away from your proposal, here is another idea to make more explicit hwloops work with the VP intrinsics - in a way that does not break with optimizations: vector.preheader: %evl = i32 llvm.hwloop.set.elements(%n) vector.body: %lastevl = phi 32 [%evl, %preheader, %next.evl, vector.body] %aval = call @llvm.vp.load(Aptr, .., %evl) call @llvm.vp.store(Bptr, %aval, ..., %evl) %next.evl = call i32 @llvm.hwloop.decrement(%evl) Note that the way VP intrinsics are designed, it is not possible to break this code by hoisting the VP calls out of the loop: passing "%evl >= the operation's vector size" consitutes UB (see https://llvm.org/docs/LangRef.html#vector-predication-intrinsics). We can use attributes to do the same for sinking (eg don't move VP across hwloop.decrement). - Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200518/66b37bea/attachment.html>
Hi, I abandoned that approach and followed Eli's suggestion, see somewhere earlier in this thread, and emit an intrinsic that represents/calculates the active mask. I've just uploaded a new revision for D79100 that implements this. Cheers. ________________________________ From: Simon Moll <Simon.Moll at EMEA.NEC.COM> Sent: 18 May 2020 13:32 To: Sjoerd Meijer <Sjoerd.Meijer at arm.com> Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com>; Eli Friedman <efriedma at quicinc.com>; listmail at philipreames.com <listmail at philipreames.com>; llvm-dev <llvm-dev at lists.llvm.org>; Sander De Smalen <Sander.DeSmalen at arm.com>; hanna.kruppe at gmail.com <hanna.kruppe at gmail.com> Subject: Re: [llvm-dev] LV: predication On 5/5/20 12:07 AM, Sjoerd Meijer via llvm-dev wrote: what we would like to generate is a vector loop with implicit predication, which works by setting up the the number of elements processed by the loop: hwloop 10 [i:4] = b[i:4] + c[i:4] Why couldn't you use VP intrinsics and scalable types for this? %bval = <4 x vscale x double> call @vp.load(..., /* %evl */ 10) %cval = <4 x vscale x double> call @vp.load(..., /* %evl */ 10) %sum = <4 x vscale x double> fadd %bval, %cval store [..] I see three issues with the llvm.set.loop.elements approach: 1) It is conceptually broken: as others have pointed out, optimization can move the intrinsic around since the intrinsic doesn't have any dependencies that would naturally keep it in place. 2) The whole proposed set of intrinsics is vendor specific: this causes fragmentation and i don't see why we would want to emit vendor-specific intrinsics in a generic optimization pass. Soon, we would see reports a la "your optimization caused regressions for MVE - add a check that the transformation must not touch llvm.set.loop.* or llvm.active.mask intrinsics when compiling for MVE..". I doubt that you would tolerate when that intrinsic were some removed in performance-critical code that would then remain scalar as a result.. so, i do not see the "beauty of the approach". 3) We need a reliable solution to properly support vector ISA such as RISC-V V extension and SX-Aurora and also MVE.. i don't see that reliability in this proposal. If for whatever reason, the above does not work and seems to far away from your proposal, here is another idea to make more explicit hwloops work with the VP intrinsics - in a way that does not break with optimizations: vector.preheader: %evl = i32 llvm.hwloop.set.elements(%n) vector.body: %lastevl = phi 32 [%evl, %preheader, %next.evl, vector.body] %aval = call @llvm.vp.load(Aptr, .., %evl) call @llvm.vp.store(Bptr, %aval, ..., %evl) %next.evl = call i32 @llvm.hwloop.decrement(%evl) Note that the way VP intrinsics are designed, it is not possible to break this code by hoisting the VP calls out of the loop: passing "%evl >= the operation's vector size" consitutes UB (see https://llvm.org/docs/LangRef.html#vector-predication-intrinsics). We can use attributes to do the same for sinking (eg don't move VP across hwloop.decrement). - Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200518/9b7c600c/attachment.html>