Hi, I abandoned that approach and followed Eli's suggestion, see somewhere earlier in this thread, and emit an intrinsic that represents/calculates the active mask. I've just uploaded a new revision for D79100 that implements this. Cheers. ________________________________ From: Simon Moll <Simon.Moll at EMEA.NEC.COM> Sent: 18 May 2020 13:32 To: Sjoerd Meijer <Sjoerd.Meijer at arm.com> Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com>; Eli Friedman <efriedma at quicinc.com>; listmail at philipreames.com <listmail at philipreames.com>; llvm-dev <llvm-dev at lists.llvm.org>; Sander De Smalen <Sander.DeSmalen at arm.com>; hanna.kruppe at gmail.com <hanna.kruppe at gmail.com> Subject: Re: [llvm-dev] LV: predication On 5/5/20 12:07 AM, Sjoerd Meijer via llvm-dev wrote: what we would like to generate is a vector loop with implicit predication, which works by setting up the the number of elements processed by the loop: hwloop 10 [i:4] = b[i:4] + c[i:4] Why couldn't you use VP intrinsics and scalable types for this? %bval = <4 x vscale x double> call @vp.load(..., /* %evl */ 10) %cval = <4 x vscale x double> call @vp.load(..., /* %evl */ 10) %sum = <4 x vscale x double> fadd %bval, %cval store [..] I see three issues with the llvm.set.loop.elements approach: 1) It is conceptually broken: as others have pointed out, optimization can move the intrinsic around since the intrinsic doesn't have any dependencies that would naturally keep it in place. 2) The whole proposed set of intrinsics is vendor specific: this causes fragmentation and i don't see why we would want to emit vendor-specific intrinsics in a generic optimization pass. Soon, we would see reports a la "your optimization caused regressions for MVE - add a check that the transformation must not touch llvm.set.loop.* or llvm.active.mask intrinsics when compiling for MVE..". I doubt that you would tolerate when that intrinsic were some removed in performance-critical code that would then remain scalar as a result.. so, i do not see the "beauty of the approach". 3) We need a reliable solution to properly support vector ISA such as RISC-V V extension and SX-Aurora and also MVE.. i don't see that reliability in this proposal. If for whatever reason, the above does not work and seems to far away from your proposal, here is another idea to make more explicit hwloops work with the VP intrinsics - in a way that does not break with optimizations: vector.preheader: %evl = i32 llvm.hwloop.set.elements(%n) vector.body: %lastevl = phi 32 [%evl, %preheader, %next.evl, vector.body] %aval = call @llvm.vp.load(Aptr, .., %evl) call @llvm.vp.store(Bptr, %aval, ..., %evl) %next.evl = call i32 @llvm.hwloop.decrement(%evl) Note that the way VP intrinsics are designed, it is not possible to break this code by hoisting the VP calls out of the loop: passing "%evl >= the operation's vector size" consitutes UB (see https://llvm.org/docs/LangRef.html#vector-predication-intrinsics). We can use attributes to do the same for sinking (eg don't move VP across hwloop.decrement). - Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200518/9b7c600c/attachment.html>
On 5/18/20 2:53 PM, Sjoerd Meijer wrote: Hi, I abandoned that approach and followed Eli's suggestion, see somewhere earlier in this thread, and emit an intrinsic that represents/calculates the active mask. I've just uploaded a new revision for D79100 that implements this. Cheers. You have similar problems with https://reviews.llvm.org/D79100 Since there are no masked operations, except for load/store.. how are LLVM optimizations supposed to know that they must not hoist/sink operations with side-effects out of the hwloop? These operations have an implicit dependence on the iteration variable. What will you do if there are no masked intrinsics in the hwloop body? This can happen once you generate vector code beyond trivial loops or have a vector IR generator other than LV. And i am curious why couldn't you use the %evl parameter of VP intrinsics to get the tail predication you are interested in? - Simon ________________________________ From: Simon Moll <Simon.Moll at EMEA.NEC.COM><mailto:Simon.Moll at EMEA.NEC.COM> Sent: 18 May 2020 13:32 To: Sjoerd Meijer <Sjoerd.Meijer at arm.com><mailto:Sjoerd.Meijer at arm.com> Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com><mailto:rofirrim at gmail.com>; Eli Friedman <efriedma at quicinc.com><mailto:efriedma at quicinc.com>; listmail at philipreames.com<mailto:listmail at philipreames.com> <listmail at philipreames.com><mailto:listmail at philipreames.com>; llvm-dev <llvm-dev at lists.llvm.org><mailto:llvm-dev at lists.llvm.org>; Sander De Smalen <Sander.DeSmalen at arm.com><mailto:Sander.DeSmalen at arm.com>; hanna.kruppe at gmail.com<mailto:hanna.kruppe at gmail.com> <hanna.kruppe at gmail.com><mailto:hanna.kruppe at gmail.com> Subject: Re: [llvm-dev] LV: predication On 5/5/20 12:07 AM, Sjoerd Meijer via llvm-dev wrote: what we would like to generate is a vector loop with implicit predication, which works by setting up the the number of elements processed by the loop: hwloop 10 [i:4] = b[i:4] + c[i:4] Why couldn't you use VP intrinsics and scalable types for this? %bval = <4 x vscale x double> call @vp.load(..., /* %evl */ 10) %cval = <4 x vscale x double> call @vp.load(..., /* %evl */ 10) %sum = <4 x vscale x double> fadd %bval, %cval store [..] I see three issues with the llvm.set.loop.elements approach: 1) It is conceptually broken: as others have pointed out, optimization can move the intrinsic around since the intrinsic doesn't have any dependencies that would naturally keep it in place. 2) The whole proposed set of intrinsics is vendor specific: this causes fragmentation and i don't see why we would want to emit vendor-specific intrinsics in a generic optimization pass. Soon, we would see reports a la "your optimization caused regressions for MVE - add a check that the transformation must not touch llvm.set.loop.* or llvm.active.mask intrinsics when compiling for MVE..". I doubt that you would tolerate when that intrinsic were some removed in performance-critical code that would then remain scalar as a result.. so, i do not see the "beauty of the approach". 3) We need a reliable solution to properly support vector ISA such as RISC-V V extension and SX-Aurora and also MVE.. i don't see that reliability in this proposal. If for whatever reason, the above does not work and seems to far away from your proposal, here is another idea to make more explicit hwloops work with the VP intrinsics - in a way that does not break with optimizations: vector.preheader: %evl = i32 llvm.hwloop.set.elements(%n) vector.body: %lastevl = phi 32 [%evl, %preheader, %next.evl, vector.body] %aval = call @llvm.vp.load(Aptr, .., %evl) call @llvm.vp.store(Bptr, %aval, ..., %evl) %next.evl = call i32 @llvm.hwloop.decrement(%evl) Note that the way VP intrinsics are designed, it is not possible to break this code by hoisting the VP calls out of the loop: passing "%evl >= the operation's vector size" consitutes UB (see https://llvm.org/docs/LangRef.html#vector-predication-intrinsics). We can use attributes to do the same for sinking (eg don't move VP across hwloop.decrement). - Simon Click here<https://www.mailcontrol.com/sr/nsi3EguIhU7GX2PQPOmvUg0Q1FXI7Aab46SsJMiMHdmGzr7A9AzNdHpVFWx1NCcWI3IMY6gxm-fOTml8Ao4xWg==> to report this email as spam. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200518/4d193380/attachment-0001.html>
> You have similar problems with https://reviews.llvm.org/D79100The new revision D79100<https://reviews.llvm.org/D79100> solves your comment 1), and I don't think your comments2) and 3) apply as there are no vendor specific intrinsics involved at all here. Just to quickly discuss the optimisation pipeline, D79100<https://reviews.llvm.org/D79100> is a small extension for the vectoriser, and nothing here is related to hardware-loops or target specific constructs. The vectoriser tail-folds the loop, and creates masked load/stores; so existing functionality, and nothing has changed here. The generic hardware loop codegen pass inserts hardware loop intrinsics. Very late in the pipeline, e.g. in the PPC and ARM backends, this is picked and turned into an actual hardwareloop, in our case possibly predicated, or it is reverted.> What will you do if there are no masked intrinsics in the hwloop body?Nothing. I.e., it can become a hardware loop, but not one with implicit predication.> And i am curious why couldn't you use the %evl parameter of VP intrinsics to get the tail predication you are interested in?In D79100<https://reviews.llvm.org/D79100>, intrinsic get.active.mask makes the backedge taken count of the scalar loop explicit. I will look again, but I don't think the VP intrinsics were able to provide this. But to be honest, I have no preference at all what this intrinsic is, it is not relevant, as long as we can make this explicit. Cheers. ________________________________ From: Simon Moll <Simon.Moll at EMEA.NEC.COM> Sent: 18 May 2020 14:11 To: Sjoerd Meijer <Sjoerd.Meijer at arm.com> Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com>; Eli Friedman <efriedma at quicinc.com>; listmail at philipreames.com <listmail at philipreames.com>; llvm-dev <llvm-dev at lists.llvm.org>; Sander De Smalen <Sander.DeSmalen at arm.com>; hanna.kruppe at gmail.com <hanna.kruppe at gmail.com> Subject: Re: [llvm-dev] LV: predication On 5/18/20 2:53 PM, Sjoerd Meijer wrote: Hi, I abandoned that approach and followed Eli's suggestion, see somewhere earlier in this thread, and emit an intrinsic that represents/calculates the active mask. I've just uploaded a new revision for D79100 that implements this. Cheers. You have similar problems with https://reviews.llvm.org/D79100 Since there are no masked operations, except for load/store.. how are LLVM optimizations supposed to know that they must not hoist/sink operations with side-effects out of the hwloop? These operations have an implicit dependence on the iteration variable. What will you do if there are no masked intrinsics in the hwloop body? This can happen once you generate vector code beyond trivial loops or have a vector IR generator other than LV. And i am curious why couldn't you use the %evl parameter of VP intrinsics to get the tail predication you are interested in? - Simon ________________________________ From: Simon Moll <Simon.Moll at EMEA.NEC.COM><mailto:Simon.Moll at EMEA.NEC.COM> Sent: 18 May 2020 13:32 To: Sjoerd Meijer <Sjoerd.Meijer at arm.com><mailto:Sjoerd.Meijer at arm.com> Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com><mailto:rofirrim at gmail.com>; Eli Friedman <efriedma at quicinc.com><mailto:efriedma at quicinc.com>; listmail at philipreames.com<mailto:listmail at philipreames.com> <listmail at philipreames.com><mailto:listmail at philipreames.com>; llvm-dev <llvm-dev at lists.llvm.org><mailto:llvm-dev at lists.llvm.org>; Sander De Smalen <Sander.DeSmalen at arm.com><mailto:Sander.DeSmalen at arm.com>; hanna.kruppe at gmail.com<mailto:hanna.kruppe at gmail.com> <hanna.kruppe at gmail.com><mailto:hanna.kruppe at gmail.com> Subject: Re: [llvm-dev] LV: predication On 5/5/20 12:07 AM, Sjoerd Meijer via llvm-dev wrote: what we would like to generate is a vector loop with implicit predication, which works by setting up the the number of elements processed by the loop: hwloop 10 [i:4] = b[i:4] + c[i:4] Why couldn't you use VP intrinsics and scalable types for this? %bval = <4 x vscale x double> call @vp.load(..., /* %evl */ 10) %cval = <4 x vscale x double> call @vp.load(..., /* %evl */ 10) %sum = <4 x vscale x double> fadd %bval, %cval store [..] I see three issues with the llvm.set.loop.elements approach: 1) It is conceptually broken: as others have pointed out, optimization can move the intrinsic around since the intrinsic doesn't have any dependencies that would naturally keep it in place. 2) The whole proposed set of intrinsics is vendor specific: this causes fragmentation and i don't see why we would want to emit vendor-specific intrinsics in a generic optimization pass. Soon, we would see reports a la "your optimization caused regressions for MVE - add a check that the transformation must not touch llvm.set.loop.* or llvm.active.mask intrinsics when compiling for MVE..". I doubt that you would tolerate when that intrinsic were some removed in performance-critical code that would then remain scalar as a result.. so, i do not see the "beauty of the approach". 3) We need a reliable solution to properly support vector ISA such as RISC-V V extension and SX-Aurora and also MVE.. i don't see that reliability in this proposal. If for whatever reason, the above does not work and seems to far away from your proposal, here is another idea to make more explicit hwloops work with the VP intrinsics - in a way that does not break with optimizations: vector.preheader: %evl = i32 llvm.hwloop.set.elements(%n) vector.body: %lastevl = phi 32 [%evl, %preheader, %next.evl, vector.body] %aval = call @llvm.vp.load(Aptr, .., %evl) call @llvm.vp.store(Bptr, %aval, ..., %evl) %next.evl = call i32 @llvm.hwloop.decrement(%evl) Note that the way VP intrinsics are designed, it is not possible to break this code by hoisting the VP calls out of the loop: passing "%evl >= the operation's vector size" consitutes UB (see https://llvm.org/docs/LangRef.html#vector-predication-intrinsics). We can use attributes to do the same for sinking (eg don't move VP across hwloop.decrement). - Simon Click here<https://www.mailcontrol.com/sr/nsi3EguIhU7GX2PQPOmvUg0Q1FXI7Aab46SsJMiMHdmGzr7A9AzNdHpVFWx1NCcWI3IMY6gxm-fOTml8Ao4xWg==> to report this email as spam. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200518/92489a3b/attachment.html>