Invitation accepted, I am happy to help out with reviews, like I did with the previous VP patches. And of course agreed that things should be well defined, and that we shouldn't paint ourselves in a corner, but I don't think that this is the case. And it's not that I am in a rush, but I don't think this change needs to be predicated on a big change landing first like the LV switching to VP intrinsics.> The difference is that in the VP version there is an explicit dependence of every vector operation in the loop to the set.num.elements intrinsic. This dependence is obscured in the hwloop proposals (more on that below).This discussion is getting complicated, because I think we are discussing 3 topics at the same time now: predication, hardware loops, and a new set of intrinsics, the VP intrinsics. For the change that kicked off this thread, i.e. 1 new intrinsic to get the active lanes, I think we can eliminate the hardware loops from this story. For us, that is just the context of this, and so I think we can just focus on predication. And if we only talk about predication, I think this new intrinsic can nicely coexist with the VP intrinsics. And please note again I am not proposing a set.num.elements intrinsic. Well, I first kind of did, but again, abandoned that approach after push back. Correct me if I am wrong, but there's no difference in your example whether all instructions consume some predicate or only masked loads/stores: vector.preheader: %init.evl = i32 llvm.hwloop.set.elements(%n) vector.body: %evl = phi 32 [%init.evl, %preheader, %next.evl, vector.body] %aval = call @llvm.vp.load(Aptr, .., %evl) call @llvm.vp.store(Bptr, %aval, ..., %evl) %next.evl = call i32 @llvm.hwloop.decrement(%evl) No difference in that the problem remains that we have a random intrinsic sitting in the preheader describing a loop property that needs to be maintained. So, eliminating hardware loops and intrinsic that defines the number of elements produced, I am proposing vector.body: %mask = lvm.get.active.lane.mask (%IV, %BTC) .. = @llvm.masked.load(.., %mask) where IV is the induction step, and BTC the backedge taken count. This completely piggy backs on everything that is already there in the vectoriser, and nothing is fundamentally changed here. Now, this seems very generic, and doesn't seem to bite the VP intrinsics. Cheers, Sjoerd. ________________________________ From: Simon Moll <Simon.Moll at EMEA.NEC.COM> Sent: 19 May 2020 15:07 To: Sjoerd Meijer <Sjoerd.Meijer at arm.com> Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com>; Eli Friedman <efriedma at quicinc.com>; listmail at philipreames.com <listmail at philipreames.com>; llvm-dev <llvm-dev at lists.llvm.org>; Sander De Smalen <Sander.DeSmalen at arm.com>; hanna.kruppe at gmail.com <hanna.kruppe at gmail.com> Subject: Re: [llvm-dev] LV: predication On 5/19/20 12:38 PM, Sjoerd Meijer wrote: Hi Simon, Thanks for reposting the example, and looking at it more carefully, I think it is very similar to my first proposal. This was met with some resistance here because it dumps loop information in the vector preheader. Doing it this early, we want to emit this in the vectoriser, puts a restriction on (future) optimisations that transform vector loops to honour/update/support this intrinsic and loop information. In D79100, it is integral part of the vector body and has some semantics (I will update it today), and thus doesn't have these disadvantages. The difference is that in the VP version there is an explicit dependence of every vector operation in the loop to the set.num.elements intrinsic. This dependence is obscured in the hwloop proposals (more on that below). I understand that you are looking to get hwloops working quickly somehow - but any proposal should be designed in a forward-looking way or we could get stuck in a place it's hard to get out of. I am looking forward to see the semantics for this spelled out. Also, the vectoriser isn't using the VP intrinsics yet, so using them is a bridge too far for me at this point. But we should definitely re-evaluate at some point if we should use or transition to them in our backend passes. I'd very much like to see LV use VP intrinsics. I invite everybody to collaborate on VP to make it functional and useful quickly! Specifically, i am hoping we can collaborate on masked reduction intrinsics and implement them in the VP namespace. There is also the VP expansion pass on Phabricator right now (D78203 - it says 'work-in-progress' in the summary, which probably was a mistake: this is the real thing).> Are all vector instructions in the hwloop implicitly predicated or only the masked load/store ops?In a nutshell, when a vector loop with (explicitly) predicated masked loads/stores hit the backend, we translate the generic intrinsic get.active.mask to a target specific one. All predication remains explicit, and this remains the case. Only at the end, we use this intrinsic to instruction select a specific variant of the hardwarloop with some implicit predication. I do not see an answer to my question here. If the vectorized loop, prepared for hwloop, looks like this: %m = get.active.mask(..) %v = masked.load ... %m %r = sdiv %x, %y Will the `sdiv` execute with implicit hwloop predication? It makes no difference to the semantics of the intrinsic at which point you lower it but how. - Simon Cheers, Sjoerd. ________________________________ From: Simon Moll <Simon.Moll at EMEA.NEC.COM><mailto:Simon.Moll at EMEA.NEC.COM> Sent: 19 May 2020 09:56 To: Sjoerd Meijer <Sjoerd.Meijer at arm.com><mailto:Sjoerd.Meijer at arm.com> Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com><mailto:rofirrim at gmail.com>; Eli Friedman <efriedma at quicinc.com><mailto:efriedma at quicinc.com>; listmail at philipreames.com<mailto:listmail at philipreames.com> <listmail at philipreames.com><mailto:listmail at philipreames.com>; llvm-dev <llvm-dev at lists.llvm.org><mailto:llvm-dev at lists.llvm.org>; Sander De Smalen <Sander.DeSmalen at arm.com><mailto:Sander.DeSmalen at arm.com>; hanna.kruppe at gmail.com<mailto:hanna.kruppe at gmail.com> <hanna.kruppe at gmail.com><mailto:hanna.kruppe at gmail.com> Subject: Re: [llvm-dev] LV: predication Hi Sjoerd, On 5/18/20 3:43 PM, Sjoerd Meijer wrote:> You have similar problems with https://reviews.llvm.org/D79100The new revision D79100<https://reviews.llvm.org/D79100> solves your comment 1), and I don't think your comments2) and 3) apply as there are no vendor specific intrinsics involved at all here. Just to quickly discuss the optimisation pipeline, D79100<https://reviews.llvm.org/D79100> is a small extension for the vectoriser, and nothing here is related to hardware-loops or target specific constructs. The vectoriser tail-folds the loop, and creates masked load/stores; so existing functionality, and nothing has changed here. The generic hardware loop codegen pass inserts hardware loop intrinsics. Very late in the pipeline, e.g. in the PPC and ARM backends, this is picked and turned into an actual hardwareloop, in our case possibly predicated, or it is reverted. Thanks for explaining it (possibly once again) I wasn't aware that this will also be used for PPC. Point 3) still stands.> What will you do if there are no masked intrinsics in the hwloop body?Nothing. I.e., it can become a hardware loop, but not one with implicit predication. Are all vector instructions in the hwloop implicitly predicated or only the masked load/store ops? If not, then the issue is that the predicate parameter of masked load/store basically affects the semantics of all other vector ops in the loop that do not have an explicit mask parameter: %v = masked.load ... %m ; explicit predication - okay %r = sdiv %x, %y ; implicit predication by %m for hwloops - unpredicated otherwise> And i am curious why couldn't you use the %evl parameter of VP intrinsics to get the tail predication you are interested in?In D79100<https://reviews.llvm.org/D79100>, intrinsic get.active.mask makes the backedge taken count of the scalar loop explicit. I will look again, but I don't think the VP intrinsics were able to provide this. But to be honest, I have no preference at all what this intrinsic is, it is not relevant, as long as we can make this explicit. VP intrinsics explicitly make every vector instruction in the loop dependent on the '%evl'. You would have : %v = vp.load ... %evl %r = vp.sdiv %x, %y, %evl ; explicitly predicated by the scalar loop trip count My previous mail had an example on how %evl could be tied to the scalar trip count. Re-posting that here: vector.preheader: %init.evl = i32 llvm.hwloop.set.elements(%n) vector.body: %evl = phi 32 [%init.evl, %preheader, %next.evl, vector.body] %aval = call @llvm.vp.load(Aptr, .., %evl) call @llvm.vp.store(Bptr, %aval, ..., %evl) %next.evl = call i32 @llvm.hwloop.decrement(%evl) - Simon Cheers. ________________________________ From: Simon Moll <Simon.Moll at EMEA.NEC.COM><mailto:Simon.Moll at EMEA.NEC.COM> Sent: 18 May 2020 14:11 To: Sjoerd Meijer <Sjoerd.Meijer at arm.com><mailto:Sjoerd.Meijer at arm.com> Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com><mailto:rofirrim at gmail.com>; Eli Friedman <efriedma at quicinc.com><mailto:efriedma at quicinc.com>; listmail at philipreames.com<mailto:listmail at philipreames.com> <listmail at philipreames.com><mailto:listmail at philipreames.com>; llvm-dev <llvm-dev at lists.llvm.org><mailto:llvm-dev at lists.llvm.org>; Sander De Smalen <Sander.DeSmalen at arm.com><mailto:Sander.DeSmalen at arm.com>; hanna.kruppe at gmail.com<mailto:hanna.kruppe at gmail.com> <hanna.kruppe at gmail.com><mailto:hanna.kruppe at gmail.com> Subject: Re: [llvm-dev] LV: predication On 5/18/20 2:53 PM, Sjoerd Meijer wrote: Hi, I abandoned that approach and followed Eli's suggestion, see somewhere earlier in this thread, and emit an intrinsic that represents/calculates the active mask. I've just uploaded a new revision for D79100 that implements this. Cheers. You have similar problems with https://reviews.llvm.org/D79100 Since there are no masked operations, except for load/store.. how are LLVM optimizations supposed to know that they must not hoist/sink operations with side-effects out of the hwloop? These operations have an implicit dependence on the iteration variable. What will you do if there are no masked intrinsics in the hwloop body? This can happen once you generate vector code beyond trivial loops or have a vector IR generator other than LV. And i am curious why couldn't you use the %evl parameter of VP intrinsics to get the tail predication you are interested in? - Simon ________________________________ From: Simon Moll <Simon.Moll at EMEA.NEC.COM><mailto:Simon.Moll at EMEA.NEC.COM> Sent: 18 May 2020 13:32 To: Sjoerd Meijer <Sjoerd.Meijer at arm.com><mailto:Sjoerd.Meijer at arm.com> Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com><mailto:rofirrim at gmail.com>; Eli Friedman <efriedma at quicinc.com><mailto:efriedma at quicinc.com>; listmail at philipreames.com<mailto:listmail at philipreames.com> <listmail at philipreames.com><mailto:listmail at philipreames.com>; llvm-dev <llvm-dev at lists.llvm.org><mailto:llvm-dev at lists.llvm.org>; Sander De Smalen <Sander.DeSmalen at arm.com><mailto:Sander.DeSmalen at arm.com>; hanna.kruppe at gmail.com<mailto:hanna.kruppe at gmail.com> <hanna.kruppe at gmail.com><mailto:hanna.kruppe at gmail.com> Subject: Re: [llvm-dev] LV: predication On 5/5/20 12:07 AM, Sjoerd Meijer via llvm-dev wrote: what we would like to generate is a vector loop with implicit predication, which works by setting up the the number of elements processed by the loop: hwloop 10 [i:4] = b[i:4] + c[i:4] Why couldn't you use VP intrinsics and scalable types for this? %bval = <4 x vscale x double> call @vp.load(..., /* %evl */ 10) %cval = <4 x vscale x double> call @vp.load(..., /* %evl */ 10) %sum = <4 x vscale x double> fadd %bval, %cval store [..] I see three issues with the llvm.set.loop.elements approach: 1) It is conceptually broken: as others have pointed out, optimization can move the intrinsic around since the intrinsic doesn't have any dependencies that would naturally keep it in place. 2) The whole proposed set of intrinsics is vendor specific: this causes fragmentation and i don't see why we would want to emit vendor-specific intrinsics in a generic optimization pass. Soon, we would see reports a la "your optimization caused regressions for MVE - add a check that the transformation must not touch llvm.set.loop.* or llvm.active.mask intrinsics when compiling for MVE..". I doubt that you would tolerate when that intrinsic were some removed in performance-critical code that would then remain scalar as a result.. so, i do not see the "beauty of the approach". 3) We need a reliable solution to properly support vector ISA such as RISC-V V extension and SX-Aurora and also MVE.. i don't see that reliability in this proposal. If for whatever reason, the above does not work and seems to far away from your proposal, here is another idea to make more explicit hwloops work with the VP intrinsics - in a way that does not break with optimizations: vector.preheader: %evl = i32 llvm.hwloop.set.elements(%n) vector.body: %lastevl = phi 32 [%evl, %preheader, %next.evl, vector.body] %aval = call @llvm.vp.load(Aptr, .., %evl) call @llvm.vp.store(Bptr, %aval, ..., %evl) %next.evl = call i32 @llvm.hwloop.decrement(%evl) Note that the way VP intrinsics are designed, it is not possible to break this code by hoisting the VP calls out of the loop: passing "%evl >= the operation's vector size" consitutes UB (see https://llvm.org/docs/LangRef.html#vector-predication-intrinsics). We can use attributes to do the same for sinking (eg don't move VP across hwloop.decrement). - Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200519/30a978f8/attachment.html>
On 5/19/20 5:22 PM, Sjoerd Meijer wrote: Invitation accepted, I am happy to help out with reviews, like I did with the previous VP patches. That's great! And of course agreed that things should be well defined, and that we shouldn't paint ourselves in a corner, but I don't think that this is the case. And it's not that I am in a rush, but I don't think this change needs to be predicated on a big change landing first like the LV switching to VP intrinsics.> The difference is that in the VP version there is an explicit dependence of every vector operation in the loop to the set.num.elements intrinsic. This dependence is obscured in the hwloop proposals (more on that below).This discussion is getting complicated, because I think we are discussing 3 topics at the same time now: predication, hardware loops, and a new set of intrinsics, the VP intrinsics. Ok. My questions (the example at the end) was asking whether hwloops imply predication (and by that i mean logically - if the hwloop implies that a SIMD instruction may not execute for all lanes in the tail then that is predication as well). For the change that kicked off this thread, i.e. 1 new intrinsic to get the active lanes, I think we can eliminate the hardware loops from this story. For us, that is just the context of this, and so I think we can just focus on predication. And if we only talk about predication, I think this new intrinsic can nicely coexist with the VP intrinsics. And please note again I am not proposing a set.num.elements intrinsic. Well, I first kind of did, but again, abandoned that approach after push back. Correct me if I am wrong, but there's no difference in your example whether all instructions consume some predicate or only masked loads/stores: Yes, and that is the point: it's about making the SIMD instructions dependent on the mask .. and all of them. vector.preheader: %init.evl = i32 llvm.hwloop.set.elements(%n) vector.body: %evl = phi 32 [%init.evl, %preheader, %next.evl, vector.body] %aval = call @llvm.vp.load(Aptr, .., %evl) call @llvm.vp.store(Bptr, %aval, ..., %evl) %next.evl = call i32 @llvm.hwloop.decrement(%evl) No difference in that the problem remains that we have a random intrinsic sitting in the preheader describing a loop property that needs to be maintained. The difference is that the intrinsic is connected to every SIMD instruction in the vector loop through data flow. It does not just sit there.. in fact it does not matter where it is placed as long as those def-use edges are visible to the hwloop transformation. So, eliminating hardware loops and intrinsic that defines the number of elements produced, I am proposing vector.body: %mask = lvm.get.active.lane.mask (%IV, %BTC) .. = @llvm.masked.load(.., %mask) where IV is the induction step, and BTC the backedge taken count. This completely piggy backs on everything that is already there in the vectoriser, and nothing is fundamentally changed here. Now, this seems very generic, and doesn't seem to bite the VP intrinsics. I see it the other way round: Right now you seem to have an implicit dependence from syntactically unmasked SIMD instructions (eg a regular SIMD sdiv) to the predicate of nearby masked intrinsics (masked.load) - that's on shaky grounds semantically. VP intrinsics already define a clean semantics for tail predication - so why not piggyback on that? You should define the hwloop support in a way that will not just peacefully coexist with VP but leverage it eventually. I'll continue in that direction in the review. One specific request (since i got you attention now ;-) ): we need a (generic) IR primitive to express %lane_id < %n for scalable vector types to expand VP intrinsics for targets with SVE support but no tail predication. Cheers, Sjoerd. - Simon ________________________________ From: Simon Moll <Simon.Moll at EMEA.NEC.COM><mailto:Simon.Moll at EMEA.NEC.COM> Sent: 19 May 2020 15:07 To: Sjoerd Meijer <Sjoerd.Meijer at arm.com><mailto:Sjoerd.Meijer at arm.com> Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com><mailto:rofirrim at gmail.com>; Eli Friedman <efriedma at quicinc.com><mailto:efriedma at quicinc.com>; listmail at philipreames.com<mailto:listmail at philipreames.com> <listmail at philipreames.com><mailto:listmail at philipreames.com>; llvm-dev <llvm-dev at lists.llvm.org><mailto:llvm-dev at lists.llvm.org>; Sander De Smalen <Sander.DeSmalen at arm.com><mailto:Sander.DeSmalen at arm.com>; hanna.kruppe at gmail.com<mailto:hanna.kruppe at gmail.com> <hanna.kruppe at gmail.com><mailto:hanna.kruppe at gmail.com> Subject: Re: [llvm-dev] LV: predication On 5/19/20 12:38 PM, Sjoerd Meijer wrote: Hi Simon, Thanks for reposting the example, and looking at it more carefully, I think it is very similar to my first proposal. This was met with some resistance here because it dumps loop information in the vector preheader. Doing it this early, we want to emit this in the vectoriser, puts a restriction on (future) optimisations that transform vector loops to honour/update/support this intrinsic and loop information. In D79100, it is integral part of the vector body and has some semantics (I will update it today), and thus doesn't have these disadvantages. The difference is that in the VP version there is an explicit dependence of every vector operation in the loop to the set.num.elements intrinsic. This dependence is obscured in the hwloop proposals (more on that below). I understand that you are looking to get hwloops working quickly somehow - but any proposal should be designed in a forward-looking way or we could get stuck in a place it's hard to get out of. I am looking forward to see the semantics for this spelled out. Also, the vectoriser isn't using the VP intrinsics yet, so using them is a bridge too far for me at this point. But we should definitely re-evaluate at some point if we should use or transition to them in our backend passes. I'd very much like to see LV use VP intrinsics. I invite everybody to collaborate on VP to make it functional and useful quickly! Specifically, i am hoping we can collaborate on masked reduction intrinsics and implement them in the VP namespace. There is also the VP expansion pass on Phabricator right now (D78203 - it says 'work-in-progress' in the summary, which probably was a mistake: this is the real thing).> Are all vector instructions in the hwloop implicitly predicated or only the masked load/store ops?In a nutshell, when a vector loop with (explicitly) predicated masked loads/stores hit the backend, we translate the generic intrinsic get.active.mask to a target specific one. All predication remains explicit, and this remains the case. Only at the end, we use this intrinsic to instruction select a specific variant of the hardwarloop with some implicit predication. I do not see an answer to my question here. If the vectorized loop, prepared for hwloop, looks like this: %m = get.active.mask(..) %v = masked.load ... %m %r = sdiv %x, %y Will the `sdiv` execute with implicit hwloop predication? It makes no difference to the semantics of the intrinsic at which point you lower it but how. - Simon Cheers, Sjoerd. ________________________________ From: Simon Moll <Simon.Moll at EMEA.NEC.COM><mailto:Simon.Moll at EMEA.NEC.COM> Sent: 19 May 2020 09:56 To: Sjoerd Meijer <Sjoerd.Meijer at arm.com><mailto:Sjoerd.Meijer at arm.com> Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com><mailto:rofirrim at gmail.com>; Eli Friedman <efriedma at quicinc.com><mailto:efriedma at quicinc.com>; listmail at philipreames.com<mailto:listmail at philipreames.com> <listmail at philipreames.com><mailto:listmail at philipreames.com>; llvm-dev <llvm-dev at lists.llvm.org><mailto:llvm-dev at lists.llvm.org>; Sander De Smalen <Sander.DeSmalen at arm.com><mailto:Sander.DeSmalen at arm.com>; hanna.kruppe at gmail.com<mailto:hanna.kruppe at gmail.com> <hanna.kruppe at gmail.com><mailto:hanna.kruppe at gmail.com> Subject: Re: [llvm-dev] LV: predication Hi Sjoerd, On 5/18/20 3:43 PM, Sjoerd Meijer wrote:> You have similar problems with https://reviews.llvm.org/D79100The new revision D79100<https://reviews.llvm.org/D79100> solves your comment 1), and I don't think your comments2) and 3) apply as there are no vendor specific intrinsics involved at all here. Just to quickly discuss the optimisation pipeline, D79100<https://reviews.llvm.org/D79100> is a small extension for the vectoriser, and nothing here is related to hardware-loops or target specific constructs. The vectoriser tail-folds the loop, and creates masked load/stores; so existing functionality, and nothing has changed here. The generic hardware loop codegen pass inserts hardware loop intrinsics. Very late in the pipeline, e.g. in the PPC and ARM backends, this is picked and turned into an actual hardwareloop, in our case possibly predicated, or it is reverted. Thanks for explaining it (possibly once again) I wasn't aware that this will also be used for PPC. Point 3) still stands.> What will you do if there are no masked intrinsics in the hwloop body?Nothing. I.e., it can become a hardware loop, but not one with implicit predication. Are all vector instructions in the hwloop implicitly predicated or only the masked load/store ops? If not, then the issue is that the predicate parameter of masked load/store basically affects the semantics of all other vector ops in the loop that do not have an explicit mask parameter: %v = masked.load ... %m ; explicit predication - okay %r = sdiv %x, %y ; implicit predication by %m for hwloops - unpredicated otherwise> And i am curious why couldn't you use the %evl parameter of VP intrinsics to get the tail predication you are interested in?In D79100<https://reviews.llvm.org/D79100>, intrinsic get.active.mask makes the backedge taken count of the scalar loop explicit. I will look again, but I don't think the VP intrinsics were able to provide this. But to be honest, I have no preference at all what this intrinsic is, it is not relevant, as long as we can make this explicit. VP intrinsics explicitly make every vector instruction in the loop dependent on the '%evl'. You would have : %v = vp.load ... %evl %r = vp.sdiv %x, %y, %evl ; explicitly predicated by the scalar loop trip count My previous mail had an example on how %evl could be tied to the scalar trip count. Re-posting that here: vector.preheader: %init.evl = i32 llvm.hwloop.set.elements(%n) vector.body: %evl = phi 32 [%init.evl, %preheader, %next.evl, vector.body] %aval = call @llvm.vp.load(Aptr, .., %evl) call @llvm.vp.store(Bptr, %aval, ..., %evl) %next.evl = call i32 @llvm.hwloop.decrement(%evl) - Simon Cheers. ________________________________ From: Simon Moll <Simon.Moll at EMEA.NEC.COM><mailto:Simon.Moll at EMEA.NEC.COM> Sent: 18 May 2020 14:11 To: Sjoerd Meijer <Sjoerd.Meijer at arm.com><mailto:Sjoerd.Meijer at arm.com> Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com><mailto:rofirrim at gmail.com>; Eli Friedman <efriedma at quicinc.com><mailto:efriedma at quicinc.com>; listmail at philipreames.com<mailto:listmail at philipreames.com> <listmail at philipreames.com><mailto:listmail at philipreames.com>; llvm-dev <llvm-dev at lists.llvm.org><mailto:llvm-dev at lists.llvm.org>; Sander De Smalen <Sander.DeSmalen at arm.com><mailto:Sander.DeSmalen at arm.com>; hanna.kruppe at gmail.com<mailto:hanna.kruppe at gmail.com> <hanna.kruppe at gmail.com><mailto:hanna.kruppe at gmail.com> Subject: Re: [llvm-dev] LV: predication On 5/18/20 2:53 PM, Sjoerd Meijer wrote: Hi, I abandoned that approach and followed Eli's suggestion, see somewhere earlier in this thread, and emit an intrinsic that represents/calculates the active mask. I've just uploaded a new revision for D79100 that implements this. Cheers. You have similar problems with https://reviews.llvm.org/D79100 Since there are no masked operations, except for load/store.. how are LLVM optimizations supposed to know that they must not hoist/sink operations with side-effects out of the hwloop? These operations have an implicit dependence on the iteration variable. What will you do if there are no masked intrinsics in the hwloop body? This can happen once you generate vector code beyond trivial loops or have a vector IR generator other than LV. And i am curious why couldn't you use the %evl parameter of VP intrinsics to get the tail predication you are interested in? - Simon ________________________________ From: Simon Moll <Simon.Moll at EMEA.NEC.COM><mailto:Simon.Moll at EMEA.NEC.COM> Sent: 18 May 2020 13:32 To: Sjoerd Meijer <Sjoerd.Meijer at arm.com><mailto:Sjoerd.Meijer at arm.com> Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com><mailto:rofirrim at gmail.com>; Eli Friedman <efriedma at quicinc.com><mailto:efriedma at quicinc.com>; listmail at philipreames.com<mailto:listmail at philipreames.com> <listmail at philipreames.com><mailto:listmail at philipreames.com>; llvm-dev <llvm-dev at lists.llvm.org><mailto:llvm-dev at lists.llvm.org>; Sander De Smalen <Sander.DeSmalen at arm.com><mailto:Sander.DeSmalen at arm.com>; hanna.kruppe at gmail.com<mailto:hanna.kruppe at gmail.com> <hanna.kruppe at gmail.com><mailto:hanna.kruppe at gmail.com> Subject: Re: [llvm-dev] LV: predication On 5/5/20 12:07 AM, Sjoerd Meijer via llvm-dev wrote: what we would like to generate is a vector loop with implicit predication, which works by setting up the the number of elements processed by the loop: hwloop 10 [i:4] = b[i:4] + c[i:4] Why couldn't you use VP intrinsics and scalable types for this? %bval = <4 x vscale x double> call @vp.load(..., /* %evl */ 10) %cval = <4 x vscale x double> call @vp.load(..., /* %evl */ 10) %sum = <4 x vscale x double> fadd %bval, %cval store [..] I see three issues with the llvm.set.loop.elements approach: 1) It is conceptually broken: as others have pointed out, optimization can move the intrinsic around since the intrinsic doesn't have any dependencies that would naturally keep it in place. 2) The whole proposed set of intrinsics is vendor specific: this causes fragmentation and i don't see why we would want to emit vendor-specific intrinsics in a generic optimization pass. Soon, we would see reports a la "your optimization caused regressions for MVE - add a check that the transformation must not touch llvm.set.loop.* or llvm.active.mask intrinsics when compiling for MVE..". I doubt that you would tolerate when that intrinsic were some removed in performance-critical code that would then remain scalar as a result.. so, i do not see the "beauty of the approach". 3) We need a reliable solution to properly support vector ISA such as RISC-V V extension and SX-Aurora and also MVE.. i don't see that reliability in this proposal. If for whatever reason, the above does not work and seems to far away from your proposal, here is another idea to make more explicit hwloops work with the VP intrinsics - in a way that does not break with optimizations: vector.preheader: %evl = i32 llvm.hwloop.set.elements(%n) vector.body: %lastevl = phi 32 [%evl, %preheader, %next.evl, vector.body] %aval = call @llvm.vp.load(Aptr, .., %evl) call @llvm.vp.store(Bptr, %aval, ..., %evl) %next.evl = call i32 @llvm.hwloop.decrement(%evl) Note that the way VP intrinsics are designed, it is not possible to break this code by hoisting the VP calls out of the loop: passing "%evl >= the operation's vector size" consitutes UB (see https://llvm.org/docs/LangRef.html#vector-predication-intrinsics). We can use attributes to do the same for sinking (eg don't move VP across hwloop.decrement). - Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200520/26c5a34d/attachment.html>
Hello, About this, I am essentially just echoing what others said on the list:> The difference is that the intrinsic is connected to every SIMD instruction in the vector loop through data flow. It does not just sit there.. in fact it does not matter where it is placed as long as those def-use edges are visible to the hwloop transformation.Yes, it is well connected with use-def chains, but the intrinsic defines a loop property. If we would have a transformation that for example peels off one vector iteration from that loop/vector body, it doesn't process %N elements but for example %N - 4 data elements. With hwloop.set.elements(%N) still sitting in the preheader, it could communicate the wrong information to other passes or the backend. Thus, this puts a maintenance burden to support that intrinsic, which is not what we want. The feedback was that we need to communicate this information in a different way, there are different ways to do this. Now, returning to hardware-loops.> Ok. My questions (the example at the end) was asking whether hwloops imply predication (and by that i mean logically - if the hwloop implies that a SIMD instruction may not execute for all lanes in the tail then that is predication as well).We should probably define what we mean by hardwareloops, i.e., where in the pipeline. In the target independent CodeGen pass HardwareLoops, hardware loop are supported with a few intrinsics to mark a loop as a hardware loop. This does not imply any predication. That is, these hardwareloop intrinsics do not influence in any way prediction or any masking of lanes, thus they do not imply certain forms of hwloops with or without predication. But there can be masked loads/stores insides these hardwareloop bodies, they are generated by the vectoriser. Please note that I am not trying to be pedantic here, but am just describing the current situation, just to get clarity what we are discussing, and what the problem is, was becoming a bit unclear to me. Now, things do change in the ARM backend, because in MVE we have 2 forms of hardware loops, let's say a normal one, and one with implicit predication. And to support this, we transform explicit predication into implicit predication, but of course only when it is okay to do this. With this in mind, returning to the example:> I do not see an answer to my question here. If the vectorized loop, prepared for hwloop, looks like this: > > %m = get.active.mask(..) > %v = masked.load ... %m > %r = sdiv %x, %y > > Will the `sdiv` execute with implicit hwloop predication?The short answer is "no". There are no hardware loops here at this point, and thus also we don't distinguish between different hwloop forms. Here, we use the let's say the vectoriser way of masking/prediction: only the load/store are masked. Your previous remark, also quoted below, is that VP intrinsic provide clean semantics, and I fully agree with that.> I see it the other way round: Right now you seem to have an implicit dependence from syntactically unmasked SIMD instructions (eg a regular SIMD sdiv) to the predicate of nearby masked intrinsics (masked.load) - that's on shaky grounds semantically. VP intrinsics already define a clean semantics for tail predication - so why not piggyback on that?IThe @lvm.get.active.lane.mask instrinsic is unrelated, but works exactly the same as the @num.elements intrinsic, i.e. it is well connected as you said with def-use chains, feeding the relevant instructions, in this case the masked loads/stores. You're unhappy that currently the vector instructions don't have explicit masks/predication, but that is the current state of the art. Again, agreed that VP intrinsics are semantically clean, and we will definitely will use them we can. Cheers, Sjoerd. ________________________________ From: Simon Moll <Simon.Moll at EMEA.NEC.COM> Sent: 20 May 2020 09:52 To: Sjoerd Meijer <Sjoerd.Meijer at arm.com> Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com>; Eli Friedman <efriedma at quicinc.com>; listmail at philipreames.com <listmail at philipreames.com>; llvm-dev <llvm-dev at lists.llvm.org>; Sander De Smalen <Sander.DeSmalen at arm.com>; hanna.kruppe at gmail.com <hanna.kruppe at gmail.com> Subject: Re: [llvm-dev] LV: predication On 5/19/20 5:22 PM, Sjoerd Meijer wrote: Invitation accepted, I am happy to help out with reviews, like I did with the previous VP patches. That's great! And of course agreed that things should be well defined, and that we shouldn't paint ourselves in a corner, but I don't think that this is the case. And it's not that I am in a rush, but I don't think this change needs to be predicated on a big change landing first like the LV switching to VP intrinsics.> The difference is that in the VP version there is an explicit dependence of every vector operation in the loop to the set.num.elements intrinsic. This dependence is obscured in the hwloop proposals (more on that below).This discussion is getting complicated, because I think we are discussing 3 topics at the same time now: predication, hardware loops, and a new set of intrinsics, the VP intrinsics. Ok. My questions (the example at the end) was asking whether hwloops imply predication (and by that i mean logically - if the hwloop implies that a SIMD instruction may not execute for all lanes in the tail then that is predication as well). For the change that kicked off this thread, i.e. 1 new intrinsic to get the active lanes, I think we can eliminate the hardware loops from this story. For us, that is just the context of this, and so I think we can just focus on predication. And if we only talk about predication, I think this new intrinsic can nicely coexist with the VP intrinsics. And please note again I am not proposing a set.num.elements intrinsic. Well, I first kind of did, but again, abandoned that approach after push back. Correct me if I am wrong, but there's no difference in your example whether all instructions consume some predicate or only masked loads/stores: Yes, and that is the point: it's about making the SIMD instructions dependent on the mask .. and all of them. vector.preheader: %init.evl = i32 llvm.hwloop.set.elements(%n) vector.body: %evl = phi 32 [%init.evl, %preheader, %next.evl, vector.body] %aval = call @llvm.vp.load(Aptr, .., %evl) call @llvm.vp.store(Bptr, %aval, ..., %evl) %next.evl = call i32 @llvm.hwloop.decrement(%evl) No difference in that the problem remains that we have a random intrinsic sitting in the preheader describing a loop property that needs to be maintained. The difference is that the intrinsic is connected to every SIMD instruction in the vector loop through data flow. It does not just sit there.. in fact it does not matter where it is placed as long as those def-use edges are visible to the hwloop transformation. So, eliminating hardware loops and intrinsic that defines the number of elements produced, I am proposing vector.body: %mask = lvm.get.active.lane.mask (%IV, %BTC) .. = @llvm.masked.load(.., %mask) where IV is the induction step, and BTC the backedge taken count. This completely piggy backs on everything that is already there in the vectoriser, and nothing is fundamentally changed here. Now, this seems very generic, and doesn't seem to bite the VP intrinsics. I see it the other way round: Right now you seem to have an implicit dependence from syntactically unmasked SIMD instructions (eg a regular SIMD sdiv) to the predicate of nearby masked intrinsics (masked.load) - that's on shaky grounds semantically. VP intrinsics already define a clean semantics for tail predication - so why not piggyback on that? You should define the hwloop support in a way that will not just peacefully coexist with VP but leverage it eventually. I'll continue in that direction in the review. One specific request (since i got you attention now ;-) ): we need a (generic) IR primitive to express %lane_id < %n for scalable vector types to expand VP intrinsics for targets with SVE support but no tail predication. Cheers, Sjoerd. - Simon ________________________________ From: Simon Moll <Simon.Moll at EMEA.NEC.COM><mailto:Simon.Moll at EMEA.NEC.COM> Sent: 19 May 2020 15:07 To: Sjoerd Meijer <Sjoerd.Meijer at arm.com><mailto:Sjoerd.Meijer at arm.com> Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com><mailto:rofirrim at gmail.com>; Eli Friedman <efriedma at quicinc.com><mailto:efriedma at quicinc.com>; listmail at philipreames.com<mailto:listmail at philipreames.com> <listmail at philipreames.com><mailto:listmail at philipreames.com>; llvm-dev <llvm-dev at lists.llvm.org><mailto:llvm-dev at lists.llvm.org>; Sander De Smalen <Sander.DeSmalen at arm.com><mailto:Sander.DeSmalen at arm.com>; hanna.kruppe at gmail.com<mailto:hanna.kruppe at gmail.com> <hanna.kruppe at gmail.com><mailto:hanna.kruppe at gmail.com> Subject: Re: [llvm-dev] LV: predication On 5/19/20 12:38 PM, Sjoerd Meijer wrote: Hi Simon, Thanks for reposting the example, and looking at it more carefully, I think it is very similar to my first proposal. This was met with some resistance here because it dumps loop information in the vector preheader. Doing it this early, we want to emit this in the vectoriser, puts a restriction on (future) optimisations that transform vector loops to honour/update/support this intrinsic and loop information. In D79100, it is integral part of the vector body and has some semantics (I will update it today), and thus doesn't have these disadvantages. The difference is that in the VP version there is an explicit dependence of every vector operation in the loop to the set.num.elements intrinsic. This dependence is obscured in the hwloop proposals (more on that below). I understand that you are looking to get hwloops working quickly somehow - but any proposal should be designed in a forward-looking way or we could get stuck in a place it's hard to get out of. I am looking forward to see the semantics for this spelled out. Also, the vectoriser isn't using the VP intrinsics yet, so using them is a bridge too far for me at this point. But we should definitely re-evaluate at some point if we should use or transition to them in our backend passes. I'd very much like to see LV use VP intrinsics. I invite everybody to collaborate on VP to make it functional and useful quickly! Specifically, i am hoping we can collaborate on masked reduction intrinsics and implement them in the VP namespace. There is also the VP expansion pass on Phabricator right now (D78203 - it says 'work-in-progress' in the summary, which probably was a mistake: this is the real thing).> Are all vector instructions in the hwloop implicitly predicated or only the masked load/store ops?In a nutshell, when a vector loop with (explicitly) predicated masked loads/stores hit the backend, we translate the generic intrinsic get.active.mask to a target specific one. All predication remains explicit, and this remains the case. Only at the end, we use this intrinsic to instruction select a specific variant of the hardwarloop with some implicit predication. I do not see an answer to my question here. If the vectorized loop, prepared for hwloop, looks like this: %m = get.active.mask(..) %v = masked.load ... %m %r = sdiv %x, %y Will the `sdiv` execute with implicit hwloop predication? It makes no difference to the semantics of the intrinsic at which point you lower it but how. - Simon Cheers, Sjoerd. ________________________________ From: Simon Moll <Simon.Moll at EMEA.NEC.COM><mailto:Simon.Moll at EMEA.NEC.COM> Sent: 19 May 2020 09:56 To: Sjoerd Meijer <Sjoerd.Meijer at arm.com><mailto:Sjoerd.Meijer at arm.com> Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com><mailto:rofirrim at gmail.com>; Eli Friedman <efriedma at quicinc.com><mailto:efriedma at quicinc.com>; listmail at philipreames.com<mailto:listmail at philipreames.com> <listmail at philipreames.com><mailto:listmail at philipreames.com>; llvm-dev <llvm-dev at lists.llvm.org><mailto:llvm-dev at lists.llvm.org>; Sander De Smalen <Sander.DeSmalen at arm.com><mailto:Sander.DeSmalen at arm.com>; hanna.kruppe at gmail.com<mailto:hanna.kruppe at gmail.com> <hanna.kruppe at gmail.com><mailto:hanna.kruppe at gmail.com> Subject: Re: [llvm-dev] LV: predication Hi Sjoerd, On 5/18/20 3:43 PM, Sjoerd Meijer wrote:> You have similar problems with https://reviews.llvm.org/D79100The new revision D79100<https://reviews.llvm.org/D79100> solves your comment 1), and I don't think your comments2) and 3) apply as there are no vendor specific intrinsics involved at all here. Just to quickly discuss the optimisation pipeline, D79100<https://reviews.llvm.org/D79100> is a small extension for the vectoriser, and nothing here is related to hardware-loops or target specific constructs. The vectoriser tail-folds the loop, and creates masked load/stores; so existing functionality, and nothing has changed here. The generic hardware loop codegen pass inserts hardware loop intrinsics. Very late in the pipeline, e.g. in the PPC and ARM backends, this is picked and turned into an actual hardwareloop, in our case possibly predicated, or it is reverted. Thanks for explaining it (possibly once again) I wasn't aware that this will also be used for PPC. Point 3) still stands.> What will you do if there are no masked intrinsics in the hwloop body?Nothing. I.e., it can become a hardware loop, but not one with implicit predication. Are all vector instructions in the hwloop implicitly predicated or only the masked load/store ops? If not, then the issue is that the predicate parameter of masked load/store basically affects the semantics of all other vector ops in the loop that do not have an explicit mask parameter: %v = masked.load ... %m ; explicit predication - okay %r = sdiv %x, %y ; implicit predication by %m for hwloops - unpredicated otherwise> And i am curious why couldn't you use the %evl parameter of VP intrinsics to get the tail predication you are interested in?In D79100<https://reviews.llvm.org/D79100>, intrinsic get.active.mask makes the backedge taken count of the scalar loop explicit. I will look again, but I don't think the VP intrinsics were able to provide this. But to be honest, I have no preference at all what this intrinsic is, it is not relevant, as long as we can make this explicit. VP intrinsics explicitly make every vector instruction in the loop dependent on the '%evl'. You would have : %v = vp.load ... %evl %r = vp.sdiv %x, %y, %evl ; explicitly predicated by the scalar loop trip count My previous mail had an example on how %evl could be tied to the scalar trip count. Re-posting that here: vector.preheader: %init.evl = i32 llvm.hwloop.set.elements(%n) vector.body: %evl = phi 32 [%init.evl, %preheader, %next.evl, vector.body] %aval = call @llvm.vp.load(Aptr, .., %evl) call @llvm.vp.store(Bptr, %aval, ..., %evl) %next.evl = call i32 @llvm.hwloop.decrement(%evl) - Simon Cheers. ________________________________ From: Simon Moll <Simon.Moll at EMEA.NEC.COM><mailto:Simon.Moll at EMEA.NEC.COM> Sent: 18 May 2020 14:11 To: Sjoerd Meijer <Sjoerd.Meijer at arm.com><mailto:Sjoerd.Meijer at arm.com> Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com><mailto:rofirrim at gmail.com>; Eli Friedman <efriedma at quicinc.com><mailto:efriedma at quicinc.com>; listmail at philipreames.com<mailto:listmail at philipreames.com> <listmail at philipreames.com><mailto:listmail at philipreames.com>; llvm-dev <llvm-dev at lists.llvm.org><mailto:llvm-dev at lists.llvm.org>; Sander De Smalen <Sander.DeSmalen at arm.com><mailto:Sander.DeSmalen at arm.com>; hanna.kruppe at gmail.com<mailto:hanna.kruppe at gmail.com> <hanna.kruppe at gmail.com><mailto:hanna.kruppe at gmail.com> Subject: Re: [llvm-dev] LV: predication On 5/18/20 2:53 PM, Sjoerd Meijer wrote: Hi, I abandoned that approach and followed Eli's suggestion, see somewhere earlier in this thread, and emit an intrinsic that represents/calculates the active mask. I've just uploaded a new revision for D79100 that implements this. Cheers. You have similar problems with https://reviews.llvm.org/D79100 Since there are no masked operations, except for load/store.. how are LLVM optimizations supposed to know that they must not hoist/sink operations with side-effects out of the hwloop? These operations have an implicit dependence on the iteration variable. What will you do if there are no masked intrinsics in the hwloop body? This can happen once you generate vector code beyond trivial loops or have a vector IR generator other than LV. And i am curious why couldn't you use the %evl parameter of VP intrinsics to get the tail predication you are interested in? - Simon ________________________________ From: Simon Moll <Simon.Moll at EMEA.NEC.COM><mailto:Simon.Moll at EMEA.NEC.COM> Sent: 18 May 2020 13:32 To: Sjoerd Meijer <Sjoerd.Meijer at arm.com><mailto:Sjoerd.Meijer at arm.com> Cc: Roger Ferrer Ibáñez <rofirrim at gmail.com><mailto:rofirrim at gmail.com>; Eli Friedman <efriedma at quicinc.com><mailto:efriedma at quicinc.com>; listmail at philipreames.com<mailto:listmail at philipreames.com> <listmail at philipreames.com><mailto:listmail at philipreames.com>; llvm-dev <llvm-dev at lists.llvm.org><mailto:llvm-dev at lists.llvm.org>; Sander De Smalen <Sander.DeSmalen at arm.com><mailto:Sander.DeSmalen at arm.com>; hanna.kruppe at gmail.com<mailto:hanna.kruppe at gmail.com> <hanna.kruppe at gmail.com><mailto:hanna.kruppe at gmail.com> Subject: Re: [llvm-dev] LV: predication On 5/5/20 12:07 AM, Sjoerd Meijer via llvm-dev wrote: what we would like to generate is a vector loop with implicit predication, which works by setting up the the number of elements processed by the loop: hwloop 10 [i:4] = b[i:4] + c[i:4] Why couldn't you use VP intrinsics and scalable types for this? %bval = <4 x vscale x double> call @vp.load(..., /* %evl */ 10) %cval = <4 x vscale x double> call @vp.load(..., /* %evl */ 10) %sum = <4 x vscale x double> fadd %bval, %cval store [..] I see three issues with the llvm.set.loop.elements approach: 1) It is conceptually broken: as others have pointed out, optimization can move the intrinsic around since the intrinsic doesn't have any dependencies that would naturally keep it in place. 2) The whole proposed set of intrinsics is vendor specific: this causes fragmentation and i don't see why we would want to emit vendor-specific intrinsics in a generic optimization pass. Soon, we would see reports a la "your optimization caused regressions for MVE - add a check that the transformation must not touch llvm.set.loop.* or llvm.active.mask intrinsics when compiling for MVE..". I doubt that you would tolerate when that intrinsic were some removed in performance-critical code that would then remain scalar as a result.. so, i do not see the "beauty of the approach". 3) We need a reliable solution to properly support vector ISA such as RISC-V V extension and SX-Aurora and also MVE.. i don't see that reliability in this proposal. If for whatever reason, the above does not work and seems to far away from your proposal, here is another idea to make more explicit hwloops work with the VP intrinsics - in a way that does not break with optimizations: vector.preheader: %evl = i32 llvm.hwloop.set.elements(%n) vector.body: %lastevl = phi 32 [%evl, %preheader, %next.evl, vector.body] %aval = call @llvm.vp.load(Aptr, .., %evl) call @llvm.vp.store(Bptr, %aval, ..., %evl) %next.evl = call i32 @llvm.hwloop.decrement(%evl) Note that the way VP intrinsics are designed, it is not possible to break this code by hoisting the VP calls out of the loop: passing "%evl >= the operation's vector size" consitutes UB (see https://llvm.org/docs/LangRef.html#vector-predication-intrinsics). We can use attributes to do the same for sinking (eg don't move VP across hwloop.decrement). - Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200520/b2eaaeb5/attachment-0001.html>