Hi Eli,> The problem with your proposal, as written, is that the vectorizer is producing the intrinsic. Because we don’t impose any ordering on optimizations before codegen, every optimization pass in LLVM would have to be taught to preserve any @llvm.set.loop.elements.i32 whenever it makes any change. This is completely impractical because the intrinsic isn’t related to anything optimizations would normally look for: it’s a random intrinsic in the middle of nowhere.I do see that point. But is that also not the beauty of it? It just sits in the preheader, if gets removed, then so be it. And if it not recognised, then also no harm done?> Probably the simplest path to get this working is to derive the number of elements in the backend (in HardwareLoops, or your tail predication pass). You should be able to figure it from the masks used in the llvm.masked.load/store instructions in the loop.This is what we are currently doing and works excellent for simpler cases. For the more complicated cases that we now what to handle as well, the pattern matching just becomes a bit too horrible, and it is fragile too. All we need is the information that the vectoriser already has, and pass this on somehow. As I am really keen to simply our backend pass, would there be another way to pass this information on? If emitting an intrinsic is a blocker, could this be done with a loop annotation? Cheers, Sjoerd. ________________________________ From: Eli Friedman <efriedma at quicinc.com> Sent: 01 May 2020 19:30 To: Sjoerd Meijer <Sjoerd.Meijer at arm.com>; llvm-dev <llvm-dev at lists.llvm.org> Subject: RE: [llvm-dev] LV: predication The problem with your proposal, as written, is that the vectorizer is producing the intrinsic. Because we don’t impose any ordering on optimizations before codegen, every optimization pass in LLVM would have to be taught to preserve any @llvm.set.loop.elements.i32 whenever it makes any change. This is completely impractical because the intrinsic isn’t related to anything optimizations would normally look for: it’s a random intrinsic in the middle of nowhere. Probably the simplest path to get this working is to derive the number of elements in the backend (in HardwareLoops, or your tail predication pass). You should be able to figure it from the masks used in the llvm.masked.load/store instructions in the loop. -Eli From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Sjoerd Meijer via llvm-dev Sent: Friday, May 1, 2020 3:50 AM To: llvm-dev at lists.llvm.org Subject: [EXT] [llvm-dev] LV: predication Hello, We are working on predication for our vector extension (MVE). Since quite a few people are working on predication and different forms of it (e.g. SVE, RISC-V, NEC), I thought I would share what we would like to add to the loop vectoriser. Hopefully it's just a minor one and not intrusive, but could be interesting and useful for others, and feedback on this is welcome of course. TL;DR: We would like the loop vectoriser to emit a new IR intrinsic for certain loops: void @llvm.set.loop.elements.i32(i32 ) This represents the number of data elements processed by a vector loop, and will be emitted in the preheader block of the vector loop after querying TTI that the backend understands this intrinsic and that it should be emitted for that loop. The vectoriser patch is available in D79100, and we pick this intrinsic up in the ARM backend here in D79175. Context: We are working on predication form that we call tail-predication: a vector hardwareloop has an implicit form of predication that sets active/inactive lanes for the last iteration of the vector loop. Thus, the scalar epilogue loop (if there is one) is tail-folded and tail-predicated in the main vector body. And to support this, we need to know the number of data elements processed by the loop, which is used in the set up of a tail-predicated vector loop. This new intrinsic communicates this information from the vectoriser to the codegen passes where we further lower these loops. In our case, we essentially let @llvm.set.loop.elements.i32 emit the trip count of the scalar loop, which represents the number of data elements processed. Thus, we let the vectoriser emits both the scalar and vector loop trip count. Although in a different stage in the optimisation pipeline, this is exactly what the generic HardwareLoop pass is doing to communicate its information to target specific codegen passes; it emits a few intrinsics to mark a hardware loop. To illustrate this and also the new intrinsic, this is the flow and life of a tail-predicated vector loop using some heavily edited/reduced examples. First, the vectoriser emits the number of elements processed, and the loads/stores are masked because tail-folding is applied: vector.ph: call void @llvm.set.loop.elements.i32(i32 %N) br label %vector.body vector.body: call <4 x i32> @llvm.masked.load call <4 x i32> @llvm.masked.load call void @llvm.masked.store br i1 %12, label %.*, label %vector.body After the HardwareLoop pass this is transformed into this, which adds the hardware loop intrinsics: vector.ph: call void @llvm.set.loop.elements.i32(i32 %N) call void @llvm.set.loop.iterations.i32(i32 %5) br label %vector.body vector.body: call <4 x i32> @llvm.masked.load call <4 x i32> @llvm.masked.load call void @llvm.masked.store call i32 @llvm.loop.decrement.reg br i1 %12, label %.*, label %vector.body We then pick this up in our tail-predication pass, remove @llvm.set.loop.elements intrinsic, and add @vctp which is our intrinsic that generates the mask of active/inactive lanes: vector.ph: call void @llvm.set.loop.iterations.i32(i32 %5) br label %vector.body vector.body: call <4 x i1> @llvm.arm.mve.vctp32 call <4 x i32> @llvm.masked.load call <4 x i32> @llvm.masked.load call void @llvm.masked.store call i32 @llvm.loop.decrement.reg br i1 %12, label %.*, label %vector.body And this is then further lowered to a tail-predicted loop, or reverted to a 'normal' vector loop if some restrictions are not met. Cheers, Sjoerd. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200501/6bc0d9ad/attachment-0001.html>
From: Sjoerd Meijer <Sjoerd.Meijer at arm.com> Sent: Friday, May 1, 2020 11:54 AM To: Eli Friedman <efriedma at quicinc.com>; llvm-dev <llvm-dev at lists.llvm.org> Subject: [EXT] Re: [llvm-dev] LV: predication Hi Eli,> The problem with your proposal, as written, is that the vectorizer is producing the intrinsic. Because we don't impose any ordering on optimizations before codegen, every optimization pass in LLVM would have to be taught to preserve any @llvm.set.loop.elements.i32 whenever it makes any change. This is completely impractical because the intrinsic isn't related to anything optimizations would normally look for: it's a random intrinsic in the middle of nowhere.I do see that point. But is that also not the beauty of it? It just sits in the preheader, if gets removed, then so be it. And if it not recognised, then also no harm done? The harm comes if the intrinsic ends up with the wrong value, or attached to the wrong loop.> Probably the simplest path to get this working is to derive the number of elements in the backend (in HardwareLoops, or your tail predication pass). You should be able to figure it from the masks used in the llvm.masked.load/store instructions in the loop.This is what we are currently doing and works excellent for simpler cases. For the more complicated cases that we now what to handle as well, the pattern matching just becomes a bit too horrible, and it is fragile too. All we need is the information that the vectoriser already has, and pass this on somehow. As I am really keen to simply our backend pass, would there be another way to pass this information on? If emitting an intrinsic is a blocker, could this be done with a loop annotation? If the problem is specifically figuring out the underlying element count given a predicate, maybe we could attack it from that angle? For example, introduce a special intrinsic for deriving the mask (sort of like the SVE whilelo). -Eli -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200501/47022af0/attachment.html>
> The harm comes if the intrinsic ends up with the wrong value, or attached to the wrong loop.The intrinsic is marked as IntrNoDuplicate, so I wasn't worried about it ending up somewhere else. Also, it is a property of a specific loop, a tail-folded vector loop, that holds even after it is transformed I think. I.e. unrolling a vector loop is probably not what you want, but even if you do the element count would remain the same. But yes, I agree that a future whacky optimisation on vector loops could invalidate this, which you can then skip but then you lose out on it.... So, I really like this:> If the problem is specifically figuring out the underlying element count given a predicate, maybe we could attack it from that angle? For example, introduce a special intrinsic for deriving the mask (sort of like the SVE whilelo).That would be an excellent way of doing it and it would also map very well to MVE too, where we have a VCTP intrinsic/instruction that creates the mask/predicate (Vector Create Tail-Predicate). So I will go for this approach. Such an intrinsic was actually also proposed in Sam's original RFC (see https://lists.llvm.org/pipermail/llvm-dev/2019-May/132512.html), but we hadn't implemented it yet. This intrinsic will probably look something like this: <N x i1> @llvm.loop.get.active.mask(AnyInt, AnyInt) It produces a <N x i1> predicate based on its two arguments, the number of elements and the vector trip count, and it will be used by the predicated masked loads/stores instructions in the vector body. I will start drafting an implementation for this and continue with this in D79100. Thanks, Sjoerd. ________________________________ From: Eli Friedman <efriedma at quicinc.com> Sent: 01 May 2020 21:11 To: Sjoerd Meijer <Sjoerd.Meijer at arm.com>; llvm-dev <llvm-dev at lists.llvm.org> Subject: RE: [llvm-dev] LV: predication From: Sjoerd Meijer <Sjoerd.Meijer at arm.com> Sent: Friday, May 1, 2020 11:54 AM To: Eli Friedman <efriedma at quicinc.com>; llvm-dev <llvm-dev at lists.llvm.org> Subject: [EXT] Re: [llvm-dev] LV: predication Hi Eli,> The problem with your proposal, as written, is that the vectorizer is producing the intrinsic. Because we don’t impose any ordering on optimizations before codegen, every optimization pass in LLVM would have to be taught to preserve any @llvm.set.loop.elements.i32 whenever it makes any change. This is completely impractical because the intrinsic isn’t related to anything optimizations would normally look for: it’s a random intrinsic in the middle of nowhere.I do see that point. But is that also not the beauty of it? It just sits in the preheader, if gets removed, then so be it. And if it not recognised, then also no harm done? The harm comes if the intrinsic ends up with the wrong value, or attached to the wrong loop.> Probably the simplest path to get this working is to derive the number of elements in the backend (in HardwareLoops, or your tail predication pass). You should be able to figure it from the masks used in the llvm.masked.load/store instructions in the loop.This is what we are currently doing and works excellent for simpler cases. For the more complicated cases that we now what to handle as well, the pattern matching just becomes a bit too horrible, and it is fragile too. All we need is the information that the vectoriser already has, and pass this on somehow. As I am really keen to simply our backend pass, would there be another way to pass this information on? If emitting an intrinsic is a blocker, could this be done with a loop annotation? If the problem is specifically figuring out the underlying element count given a predicate, maybe we could attack it from that angle? For example, introduce a special intrinsic for deriving the mask (sort of like the SVE whilelo). -Eli -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200504/84451aaf/attachment.html>
I realize this discussion and D79100 have progressed, sorry, but could we revisit the "simplest path" of deriving the desired number?> This is what we are currently doing and works excellent for simpler cases. For the more complicated cases that we now what to handle as well, the pattern matching just becomes a bit too horrible, and it is fragile too.Could you elaborate on these more complicated cases and the difficulty they entail? Presumably a vector compare of a "Vector Induction Variable" with a broadcasted invariant value is sought, to be RAUW'd by a hardware configured mask. Is it the recognition of VIV's that's becoming horrible and fragile? It may be generally useful to have a robust utility and/or analysis that identifies such VIV, effectively extending SCEV to reason about vector values, rather than complicating any backend pass. Middle-end passes may find this information useful too, operating after LV, or on vector IR produced elsewhere. This is somewhat analogous to the argument about relying on a canonical induction variable versus employing SCEV to derive it, http://lists.llvm.org/pipermail/llvm-dev/2020-April/140572.html. A dedicated intrinsic that freezes the compare instruction, for no apparent reason, may potentially cripple subsequent passes from further optimizing the vectorized loop. From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Sjoerd Meijer via llvm-dev Sent: Friday, May 01, 2020 21:54 To: Eli Friedman <efriedma at quicinc.com>; llvm-dev <llvm-dev at lists.llvm.org> Subject: Re: [llvm-dev] LV: predication Hi Eli,> The problem with your proposal, as written, is that the vectorizer is producing the intrinsic. Because we don't impose any ordering on optimizations before codegen, every optimization pass in LLVM would have to be taught to preserve any @llvm.set.loop.elements.i32 whenever it makes any change. This is completely impractical because the intrinsic isn't related to anything optimizations would normally look for: it's a random intrinsic in the middle of nowhere.I do see that point. But is that also not the beauty of it? It just sits in the preheader, if gets removed, then so be it. And if it not recognised, then also no harm done?> Probably the simplest path to get this working is to derive the number of elements in the backend (in HardwareLoops, or your tail predication pass). You should be able to figure it from the masks used in the llvm.masked.load/store instructions in the loop.This is what we are currently doing and works excellent for simpler cases. For the more complicated cases that we now what to handle as well, the pattern matching just becomes a bit too horrible, and it is fragile too. All we need is the information that the vectoriser already has, and pass this on somehow. As I am really keen to simply our backend pass, would there be another way to pass this information on? If emitting an intrinsic is a blocker, could this be done with a loop annotation? Cheers, Sjoerd. ________________________________ From: Eli Friedman <efriedma at quicinc.com<mailto:efriedma at quicinc.com>> Sent: 01 May 2020 19:30 To: Sjoerd Meijer <Sjoerd.Meijer at arm.com<mailto:Sjoerd.Meijer at arm.com>>; llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> Subject: RE: [llvm-dev] LV: predication The problem with your proposal, as written, is that the vectorizer is producing the intrinsic. Because we don't impose any ordering on optimizations before codegen, every optimization pass in LLVM would have to be taught to preserve any @llvm.set.loop.elements.i32 whenever it makes any change. This is completely impractical because the intrinsic isn't related to anything optimizations would normally look for: it's a random intrinsic in the middle of nowhere. Probably the simplest path to get this working is to derive the number of elements in the backend (in HardwareLoops, or your tail predication pass). You should be able to figure it from the masks used in the llvm.masked.load/store instructions in the loop. -Eli From: llvm-dev <llvm-dev-bounces at lists.llvm.org<mailto:llvm-dev-bounces at lists.llvm.org>> On Behalf Of Sjoerd Meijer via llvm-dev Sent: Friday, May 1, 2020 3:50 AM To: llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org> Subject: [EXT] [llvm-dev] LV: predication Hello, We are working on predication for our vector extension (MVE). Since quite a few people are working on predication and different forms of it (e.g. SVE, RISC-V, NEC), I thought I would share what we would like to add to the loop vectoriser. Hopefully it's just a minor one and not intrusive, but could be interesting and useful for others, and feedback on this is welcome of course. TL;DR: We would like the loop vectoriser to emit a new IR intrinsic for certain loops: void @llvm.set.loop.elements.i32(i32 ) This represents the number of data elements processed by a vector loop, and will be emitted in the preheader block of the vector loop after querying TTI that the backend understands this intrinsic and that it should be emitted for that loop. The vectoriser patch is available in D79100, and we pick this intrinsic up in the ARM backend here in D79175. Context: We are working on predication form that we call tail-predication: a vector hardwareloop has an implicit form of predication that sets active/inactive lanes for the last iteration of the vector loop. Thus, the scalar epilogue loop (if there is one) is tail-folded and tail-predicated in the main vector body. And to support this, we need to know the number of data elements processed by the loop, which is used in the set up of a tail-predicated vector loop. This new intrinsic communicates this information from the vectoriser to the codegen passes where we further lower these loops. In our case, we essentially let @llvm.set.loop.elements.i32 emit the trip count of the scalar loop, which represents the number of data elements processed. Thus, we let the vectoriser emits both the scalar and vector loop trip count. Although in a different stage in the optimisation pipeline, this is exactly what the generic HardwareLoop pass is doing to communicate its information to target specific codegen passes; it emits a few intrinsics to mark a hardware loop. To illustrate this and also the new intrinsic, this is the flow and life of a tail-predicated vector loop using some heavily edited/reduced examples. First, the vectoriser emits the number of elements processed, and the loads/stores are masked because tail-folding is applied: vector.ph: call void @llvm.set.loop.elements.i32(i32 %N) br label %vector.body vector.body: call <4 x i32> @llvm.masked.load call <4 x i32> @llvm.masked.load call void @llvm.masked.store br i1 %12, label %.*, label %vector.body After the HardwareLoop pass this is transformed into this, which adds the hardware loop intrinsics: vector.ph: call void @llvm.set.loop.elements.i32(i32 %N) call void @llvm.set.loop.iterations.i32(i32 %5) br label %vector.body vector.body: call <4 x i32> @llvm.masked.load call <4 x i32> @llvm.masked.load call void @llvm.masked.store call i32 @llvm.loop.decrement.reg br i1 %12, label %.*, label %vector.body We then pick this up in our tail-predication pass, remove @llvm.set.loop.elements intrinsic, and add @vctp which is our intrinsic that generates the mask of active/inactive lanes: vector.ph: call void @llvm.set.loop.iterations.i32(i32 %5) br label %vector.body vector.body: call <4 x i1> @llvm.arm.mve.vctp32 call <4 x i32> @llvm.masked.load call <4 x i32> @llvm.masked.load call void @llvm.masked.store call i32 @llvm.loop.decrement.reg br i1 %12, label %.*, label %vector.body And this is then further lowered to a tail-predicted loop, or reverted to a 'normal' vector loop if some restrictions are not met. Cheers, Sjoerd. --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200520/afe8e276/attachment.html>
Hi Ayal, Let me start with commenting on this:> A dedicated intrinsic that freezes the compare instruction, for no apparent reason, may potentially cripple subsequent passes from further optimizing the vectorized loop.The point is we have a very good reason, which is that it passes on the right information on the backend, enabling opimisations as opposed to crippling them. The compare that we are talking is the compare that compares the induction step and the backedge taken count, and this feeds the masked loads/stores. Thus, for example, we are not talking about the compare controlling the backedge, and it is not affecting loop control. While it is undoubtedly true that there could optimisation that can't handle this particular icmp instruction, it is difficult to imagine for me at this point that being unable to analyse this icmp would cripple things.> Could you elaborate on these more complicated cases and the difficulty they entail?The problem that we are solving is that we need the scalar loop backedge taken count (BTC), or just the iteration count, of the original scalar loop for a given vector loop. Just to be clear, we do not only need the vector iteration count, but again also the scalar loop Iteration Count (IC). We need this for a certain form of predication. This information, the scalar loop IC is produced by vectoriser, and is materialised in the form of the instructions that generate the predicates for the masked loads/stores: this icmp with induction step and the scalar IC. Our current approach works for simple cases, because we pattern match the IR, and look for the scalar IC in these icmps that feed masked loads/stores. To make sure we let's say don't accidentally pattern match a random icmp, we compare this with SCEV information. Thus, we have to match up a SCEV expression with pattern matched IR. I could give IR examples, but hopefully it's easy to imagine that this pattern matching and matching up with SCEV info is becoming a bit horrible for doubly nested loops or reductions. This icmp materliased as @llvm.get.active.lanes.mask(%IV, %BTC) avoids all of this, as we can just pick up %BTC in the backend. As we are looking for the scalar loop iteration count, not the VIV, I don't think SCEV for vector loops is going to be helpful. Please let me know if I can elaborate further, or if things are not clear. Cheers, Sjoerd. ________________________________ From: Zaks, Ayal (Mobileye) <ayal.zaks at intel.com> Sent: 20 May 2020 20:39 To: Sjoerd Meijer <Sjoerd.Meijer at arm.com>; Eli Friedman <efriedma at quicinc.com> Cc: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org> Subject: RE: [llvm-dev] LV: predication I realize this discussion and D79100 have progressed, sorry, but could we revisit the “simplest path” of deriving the desired number?> This is what we are currently doing and works excellent for simpler cases. For the more complicated cases that we now what to handle as well, the pattern matching just becomes a bit too horrible, and it is fragile too.Could you elaborate on these more complicated cases and the difficulty they entail? Presumably a vector compare of a “Vector Induction Variable” with a broadcasted invariant value is sought, to be RAUW’d by a hardware configured mask. Is it the recognition of VIV’s that’s becoming horrible and fragile? It may be generally useful to have a robust utility and/or analysis that identifies such VIV, effectively extending SCEV to reason about vector values, rather than complicating any backend pass. Middle-end passes may find this information useful too, operating after LV, or on vector IR produced elsewhere. This is somewhat analogous to the argument about relying on a canonical induction variable versus employing SCEV to derive it, http://lists.llvm.org/pipermail/llvm-dev/2020-April/140572.html. A dedicated intrinsic that freezes the compare instruction, for no apparent reason, may potentially cripple subsequent passes from further optimizing the vectorized loop. From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Sjoerd Meijer via llvm-dev Sent: Friday, May 01, 2020 21:54 To: Eli Friedman <efriedma at quicinc.com>; llvm-dev <llvm-dev at lists.llvm.org> Subject: Re: [llvm-dev] LV: predication Hi Eli,> The problem with your proposal, as written, is that the vectorizer is producing the intrinsic. Because we don’t impose any ordering on optimizations before codegen, every optimization pass in LLVM would have to be taught to preserve any @llvm.set.loop.elements.i32 whenever it makes any change. This is completely impractical because the intrinsic isn’t related to anything optimizations would normally look for: it’s a random intrinsic in the middle of nowhere.I do see that point. But is that also not the beauty of it? It just sits in the preheader, if gets removed, then so be it. And if it not recognised, then also no harm done?> Probably the simplest path to get this working is to derive the number of elements in the backend (in HardwareLoops, or your tail predication pass). You should be able to figure it from the masks used in the llvm.masked.load/store instructions in the loop.This is what we are currently doing and works excellent for simpler cases. For the more complicated cases that we now what to handle as well, the pattern matching just becomes a bit too horrible, and it is fragile too. All we need is the information that the vectoriser already has, and pass this on somehow. As I am really keen to simply our backend pass, would there be another way to pass this information on? If emitting an intrinsic is a blocker, could this be done with a loop annotation? Cheers, Sjoerd. ________________________________ From: Eli Friedman <efriedma at quicinc.com<mailto:efriedma at quicinc.com>> Sent: 01 May 2020 19:30 To: Sjoerd Meijer <Sjoerd.Meijer at arm.com<mailto:Sjoerd.Meijer at arm.com>>; llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> Subject: RE: [llvm-dev] LV: predication The problem with your proposal, as written, is that the vectorizer is producing the intrinsic. Because we don’t impose any ordering on optimizations before codegen, every optimization pass in LLVM would have to be taught to preserve any @llvm.set.loop.elements.i32 whenever it makes any change. This is completely impractical because the intrinsic isn’t related to anything optimizations would normally look for: it’s a random intrinsic in the middle of nowhere. Probably the simplest path to get this working is to derive the number of elements in the backend (in HardwareLoops, or your tail predication pass). You should be able to figure it from the masks used in the llvm.masked.load/store instructions in the loop. -Eli From: llvm-dev <llvm-dev-bounces at lists.llvm.org<mailto:llvm-dev-bounces at lists.llvm.org>> On Behalf Of Sjoerd Meijer via llvm-dev Sent: Friday, May 1, 2020 3:50 AM To: llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org> Subject: [EXT] [llvm-dev] LV: predication Hello, We are working on predication for our vector extension (MVE). Since quite a few people are working on predication and different forms of it (e.g. SVE, RISC-V, NEC), I thought I would share what we would like to add to the loop vectoriser. Hopefully it's just a minor one and not intrusive, but could be interesting and useful for others, and feedback on this is welcome of course. TL;DR: We would like the loop vectoriser to emit a new IR intrinsic for certain loops: void @llvm.set.loop.elements.i32(i32 ) This represents the number of data elements processed by a vector loop, and will be emitted in the preheader block of the vector loop after querying TTI that the backend understands this intrinsic and that it should be emitted for that loop. The vectoriser patch is available in D79100, and we pick this intrinsic up in the ARM backend here in D79175. Context: We are working on predication form that we call tail-predication: a vector hardwareloop has an implicit form of predication that sets active/inactive lanes for the last iteration of the vector loop. Thus, the scalar epilogue loop (if there is one) is tail-folded and tail-predicated in the main vector body. And to support this, we need to know the number of data elements processed by the loop, which is used in the set up of a tail-predicated vector loop. This new intrinsic communicates this information from the vectoriser to the codegen passes where we further lower these loops. In our case, we essentially let @llvm.set.loop.elements.i32 emit the trip count of the scalar loop, which represents the number of data elements processed. Thus, we let the vectoriser emits both the scalar and vector loop trip count. Although in a different stage in the optimisation pipeline, this is exactly what the generic HardwareLoop pass is doing to communicate its information to target specific codegen passes; it emits a few intrinsics to mark a hardware loop. To illustrate this and also the new intrinsic, this is the flow and life of a tail-predicated vector loop using some heavily edited/reduced examples. First, the vectoriser emits the number of elements processed, and the loads/stores are masked because tail-folding is applied: vector.ph: call void @llvm.set.loop.elements.i32(i32 %N) br label %vector.body vector.body: call <4 x i32> @llvm.masked.load call <4 x i32> @llvm.masked.load call void @llvm.masked.store br i1 %12, label %.*, label %vector.body After the HardwareLoop pass this is transformed into this, which adds the hardware loop intrinsics: vector.ph: call void @llvm.set.loop.elements.i32(i32 %N) call void @llvm.set.loop.iterations.i32(i32 %5) br label %vector.body vector.body: call <4 x i32> @llvm.masked.load call <4 x i32> @llvm.masked.load call void @llvm.masked.store call i32 @llvm.loop.decrement.reg br i1 %12, label %.*, label %vector.body We then pick this up in our tail-predication pass, remove @llvm.set.loop.elements intrinsic, and add @vctp which is our intrinsic that generates the mask of active/inactive lanes: vector.ph: call void @llvm.set.loop.iterations.i32(i32 %5) br label %vector.body vector.body: call <4 x i1> @llvm.arm.mve.vctp32 call <4 x i32> @llvm.masked.load call <4 x i32> @llvm.masked.load call void @llvm.masked.store call i32 @llvm.loop.decrement.reg br i1 %12, label %.*, label %vector.body And this is then further lowered to a tail-predicted loop, or reverted to a 'normal' vector loop if some restrictions are not met. Cheers, Sjoerd. --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200520/e432e135/attachment.html>
Sjoerd Meijer via llvm-dev <llvm-dev at lists.llvm.org> writes:> This is what we are currently doing and works excellent for simpler > cases. For the more complicated cases that we now what to handle as > well, the pattern matching just becomes a bit too horrible, and it is > fragile too. All we need is the information that the vectoriser > already has, and pass this on somehow. > > As I am really keen to simply our backend pass, would there be another > way to pass this information on? If emitting an intrinsic is a > blocker, could this be done with a loop annotation?I have had to communicate information exactly like this from optimizer to late codegen. It is painful and would be a lot easier if we had metadata support in machine instructions. Perhaps that is an avenue to pursue as it would be more general and applicable to lots of things. -David