We're looking at how to handle masked vector operations in architectures like Knight's Corner. In our case, we have to translate from a fully vectorized IR that has mask support to llvm IR which does not have mask support. For non-trapping instructions this is fairly straightforward: ; Input t1 = add t2, t3, mask ; llvm IR -- assuming we want zeros in the false positions, which is not ; always the case tt = add t2, t3 t1 = select mask, tt, 0 It's easy enough to write a TableGen pattern to match add+select and emit a masked instruction. Alternative, we can always resort to manual lowering (ugh). For trapping operations this is problematic. Take a load. Here's the same attempt with a load: tt = load [addr] t1 = select mask, tt, 0 The problem is that the load is unconditional. Originally it was masked presumably because the original scalar load was under a condition, probably protecting against traps. However, since llvm has no concept of a masked load, to stay within the IR we need to use an unconditional load. We can get the same result vector after the select, but that's too late. The load has already executed and the llvm passes will assume that load cannot fault (otherwise it's undefined behavior). The llvm IR does not convey the same semantics as the fully predicated IR. It seems the only solution is to create an intrinsic: llvm_int_load_masked mask, [addr] But this unnecessarily shuts down optimization. Similar problems exist with any trapping instruction (div, mod, etc.). It gets even worse when you consider than any floating point operation can trap on a signalling NaN input. The gpuocelot project is essentially trying to do the same thing but I haven't dived deep enough into their notes and implementation to see how they handle this issue. Perhaps because current GPUs don't trap it's a non-issue. But that will likely change in the future. So are there any ideas out there for how to efficiently handle this? We've talked about llvm and masks before and it's clear that there is strong resistance to adding masks to the IR. Perhaps an alternative is to mark an instruction as "may trap" so that llvm will not perform certain transformations on it. Of course that involves teaching all of the passes about a new "may trap" attribute, or whatever mechanism is devised. I would very much appreciate thoughts and ideas on this. As it is, it doesn't seem like it's possible to generate efficient llvm IR for fully predicated instruction sets. -David
Hi David,> > It seems the only solution is to create an intrinsic: > > llvm_int_load_masked mask, [addr] > > But this unnecessarily shuts down optimization. >I think that using intrinsics is the right solution. I imagine that most interesting load/store optimizations happen before vectorization, so I am not sure how much we can gain by optimizing masked load/stores.> Similar problems exist with any trapping instruction (div, mod, etc.). > It gets even worse when you consider than any floating point operation > can trap on a signalling NaN input.For DIV/MOD you can blend the inputs BEFORE the operation. You can place ones or zeros depending on the operation.> > So are there any ideas out there for how to efficiently handle this? > We've talked about llvm and masks before and it's clear that there is > strong resistance to adding masks to the IR.Yes. I think that the consensus is that we don't need to predicate the IR itself to support MIC-like processors. Thanks, Nadav -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130502/22bbda6f/attachment.html>
Nadav Rotem <nrotem at apple.com> writes:> It seems the only solution is to create an intrinsic: > > llvm_int_load_masked mask, [addr] > > But this unnecessarily shuts down optimization. > > > > I think that using intrinsics is the right solution. I imagine that > most interesting load/store optimizations happen before vectorization, > so I am not sure how much we can gain by optimizing masked > load/stores.Perhaps that is true. If this is the only intrinsic we need (well, a store too), maybe it's not too bad.> Similar problems exist with any trapping instruction (div, mod, > etc.). > It gets even worse when you consider than any floating point > operation > can trap on a signalling NaN input. > > > For DIV/MOD you can blend the inputs BEFORE the operation. You can > place ones or zeros depending on the operation.That's true but it's inefficient. I suppose we can write patterns to match the input selects as well and just drop them, opting for the masked operation. But this all requires that these select/select/op/select sequences stay intact throughout llvm so isel can match it. I'm not totally confident that's possible.> > So are there any ideas out there for how to efficiently handle > this? > We've talked about llvm and masks before and it's clear that there > is > strong resistance to adding masks to the IR. > > Yes. I think that the consensus is that we don't need to predicate the > IR itself to support MIC-like processors.Perhaps not but I think we need a little more than we have right now. I'll ponder this some more but in the mean time, please continue to add thoughts and ideas. -David
Nadav Rotem <nrotem at apple.com> writes:> For DIV/MOD you can blend the inputs BEFORE the operation. You can > place ones or zeros depending on the operation.Quick follow-up on this. What about using "undef" as the input for false items: tv1 = select mask, v1, undef tv2 = select mask, v2, undef tv3 = div tv1, tv2 v3 = select mask, tv3, undef I'm always confused about the semantics of undef. Is the above safe code? It would simplify things a bit not to have to track which input values are safe based on the context of an operation. -David
Hi David, On 02/05/13 17:57, dag at cray.com wrote:> We're looking at how to handle masked vector operations in architectures > like Knight's Corner. In our case, we have to translate from a fully > vectorized IR that has mask support to llvm IR which does not have mask > support. > > For non-trapping instructions this is fairly straightforward: > > ; Input > t1 = add t2, t3, mask > > ; llvm IR -- assuming we want zeros in the false positions, which is not > ; always the case > tt = add t2, t3 > t1 = select mask, tt, 0there seems to be a plan to get rid of the select instruction and just use branches and phi nodes instead. Amongst other things this requires boosting the power of codegen so that branches+phi nodes can be turned into cmov or whatever when appropriate.> > It's easy enough to write a TableGen pattern to match add+select and > emit a masked instruction. Alternative, we can always resort to manual > lowering (ugh). > > For trapping operations this is problematic. Take a load. Here's the > same attempt with a load: > > tt = load [addr] > t1 = select mask, tt, 0This would not be problematic at the IR level if it was done by branching to one of two basic blocks based on the condition, and doing the load in the appropriate basic block. Codegen would however need to become powerful enough to turn this construct into your target's predicated load. Ciao, Duncan.
On May 3, 2013, at 1:10 AM, Duncan Sands <baldrick at free.fr> wrote:> there seems to be a plan to get rid of the select instruction and just use > branches and phi nodes instead. Amongst other things this requires boosting > the power of codegen so that branches+phi nodes can be turned into cmov or > whatever when appropriate.Hi Duncan, Thanks for commenting on this. I completely disagree with the proposal to remove SELECTS from the IR, and David's email is a perfect example of how useful Selects are. I'd be happy to comment about the proposal when it is presented to the community. Thanks, Nadav -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130503/7a450283/attachment.html>
Duncan Sands <baldrick at free.fr> writes:> there seems to be a plan to get rid of the select instruction and just use > branches and phi nodes instead. Amongst other things this requires boosting > the power of codegen so that branches+phi nodes can be turned into cmov or > whatever when appropriate.This is a very BAD idea. Are you telling me that every predicated instruction will need to be in its own basic block so it can be represented with a phi? Certainly this will be true for any instructions that do not share masks. PLEASE do not do this! This would be a huge step backward in vector support.>> >> It's easy enough to write a TableGen pattern to match add+select and >> emit a masked instruction. Alternative, we can always resort to manual >> lowering (ugh). >> >> For trapping operations this is problematic. Take a load. Here's the >> same attempt with a load: >> >> tt = load [addr] >> t1 = select mask, tt, 0 > > This would not be problematic at the IR level if it was done by branching to > one of two basic blocks based on the condition, and doing the load in the > appropriate basic block. Codegen would however need to become powerful enough > to turn this construct into your target's predicated load.How will that ever happen? isel has never known much about control flow at all. Please do NOT remove select until we have a solid replacement in place, something that's tested and known to work. I cannot object strongly enough. I've bit my tongue at a few IR changes, but not this one. Who propsed this change? Why has it not been discussed on the list? -David
On 2 May 2013 16:57, <dag at cray.com> wrote:> We're looking at how to handle masked vector operations in architectures > like Knight's Corner. In our case, we have to translate from a fully > vectorized IR that has mask support to llvm IR which does not have mask > support. >Has anyone done a comparision between the "fully vectorized IR and "LLVM IR" ? If someone has already invented a "fully vectorized IR", it might be beneficial to not re-invent it for LLVM. For example, if you are optimizing a loop, and splitting it into 3 loops, one of which can then be fully vectorized, it would be useful to represent that optimization/translation at the IR level. Adding mask support to LLVM IR would therefore seem a sensible course to me. It might be a short term pain, but would possibly benefit the longterm optimization goals of LLVM.
James Courtier-Dutton <james.dutton at gmail.com> writes:> On 2 May 2013 16:57, <dag at cray.com> wrote: >> We're looking at how to handle masked vector operations in architectures >> like Knight's Corner. In our case, we have to translate from a fully >> vectorized IR that has mask support to llvm IR which does not have mask >> support. >> > > Has anyone done a comparision between the "fully vectorized IR and "LLVM IR" ? > If someone has already invented a "fully vectorized IR", it might be > beneficial to not re-invent it for LLVM. > For example, if you are optimizing a loop, and splitting it into 3 > loops, one of which can then be fully vectorized, it would be useful > to represent that optimization/translation at the IR level. Adding > mask support to LLVM IR would therefore seem a sensible course to me. > It might be a short term pain, but would possibly benefit the longterm > optimization goals of LLVM.The vectorized IR we are translating from has explicit masking at the leaf nodes and implied masking at the inner nodes. For example: ___MERGE___ / \ + - / \ / \ / \ / \ [a#m1] [b#m1] [a#m2] [b#m2] So the add is assumed to operate under #m1 and the subtract is assumed to operate under #m2. Then there is an explicit merge operation to form the final vector. In this case #m2 = ~#m1. I believe we can represent this in LLVM IR with selects as long as we have predication at the leaves. The trick is to have isel match all of these selects and generate an efficient predicated operation. I'm working on trying that experiment to see if it will suffice. So I don't know that a fully predicated IR would be any better than selects + predication at the leaves. That's why I'm doing the experiment. :) -David