I'm trying to understand how predicated/masked instructions can be generated in llvm, specifically an instruction where a set bit in the mask will write the new result into the corresponding vector lane in the destination and a clear bit will cause the lane in the destination to remain what it was before the instruction executed. I've seen a few places that suggest 'select' is the proper way to implement predication. I believe the predicated form cannot be explicitly expressed in LLVM asm, because it is SSA. It can be done implicitly: %sum = add <16 x i32> %x, %y %newvalue = select <16 x i1> %mask, <16 x i32> %sum, <16 x i32> %oldvalue The issue becomes how to match the instruction form above in a TableGen pattern. In order for this to emit a masked instruction, %newvalue and %oldvalue must be assigned to same physical register (I'm assuming an instruction like 'add %r0{%m0} %r1 %r2') However, I don't think there is even a notion of physical registers at the point that instruction selection is performed and the virtual registers will be different because everything is still in SSA form. I suspect I'm missing something fundamental. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130506/37bfb706/attachment.html>
Jeff Bush <jeffbush001 at gmail.com> writes:> I'm trying to understand how predicated/masked instructions can be > generated in llvm, specifically an instruction where a set bit in the > mask will write the new result into the corresponding vector lane in > the destination and a clear bit will cause the lane in the destination > to remain what it was before the instruction executed. > > I've seen a few places that suggest 'select' is the proper way to > implement predication. I believe the predicated form cannot be > explicitly expressed in LLVM asm, because it is SSA. It can be done > implicitly: > > %sum = add <16 x i32> %x, %y > %newvalue = select <16 x i1> %mask, <16 x i32> %sum, <16 x i32> > %oldvalueThis is not necessarily sufficient in general. For any operation that can trap (pretty much any fp operation) you also need to protect the inputs with safe values to adhere to the semantics of the LLVM IR. For an fadd, for example: %tx = select %mask, %x, <0.0, 0.0, 0.0 ...> %ty = select %mask, %y, <0.0, 0.0, 0.0 ...> %sum = fadd %tx, %ty %newvalue = select %mask, %sum, %oldvalue Then this entire pattern of selects and fadd will have to be matched in isel, which would throw away the safe value protection overhead (because the hardware masking takes care of the safety). In fact the problem is more general. Even in the integer add case, you have to ensure that llvm won't move the definition of %sum before the definition of %mask. Otherwise you're going to end up generating a predicated add with a bad mask value. Selects on the inputs to the integer add also solve this problem.> The issue becomes how to match the instruction form above in a > TableGen pattern. In order for this to emit a masked instruction, > %newvalue and %oldvalue must be assigned to same physical register (I'm > assuming an instruction like 'add %r0{%m0} %r1 %r2') However, I don't > think there is even a notion of physical registers at the point that > instruction selection is performed and the virtual registers will be > different because everything is still in SSA form.Potentially you could use the "$src = $dst" constraint as in the two-address x86 forms. I don't know that TableGen has been generalized enough to do this, though. I think it's pretty highly specialized to specific x86 two-address forms at the moment. Another problem is how to express %oldvalue in LLVM IR. Presumably %newvalue will be consumed, possibly by another arithmetic operation. Presumably %oldvalue can similarly come from a previous arithmetic operation feeding into the add. If that's true, then %oldvalue is either %x or %y. Otherwise it is some other thing highly context-dependent. The gpuocelot project ran into the problem and they talk about it here: http://code.google.com/p/gpuocelot/source/browse/wiki/LLVM.wiki?r=272 The bottom line is that it is probably easier to set this up before LLVM IR goes into SSA form. There is a lot of interest in predication and a lot of recent discussions about handling it in LLVM. Personally I think that long-term we will need some IR changes. It might be as simple as adding an IR-level predicated load and predicated store, I'm not sure. We're pretty heavily into trying to get predicated vector code working well in LLVM here. Expect to see many questions, code examples and hopefully upstream code submissions over the next couple of months. -David
On May 8, 2013, at 11:07 AM, dag at cray.com wrote:> It might be as simple as adding > an IR-level predicated load and predicated store, I'm not sure.I think that selects on the inputs+outputs of instructions is a good abstraction, and I don't think that we need to add a mask operand to every LLVM IR instruction. However, we do need support for masked load/stores, and I think that we should implement them as target independent intrinsics. I don't want to change the generic LLVM IR Load/Store instructions because I don't want to modify all of the existing optimizations and I also don't want other users of the compiler to pay for this complexity. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130508/ad04dea8/attachment.html>
On Wed, May 8, 2013 at 11:07 AM, <dag at cray.com> wrote:>> The issue becomes how to match the instruction form above in a >> TableGen pattern. In order for this to emit a masked instruction, >> %newvalue and %oldvalue must be assigned to same physical register (I'm >> assuming an instruction like 'add %r0{%m0} %r1 %r2') However, I don't >> think there is even a notion of physical registers at the point that >> instruction selection is performed and the virtual registers will be >> different because everything is still in SSA form. > > Potentially you could use the "$src = $dst" constraint as in the > two-address x86 forms. I don't know that TableGen has been generalized > enough to do this, though. I think it's pretty highly specialized to > specific x86 two-address forms at the moment.I'm not familiar with the two-address constraint, but it seems like this would be challenging to express generally. Consider the previous example: %tx = select %mask, %x, <0.0, 0.0, 0.0 ...> %ty = select %mask, %y, <0.0, 0.0, 0.0 ...> %sum = fadd %tx, %ty %newvalue = select %mask, %sum, %oldvalue I believe the generated instructions depend on whether %oldvalue is still live after the last instruction. If it is, you need to generate two instructions: a copy into a new physical register then predicated write to it. If it is not used, then it is just a predicated write to the same register. move r1, r0 fadd r1{m0}, r2, r3 (r0 is now %oldvalue and r1 is %newvalue) vs. fadd r0{m0}, r2, r3 (r0 was %oldvalue and is now %newvalue)> The bottom line is that it is probably easier to set this up before LLVM > IR goes into SSA form.That makes sense, but it's unclear to me how you would preserve that information after going into SSA form.> There is a lot of interest in predication and a lot of recent > discussions about handling it in LLVM. Personally I think that > long-term we will need some IR changes. It might be as simple as adding > an IR-level predicated load and predicated store, I'm not sure.It seems to me that these are not really LLVM issues as much as the fact that SSA doesn't cleanly map to predicated instructions. For example, if predicates were hypothetically added universally to the IR (which I don't think anyone wants to do), it's not clear to me how that would even work. How would you specify what value the result would be received for non-enabled lanes? Perhaps another parameter: %newvalue = fadd %x, %y, %mask, %previousvalue Yuck.
On Thu, May 9, 2013 at 8:10 AM, <dag at cray.com> wrote:> Jeff Bush <jeffbush001 at gmail.com> writes: > >> %tx = select %mask, %x, <0.0, 0.0, 0.0 ...> >> %ty = select %mask, %y, <0.0, 0.0, 0.0 ...> >> %sum = fadd %tx, %ty >> %newvalue = select %mask, %sum, %oldvalue >> >> I believe the generated instructions depend on whether %oldvalue is >> still live after the last instruction. If it is, you need to generate >> two instructions: a copy into a new physical register then predicated >> write to it. If it is not used, then it is just a predicated write to >> the same register. >> >> move r1, r0 >> fadd r1{m0}, r2, r3 >> >> (r0 is now %oldvalue and r1 is %newvalue) >> >> vs. >> >> fadd r0{m0}, r2, r3 >> >> (r0 was %oldvalue and is now %newvalue) > > I'm assuming some parts of %oldvalue are still used. The masked fadd > could preserve them for false values of the mask, depending on how > masking was defined. Therefore, there's no need for a register copy. > If the masked operation does not preserve the old values in r0, then we > do need a register copy. > > Preserving old values does complicate things for SSA, as you note. > >>> The bottom line is that it is probably easier to set this up before LLVM >>> IR goes into SSA form. >> >> That makes sense, but it's unclear to me how you would preserve that >> information after going into SSA form. > > I should think the semantics of select would handle that. After a > select all vector elements of the result are defined. There is no > preservation of old values. There cannot be, by definition of SSA. > >> It seems to me that these are not really LLVM issues as much as the >> fact that SSA doesn't cleanly map to predicated instructions. > > It entirely depends on how the predication is defined to work.Good point. I was thinking of it narrowly as preserving the old value in the register. I guess I'd amend my previous statement to say that it actually does map just fine to SSA, but instruction selection becomes more complex. It sounds like the current LLVM instruction selection algorithm can't really handle the use case I described cleanly (generating predicated arithmetic instructions that preserve the old register value). Is that a fair statement?>> For example, if predicates were hypothetically added universally to >> the IR (which I don't think anyone wants to do), it's not clear to me >> how that would even work. How would you specify what value the result >> would be received for non-enabled lanes? Perhaps another parameter: >> >> %newvalue = fadd %x, %y, %mask, %previousvalue > > It depends on how you define the mask operation. > > On the Cray X1, result elements mapped to false mask values are > undefined after the operation.I assume the only point of the mask is to avoid traps/exceptions in this case (otherwise it doesn't really do anything, right?).> This is convenient for the hardware from > a register renaming perspective. For the same reason, tt also maps well > to SSA because there is no reuse of old values. It is really no > different than a shufflevector with some undef inputs from an SSA > viewpoint. > > On Intel's Knights Corner, result elements mapped to false mask elements > are preserved. This is tricky for SSA but I think we can model it with > select instructions and proper selection of input values. The input > related to %oldvalue will be very context-sensitive and I'm not sure > there's a way to express it generally for a single instruction outside a > code sequence. > > My assumption has always been that general predication in the IR would > have similar semantics to select except that traps would be disabled for > false mask elements. However I haven't fully thought through the > implications of that. > > But I don't think we need general predication throughout the IR, so I > don't think we need to worry about it. What we need is predication at > the expression tree leaves, which is why we're talking about loads and > stores. For loads, there is no %oldvalue so we don't have to worry > about it. > > %ra = ... > if (%mask) { > %rb = [mem1] > } > else { > %rb = [mem2] > } > %rc = %ra + %rb > > Would be translated to: > > %ra = ... > > %rt = load [mem1], %mask > %rf = load [mem2], ~%mask > > %rb = select %mask, %rt, %rf > %rc = fadd %ra, %rb > > Since the input select defined all vector elements we don't need selects > on the fadd. It's an unmasked operation. > > Slightly more complicated: > > %ra = ... > if (%mask) { > %rb = [mem1] > %rc = %ra + %rb > } > else { > %rb = [mem2] > %rc = %ra - %rb > } > > Would be translated to: > > %ra = ... > > %rt = load [mem1], %mask > %ru = select %mask, %rt, 0.0 ; Fill with safe values > %rv = fadd %ra, %ru > > %rf = load [mem2], ~%mask > %rg = select ~%mask, %rf, 0.0 ; Fill with safe values > %rh = fadd %ra, %rg > > %rc = select %mask, %rv, %rh > > The reason we need predication on loads and stores is so they don't trap > for false values of the mask. Currently there is no way to express this > in the IR. > > -David
On May 9, 2013, at 3:05 PM, Jeff Bush <jeffbush001 at gmail.com> wrote:> On Thu, May 9, 2013 at 8:10 AM, <dag at cray.com> wrote: >> Jeff Bush <jeffbush001 at gmail.com> writes: >> >>> %tx = select %mask, %x, <0.0, 0.0, 0.0 ...> >>> %ty = select %mask, %y, <0.0, 0.0, 0.0 ...> >>> %sum = fadd %tx, %ty >>> %newvalue = select %mask, %sum, %oldvalue >>> >>> I believe the generated instructions depend on whether %oldvalue is >>> still live after the last instruction. If it is, you need to generate >>> two instructions: a copy into a new physical register then predicated >>> write to it. If it is not used, then it is just a predicated write to >>> the same register. >>> >>> move r1, r0 >>> fadd r1{m0}, r2, r3 >>> >>> (r0 is now %oldvalue and r1 is %newvalue) >>> >>> vs. >>> >>> fadd r0{m0}, r2, r3 >>> >>> (r0 was %oldvalue and is now %newvalue) >> >> I'm assuming some parts of %oldvalue are still used. The masked fadd >> could preserve them for false values of the mask, depending on how >> masking was defined. Therefore, there's no need for a register copy. >> If the masked operation does not preserve the old values in r0, then we >> do need a register copy. >> >> Preserving old values does complicate things for SSA, as you note. >> >>>> The bottom line is that it is probably easier to set this up before LLVM >>>> IR goes into SSA form. >>> >>> That makes sense, but it's unclear to me how you would preserve that >>> information after going into SSA form. >> >> I should think the semantics of select would handle that. After a >> select all vector elements of the result are defined. There is no >> preservation of old values. There cannot be, by definition of SSA. >> >>> It seems to me that these are not really LLVM issues as much as the >>> fact that SSA doesn't cleanly map to predicated instructions. >> >> It entirely depends on how the predication is defined to work. > > Good point. I was thinking of it narrowly as preserving the old value > in the register. I guess I'd amend my previous statement to say that > it actually does map just fine to SSA, but instruction selection > becomes more complex. > > It sounds like the current LLVM instruction selection algorithm can't > really handle the use case I described cleanly (generating predicated > arithmetic instructions that preserve the old register value). Is > that a fair statement?I don’t think this is a fair statement. Tied register operands should handle this use case just fine. This problem is similar to that of two-address constraints. Two address instructions work as follows. When we match an instruction we “tie” input and output registers. Say you had an LLVM-IR add: x = add i32 y, z for x86 we generate the following machine ir instruction during ISel: vr0<def, tied1> = ADD32rr vr1<use, tied0>, vr2<use> Once we go out of SSA during CodeGen we have to replace the two address constraint by copies: vr0 = vr1 vr0 = ADD32rr vr0, vr2 Coalescing and allocation will then take care of removing unnecessary copies. I think that predicate instructions would be handled similar (for the sake of making the example shorted I replaced your sequence of IR instruction by one “virtual” IR instruction): x = predicated_add %mask, %x, %y, %oldvalue This (actually, your sequence of selects, and add) would be matched during ISel to: vr0<def, tied1> = PRED_ADD mask_vr, vr1<use>, vr2<use>, vr3<use, tied0>>From here on the machinery is the same, the two-address pass would translate such instructions to:vr0 = vr3 vr0 = PRED_ADD mask_vr, vr1, vr2, vr0 If vr3 is not live after PRED_ADD coalescing will coalesce vr0 and vr3. I don’t think there is a fundamental difficulty handling predictated instructions this way - at least wrt. to register constraints.
Jeff Bush <jeffbush001 at gmail.com> writes:>>> It seems to me that these are not really LLVM issues as much as the >>> fact that SSA doesn't cleanly map to predicated instructions. >> >> It entirely depends on how the predication is defined to work. > > Good point. I was thinking of it narrowly as preserving the old value > in the register. I guess I'd amend my previous statement to say that > it actually does map just fine to SSA, but instruction selection > becomes more complex.Yes, if you want to take advantage of the hardware preserving old values. There's nothing that requires you to do that, however. You can always write to a new hardware register and effectively get the "undefined for false mask" behavior. Maybe a later pass could even clean things up and opportunistically use the old value preserving behavior by rewriting register names and combining instructions. Just thinking off the top of my head.> It sounds like the current LLVM instruction selection algorithm can't > really handle the use case I described cleanly (generating predicated > arithmetic instructions that preserve the old register value). Is > that a fair statement?I actually don't know yet. I'll let you know in a few weeks once I've tested some things. :)>>> For example, if predicates were hypothetically added universally to >>> the IR (which I don't think anyone wants to do), it's not clear to me >>> how that would even work. How would you specify what value the result >>> would be received for non-enabled lanes? Perhaps another parameter: >>> >>> %newvalue = fadd %x, %y, %mask, %previousvalue >> >> It depends on how you define the mask operation. >> >> On the Cray X1, result elements mapped to false mask values are >> undefined after the operation. > > I assume the only point of the mask is to avoid traps/exceptions in > this case (otherwise it doesn't really do anything, right?).On the X1 it also potentially increases performance as masked elements can be skipped and the operation can finish early. On a machine like Knights Corner, the mask is the only way to have a dynamic vector length. Traditional vector machines have a vector length register that controls how many elements an operation should affect. This is used to avoid the need for a remainder loop after the main vectorized loop. A mask can be used for the same purpose. So a vector mask is used for more than just traditional if-conversion. -David