thr3ads.net - llvm dev - [LLVMdev] Predicated Vector Operations [May 2013]

If this information is useful, please help other people find it:
Share via:

Jeff Bush

2013-May-07 01:45 UTC

[LLVMdev] Predicated Vector Operations

I'm trying to understand how predicated/masked instructions can be
generated in llvm, specifically an instruction where a set bit in the mask
will write the new result into the corresponding vector lane in the
destination and a clear bit will cause the lane in the destination to
remain what it was before the instruction executed.

I've seen a few places that suggest 'select' is the proper way to
implement
predication.  I believe the predicated form cannot be explicitly expressed
in LLVM asm, because it is SSA.  It can be done implicitly:

  %sum = add <16 x i32> %x, %y
%newvalue = select <16 x i1> %mask, <16 x i32> %sum, <16 x
i32> %oldvalue
The issue becomes how to match the instruction form above in a TableGen
pattern.  In order for this to emit a masked instruction, %newvalue and
%oldvalue must be assigned to same physical register (I'm assuming an
instruction like 'add %r0{%m0} %r1 %r2') However, I don't think
there is
even a notion of physical registers at the point that instruction selection
is performed and the virtual registers will be different because everything
is still in SSA form.

I suspect I'm missing something fundamental.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130506/37bfb706/attachment.html>

dag at cray.com

2013-May-08 18:07 UTC

head link

[LLVMdev] Predicated Vector Operations

Jeff Bush <jeffbush001 at gmail.com> writes:
> I'm trying to understand how predicated/masked instructions can be
> generated in llvm, specifically an instruction where a set bit in the
> mask will write the new result into the corresponding vector lane in
> the destination and a clear bit will cause the lane in the destination
> to remain what it was before the instruction executed.
>
> I've seen a few places that suggest 'select' is the proper way
to
> implement predication. I believe the predicated form cannot be
> explicitly expressed in LLVM asm, because it is SSA. It can be done
> implicitly:
>
> %sum = add <16 x i32> %x, %y
> %newvalue = select <16 x i1> %mask, <16 x i32> %sum, <16 x
i32>
> %oldvalue
This is not necessarily sufficient in general.  For any operation that
can trap (pretty much any fp operation) you also need to protect the
inputs with safe values to adhere to the semantics of the LLVM IR.  For
an fadd, for example:

%tx = select %mask, %x, <0.0, 0.0, 0.0 ...>
%ty = select %mask, %y, <0.0, 0.0, 0.0 ...>
%sum = fadd %tx, %ty
%newvalue = select %mask, %sum, %oldvalue

Then this entire pattern of selects and fadd will have to be matched in
isel, which would throw away the safe value protection overhead (because
the hardware masking takes care of the safety).

In fact the problem is more general.  Even in the integer add case, you
have to ensure that llvm won't move the definition of %sum before the
definition of %mask.  Otherwise you're going to end up generating a
predicated add with a bad mask value.  Selects on the inputs to the
integer add also solve this problem.
> The issue becomes how to match the instruction form above in a
> TableGen pattern. In order for this to emit a masked instruction,
> %newvalue and %oldvalue must be assigned to same physical register (I'm
> assuming an instruction like 'add %r0{%m0} %r1 %r2') However, I
don't
> think there is even a notion of physical registers at the point that
> instruction selection is performed and the virtual registers will be
> different because everything is still in SSA form.
Potentially you could use the "$src = $dst" constraint as in the
two-address x86 forms.  I don't know that TableGen has been generalized
enough to do this, though.  I think it's pretty highly specialized to
specific x86 two-address forms at the moment.

Another problem is how to express %oldvalue in LLVM IR.  Presumably
%newvalue will be consumed, possibly by another arithmetic operation.
Presumably %oldvalue can similarly come from a previous arithmetic
operation feeding into the add.  If that's true, then %oldvalue is
either %x or %y.  Otherwise it is some other thing highly
context-dependent.

The gpuocelot project ran into the problem and they talk about it here:

http://code.google.com/p/gpuocelot/source/browse/wiki/LLVM.wiki?r=272

The bottom line is that it is probably easier to set this up before LLVM
IR goes into SSA form.

There is a lot of interest in predication and a lot of recent
discussions about handling it in LLVM.  Personally I think that
long-term we will need some IR changes.  It might be as simple as adding
an IR-level predicated load and predicated store, I'm not sure.

We're pretty heavily into trying to get predicated vector code working
well in LLVM here.  Expect to see many questions, code examples and
hopefully upstream code submissions over the next couple of months.

                          -David

Nadav Rotem

2013-May-08 18:32 UTC

head link

[LLVMdev] Predicated Vector Operations

On May 8, 2013, at 11:07 AM, dag at cray.com wrote:
>   It might be as simple as adding
> an IR-level predicated load and predicated store, I'm not sure.
I think that selects on the inputs+outputs of instructions is a good
abstraction, and I don't think that we need to add a mask operand to every
LLVM IR instruction. However, we do need support for masked load/stores, and I
think that we should implement them as target independent intrinsics. I
don't want to change the generic LLVM IR Load/Store instructions because I
don't want to modify all of the existing optimizations and I also don't
want other users of the compiler to pay for this complexity.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130508/ad04dea8/attachment.html>

Jeff Bush

2013-May-09 02:38 UTC

head link

[LLVMdev] Predicated Vector Operations

On Wed, May 8, 2013 at 11:07 AM,  <dag at cray.com>
wrote:>> The issue becomes how to match the instruction form above in a
>> TableGen pattern. In order for this to emit a masked instruction,
>> %newvalue and %oldvalue must be assigned to same physical register
(I'm
>> assuming an instruction like 'add %r0{%m0} %r1 %r2') However, I
don't
>> think there is even a notion of physical registers at the point that
>> instruction selection is performed and the virtual registers will be
>> different because everything is still in SSA form.
>
> Potentially you could use the "$src = $dst" constraint as in the
> two-address x86 forms.  I don't know that TableGen has been generalized
> enough to do this, though.  I think it's pretty highly specialized to
> specific x86 two-address forms at the moment.
I'm not familiar with the two-address constraint, but it seems like
this would be challenging to express generally.  Consider the previous
example:

%tx = select %mask, %x, <0.0, 0.0, 0.0 ...>
%ty = select %mask, %y, <0.0, 0.0, 0.0 ...>
%sum = fadd %tx, %ty
%newvalue = select %mask, %sum, %oldvalue

I believe the generated instructions depend on whether %oldvalue is
still live after the last instruction. If it is, you need to generate
two instructions: a copy into a new physical register then predicated
write to it.  If it is not used, then it is just a predicated write to
the same register.

  move r1, r0
  fadd r1{m0}, r2, r3

(r0 is now %oldvalue and r1 is %newvalue)

vs.

  fadd r0{m0}, r2, r3

(r0 was %oldvalue and is now %newvalue)
> The bottom line is that it is probably easier to set this up before LLVM
> IR goes into SSA form.
That makes sense, but it's unclear to me how you would preserve that
information after going into SSA form.
> There is a lot of interest in predication and a lot of recent
> discussions about handling it in LLVM.  Personally I think that
> long-term we will need some IR changes.  It might be as simple as adding
> an IR-level predicated load and predicated store, I'm not sure.
It seems to me that these are not really LLVM issues as much as the
fact that SSA doesn't cleanly map to predicated instructions. For
example, if predicates were hypothetically added universally to the IR
(which I don't think anyone wants to do), it's not clear to me how
that would even work.  How would you specify what value the result
would be received for non-enabled lanes?  Perhaps another parameter:

  %newvalue = fadd %x, %y, %mask, %previousvalue

Yuck.

Jeff Bush

2013-May-09 20:05 UTC

head link

[LLVMdev] Predicated Vector Operations

On Thu, May 9, 2013 at 8:10 AM,  <dag at cray.com>
wrote:> Jeff Bush <jeffbush001 at gmail.com> writes:
>
>> %tx = select %mask, %x, <0.0, 0.0, 0.0 ...>
>> %ty = select %mask, %y, <0.0, 0.0, 0.0 ...>
>> %sum = fadd %tx, %ty
>> %newvalue = select %mask, %sum, %oldvalue
>>
>> I believe the generated instructions depend on whether %oldvalue is
>> still live after the last instruction. If it is, you need to generate
>> two instructions: a copy into a new physical register then predicated
>> write to it.  If it is not used, then it is just a predicated write to
>> the same register.
>>
>>   move r1, r0
>>   fadd r1{m0}, r2, r3
>>
>> (r0 is now %oldvalue and r1 is %newvalue)
>>
>> vs.
>>
>>   fadd r0{m0}, r2, r3
>>
>> (r0 was %oldvalue and is now %newvalue)
>
> I'm assuming some parts of %oldvalue are still used.  The masked fadd
> could preserve them for false values of the mask, depending on how
> masking was defined.  Therefore, there's no need for a register copy.
> If the masked operation does not preserve the old values in r0, then we
> do need a register copy.
>
> Preserving old values does complicate things for SSA, as you note.
>
>>> The bottom line is that it is probably easier to set this up before
LLVM
>>> IR goes into SSA form.
>>
>> That makes sense, but it's unclear to me how you would preserve
that
>> information after going into SSA form.
>
> I should think the semantics of select would handle that.  After a
> select all vector elements of the result are defined.  There is no
> preservation of old values.  There cannot be, by definition of SSA.
>
>> It seems to me that these are not really LLVM issues as much as the
>> fact that SSA doesn't cleanly map to predicated instructions.
>
> It entirely depends on how the predication is defined to work.
Good point.  I was thinking of it narrowly as preserving the old value
in the register.  I guess I'd amend my previous statement to say that
it actually does map just fine to SSA, but instruction selection
becomes more complex.

It sounds like the current LLVM instruction selection algorithm can't
really handle the use case I described cleanly (generating predicated
arithmetic instructions that preserve the old register value).  Is
that a fair statement?
>> For example, if predicates were hypothetically added universally to
>> the IR (which I don't think anyone wants to do), it's not clear
to me
>> how that would even work.  How would you specify what value the result
>> would be received for non-enabled lanes?  Perhaps another parameter:
>>
>>   %newvalue = fadd %x, %y, %mask, %previousvalue
>
> It depends on how you define the mask operation.
>
> On the Cray X1, result elements mapped to false mask values are
> undefined after the operation.
I assume the only point of the mask is to avoid traps/exceptions in
this case (otherwise it doesn't really do anything, right?).
> This is convenient for the hardware from
> a register renaming perspective.  For the same reason, tt also maps well
> to SSA because there is no reuse of old values.  It is really no
> different than a shufflevector with some undef inputs from an SSA
> viewpoint.
>
> On Intel's Knights Corner, result elements mapped to false mask
elements
> are preserved.  This is tricky for SSA but I think we can model it with
> select instructions and proper selection of input values.  The input
> related to %oldvalue will be very context-sensitive and I'm not sure
> there's a way to express it generally for a single instruction outside
a
> code sequence.
>
> My assumption has always been that general predication in the IR would
> have similar semantics to select except that traps would be disabled for
> false mask elements.  However I haven't fully thought through the
> implications of that.
>
> But I don't think we need general predication throughout the IR, so I
> don't think we need to worry about it.  What we need is predication at
> the expression tree leaves, which is why we're talking about loads and
> stores.  For loads, there is no %oldvalue so we don't have to worry
> about it.
>
> %ra = ...
> if (%mask) {
>   %rb = [mem1]
> }
> else {
>   %rb = [mem2]
> }
> %rc = %ra + %rb
>
> Would be translated to:
>
> %ra = ...
>
> %rt = load [mem1], %mask
> %rf = load [mem2], ~%mask
>
> %rb = select %mask, %rt, %rf
> %rc = fadd %ra, %rb
>
> Since the input select defined all vector elements we don't need
selects
> on the fadd.  It's an unmasked operation.
>
> Slightly more complicated:
>
> %ra = ...
> if (%mask) {
>   %rb = [mem1]
>   %rc = %ra + %rb
> }
> else {
>   %rb = [mem2]
>   %rc = %ra - %rb
> }
>
> Would be translated to:
>
> %ra = ...
>
> %rt = load [mem1], %mask
> %ru = select %mask, %rt, 0.0 ; Fill with safe values
> %rv = fadd %ra, %ru
>
> %rf = load [mem2], ~%mask
> %rg = select ~%mask, %rf, 0.0 ; Fill with safe values
> %rh = fadd %ra, %rg
>
> %rc = select %mask, %rv, %rh
>
> The reason we need predication on loads and stores is so they don't
trap
> for false values of the mask.  Currently there is no way to express this
> in the IR.
>
>                                    -David

Arnold Schwaighofer

2013-May-09 21:15 UTC

head link

[LLVMdev] Predicated Vector Operations

On May 9, 2013, at 3:05 PM, Jeff Bush <jeffbush001 at gmail.com> wrote:
> On Thu, May 9, 2013 at 8:10 AM,  <dag at cray.com> wrote:
>> Jeff Bush <jeffbush001 at gmail.com> writes:
>> 
>>> %tx = select %mask, %x, <0.0, 0.0, 0.0 ...>
>>> %ty = select %mask, %y, <0.0, 0.0, 0.0 ...>
>>> %sum = fadd %tx, %ty
>>> %newvalue = select %mask, %sum, %oldvalue
>>> 
>>> I believe the generated instructions depend on whether %oldvalue is
>>> still live after the last instruction. If it is, you need to
generate
>>> two instructions: a copy into a new physical register then
predicated
>>> write to it.  If it is not used, then it is just a predicated write
to
>>> the same register.
>>> 
>>>  move r1, r0
>>>  fadd r1{m0}, r2, r3
>>> 
>>> (r0 is now %oldvalue and r1 is %newvalue)
>>> 
>>> vs.
>>> 
>>>  fadd r0{m0}, r2, r3
>>> 
>>> (r0 was %oldvalue and is now %newvalue)
>> 
>> I'm assuming some parts of %oldvalue are still used.  The masked
fadd
>> could preserve them for false values of the mask, depending on how
>> masking was defined.  Therefore, there's no need for a register
copy.
>> If the masked operation does not preserve the old values in r0, then we
>> do need a register copy.
>> 
>> Preserving old values does complicate things for SSA, as you note.
>> 
>>>> The bottom line is that it is probably easier to set this up
before LLVM
>>>> IR goes into SSA form.
>>> 
>>> That makes sense, but it's unclear to me how you would preserve
that
>>> information after going into SSA form.
>> 
>> I should think the semantics of select would handle that.  After a
>> select all vector elements of the result are defined.  There is no
>> preservation of old values.  There cannot be, by definition of SSA.
>> 
>>> It seems to me that these are not really LLVM issues as much as the
>>> fact that SSA doesn't cleanly map to predicated instructions.
>> 
>> It entirely depends on how the predication is defined to work.
> 
> Good point.  I was thinking of it narrowly as preserving the old value
> in the register.  I guess I'd amend my previous statement to say that
> it actually does map just fine to SSA, but instruction selection
> becomes more complex.
> 
> It sounds like the current LLVM instruction selection algorithm can't
> really handle the use case I described cleanly (generating predicated
> arithmetic instructions that preserve the old register value).  Is
> that a fair statement?

I don’t think this is a fair statement. Tied register operands should handle
this use case just fine. This problem is similar to that of two-address
constraints. Two address instructions work as follows. When we match an
instruction we “tie” input and output registers.

Say you had an LLVM-IR add:

x = add i32 y, z

for x86 we generate the following machine ir instruction during ISel:

vr0<def, tied1> = ADD32rr vr1<use, tied0>, vr2<use>

Once we go out of SSA during CodeGen we have to replace the two address
constraint by copies:

vr0 = vr1
vr0 = ADD32rr vr0, vr2

Coalescing and allocation will then take care of removing unnecessary copies. I
think that predicate instructions would be handled similar (for the sake of
making the example shorted I replaced your sequence of IR instruction by one
“virtual” IR instruction):

x = predicated_add %mask, %x, %y, %oldvalue

This (actually, your sequence of selects, and add) would be matched during ISel
to:

vr0<def, tied1> = PRED_ADD mask_vr, vr1<use>, vr2<use>,
vr3<use, tied0>
>From here on the machinery is the same, the two-address pass would translate
such instructions to:
vr0 = vr3
vr0 = PRED_ADD mask_vr, vr1, vr2, vr0

If vr3 is not live after PRED_ADD coalescing will coalesce vr0 and vr3. I don’t
think there is a fundamental difficulty handling predictated instructions this
way - at least wrt. to register constraints.

dag at cray.com

2013-May-10 16:50 UTC

head link

[LLVMdev] Predicated Vector Operations

Jeff Bush <jeffbush001 at gmail.com> writes:
>>> It seems to me that these are not really LLVM issues as much as the
>>> fact that SSA doesn't cleanly map to predicated instructions.
>>
>> It entirely depends on how the predication is defined to work.
>
> Good point.  I was thinking of it narrowly as preserving the old value
> in the register.  I guess I'd amend my previous statement to say that
> it actually does map just fine to SSA, but instruction selection
> becomes more complex.
Yes, if you want to take advantage of the hardware preserving old
values.  There's nothing that requires you to do that, however.  You can
always write to a new hardware register and effectively get the
"undefined for false mask" behavior.

Maybe a later pass could even clean things up and opportunistically use
the old value preserving behavior by rewriting register names and
combining instructions.  Just thinking off the top of my head.
> It sounds like the current LLVM instruction selection algorithm can't
> really handle the use case I described cleanly (generating predicated
> arithmetic instructions that preserve the old register value).  Is
> that a fair statement?
I actually don't know yet.  I'll let you know in a few weeks once
I've
tested some things.  :)
>>> For example, if predicates were hypothetically added universally to
>>> the IR (which I don't think anyone wants to do), it's not
clear to me
>>> how that would even work.  How would you specify what value the
result
>>> would be received for non-enabled lanes?  Perhaps another
parameter:
>>>
>>>   %newvalue = fadd %x, %y, %mask, %previousvalue
>>
>> It depends on how you define the mask operation.
>>
>> On the Cray X1, result elements mapped to false mask values are
>> undefined after the operation.
>
> I assume the only point of the mask is to avoid traps/exceptions in
> this case (otherwise it doesn't really do anything, right?).
On the X1 it also potentially increases performance as masked elements
can be skipped and the operation can finish early.

On a machine like Knights Corner, the mask is the only way to have a
dynamic vector length.  Traditional vector machines have a vector length
register that controls how many elements an operation should affect.
This is used to avoid the need for a remainder loop after the main
vectorized loop.  A mask can be used for the same purpose.  So a vector
mask is used for more than just traditional if-conversion.

                           -David

Seemingly Similar Threads

Search for more seemingly similar threads

llvm dev - May 2013 - [LLVMdev] Predicated Vector Operations

[LLVMdev] Predicated Vector Operations

[LLVMdev] Predicated Vector Operations

[LLVMdev] Predicated Vector Operations

[LLVMdev] Predicated Vector Operations

[LLVMdev] Predicated Vector Operations

[LLVMdev] Predicated Vector Operations

[LLVMdev] Predicated Vector Operations

Seemingly Similar Threads