thr3ads.net - llvm dev - [LLVMdev] Handling Masked Vector Operations [May 2013]

If this information is useful, please help other people find it:
Share via:

dag at cray.com

2013-May-02 15:57 UTC

[LLVMdev] Handling Masked Vector Operations

We're looking at how to handle masked vector operations in architectures
like Knight's Corner.  In our case, we have to translate from a fully
vectorized IR that has mask support to llvm IR which does not have mask
support.

For non-trapping instructions this is fairly straightforward:

; Input
t1 = add t2, t3, mask

; llvm IR -- assuming we want zeros in the false positions, which is not
; always the case
tt = add t2, t3
t1 = select mask, tt, 0

It's easy enough to write a TableGen pattern to match add+select and
emit a masked instruction.  Alternative, we can always resort to manual
lowering (ugh).

For trapping operations this is problematic.  Take a load.  Here's the
same attempt with a load:

tt = load [addr]
t1 = select mask, tt, 0

The problem is that the load is unconditional.  Originally it was masked
presumably because the original scalar load was under a condition,
probably protecting against traps.  However, since llvm has no concept
of a masked load, to stay within the IR we need to use an unconditional
load.  We can get the same result vector after the select, but that's
too late.  The load has already executed and the llvm passes will assume
that load cannot fault (otherwise it's undefined behavior).  The llvm IR
does not convey the same semantics as the fully predicated IR.

It seems the only solution is to create an intrinsic:

llvm_int_load_masked mask, [addr]

But this unnecessarily shuts down optimization.

Similar problems exist with any trapping instruction (div, mod, etc.).
It gets even worse when you consider than any floating point operation
can trap on a signalling NaN input.

The gpuocelot project is essentially trying to do the same thing but I
haven't dived deep enough into their notes and implementation to see how
they handle this issue.  Perhaps because current GPUs don't trap it's a
non-issue.  But that will likely change in the future.

So are there any ideas out there for how to efficiently handle this?
We've talked about llvm and masks before and it's clear that there is
strong resistance to adding masks to the IR.  Perhaps an alternative is
to mark an instruction as "may trap" so that llvm will not perform
certain transformations on it.  Of course that involves teaching all of
the passes about a new "may trap" attribute, or whatever mechanism is
devised.

I would very much appreciate thoughts and ideas on this.  As it is, it
doesn't seem like it's possible to generate efficient llvm IR for fully
predicated instruction sets.

                           -David

Nadav Rotem

2013-May-02 16:48 UTC

head link

[LLVMdev] Handling Masked Vector Operations

Hi David, 
> 
> It seems the only solution is to create an intrinsic:
> 
> llvm_int_load_masked mask, [addr]
> 
> But this unnecessarily shuts down optimization.
> 
I think that using intrinsics is the right solution. I imagine that most
interesting load/store optimizations happen before vectorization, so I am not
sure how much we can gain by optimizing masked load/stores.
> Similar problems exist with any trapping instruction (div, mod, etc.).
> It gets even worse when you consider than any floating point operation
> can trap on a signalling NaN input.
For DIV/MOD you can blend the inputs BEFORE the operation. You can place ones or
zeros depending on the operation.
> 
> So are there any ideas out there for how to efficiently handle this?
> We've talked about llvm and masks before and it's clear that there
is
> strong resistance to adding masks to the IR.
Yes. I think that the consensus is that we don't need to predicate the IR
itself to support MIC-like processors.

Thanks,
Nadav
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130502/22bbda6f/attachment.html>

dag at cray.com

2013-May-02 17:07 UTC

head link

[LLVMdev] Handling Masked Vector Operations

Nadav Rotem <nrotem at apple.com> writes:
>     It seems the only solution is to create an intrinsic:
>     
>     llvm_int_load_masked mask, [addr]
>     
>     But this unnecessarily shuts down optimization.
>     
>     
>
> I think that using intrinsics is the right solution. I imagine that
> most interesting load/store optimizations happen before vectorization,
> so I am not sure how much we can gain by optimizing masked
> load/stores. 
Perhaps that is true.  If this is the only intrinsic we need (well, a
store too), maybe it's not too bad.
>     Similar problems exist with any trapping instruction (div, mod,
>     etc.).
>     It gets even worse when you consider than any floating point
>     operation
>     can trap on a signalling NaN input.
>     
>
> For DIV/MOD you can blend the inputs BEFORE the operation. You can
> place ones or zeros depending on the operation. 
That's true but it's inefficient.  I suppose we can write patterns to
match the input selects as well and just drop them, opting for the
masked operation.  But this all requires that these
select/select/op/select sequences stay intact throughout llvm so isel
can match it.  I'm not totally confident that's possible.
>     
>     So are there any ideas out there for how to efficiently handle
>     this?
>     We've talked about llvm and masks before and it's clear that
there
>     is
>     strong resistance to adding masks to the IR. 
>
> Yes. I think that the consensus is that we don't need to predicate the
> IR itself to support MIC-like processors. 
Perhaps not but I think we need a little more than we have right now.
I'll ponder this some more but in the mean time, please continue to add
thoughts and ideas.

                                -David

dag at cray.com

2013-May-02 17:12 UTC

head link

[LLVMdev] Handling Masked Vector Operations

Nadav Rotem <nrotem at apple.com> writes:
> For DIV/MOD you can blend the inputs BEFORE the operation. You can
> place ones or zeros depending on the operation. 
Quick follow-up on this.  What about using "undef" as the input for
false items:

tv1 = select mask, v1, undef
tv2 = select mask, v2, undef
tv3 = div tv1, tv2
v3 = select mask, tv3, undef

I'm always confused about the semantics of undef.  Is the above safe
code?  It would simplify things a bit not to have to track which input
values are safe based on the context of an operation.

                             -David

Duncan Sands

2013-May-03 08:10 UTC

head link

[LLVMdev] Handling Masked Vector Operations

Hi David,

On 02/05/13 17:57, dag at cray.com wrote:> We're looking at how to handle masked vector operations in
architectures
> like Knight's Corner.  In our case, we have to translate from a fully
> vectorized IR that has mask support to llvm IR which does not have mask
> support.
>
> For non-trapping instructions this is fairly straightforward:
>
> ; Input
> t1 = add t2, t3, mask
>
> ; llvm IR -- assuming we want zeros in the false positions, which is not
> ; always the case
> tt = add t2, t3
> t1 = select mask, tt, 0
there seems to be a plan to get rid of the select instruction and just use
branches and phi nodes instead.  Amongst other things this requires boosting
the power of codegen so that branches+phi nodes can be turned into cmov or
whatever when appropriate.
>
> It's easy enough to write a TableGen pattern to match add+select and
> emit a masked instruction.  Alternative, we can always resort to manual
> lowering (ugh).
>
> For trapping operations this is problematic.  Take a load.  Here's the
> same attempt with a load:
>
> tt = load [addr]
> t1 = select mask, tt, 0
This would not be problematic at the IR level if it was done by branching to
one of two basic blocks based on the condition, and doing the load in the
appropriate basic block.  Codegen would however need to become powerful enough
to turn this construct into your target's predicated load.

Ciao, Duncan.

Nadav Rotem

2013-May-03 17:59 UTC

head link

[LLVMdev] Handling Masked Vector Operations

On May 3, 2013, at 1:10 AM, Duncan Sands <baldrick at free.fr> wrote:
> there seems to be a plan to get rid of the select instruction and just use
> branches and phi nodes instead.  Amongst other things this requires
boosting
> the power of codegen so that branches+phi nodes can be turned into cmov or
> whatever when appropriate.
Hi Duncan, 

Thanks for commenting on this. I completely disagree with the proposal to remove
SELECTS from the IR, and David's email is a perfect example of how useful
Selects are. I'd be happy to comment about the proposal when it is presented
to the community.

Thanks,
Nadav
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130503/7a450283/attachment.html>

dag at cray.com

2013-May-06 16:35 UTC

head link

[LLVMdev] Handling Masked Vector Operations

Duncan Sands <baldrick at free.fr> writes:
> there seems to be a plan to get rid of the select instruction and just use
> branches and phi nodes instead.  Amongst other things this requires
boosting
> the power of codegen so that branches+phi nodes can be turned into cmov or
> whatever when appropriate.
This is a very BAD idea.  Are you telling me that every predicated
instruction will need to be in its own basic block so it can be
represented with a phi?  Certainly this will be true for any
instructions that do not share masks.

PLEASE do not do this!  This would be a huge step backward in vector
support.
>>
>> It's easy enough to write a TableGen pattern to match add+select
and
>> emit a masked instruction.  Alternative, we can always resort to manual
>> lowering (ugh).
>>
>> For trapping operations this is problematic.  Take a load.  Here's
the
>> same attempt with a load:
>>
>> tt = load [addr]
>> t1 = select mask, tt, 0
>
> This would not be problematic at the IR level if it was done by branching
to
> one of two basic blocks based on the condition, and doing the load in the
> appropriate basic block.  Codegen would however need to become powerful
enough
> to turn this construct into your target's predicated load.
How will that ever happen?  isel has never known much about control flow
at all.

Please do NOT remove select until we have a solid replacement in place,
something that's tested and known to work.

I cannot object strongly enough.  I've bit my tongue at a few IR
changes, but not this one.

Who propsed this change?  Why has it not been discussed on the list?

                             -David

James Courtier-Dutton

2013-May-09 08:21 UTC

head link

[LLVMdev] Handling Masked Vector Operations

On 2 May 2013 16:57,  <dag at cray.com> wrote:> We're looking at how to handle masked vector operations in
architectures
> like Knight's Corner.  In our case, we have to translate from a fully
> vectorized IR that has mask support to llvm IR which does not have mask
> support.
>
Has anyone done a comparision between the "fully vectorized IR and
"LLVM IR" ?
If someone has already invented a "fully vectorized IR", it might be
beneficial to not re-invent it for LLVM.
For example, if you are optimizing a loop, and splitting it into 3
loops, one of which can then be fully vectorized, it would be useful
to represent that optimization/translation at the IR level. Adding
mask support to LLVM IR would therefore seem a sensible course to me.
It might be a short term pain, but would possibly benefit the longterm
optimization goals of LLVM.

dag at cray.com

2013-May-09 15:19 UTC

head link

[LLVMdev] Handling Masked Vector Operations

James Courtier-Dutton <james.dutton at gmail.com> writes:
> On 2 May 2013 16:57,  <dag at cray.com> wrote:
>> We're looking at how to handle masked vector operations in
architectures
>> like Knight's Corner.  In our case, we have to translate from a
fully
>> vectorized IR that has mask support to llvm IR which does not have mask
>> support.
>>
>
> Has anyone done a comparision between the "fully vectorized IR and
"LLVM IR" ?
> If someone has already invented a "fully vectorized IR", it might
be
> beneficial to not re-invent it for LLVM.
> For example, if you are optimizing a loop, and splitting it into 3
> loops, one of which can then be fully vectorized, it would be useful
> to represent that optimization/translation at the IR level. Adding
> mask support to LLVM IR would therefore seem a sensible course to me.
> It might be a short term pain, but would possibly benefit the longterm
> optimization goals of LLVM.
The vectorized IR we are translating from has explicit masking at the
leaf nodes and implied masking at the inner nodes.  For example:

                 ___MERGE___
                /           \
               +             -
              / \           / \
             /   \         /   \
         [a#m1] [b#m1] [a#m2] [b#m2]

So the add is assumed to operate under #m1 and the subtract is assumed
to operate under #m2.  Then there is an explicit merge operation to form
the final vector.  In this case #m2 = ~#m1.

I believe we can represent this in LLVM IR with selects as long as we
have predication at the leaves.  The trick is to have isel match all of
these selects and generate an efficient predicated operation.  I'm
working on trying that experiment to see if it will suffice.

So I don't know that a fully predicated IR would be any better than
selects + predication at the leaves.  That's why I'm doing the
experiment.  :)

                                -David

Seemingly Similar Threads

Search for more apparently analagous threads

llvm dev - May 2013 - [LLVMdev] Handling Masked Vector Operations

[LLVMdev] Handling Masked Vector Operations

[LLVMdev] Handling Masked Vector Operations

[LLVMdev] Handling Masked Vector Operations

[LLVMdev] Handling Masked Vector Operations

[LLVMdev] Handling Masked Vector Operations

[LLVMdev] Handling Masked Vector Operations

[LLVMdev] Handling Masked Vector Operations

[LLVMdev] Handling Masked Vector Operations

[LLVMdev] Handling Masked Vector Operations

Seemingly Similar Threads