thr3ads.net - llvm dev - [LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR [Aug 2008]

If this information is useful, please help other people find it:
Share via:

Dan Gohman

2008-Aug-05 17:39 UTC

[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR

On Tue, August 5, 2008 8:32 am, David Greene wrote:> On Monday 04 August 2008 17:56, Dan Gohman wrote:
>> The applymask approach leverages use-def information rather than
>> what can be thought of as duplicating a subset of it, making the IR
>
> I don't understand what you mean by "duplicating" here.
If you look just at the case where every instruction in a
given use-def sub-dag uses the same mask, adding that mask as
an operand to all of them is largely just duplicating the
information about them all being connected. This is the common
case that applymask is aimed at.

In the case where multiple masks are used, applymask can still
cope, but the neat thing is that in this case it serves to
mark the dataflow edges where masks change.
>  You need some
> kind of use-def information for the masks themselves because at some
> point they need to be register-allocated.
What I'm talking about here is just in LLVM IR. I agree that we want
mask registers as operands during register allocation, and probably
also instruction selection.
>
>> less cluttered. And, it makes it trivially straightforward to write
>> passes that work correctly on both masked and unmasked code.
>
> I had a thought on this, actually.  Let's say the mask is the very last
> operand on masked instructions.  Most passes don't care about the mask
> at all.  They can just ignore it.  Since they don't look at the extra
> operand
> right now, there shouldn't be many changes necessary (some asserts
> may need to be fixed, etc.).
>
> Think about instcombine.  It's matching patterns.  If the matcher
doesn't
> look at masks, that may be ok most of the time (mod corner cases which
> I fully appreciate can be a real pain to track down).  If we want fancy
> instcombine tricks that understand masks, we can add those later.
If masks are operands, instcombine will need to check if all the
relevent masks match before many of the transformations it does,
and it'll need to take care to put the mask operand in the
instructions it creates.

With applymask, I believe instcombine wouldn't require any
modifications, except things like "case ApplyMaskInst: break;" in
a few places. Applymask makes masks in the IR so easy to reason
about, most passes won't need to do any special reasoning.
>
>> >  Not all dependencies are readily expressed
>> > in the instructions.  How would one express TableGen patterns for
such
>> > things?
>>
>> The syntax above is an idea for LLVM IR. SelectionDAG doesn't
>> necessarily
>> have to use the same approach.
>
> What do you mean by "ideal for LLVM IR?"  This looks very much
_not_ ideal
> to
> me from a debugging standpoint.  It's difficult to understand.  It took
me
> reading through the proposal a few times to grok what you are talking
> about.
I said "idea", not "ideal" :-). But I just meant that LLVM
IR
and SelectionDAG don't have to do the same thing.
>
>> I think we all recognize the need, and in the absence of better
>> alternatives are willing to accept the mask operand approach. It would
>> have a significant impact on everyone, even those that don't use
masks.
>
> How do you define "significant impact?"  Compile time? 
Development
> effort?
> Transition pain?  All of the above?  More?
With mask operands, many passes will need to explicitly check for
masks even if they don't care and just want to be conservatively
correct.

With applymask, passes will often be able to operate on masked IR
just as aggressively as non-masked IR.
>> I don't want to stand in the way of progress, but this alternative
>> approach seems promising enough to be worth consideration.
>
> Alternatives are always welcome and worth considering.  I'm looking at
the
> kind of things the LLVM community is going to want to support and I'm
> pretty sure masks are going to be a very big part of architectures in the
> future.  We're done with clock speed improvements, so we need to rely
on
> architecture more.  Vectorization is a well-known technique to improve
> single thread performance and masks are critical to producing efficient
> vector
> code.
>
> If y'all agree with this premise, it seems to me that we want to
support
> such architectures in as straightforward a way as possible so as to
> minimize
> future pain when we're all writing complex and beautiful vector hacks. 
:)
I think we basically agree here :-). For me, that applymask
simplifies the reasoning that optimizers must do for masked
instructions is a large part of what motivates it for
consideration.
> What can we learn from the IA64 and ARM backends?  How do they handle
> their masks (scalar predication)?  Is all the if-conversion done in
> target-specific passes?
It's in lib/CodeGen/IfConversion.cpp, but it wouldn't be usable for
vectors. If-conversion for vectors must be done as part of the
vectorization (whether that's the user/front-end or the optimizer).

Dan

David Greene

2008-Aug-05 18:27 UTC

head link

[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR

On Tuesday 05 August 2008 12:39, Dan Gohman wrote:> On Tue, August 5, 2008 8:32 am, David Greene wrote:
> > On Monday 04 August 2008 17:56, Dan Gohman wrote:
> >> The applymask approach leverages use-def information rather than
> >> what can be thought of as duplicating a subset of it, making the
IR
> >
> > I don't understand what you mean by "duplicating" here.
>
> If you look just at the case where every instruction in a
> given use-def sub-dag uses the same mask, adding that mask as
> an operand to all of them is largely just duplicating the
> information about them all being connected. This is the common
> case that applymask is aimed at.
But why couldn't one argue the same thing for other operands and make
LLVM IR a stack machine?  It just seems strange to me to treat a
fundamental part of the IR differently simply because it came along
later.
> In the case where multiple masks are used, applymask can still
> cope, but the neat thing is that in this case it serves to
> mark the dataflow edges where masks change.
What advantage does that give over explicit masks in SSA form?  PHI nodes do 
the same thing.
> >  You need some
> > kind of use-def information for the masks themselves because at some
> > point they need to be register-allocated.
>
> What I'm talking about here is just in LLVM IR. I agree that we want
> mask registers as operands during register allocation, and probably
> also instruction selection.
So how does isel go from an implicit-masked LLVM IR to an
explicit-masked MachineInstruction?  And then wouldn't all Machine
passes need to understand mask operands?
> > Think about instcombine.  It's matching patterns.  If the matcher
doesn't
> > look at masks, that may be ok most of the time (mod corner cases which
> > I fully appreciate can be a real pain to track down).  If we want
fancy
> > instcombine tricks that understand masks, we can add those later.
>
> If masks are operands, instcombine will need to check if all the
> relevent masks match before many of the transformations it does,
I'm not sure about "many."  Many of instcombines tricks are based
on
expression reorderings which are valid with or without masks.  We might
need to do some mask manipulation (ANDing or ORing them, for example)
but we'd have to do that anyway with applymask if the incoming masks
to expressions being combined don't match.  We'd have to introduce a new
applymask that takes the result of the logical mask operation.

So, it seems to me that instcombine needs to check and manipulate masks
either way.

We'd have to be careful about not introducing traps but again instcombine
will
have to worry about that with applymask as well in the general case.

It strikes me that you're indicating that with applymask, instcombine can
just
ignore the implicit mask operand on every instruction (because it isn't 
explicit and instcombine doesn't look at it right now).  I'm in fact
saying
the exact same thing, except masks would be explicit and instcombine would
just not look at those operands.

Neither solution eliminates the need for instcombine to be careful and consult
masks from time to time.

Perhaps I'm totally missing something.  Concrete examples would be helpful.
> and it'll need to take care to put the mask operand in the
> instructions it creates.
Or create a new applymask if it needs to generate new masks as the result of 
combining.
> With applymask, I believe instcombine wouldn't require any
> modifications, except things like "case ApplyMaskInst: break;" in
> a few places. Applymask makes masks in the IR so easy to reason
> about, most passes won't need to do any special reasoning.
I don't think this is quite right.  See the comment above.
> >> The syntax above is an idea for LLVM IR. SelectionDAG doesn't
> >> necessarily
> >> have to use the same approach.
> >
> > What do you mean by "ideal for LLVM IR?"  This looks very
much _not_
> > ideal to
> > me from a debugging standpoint.  It's difficult to understand.  It
took
> > me reading through the proposal a few times to grok what you are
talking
> > about.
>
> I said "idea", not "ideal" :-). But I just meant that
LLVM IR
> and SelectionDAG don't have to do the same thing.
Oops, sorry about the misread.  :)

I don't think SelectionDAG is really the issue.  The issue is how to reason 
about vector code and debug compilers.  In my experience, explicit is 
usually the better way to go for maintainability.
> > How do you define "significant impact?"  Compile time? 
Development
> > effort?
> > Transition pain?  All of the above?  More?
>
> With mask operands, many passes will need to explicitly check for
> masks even if they don't care and just want to be conservatively
> correct.
If passes need to worry about masks for correctness, that doesn't change
with the representation of masks.  That's a program dataflow issue and
applies
no matter how it's represented.
> With applymask, passes will often be able to operate on masked IR
> just as aggressively as non-masked IR.
And they should be able to do so with any other representation as well.  The 
compiler needs to maintain semantics, not syntax.
> > If y'all agree with this premise, it seems to me that we want to
support
> > such architectures in as straightforward a way as possible so as to
> > minimize
> > future pain when we're all writing complex and beautiful vector
hacks.
> > :)
>
> I think we basically agree here :-). For me, that applymask
> simplifies the reasoning that optimizers must do for masked
> instructions is a large part of what motivates it for
> consideration.
I'm not convinced that there is a simplification of reasoning but I'm
open to
changing my mind.   Again, concrete examples would be really helpful.

I do agree that in some cases code rewrites might be avoided because existing
Instructions will retain the same number of operands they have now.  But 
that's not a "reasoning" issue.  That's a mechanics issue (how
you implement
the algorithm).

Avoiding the one-time pain of converting this kind of code doesn't seem
worth
the cost to maintainability to me.

The key with all of this is going to be introducing it incrementally.  The 
first step is to support vector of i1 all the way through isel.  Next is 
vector select.  Then we can start selectively adding masks to Instructions.
As we do this, if a vector operation can trap and a mask isn't available for
that Instruction yet, then we just can't vectorize that code yet. 
There's no
loss from what we have right now, which is total inability to vectorize
conditional code in a reasonable way, save for a few tricks for corner cases.

If instcombine isn't ready to handle masks on some Instructions, then we
can't
vectorize in those cases.  We don't *need* to convert instcombine or any
other
pass all at once.  We can do it gradually and introduce more vector capability
as we go.
> > What can we learn from the IA64 and ARM backends?  How do they handle
> > their masks (scalar predication)?  Is all the if-conversion done in
> > target-specific passes?
>
> It's in lib/CodeGen/IfConversion.cpp, but it wouldn't be usable for
> vectors. If-conversion for vectors must be done as part of the
> vectorization (whether that's the user/front-end or the optimizer).
I meant using them as examples of representation.  I know that scalar 
if-conversion is different than vector if-conversion.  I'm curious about how
IfConversion.cpp represents the predicates it creates.  Since it's in
CodeGen
I guess the MachineInstruction representation just includes them as extra
operands.

                                                            -Dave

David Greene

2008-Aug-07 19:13 UTC

head link

[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR

On Tuesday 05 August 2008 13:27, David Greene wrote:
> Neither solution eliminates the need for instcombine to be careful and
> consult masks from time to time.
>
> Perhaps I'm totally missing something.  Concrete examples would be
helpful.
Ok, so I took my own advice and thought about CSE and instcombine a bit.
I wrote the code by hand in a sort of pseudo-llvm language, so don't
crucify me for mistakes.  :)

CSE is an optimization that needs to pay attention to masks.  You can't
(easily) CSE an expression with different masks.

Using a mask-per-operation setup, code might look like this:

CSE Mask Example

Mask-per-operation:
%m1 = ...
%m2 = ...
%r1 = add %r2, %r3, %m1
%r4 = add %r2, %r3, %m2

We can only CSE the adds if %m1 == %m2.  I'd think it would be
sufficient to just check whether the mask registers are the same, though
you could imagine walking back through the use-def chains to do more
complex analysis.  Esentially, the mask operand becomes part of the
expression you're trying to match.  CSE definitely needs to be
changed to understand masks.

With applymask, things look something like this:

applymask:
%m1 = ...
%m2 = ...
%r5 = applymask %r2, %m1
%r6 = applymask %r3, %m1
%r7 = applymask %r2, %m2
%r8 = applymask %r3, %m2
%r1 = add %r5, %r6
%r4 = add %r7, %r8

This is interesting because CSE will not do the optimization since the
operands of the adds are different.  CSE does not have to change to
understand masks.  Value numbering will have to understand applymask to
make sure it doesn't do something bad because it doesn't care about the
names of operands but the renaming of registers required by applymask is
somewhat attractive.

Now let's look at an instcombine example:

Instcombine example: -A + -B --> -(A + B)

Mask-per-operation:
%m1 = ...
%m2 = ...
%m3 = ...
%r1 = neg %r2, %m1
%r1 = select %r1, %r2, %m1  # Define all elements
%r3 = neg %r4, %m2
%r3 = select %r3, %r4, %m2  # Define all elements
%r5 = add %r1, %r3, %m3

Since the masks on the negs are different we have to be careful in how
we do this.  In the easiest case, we simply disallow combining here,
which requires instcombine to examine the masks and, as in the CSE
example above, potentially just look at the mask register names to
determine equality.

One could imagine doing the combining this way:

%m1 = ...
%m2 = ...
%m3 = ...
# Do the combining where legal
%m4 = and %m1, %m2
%r6 = add %r2, %r4, %m4
%r7 = neg %r6, %m4
%r7 = select %r7, %r6, %m4 # Define all elements for final merge
# Take care of the rest
%m6 = xor %m1, %m4
%m7 = xor %m2, %m4
%m8 = xor %m3, %m4
%r1 = neg %r2, %m6
%r1 = select %r1, %r2, %m6 # Define all elements for add
%r3 = neg %r4, %m7
%r3 = select %r3, %r4, %m7 # Define all elements for add
%r5 = add %r1, %r3, %m8
%r5 = select %r5, %r1, %m7 # Define all elements for final merge
# Do the final merge
%r5 = select %r7, %r5, %m4

But this is obviously a loss.  We may be able to get away with
eliminating some merges (I'm being conservative about undefined elements
here) but it would still be a loss.

applymask:

%m1 = ...
%m2 = ...
%m3 = ...
%r6 = applymask %r2, %m1
%r7 = applymask %r4, %m2
%r1 = neg %r6
%r3 = neg %r7
%r8 = applymask %r1. %m3
%r9 = applymask %r3, %m3
%r5 = add %r8, %r9

This is really no different than the mask-per-operation case.
Instcombine still has to consider masks when evaluating the legality and
profitability of combining.  Here instcombine would have to walk back
through the def-use lists on the operands of the neg and add to find if
they are masked and then determine the mask value (or mask register name
for simpler equality testing).

Again, doing the combining in this case would be detrimental.  My guess
is that most combining turns out worse code when operations are masked
and the masks are not equal.

Given these two examples, I see some benefit to applymask for
transformations like CSE and other pattern-matching algorithms where the
renaming of registers that applymask requires converts previously
matching patterns into patterns that don't match.  For something like
instcombine or value numbering that is based on value flow and/or pure
semantics, I don't see any benefit.

Of course, these are just two examples.  One could repeat the exercise
for any transformation we're interested in.

I hope this is helpful in moving us along in the discussion.  Feedback
and comments are welcome!

                                                -Dave

Maybe Matching Threads

Search for more reasonably related threads

llvm dev - Aug 2008 - [LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR

[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR

[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR

[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR

Maybe Matching Threads