thr3ads.net - llvm dev - [LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR [Aug 2008]

If this information is useful, please help other people find it:
Share via:

David Greene

2008-Aug-07 19:13 UTC

[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR

On Tuesday 05 August 2008 13:27, David Greene wrote:
> Neither solution eliminates the need for instcombine to be careful and
> consult masks from time to time.
>
> Perhaps I'm totally missing something.  Concrete examples would be
helpful.
Ok, so I took my own advice and thought about CSE and instcombine a bit.
I wrote the code by hand in a sort of pseudo-llvm language, so don't
crucify me for mistakes.  :)

CSE is an optimization that needs to pay attention to masks.  You can't
(easily) CSE an expression with different masks.

Using a mask-per-operation setup, code might look like this:

CSE Mask Example

Mask-per-operation:
%m1 = ...
%m2 = ...
%r1 = add %r2, %r3, %m1
%r4 = add %r2, %r3, %m2

We can only CSE the adds if %m1 == %m2.  I'd think it would be
sufficient to just check whether the mask registers are the same, though
you could imagine walking back through the use-def chains to do more
complex analysis.  Esentially, the mask operand becomes part of the
expression you're trying to match.  CSE definitely needs to be
changed to understand masks.

With applymask, things look something like this:

applymask:
%m1 = ...
%m2 = ...
%r5 = applymask %r2, %m1
%r6 = applymask %r3, %m1
%r7 = applymask %r2, %m2
%r8 = applymask %r3, %m2
%r1 = add %r5, %r6
%r4 = add %r7, %r8

This is interesting because CSE will not do the optimization since the
operands of the adds are different.  CSE does not have to change to
understand masks.  Value numbering will have to understand applymask to
make sure it doesn't do something bad because it doesn't care about the
names of operands but the renaming of registers required by applymask is
somewhat attractive.

Now let's look at an instcombine example:

Instcombine example: -A + -B --> -(A + B)

Mask-per-operation:
%m1 = ...
%m2 = ...
%m3 = ...
%r1 = neg %r2, %m1
%r1 = select %r1, %r2, %m1  # Define all elements
%r3 = neg %r4, %m2
%r3 = select %r3, %r4, %m2  # Define all elements
%r5 = add %r1, %r3, %m3

Since the masks on the negs are different we have to be careful in how
we do this.  In the easiest case, we simply disallow combining here,
which requires instcombine to examine the masks and, as in the CSE
example above, potentially just look at the mask register names to
determine equality.

One could imagine doing the combining this way:

%m1 = ...
%m2 = ...
%m3 = ...
# Do the combining where legal
%m4 = and %m1, %m2
%r6 = add %r2, %r4, %m4
%r7 = neg %r6, %m4
%r7 = select %r7, %r6, %m4 # Define all elements for final merge
# Take care of the rest
%m6 = xor %m1, %m4
%m7 = xor %m2, %m4
%m8 = xor %m3, %m4
%r1 = neg %r2, %m6
%r1 = select %r1, %r2, %m6 # Define all elements for add
%r3 = neg %r4, %m7
%r3 = select %r3, %r4, %m7 # Define all elements for add
%r5 = add %r1, %r3, %m8
%r5 = select %r5, %r1, %m7 # Define all elements for final merge
# Do the final merge
%r5 = select %r7, %r5, %m4

But this is obviously a loss.  We may be able to get away with
eliminating some merges (I'm being conservative about undefined elements
here) but it would still be a loss.

applymask:

%m1 = ...
%m2 = ...
%m3 = ...
%r6 = applymask %r2, %m1
%r7 = applymask %r4, %m2
%r1 = neg %r6
%r3 = neg %r7
%r8 = applymask %r1. %m3
%r9 = applymask %r3, %m3
%r5 = add %r8, %r9

This is really no different than the mask-per-operation case.
Instcombine still has to consider masks when evaluating the legality and
profitability of combining.  Here instcombine would have to walk back
through the def-use lists on the operands of the neg and add to find if
they are masked and then determine the mask value (or mask register name
for simpler equality testing).

Again, doing the combining in this case would be detrimental.  My guess
is that most combining turns out worse code when operations are masked
and the masks are not equal.

Given these two examples, I see some benefit to applymask for
transformations like CSE and other pattern-matching algorithms where the
renaming of registers that applymask requires converts previously
matching patterns into patterns that don't match.  For something like
instcombine or value numbering that is based on value flow and/or pure
semantics, I don't see any benefit.

Of course, these are just two examples.  One could repeat the exercise
for any transformation we're interested in.

I hope this is helpful in moving us along in the discussion.  Feedback
and comments are welcome!

                                                -Dave

Tim Foley

2008-Aug-08 01:13 UTC

head link

[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR

This thread is already churning, so I apologize if by being late to the
party I have missed important information.
The "apply_mask" approach is very familiar to me, in that I spent a
lot of
time thinking about how masking could be added pervasively to LLVM without
disrupting the currently nice property that most of the standard arithmetic
instructions are "overloaded" to work on all types.

The space of options that I saw available (and please don't judge these
options just yet, I am well aware that some of them are abhorrent):

1 - Make masking explicit and change every potentially side-effecting
operation to include an i1 mask in the scalar case and i1 vector mask in the
vector case.
2 - Add specialized masked versions of these operations (distinct from the
unmasked versions) either as intrinsics or new instructions
3 - Add an implicit mask that all vector operations occur "under"
along with
operations to set/get (or push/pop) this mask.
4 - Like 3 but add some notion of a "vector branch" - that is a
conditional
branch on a vector of i1 values instead of a single i1, which would
implicitly do "the right thing" for masking.
5 - Add a new type for "partial" vectors that combines a vector and
same-sized mask. Operations overloaded for normal vectors would also work on
partial vectors (and not operate on "missing" elements), and would
produce
partial results.

1 and 2 are the more-or-less straightforward approaches being considered,
and involve either plumbing through new operations, or new semantics for old
operations. There is nontrivial (I expect) but straightforward (again, I
expect) work involved in either. The rest of the options try to avoid some
or all of that work and/or provide niceties for end users at the expense of
being "clever".

Options 3 and 4 are almost certainly outside of the realm of what LLVM
should consider. One of the best things about SSA is how explicit it makes
dependencies, so adding any kind of implicit mask that is carried behind
people's backs is really scary.

Option 5 is (to my understanding) equivalent to this whole
"apply_mask"
approach being discussed, although the latter tries to avoid introducing a
new type. Even though "apply_mask" doesn't explicitly introduce a
new type,
there are all kinds of special rules about when and where a value that has
been "apply_mask"'d can be used. Adding a type to represent
partiality
allows these rules to be made explicit as part of type-checking the IR.

Such partial vectors are still fairly non-orthogonal, so it is unclear
whether the end-user experience is worth whatever complexity it adds to the
IR.

Anyway, that's my breakdown of the situation. I welcome any feedback from
those with a better understanding of the consequences of any of these
approaches...

- Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20080807/51a1195f/attachment.html>

Mark Leone

2008-Aug-08 01:20 UTC

head link

[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR

Good contribution, thanks. I've read the thread but don't have much  
insight. Dan's suggestion sounds plausible, but not especially elegant.

- Mark


On Aug 7, 2008, at 6:13 PM, "Tim Foley" <tim.foley.is at
gmail.com> wrote:
> This thread is already churning, so I apologize if by being late to  
> the party I have missed important information.
>
> The "apply_mask" approach is very familiar to me, in that I spent
a
> lot of time thinking about how masking could be added pervasively to  
> LLVM without disrupting the currently nice property that most of the  
> standard arithmetic instructions are "overloaded" to work on all
> types.
>
> The space of options that I saw available (and please don't judge  
> these options just yet, I am well aware that some of them are  
> abhorrent):
>
> 1 - Make masking explicit and change every potentially side- 
> effecting operation to include an i1 mask in the scalar case and i1  
> vector mask in the vector case.
> 2 - Add specialized masked versions of these operations (distinct  
> from the unmasked versions) either as intrinsics or new instructions
> 3 - Add an implicit mask that all vector operations occur "under"
> along with operations to set/get (or push/pop) this mask.
> 4 - Like 3 but add some notion of a "vector branch" - that is a  
> conditional branch on a vector of i1 values instead of a single i1,  
> which would implicitly do "the right thing" for masking.
> 5 - Add a new type for "partial" vectors that combines a vector
and
> same-sized mask. Operations overloaded for normal vectors would also  
> work on partial vectors (and not operate on "missing" elements),
and
> would produce partial results.
>
> 1 and 2 are the more-or-less straightforward approaches being  
> considered, and involve either plumbing through new operations, or  
> new semantics for old operations. There is nontrivial (I expect) but  
> straightforward (again, I expect) work involved in either. The rest  
> of the options try to avoid some or all of that work and/or provide  
> niceties for end users at the expense of being "clever".
>
> Options 3 and 4 are almost certainly outside of the realm of what  
> LLVM should consider. One of the best things about SSA is how  
> explicit it makes dependencies, so adding any kind of implicit mask  
> that is carried behind people's backs is really scary.
>
> Option 5 is (to my understanding) equivalent to this whole  
> "apply_mask" approach being discussed, although the latter tries
to
> avoid introducing a new type. Even though "apply_mask"
doesn't
> explicitly introduce a new type, there are all kinds of special  
> rules about when and where a value that has been
"apply_mask"'d can
> be used. Adding a type to represent partiality allows these rules to  
> be made explicit as part of type-checking the IR.
>
> Such partial vectors are still fairly non-orthogonal, so it is  
> unclear whether the end-user experience is worth whatever complexity  
> it adds to the IR.
>
> Anyway, that's my breakdown of the situation. I welcome any feedback  
> from those with a better understanding of the consequences of any of  
> these approaches...
>
> - Tim
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Dan Gohman

2008-Aug-08 02:48 UTC

head link

[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR

On Aug 7, 2008, at 12:13 PM, David Greene wrote:
> On Tuesday 05 August 2008 13:27, David Greene wrote:
>
>> Neither solution eliminates the need for instcombine to be careful  
>> and
>> consult masks from time to time.
>>
>> Perhaps I'm totally missing something.  Concrete examples would be
>> helpful.
>
> Ok, so I took my own advice and thought about CSE and instcombine a  
> bit.
> I wrote the code by hand in a sort of pseudo-llvm language, so don't
> crucify me for mistakes.  :)
>
> CSE is an optimization that needs to pay attention to masks.  You  
> can't
> (easily) CSE an expression with different masks.
>
> Using a mask-per-operation setup, code might look like this:
>
> CSE Mask Example
>
> Mask-per-operation:
> %m1 = ...
> %m2 = ...
> %r1 = add %r2, %r3, %m1
> %r4 = add %r2, %r3, %m2
>
> We can only CSE the adds if %m1 == %m2.  I'd think it would be
> sufficient to just check whether the mask registers are the same,  
> though
> you could imagine walking back through the use-def chains to do more
> complex analysis.  Esentially, the mask operand becomes part of the
> expression you're trying to match.  CSE definitely needs to be
> changed to understand masks.
>
> With applymask, things look something like this:
>
> applymask:
> %m1 = ...
> %m2 = ...
> %r5 = applymask %r2, %m1
> %r6 = applymask %r3, %m1
> %r7 = applymask %r2, %m2
> %r8 = applymask %r3, %m2
> %r1 = add %r5, %r6
> %r4 = add %r7, %r8
>
> This is interesting because CSE will not do the optimization since the
> operands of the adds are different.  CSE does not have to change to
> understand masks.  Value numbering will have to understand applymask  
> to
> make sure it doesn't do something bad because it doesn't care about
> the
> names of operands but the renaming of registers required by  
> applymask is
> somewhat attractive.
Right. One way to think about it is that applymask can conservatively be
treated as plain binary operator, for register-oriented passes at least.
>
>
> Now let's look at an instcombine example:
>
> Instcombine example: -A + -B --> -(A + B)
>
> Mask-per-operation:
> %m1 = ...
> %m2 = ...
> %m3 = ...
> %r1 = neg %r2, %m1
> %r1 = select %r1, %r2, %m1  # Define all elements
> %r3 = neg %r4, %m2
> %r3 = select %r3, %r4, %m2  # Define all elements
> %r5 = add %r1, %r3, %m3
>
> Since the masks on the negs are different we have to be careful in how
> we do this.  In the easiest case, we simply disallow combining here,
> which requires instcombine to examine the masks and, as in the CSE
> example above, potentially just look at the mask register names to
> determine equality.
>
>
> One could imagine doing the combining this way:
>
> %m1 = ...
> %m2 = ...
> %m3 = ...
> # Do the combining where legal
> %m4 = and %m1, %m2
> %r6 = add %r2, %r4, %m4
> %r7 = neg %r6, %m4
> %r7 = select %r7, %r6, %m4 # Define all elements for final merge
> # Take care of the rest
> %m6 = xor %m1, %m4
> %m7 = xor %m2, %m4
> %m8 = xor %m3, %m4
> %r1 = neg %r2, %m6
> %r1 = select %r1, %r2, %m6 # Define all elements for add
> %r3 = neg %r4, %m7
> %r3 = select %r3, %r4, %m7 # Define all elements for add
> %r5 = add %r1, %r3, %m8
> %r5 = select %r5, %r1, %m7 # Define all elements for final merge
> # Do the final merge
> %r5 = select %r7, %r5, %m4
>
> But this is obviously a loss.  We may be able to get away with
> eliminating some merges (I'm being conservative about undefined  
> elements
> here) but it would still be a loss.
>
> applymask:
>
> %m1 = ...
> %m2 = ...
> %m3 = ...
> %r6 = applymask %r2, %m1
> %r7 = applymask %r4, %m2
> %r1 = neg %r6
> %r3 = neg %r7
> %r8 = applymask %r1. %m3
> %r9 = applymask %r3, %m3
> %r5 = add %r8, %r9
>
> This is really no different than the mask-per-operation case.
> Instcombine still has to consider masks when evaluating the legality  
> and
> profitability of combining.  Here instcombine would have to walk back
> through the def-use lists on the operands of the neg and add to find  
> if
> they are masked and then determine the mask value (or mask register  
> name
> for simpler equality testing).
>
> Again, doing the combining in this case would be detrimental.  My  
> guess
> is that most combining turns out worse code when operations are masked
> and the masks are not equal.
>
> Given these two examples, I see some benefit to applymask for
> transformations like CSE and other pattern-matching algorithms where  
> the
> renaming of registers that applymask requires converts previously
> matching patterns into patterns that don't match.  For something like
> instcombine or value numbering that is based on value flow and/or pure
> semantics, I don't see any benefit.
I agree about value numbering. I think instcombine is a little
different. But only with the help of a pretty non-trivial
assumption.

One of the cornerstone assumptions of the applymask approach is
that code like what's in your examples is relatively uncommon,
and that the common case is larger groups of instructions
sharing mask values. If this assumption isn't true, many of
the advantages of applymask no longer hold. I should encourage
people to question this assumption first, because pretty much
everything related to applymask depends on it :-).

But if it is true, then we can say that instcombine's job is
made easier by applymask. Instcombine primarily does very
localized analysis, so it can look at something like this:

   %r0 = neg %r888
   %r1 = neg %r999
   %r2 = add %r0, %r1

and without knowing anything about r888 or r999 and whether
or not there were applymasks involved in producing their
value, it can safely do the transformation.

A key rule in my initial email is that the operands of any
operator must be under the same mask. The Verifier can enforce
this constraint. One problem though is that this places an
interesting burden on front-ends and vectorizers. I'll try to
write more about this soon.)

There are things instcombine can do when the masks are different.
I think you mentioned earlier it could do clever tricks ANDing
masks and such. However, based on the assumption above, this
wouldn't be used very often, so a conservative implementation of
instcombine that doesn't do any of this is mostly good enough.
>
>
> Of course, these are just two examples.  One could repeat the exercise
> for any transformation we're interested in.
>
> I hope this is helpful in moving us along in the discussion.  Feedback
> and comments are welcome!
Thanks for the examples! I think they have been helpful. I'll
try to add a few more here.

First, dead store elimination. It clearly needs to know if a
store is predicated. Mask operands would make this pretty
obvious. Applymask would make it less obvious, but still
doable, and would probably want memoization to be
efficient. This is case where mask operands are favored.

Here's one more example, or at least a skeleton of several
examples. A loop to be vectorized:

   for (...) {
     A
     if (...) {
       B
     } else {
       C
     }
     D
   }

Assuming there's a bunch of code in B and a bunch in C, then
we have four bunches of code and three mask conditions -
A and D are unmasked, B is masked, and C is masked with the
inverse mask value. This code could easily have more if
branches, nested loops, early exits, and so on, but the main
idea is that there are blocks of instructions grouped together
under the same mask, cross-block optimization opportunities
exist but are limited, and that this is assumed to be
basically representative.

There are some interesting cross-block cases here.
If you can prove that something in B and/or C can be run
unmasked, it could be CSE'd/LICM'd/PRE'd/etc.
If D has a PHI merging a value from B and C, it might be
nice to do a vector merge. It's interesting to look at
out how these kinds of cases work under both
mask operands and applymask.

Dan

David Greene

2008-Aug-12 21:47 UTC

head link

[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR

On Thursday 07 August 2008 21:48, Dan Gohman wrote:
> A key rule in my initial email is that the operands of any
> operator must be under the same mask. The Verifier can enforce
> this constraint. One problem though is that this places an
> interesting burden on front-ends and vectorizers. I'll try to
> write more about this soon.)
I totally missed that restriction.  I think this is error-prone.  I certainly 
wouldn't have caught the problem in my code even if I had known about the 
restriction. I suppose the fix in this case is to set the mask to all 1's 
before the add.  This could get rather complicated in interesting cases.
> There are things instcombine can do when the masks are different.
> I think you mentioned earlier it could do clever tricks ANDing
> masks and such. However, based on the assumption above, this
> wouldn't be used very often, so a conservative implementation of
> instcombine that doesn't do any of this is mostly good enough.
Since we don't support ANY vectorization of conditional code right now, any 
addition is a win.  That's true for whatever implementation of masks we
go with.
> Here's one more example, or at least a skeleton of several
> examples. A loop to be vectorized:
>
>    for (...) {
>      A
>      if (...) {
>        B
>      } else {
>        C
>      }
>      D
>    }
>
> Assuming there's a bunch of code in B and a bunch in C, then
> we have four bunches of code and three mask conditions -
> A and D are unmasked, B is masked, and C is masked with the
> inverse mask value. This code could easily have more if
> branches, nested loops, early exits, and so on, but the main
> idea is that there are blocks of instructions grouped together
> under the same mask, cross-block optimization opportunities
> exist but are limited, and that this is assumed to be
> basically representative.
I think that's right.
> There are some interesting cross-block cases here.
> If you can prove that something in B and/or C can be run
> unmasked, it could be CSE'd/LICM'd/PRE'd/etc.
> If D has a PHI merging a value from B and C, it might be
> nice to do a vector merge. It's interesting to look at
> out how these kinds of cases work under both
> mask operands and applymask.
Yes.  Unfortunately, I'm out of spare cycles at the moment.

I was thinking about this more yesterday and came up with what I think is a 
useful question to ask.

If LLVM had had masks way back when it started, what would it look like?

I realize that there are constraints on what kind of transition pain we want
to go through but it gets back to my earlier point that core IR infrastructure
should first first-class no matter when it is added.  And "what it would
have
looked like from the start" is a reasonable definition of
"first-class."

                                                 -Dave

Daniel Berlin

2008-Aug-12 21:55 UTC

head link

[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR

On Thu, Aug 7, 2008 at 3:13 PM, David Greene <dag at cray.com>
wrote:> On Tuesday 05 August 2008 13:27, David Greene wrote:
>
>> Neither solution eliminates the need for instcombine to be careful and
>> consult masks from time to time.
>>
>> Perhaps I'm totally missing something.  Concrete examples would be
helpful.
>
> Ok, so I took my own advice and thought about CSE and instcombine a bit.
> I wrote the code by hand in a sort of pseudo-llvm language, so don't
> crucify me for mistakes.  :)
>
> CSE is an optimization that needs to pay attention to masks.  You can't
> (easily) CSE an expression with different masks.
If you are using a value based CSE, you can easily CSE expressions
with different masks.
I'd agree with lexical expression based CSE, you can't, but that's a
pretty weak form of CSE.
>
> Using a mask-per-operation setup, code might look like this:
>
>.....
>
> With applymask, things look something like this:
>
> applymask:
> %m1 = ...
> %m2 = ...
> %r5 = applymask %r2, %m1
> %r6 = applymask %r3, %m1
> %r7 = applymask %r2, %m2
> %r8 = applymask %r3, %m2
> %r1 = add %r5, %r6
> %r4 = add %r7, %r8
>
> This is interesting because CSE will not do the optimization since the
> operands of the adds are different.
LLVM's main CSE these days is really GVN based.
GVN should happily compute these to be the same value, and if it
doesn't, you have much larger problems.

GCC's value numberer already combines shifts and masks during value
numbering over multiple binary operations.
It is not hard to make LLVM's do the same to the small degree it
doesn't (the value numbering does not value number phi nodes/cycles
iteratively so it misses identities between induction variables and
things based on them in some cases.  SCEV saves it in others)
You don't need to do it to the level instcombine does to get good results.

This is all with my "Person who implemented SCC based value numbering
and value based PRE in GCC" hat on :)
--Dan

David Greene

2008-Aug-13 15:30 UTC

head link

[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR

On Tuesday 12 August 2008 16:55, Daniel Berlin wrote:
> > With applymask, things look something like this:
> >
> > applymask:
> > %m1 = ...
> > %m2 = ...
> > %r5 = applymask %r2, %m1
> > %r6 = applymask %r3, %m1
> > %r7 = applymask %r2, %m2
> > %r8 = applymask %r3, %m2
> > %r1 = add %r5, %r6
> > %r4 = add %r7, %r8
> >
> > This is interesting because CSE will not do the optimization since the
> > operands of the adds are different.
>
> LLVM's main CSE these days is really GVN based.
> GVN should happily compute these to be the same value, and if it
> doesn't, you have much larger problems.
Eh?  How can these compute the same thing if the masks are different?

                                                   -Dave

Seemingly Similar Threads

Search for more possibly parallel threads

llvm dev - Aug 2008 - [LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR

[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR

[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR

[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR

[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR

[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR

[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR

[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR

Seemingly Similar Threads