David Greene
2008-Aug-07 19:13 UTC
[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR
On Tuesday 05 August 2008 13:27, David Greene wrote:> Neither solution eliminates the need for instcombine to be careful and > consult masks from time to time. > > Perhaps I'm totally missing something. Concrete examples would be helpful.Ok, so I took my own advice and thought about CSE and instcombine a bit. I wrote the code by hand in a sort of pseudo-llvm language, so don't crucify me for mistakes. :) CSE is an optimization that needs to pay attention to masks. You can't (easily) CSE an expression with different masks. Using a mask-per-operation setup, code might look like this: CSE Mask Example Mask-per-operation: %m1 = ... %m2 = ... %r1 = add %r2, %r3, %m1 %r4 = add %r2, %r3, %m2 We can only CSE the adds if %m1 == %m2. I'd think it would be sufficient to just check whether the mask registers are the same, though you could imagine walking back through the use-def chains to do more complex analysis. Esentially, the mask operand becomes part of the expression you're trying to match. CSE definitely needs to be changed to understand masks. With applymask, things look something like this: applymask: %m1 = ... %m2 = ... %r5 = applymask %r2, %m1 %r6 = applymask %r3, %m1 %r7 = applymask %r2, %m2 %r8 = applymask %r3, %m2 %r1 = add %r5, %r6 %r4 = add %r7, %r8 This is interesting because CSE will not do the optimization since the operands of the adds are different. CSE does not have to change to understand masks. Value numbering will have to understand applymask to make sure it doesn't do something bad because it doesn't care about the names of operands but the renaming of registers required by applymask is somewhat attractive. Now let's look at an instcombine example: Instcombine example: -A + -B --> -(A + B) Mask-per-operation: %m1 = ... %m2 = ... %m3 = ... %r1 = neg %r2, %m1 %r1 = select %r1, %r2, %m1 # Define all elements %r3 = neg %r4, %m2 %r3 = select %r3, %r4, %m2 # Define all elements %r5 = add %r1, %r3, %m3 Since the masks on the negs are different we have to be careful in how we do this. In the easiest case, we simply disallow combining here, which requires instcombine to examine the masks and, as in the CSE example above, potentially just look at the mask register names to determine equality. One could imagine doing the combining this way: %m1 = ... %m2 = ... %m3 = ... # Do the combining where legal %m4 = and %m1, %m2 %r6 = add %r2, %r4, %m4 %r7 = neg %r6, %m4 %r7 = select %r7, %r6, %m4 # Define all elements for final merge # Take care of the rest %m6 = xor %m1, %m4 %m7 = xor %m2, %m4 %m8 = xor %m3, %m4 %r1 = neg %r2, %m6 %r1 = select %r1, %r2, %m6 # Define all elements for add %r3 = neg %r4, %m7 %r3 = select %r3, %r4, %m7 # Define all elements for add %r5 = add %r1, %r3, %m8 %r5 = select %r5, %r1, %m7 # Define all elements for final merge # Do the final merge %r5 = select %r7, %r5, %m4 But this is obviously a loss. We may be able to get away with eliminating some merges (I'm being conservative about undefined elements here) but it would still be a loss. applymask: %m1 = ... %m2 = ... %m3 = ... %r6 = applymask %r2, %m1 %r7 = applymask %r4, %m2 %r1 = neg %r6 %r3 = neg %r7 %r8 = applymask %r1. %m3 %r9 = applymask %r3, %m3 %r5 = add %r8, %r9 This is really no different than the mask-per-operation case. Instcombine still has to consider masks when evaluating the legality and profitability of combining. Here instcombine would have to walk back through the def-use lists on the operands of the neg and add to find if they are masked and then determine the mask value (or mask register name for simpler equality testing). Again, doing the combining in this case would be detrimental. My guess is that most combining turns out worse code when operations are masked and the masks are not equal. Given these two examples, I see some benefit to applymask for transformations like CSE and other pattern-matching algorithms where the renaming of registers that applymask requires converts previously matching patterns into patterns that don't match. For something like instcombine or value numbering that is based on value flow and/or pure semantics, I don't see any benefit. Of course, these are just two examples. One could repeat the exercise for any transformation we're interested in. I hope this is helpful in moving us along in the discussion. Feedback and comments are welcome! -Dave
Tim Foley
2008-Aug-08 01:13 UTC
[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR
This thread is already churning, so I apologize if by being late to the party I have missed important information. The "apply_mask" approach is very familiar to me, in that I spent a lot of time thinking about how masking could be added pervasively to LLVM without disrupting the currently nice property that most of the standard arithmetic instructions are "overloaded" to work on all types. The space of options that I saw available (and please don't judge these options just yet, I am well aware that some of them are abhorrent): 1 - Make masking explicit and change every potentially side-effecting operation to include an i1 mask in the scalar case and i1 vector mask in the vector case. 2 - Add specialized masked versions of these operations (distinct from the unmasked versions) either as intrinsics or new instructions 3 - Add an implicit mask that all vector operations occur "under" along with operations to set/get (or push/pop) this mask. 4 - Like 3 but add some notion of a "vector branch" - that is a conditional branch on a vector of i1 values instead of a single i1, which would implicitly do "the right thing" for masking. 5 - Add a new type for "partial" vectors that combines a vector and same-sized mask. Operations overloaded for normal vectors would also work on partial vectors (and not operate on "missing" elements), and would produce partial results. 1 and 2 are the more-or-less straightforward approaches being considered, and involve either plumbing through new operations, or new semantics for old operations. There is nontrivial (I expect) but straightforward (again, I expect) work involved in either. The rest of the options try to avoid some or all of that work and/or provide niceties for end users at the expense of being "clever". Options 3 and 4 are almost certainly outside of the realm of what LLVM should consider. One of the best things about SSA is how explicit it makes dependencies, so adding any kind of implicit mask that is carried behind people's backs is really scary. Option 5 is (to my understanding) equivalent to this whole "apply_mask" approach being discussed, although the latter tries to avoid introducing a new type. Even though "apply_mask" doesn't explicitly introduce a new type, there are all kinds of special rules about when and where a value that has been "apply_mask"'d can be used. Adding a type to represent partiality allows these rules to be made explicit as part of type-checking the IR. Such partial vectors are still fairly non-orthogonal, so it is unclear whether the end-user experience is worth whatever complexity it adds to the IR. Anyway, that's my breakdown of the situation. I welcome any feedback from those with a better understanding of the consequences of any of these approaches... - Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20080807/51a1195f/attachment.html>
Mark Leone
2008-Aug-08 01:20 UTC
[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR
Good contribution, thanks. I've read the thread but don't have much insight. Dan's suggestion sounds plausible, but not especially elegant. - Mark On Aug 7, 2008, at 6:13 PM, "Tim Foley" <tim.foley.is at gmail.com> wrote:> This thread is already churning, so I apologize if by being late to > the party I have missed important information. > > The "apply_mask" approach is very familiar to me, in that I spent a > lot of time thinking about how masking could be added pervasively to > LLVM without disrupting the currently nice property that most of the > standard arithmetic instructions are "overloaded" to work on all > types. > > The space of options that I saw available (and please don't judge > these options just yet, I am well aware that some of them are > abhorrent): > > 1 - Make masking explicit and change every potentially side- > effecting operation to include an i1 mask in the scalar case and i1 > vector mask in the vector case. > 2 - Add specialized masked versions of these operations (distinct > from the unmasked versions) either as intrinsics or new instructions > 3 - Add an implicit mask that all vector operations occur "under" > along with operations to set/get (or push/pop) this mask. > 4 - Like 3 but add some notion of a "vector branch" - that is a > conditional branch on a vector of i1 values instead of a single i1, > which would implicitly do "the right thing" for masking. > 5 - Add a new type for "partial" vectors that combines a vector and > same-sized mask. Operations overloaded for normal vectors would also > work on partial vectors (and not operate on "missing" elements), and > would produce partial results. > > 1 and 2 are the more-or-less straightforward approaches being > considered, and involve either plumbing through new operations, or > new semantics for old operations. There is nontrivial (I expect) but > straightforward (again, I expect) work involved in either. The rest > of the options try to avoid some or all of that work and/or provide > niceties for end users at the expense of being "clever". > > Options 3 and 4 are almost certainly outside of the realm of what > LLVM should consider. One of the best things about SSA is how > explicit it makes dependencies, so adding any kind of implicit mask > that is carried behind people's backs is really scary. > > Option 5 is (to my understanding) equivalent to this whole > "apply_mask" approach being discussed, although the latter tries to > avoid introducing a new type. Even though "apply_mask" doesn't > explicitly introduce a new type, there are all kinds of special > rules about when and where a value that has been "apply_mask"'d can > be used. Adding a type to represent partiality allows these rules to > be made explicit as part of type-checking the IR. > > Such partial vectors are still fairly non-orthogonal, so it is > unclear whether the end-user experience is worth whatever complexity > it adds to the IR. > > Anyway, that's my breakdown of the situation. I welcome any feedback > from those with a better understanding of the consequences of any of > these approaches... > > - Tim > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Dan Gohman
2008-Aug-08 02:48 UTC
[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR
On Aug 7, 2008, at 12:13 PM, David Greene wrote:> On Tuesday 05 August 2008 13:27, David Greene wrote: > >> Neither solution eliminates the need for instcombine to be careful >> and >> consult masks from time to time. >> >> Perhaps I'm totally missing something. Concrete examples would be >> helpful. > > Ok, so I took my own advice and thought about CSE and instcombine a > bit. > I wrote the code by hand in a sort of pseudo-llvm language, so don't > crucify me for mistakes. :) > > CSE is an optimization that needs to pay attention to masks. You > can't > (easily) CSE an expression with different masks. > > Using a mask-per-operation setup, code might look like this: > > CSE Mask Example > > Mask-per-operation: > %m1 = ... > %m2 = ... > %r1 = add %r2, %r3, %m1 > %r4 = add %r2, %r3, %m2 > > We can only CSE the adds if %m1 == %m2. I'd think it would be > sufficient to just check whether the mask registers are the same, > though > you could imagine walking back through the use-def chains to do more > complex analysis. Esentially, the mask operand becomes part of the > expression you're trying to match. CSE definitely needs to be > changed to understand masks. > > With applymask, things look something like this: > > applymask: > %m1 = ... > %m2 = ... > %r5 = applymask %r2, %m1 > %r6 = applymask %r3, %m1 > %r7 = applymask %r2, %m2 > %r8 = applymask %r3, %m2 > %r1 = add %r5, %r6 > %r4 = add %r7, %r8 > > This is interesting because CSE will not do the optimization since the > operands of the adds are different. CSE does not have to change to > understand masks. Value numbering will have to understand applymask > to > make sure it doesn't do something bad because it doesn't care about > the > names of operands but the renaming of registers required by > applymask is > somewhat attractive.Right. One way to think about it is that applymask can conservatively be treated as plain binary operator, for register-oriented passes at least.> > > Now let's look at an instcombine example: > > Instcombine example: -A + -B --> -(A + B) > > Mask-per-operation: > %m1 = ... > %m2 = ... > %m3 = ... > %r1 = neg %r2, %m1 > %r1 = select %r1, %r2, %m1 # Define all elements > %r3 = neg %r4, %m2 > %r3 = select %r3, %r4, %m2 # Define all elements > %r5 = add %r1, %r3, %m3 > > Since the masks on the negs are different we have to be careful in how > we do this. In the easiest case, we simply disallow combining here, > which requires instcombine to examine the masks and, as in the CSE > example above, potentially just look at the mask register names to > determine equality. > > > One could imagine doing the combining this way: > > %m1 = ... > %m2 = ... > %m3 = ... > # Do the combining where legal > %m4 = and %m1, %m2 > %r6 = add %r2, %r4, %m4 > %r7 = neg %r6, %m4 > %r7 = select %r7, %r6, %m4 # Define all elements for final merge > # Take care of the rest > %m6 = xor %m1, %m4 > %m7 = xor %m2, %m4 > %m8 = xor %m3, %m4 > %r1 = neg %r2, %m6 > %r1 = select %r1, %r2, %m6 # Define all elements for add > %r3 = neg %r4, %m7 > %r3 = select %r3, %r4, %m7 # Define all elements for add > %r5 = add %r1, %r3, %m8 > %r5 = select %r5, %r1, %m7 # Define all elements for final merge > # Do the final merge > %r5 = select %r7, %r5, %m4 > > But this is obviously a loss. We may be able to get away with > eliminating some merges (I'm being conservative about undefined > elements > here) but it would still be a loss. > > applymask: > > %m1 = ... > %m2 = ... > %m3 = ... > %r6 = applymask %r2, %m1 > %r7 = applymask %r4, %m2 > %r1 = neg %r6 > %r3 = neg %r7 > %r8 = applymask %r1. %m3 > %r9 = applymask %r3, %m3 > %r5 = add %r8, %r9 > > This is really no different than the mask-per-operation case. > Instcombine still has to consider masks when evaluating the legality > and > profitability of combining. Here instcombine would have to walk back > through the def-use lists on the operands of the neg and add to find > if > they are masked and then determine the mask value (or mask register > name > for simpler equality testing). > > Again, doing the combining in this case would be detrimental. My > guess > is that most combining turns out worse code when operations are masked > and the masks are not equal. > > Given these two examples, I see some benefit to applymask for > transformations like CSE and other pattern-matching algorithms where > the > renaming of registers that applymask requires converts previously > matching patterns into patterns that don't match. For something like > instcombine or value numbering that is based on value flow and/or pure > semantics, I don't see any benefit.I agree about value numbering. I think instcombine is a little different. But only with the help of a pretty non-trivial assumption. One of the cornerstone assumptions of the applymask approach is that code like what's in your examples is relatively uncommon, and that the common case is larger groups of instructions sharing mask values. If this assumption isn't true, many of the advantages of applymask no longer hold. I should encourage people to question this assumption first, because pretty much everything related to applymask depends on it :-). But if it is true, then we can say that instcombine's job is made easier by applymask. Instcombine primarily does very localized analysis, so it can look at something like this: %r0 = neg %r888 %r1 = neg %r999 %r2 = add %r0, %r1 and without knowing anything about r888 or r999 and whether or not there were applymasks involved in producing their value, it can safely do the transformation. A key rule in my initial email is that the operands of any operator must be under the same mask. The Verifier can enforce this constraint. One problem though is that this places an interesting burden on front-ends and vectorizers. I'll try to write more about this soon.) There are things instcombine can do when the masks are different. I think you mentioned earlier it could do clever tricks ANDing masks and such. However, based on the assumption above, this wouldn't be used very often, so a conservative implementation of instcombine that doesn't do any of this is mostly good enough.> > > Of course, these are just two examples. One could repeat the exercise > for any transformation we're interested in. > > I hope this is helpful in moving us along in the discussion. Feedback > and comments are welcome!Thanks for the examples! I think they have been helpful. I'll try to add a few more here. First, dead store elimination. It clearly needs to know if a store is predicated. Mask operands would make this pretty obvious. Applymask would make it less obvious, but still doable, and would probably want memoization to be efficient. This is case where mask operands are favored. Here's one more example, or at least a skeleton of several examples. A loop to be vectorized: for (...) { A if (...) { B } else { C } D } Assuming there's a bunch of code in B and a bunch in C, then we have four bunches of code and three mask conditions - A and D are unmasked, B is masked, and C is masked with the inverse mask value. This code could easily have more if branches, nested loops, early exits, and so on, but the main idea is that there are blocks of instructions grouped together under the same mask, cross-block optimization opportunities exist but are limited, and that this is assumed to be basically representative. There are some interesting cross-block cases here. If you can prove that something in B and/or C can be run unmasked, it could be CSE'd/LICM'd/PRE'd/etc. If D has a PHI merging a value from B and C, it might be nice to do a vector merge. It's interesting to look at out how these kinds of cases work under both mask operands and applymask. Dan
David Greene
2008-Aug-12 21:47 UTC
[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR
On Thursday 07 August 2008 21:48, Dan Gohman wrote:> A key rule in my initial email is that the operands of any > operator must be under the same mask. The Verifier can enforce > this constraint. One problem though is that this places an > interesting burden on front-ends and vectorizers. I'll try to > write more about this soon.)I totally missed that restriction. I think this is error-prone. I certainly wouldn't have caught the problem in my code even if I had known about the restriction. I suppose the fix in this case is to set the mask to all 1's before the add. This could get rather complicated in interesting cases.> There are things instcombine can do when the masks are different. > I think you mentioned earlier it could do clever tricks ANDing > masks and such. However, based on the assumption above, this > wouldn't be used very often, so a conservative implementation of > instcombine that doesn't do any of this is mostly good enough.Since we don't support ANY vectorization of conditional code right now, any addition is a win. That's true for whatever implementation of masks we go with.> Here's one more example, or at least a skeleton of several > examples. A loop to be vectorized: > > for (...) { > A > if (...) { > B > } else { > C > } > D > } > > Assuming there's a bunch of code in B and a bunch in C, then > we have four bunches of code and three mask conditions - > A and D are unmasked, B is masked, and C is masked with the > inverse mask value. This code could easily have more if > branches, nested loops, early exits, and so on, but the main > idea is that there are blocks of instructions grouped together > under the same mask, cross-block optimization opportunities > exist but are limited, and that this is assumed to be > basically representative.I think that's right.> There are some interesting cross-block cases here. > If you can prove that something in B and/or C can be run > unmasked, it could be CSE'd/LICM'd/PRE'd/etc. > If D has a PHI merging a value from B and C, it might be > nice to do a vector merge. It's interesting to look at > out how these kinds of cases work under both > mask operands and applymask.Yes. Unfortunately, I'm out of spare cycles at the moment. I was thinking about this more yesterday and came up with what I think is a useful question to ask. If LLVM had had masks way back when it started, what would it look like? I realize that there are constraints on what kind of transition pain we want to go through but it gets back to my earlier point that core IR infrastructure should first first-class no matter when it is added. And "what it would have looked like from the start" is a reasonable definition of "first-class." -Dave
Daniel Berlin
2008-Aug-12 21:55 UTC
[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR
On Thu, Aug 7, 2008 at 3:13 PM, David Greene <dag at cray.com> wrote:> On Tuesday 05 August 2008 13:27, David Greene wrote: > >> Neither solution eliminates the need for instcombine to be careful and >> consult masks from time to time. >> >> Perhaps I'm totally missing something. Concrete examples would be helpful. > > Ok, so I took my own advice and thought about CSE and instcombine a bit. > I wrote the code by hand in a sort of pseudo-llvm language, so don't > crucify me for mistakes. :) > > CSE is an optimization that needs to pay attention to masks. You can't > (easily) CSE an expression with different masks.If you are using a value based CSE, you can easily CSE expressions with different masks. I'd agree with lexical expression based CSE, you can't, but that's a pretty weak form of CSE.> > Using a mask-per-operation setup, code might look like this: > >.....> > With applymask, things look something like this: > > applymask: > %m1 = ... > %m2 = ... > %r5 = applymask %r2, %m1 > %r6 = applymask %r3, %m1 > %r7 = applymask %r2, %m2 > %r8 = applymask %r3, %m2 > %r1 = add %r5, %r6 > %r4 = add %r7, %r8 > > This is interesting because CSE will not do the optimization since the > operands of the adds are different.LLVM's main CSE these days is really GVN based. GVN should happily compute these to be the same value, and if it doesn't, you have much larger problems. GCC's value numberer already combines shifts and masks during value numbering over multiple binary operations. It is not hard to make LLVM's do the same to the small degree it doesn't (the value numbering does not value number phi nodes/cycles iteratively so it misses identities between induction variables and things based on them in some cases. SCEV saves it in others) You don't need to do it to the level instcombine does to get good results. This is all with my "Person who implemented SCC based value numbering and value based PRE in GCC" hat on :) --Dan
David Greene
2008-Aug-13 15:30 UTC
[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR
On Tuesday 12 August 2008 16:55, Daniel Berlin wrote:> > With applymask, things look something like this: > > > > applymask: > > %m1 = ... > > %m2 = ... > > %r5 = applymask %r2, %m1 > > %r6 = applymask %r3, %m1 > > %r7 = applymask %r2, %m2 > > %r8 = applymask %r3, %m2 > > %r1 = add %r5, %r6 > > %r4 = add %r7, %r8 > > > > This is interesting because CSE will not do the optimization since the > > operands of the adds are different. > > LLVM's main CSE these days is really GVN based. > GVN should happily compute these to be the same value, and if it > doesn't, you have much larger problems.Eh? How can these compute the same thing if the masks are different? -Dave
Apparently Analagous Threads
- [LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR
- [LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR
- [LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR
- [LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR
- [LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR