thr3ads.net - llvm dev - [LLVMdev] Extending vector operations [Jul 2008]

If this information is useful, please help other people find it:
Share via:

Stefanus Du Toit

2008-Jul-21 20:21 UTC

[LLVMdev] Extending vector operations

Hi,

We would like to extend the vector operations in llvm a bit. We're  
hoping to get some feedback on the right way to go, or some starting  
points. I had previously had some discussion on this list about a  
subset of the changes we have in mind.

All of these changes are intended to make target-independent IR (i.e.  
IR without machine specific intrinsics) generate better code or be  
easier to generate from a frontend with vector support (whether from  
manual or autovectorization).

If you have any insight into how to best get started with any of these  
changes, and whether they are feasible and sensible, please let me  
know. We're mostly interested in x86 as a target in the short term,  
but obviously want these to apply to other LLVM targets as well. We're  
prepared to implement these changes, but would like to hear any  
suggestions and objections you might have.

Below are the specific additions we have in mind.

==1) Vector shl, lshr, ashr

I think these are no-brainers. We would like to extend the semantics  
of the shifting instructions to naturally apply to vectors as well.  
One issue is that these operations often only support a single shift  
amount for an entire vector. I assume it should be fairly  
straightforward to select on this pattern, and scalarize the general  
case as necessary.

2) Vector strunc, sext, zext, fptrunc and fpext

Again, I think these are hopefully straightforward. Please let me know  
if you expect any issues with vector operations that change element  
sizes from the RHS to the LHS, e.g. around legalization.

3) Vector intrinsics for floor, ceil, round, frac/modf

These are operations that are not trivially specified in terms of  
simpler operations. It would be nice to have these as overloaded,  
target-independent intrinsics, in the same way as llvm.cos etc. are  
supported now.

4) Vector select

We consider a vector select extremely important for a number of  
operations. This would be an extension of select to support an <N x  
i1> vector mask to select between elements of <N x T> vectors for some
basic type T. Vector min, max, sign, etc. can be built on top of this  
operation.

5) Vector comparisons that return <N x i1>

This is maybe not a must-have, and perhaps more a question of  
preference. I understand the current vfcmp/vicmp semantics, returning  
a vector of iK where K matches the bitwidth of the operands being  
compared with the high bit set or not, are there for pragmatic  
reasons, and that these functions exist to aid with code emitted that  
uses machine-specific intrinsics.

For code that does not use machine intrinsics, I believe it would be  
cleaner, simpler, and potentially more efficient, to have a vector  
compare that returns <N x i1> instead. For example, in conjunction  
with the above-mentioned vector select, this would allow a max to be  
expressed simply as a sequence of compare and select.

Vector bitshifts would actually help with the amount of code generated  
for something like a vectorized max, but makes the patterns for  
recognizing these a lot longer.

I realize this is probably the most controversial change amongst  
these. I gather there is some concern about representing "variable  
width" i1s, but I would contend that that's the case even for i1s  
which are not vectors.
==
In addition to the above suggestions, I'd also like to hear what  
others think about handling vector operations that aren't powers of  
two in size, e.g. <3 x float> operations. I gather the status quo is  
that only POT sizes are expected to work (although we've found some  
bugs for things like <2 x float> that we're submitting). Ideally  
things like <3 x float> operands would usually be rounded up to the  
size supported by the machine directly. We can try to do this in the  
frontend, but it would of course be ideal if these just worked. I'm  
curious if anyone else out there has dealt with this already and has  
some suggestions.

Please let me know what you think,

Stefanus

--
Stefanus Du Toit <stefanus.dutoit at rapidmind.com>
   RapidMind Inc.
   phone: +1 519 885 5455 x116 -- fax: +1 519 885 1463

Eli Friedman

2008-Jul-21 23:33 UTC

head link

[LLVMdev] Extending vector operations

On Mon, Jul 21, 2008 at 1:21 PM, Stefanus Du Toit
<stefanus.dutoit at rapidmind.com> wrote:> 1) Vector shl, lshr, ashr
>
> I think these are no-brainers. We would like to extend the semantics
> of the shifting instructions to naturally apply to vectors as well.
> One issue is that these operations often only support a single shift
> amount for an entire vector. I assume it should be fairly
> straightforward to select on this pattern, and scalarize the general
> case as necessary.
I have a rough draft of a patch for this that works reasonably well
for simple cases... I don't think I really have the time to finish it
properly, but I'll clean it up a bit and send it to you as a starting
point.
> 2) Vector strunc, sext, zext, fptrunc and fpext
>
> Again, I think these are hopefully straightforward. Please let me know
> if you expect any issues with vector operations that change element
> sizes from the RHS to the LHS, e.g. around legalization.
These are tricky to generate efficient code for because they change
the size of the vector and most likely introduce illegal vectors.
It'll be a lot of work to get these working well for non-trivial
cases.
> 5) Vector comparisons that return <N x i1>
>
> This is maybe not a must-have, and perhaps more a question of
> preference. I understand the current vfcmp/vicmp semantics, returning
> a vector of iK where K matches the bitwidth of the operands being
> compared with the high bit set or not, are there for pragmatic
> reasons, and that these functions exist to aid with code emitted that
> uses machine-specific intrinsics.
>
> For code that does not use machine intrinsics, I believe it would be
> cleaner, simpler, and potentially more efficient, to have a vector
> compare that returns <N x i1> instead. For example, in conjunction
> with the above-mentioned vector select, this would allow a max to be
> expressed simply as a sequence of compare and select.
>
> I realize this is probably the most controversial change amongst
> these. I gather there is some concern about representing "variable
> width" i1s, but I would contend that that's the case even for i1s
> which are not vectors.
It makes sense; the difficulty is making CodeGen generate efficient
code for these constructs.  Legalization has to transform illegal
types into legal types, and as far as I know, none of the current
transformations varies the type transformation based on what is using
the value.  I don't know exactly how hard it would be, though... maybe
someone who's more familiar with the legalization infrastructure could
say more?

-Eli

Nate Begeman

2008-Jul-21 23:46 UTC

head link

[LLVMdev] Extending vector operations

On Jul 21, 2008, at 1:21 PM, Stefanus Du Toit wrote:
> Hi,
>
> We would like to extend the vector operations in llvm a bit. We're
> hoping to get some feedback on the right way to go, or some starting
> points. I had previously had some discussion on this list about a
> subset of the changes we have in mind.
>
> All of these changes are intended to make target-independent IR (i.e.
> IR without machine specific intrinsics) generate better code or be
> easier to generate from a frontend with vector support (whether from
> manual or autovectorization).
>
> If you have any insight into how to best get started with any of these
> changes, and whether they are feasible and sensible, please let me
> know. We're mostly interested in x86 as a target in the short term,
> but obviously want these to apply to other LLVM targets as well. We're
> prepared to implement these changes, but would like to hear any
> suggestions and objections you might have.
>
> Below are the specific additions we have in mind.
>
> ==> 1) Vector shl, lshr, ashr
>
> I think these are no-brainers. We would like to extend the semantics
> of the shifting instructions to naturally apply to vectors as well.
> One issue is that these operations often only support a single shift
> amount for an entire vector. I assume it should be fairly
> straightforward to select on this pattern, and scalarize the general
> case as necessary.
That seems reasonable.
> 2) Vector strunc, sext, zext, fptrunc and fpext
>
> Again, I think these are hopefully straightforward. Please let me know
> if you expect any issues with vector operations that change element
> sizes from the RHS to the LHS, e.g. around legalization.
Is the proposed semantics here that the number of elements stays the  
same size, and the overall vector width changes?
> 3) Vector intrinsics for floor, ceil, round, frac/modf
>
> These are operations that are not trivially specified in terms of
> simpler operations. It would be nice to have these as overloaded,
> target-independent intrinsics, in the same way as llvm.cos etc. are
> supported now.
It seems like these could be handled through intrinsics in the LLVM  
IR, and could use general improvement in the selection dag.
> 4) Vector select
>
> We consider a vector select extremely important for a number of
> operations. This would be an extension of select to support an <N x
> i1> vector mask to select between elements of <N x T> vectors for
some
> basic type T. Vector min, max, sign, etc. can be built on top of this
> operation.
How is this anything other than AND/ANDN/OR on any integer vector  
type?  I don't see what adding this to the IR gets you for vectors,  
since "vector of i1" doesn't mean "vector of bits"
necessarily.
> 5) Vector comparisons that return <N x i1>
>
> This is maybe not a must-have, and perhaps more a question of
> preference. I understand the current vfcmp/vicmp semantics, returning
> a vector of iK where K matches the bitwidth of the operands being
> compared with the high bit set or not, are there for pragmatic
> reasons, and that these functions exist to aid with code emitted that
> uses machine-specific intrinsics.
I totally disagree with this approach; A vector of i1 doesn't actually  
match what you want to do with the hardware, unless you had say, 128 x  
i1 for SSE, and it's strange when you have to spill and reload it.   
The current VICMP and VFCMP instructions do not exist for use with  
machine intrinsics; they exist to allow code written use C-style  
comparison operators to generate efficient code on a wide range of  
both scalar and vector hardware.
> For code that does not use machine intrinsics, I believe it would be
> cleaner, simpler, and potentially more efficient, to have a vector
> compare that returns <N x i1> instead. For example, in conjunction
> with the above-mentioned vector select, this would allow a max to be
> expressed simply as a sequence of compare and select.
Having gone down this path, I'd have to disagree with
you.>
> In addition to the above suggestions, I'd also like to hear what
> others think about handling vector operations that aren't powers of
> two in size, e.g. <3 x float> operations. I gather the status quo is
> that only POT sizes are expected to work (although we've found some
> bugs for things like <2 x float> that we're submitting). Ideally
> things like <3 x float> operands would usually be rounded up to the
> size supported by the machine directly. We can try to do this in the
> frontend, but it would of course be ideal if these just worked. I'm
> curious if anyone else out there has dealt with this already and has
> some suggestions.
Handling NPOT vectors in the code generator ideally would be great; I  
know some people are working on widening the operations to a wider  
legal vector type, and scalarizing is always a possibility as well.   
The main problem here is what to do with address calculations, and  
alignment.

Nate
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20080721/ed9a1278/attachment.html>

Chris Lattner

2008-Jul-22 06:17 UTC

head link

[LLVMdev] Extending vector operations

On Jul 21, 2008, at 4:33 PM, Eli Friedman wrote:
>>
>> 2) Vector strunc, sext, zext, fptrunc and fpext
>>
>> Again, I think these are hopefully straightforward. Please let me  
>> know
>> if you expect any issues with vector operations that change element
>> sizes from the RHS to the LHS, e.g. around legalization.
>
> These are tricky to generate efficient code for because they change
> the size of the vector and most likely introduce illegal vectors.
> It'll be a lot of work to get these working well for non-trivial
> cases.
Consider something like <4 x int> -> <4 x double>.  On a target
with
128-bit vectors, this could be custom expanded/split into a convert hi  
and convert low operation.  This is real work to support, but I don't  
think this is too crazy.

-Chris

Chris Lattner

2008-Jul-22 06:19 UTC

head link

[LLVMdev] Extending vector operations

On Jul 21, 2008, at 4:46 PM, Nate Begeman wrote:
>
> I totally disagree with this approach; A vector of i1 doesn't  
> actually match what you want to do with the hardware, unless you had  
> say, 128 x i1 for SSE, and it's strange when you have to spill and  
> reload it.  The current VICMP and VFCMP instructions do not exist  
> for use with machine intrinsics; they exist to allow code written  
> use C-style comparison operators to generate efficient code on a  
> wide range of both scalar and vector hardware.
One interesting point here is that clang follows the same semantics  
that LLVM IR does for comparisons.  This means you can use "V1 <
V2"
and the result is defined to be a vector of integers the same width  
with only the sign bits defined.

-Chris

Stefanus Du Toit

2008-Jul-22 13:33 UTC

head link

[LLVMdev] Extending vector operations

On 21-Jul-08, at 7:33 PM, Eli Friedman wrote:> On Mon, Jul 21, 2008 at 1:21 PM, Stefanus Du Toit
> <stefanus.dutoit at rapidmind.com> wrote:
>> 1) Vector shl, lshr, ashr
>>
> I have a rough draft of a patch for this that works reasonably well
> for simple cases... I don't think I really have the time to finish it
> properly, but I'll clean it up a bit and send it to you as a starting
> point.
That's great. Please do send it over, it'd be great to have a starting  
point like this.
>> 2) Vector strunc, sext, zext, fptrunc and fpext
>>
> These are tricky to generate efficient code for because they change
> the size of the vector and most likely introduce illegal vectors.
> It'll be a lot of work to get these working well for non-trivial
> cases.
Right, though it seems they can generally be split up as necessary (as  
Chris pointed out as well). I just don't know how well this fits the  
existing infrastructure. More pointers would be welcome here!
>> 5) Vector comparisons that return <N x i1>
>>
> It makes sense; the difficulty is making CodeGen generate efficient
> code for these constructs.  Legalization has to transform illegal
> types into legal types, and as far as I know, none of the current
> transformations varies the type transformation based on what is using
> the value.  I don't know exactly how hard it would be, though... maybe
> someone who's more familiar with the legalization infrastructure could
> say more?
Right, the type that needs to be used at the birthpoint of the value  
will generally be determined by the operation being performed to  
define it. At uses, that representation will have to be converted.  
There are some operations which may not provide a unique  
representation at a definition. Phi statements are the most obvious  
ones - if a phi statement joins two boolean vectors of different  
representations, the optimal choice of the result may depend on where  
that result is used.

Conversion code that may need to be inserted can also probably benefit  
from optimizations like CSE.

--
Stefanus Du Toit <stefanus.dutoit at rapidmind.com>
   RapidMind Inc.
   phone: +1 519 885 5455 x116 -- fax: +1 519 885 1463

Stefanus Du Toit

2008-Jul-22 15:04 UTC

head link

[LLVMdev] Extending vector operations

Hi Nate,

On 21-Jul-08, at 7:46 PM, Nate Begeman wrote:> On Jul 21, 2008, at 1:21 PM, Stefanus Du Toit wrote:
>> 1) Vector shl, lshr, ashr
>>
> That seems reasonable.
Thanks.
>> 2) Vector strunc, sext, zext, fptrunc and fpext
>>
>> Again, I think these are hopefully straightforward. Please let me  
>> know
>> if you expect any issues with vector operations that change element
>> sizes from the RHS to the LHS, e.g. around legalization.
>
> Is the proposed semantics here that the number of elements stays the  
> same size, and the overall vector width changes?
Yes.
>> 3) Vector intrinsics for floor, ceil, round, frac/modf
>>
>> These are operations that are not trivially specified in terms of
>> simpler operations. It would be nice to have these as overloaded,
>> target-independent intrinsics, in the same way as llvm.cos etc. are
>> supported now.
>
> It seems like these could be handled through intrinsics in the LLVM  
> IR, and could use general improvement in the selection dag.
Right, that's what we were thinking too. Glad to hear that makes sense!
>
>> 4) Vector select
>>
>> We consider a vector select extremely important for a number of
>> operations. This would be an extension of select to support an <N x
>> i1> vector mask to select between elements of <N x T> vectors
for
>> some
>> basic type T. Vector min, max, sign, etc. can be built on top of this
>> operation.
>
> How is this anything other than AND/ANDN/OR on any integer vector  
> type?  I don't see what adding this to the IR gets you for vectors,  
> since "vector of i1" doesn't mean "vector of bits"
necessarily.
Note that I don't mean a "select bits", but rather a "select
components" operation. In other words, a straightforward vectorization  
of the existing "select" IR operation.

You can implement the existing LLVM select instruction with bitwise  
operations, but it's still convenient to have. For one, as I  
mentioned, it provides an obvious idiomatic way to express operations  
like min and max.

Vector selection is a common operation amongst languages that support  
vectors directly, and vector code will often avoid branches by  
performing some form of predication instead.

I'm really not that concerned about how it's expressed, but there  
needs to be a well-understood way to lower something that looks like  
"vector ?:", vector max, etc in a frontend to something that will  
actually generate good code in the backend. If the idiom for vector  
float max is "vfcmp, ashr, bitcast, and, xor, and, or, bitcast" and  
that generates a single maxps from the x86 backend, great. If the  
idiom is "vfcmp, select", even better.
>> 5) Vector comparisons that return <N x i1>
>>
>> This is maybe not a must-have, and perhaps more a question of
>> preference. I understand the current vfcmp/vicmp semantics, returning
>> a vector of iK where K matches the bitwidth of the operands being
>> compared with the high bit set or not, are there for pragmatic
>> reasons, and that these functions exist to aid with code emitted that
>> uses machine-specific intrinsics.
>
> I totally disagree with this approach; A vector of i1 doesn't  
> actually match what you want to do with the hardware, unless you had  
> say, 128 x i1 for SSE, and it's strange when you have to spill and  
> reload it.
I definitely am not thinking of <128 x i1>. The intent is really just  
to express "here is a vector of values where only one bit matters in  
each element" and have codegen map that to a representation  
appropriate to the machine being targeted.

Of course these eventually need to be widened to an appropriately  
sized integer vector, and that vector may be a mask or a 0/1 value, or  
whatever. The responsibility to doing so can be placed at pretty much  
any level of the stack, all the way up to making the user worry about  
it.

For us, making the user worry about it isn't an option; we have first  
class bools in our frontend, support vectors of these, and naturally  
define comparisons, selection, etc to be consistent with this. So that  
leaves it to either the part generating LLVM IR, or the LLVM backend/ 
mid-end. I'm of the tendency that this job is best suited to be done  
in an SSA representation, and best suited to be done with a specific  
machine in mind. If this is completely infeasible, we'll have to worry  
about it during generation of LLVM IR, which is fine. I'd just rather  
see it done in LLVM where it can hopefully benefit others with the  
same issues.

To me this isn't much different whether we're talking about vectors or  
scalars; it's about the utility of an "i1" type generally, and
about
whose responsibility it is to map such a type to hardware.
> The current VICMP and VFCMP instructions do not exist for use with  
> machine intrinsics; they exist to allow code written use C-style  
> comparison operators to generate efficient code on a wide range of  
> both scalar and vector hardware.
OK. My understanding of this was based on an email from Chris to llvm- 
dev. I had asked how these were used today, especially given the lack  
of vector shifts. Here's his response:
>> They can be used with target-specific intrinsics.  For example, SSE
>> provides a broad range of intrinsics to support instructions that  
>> LLVM
>> IR can't express well.  See llvm/include/llvm/IntrinsicsX86.td for
>> more details.
If you have examples on how these are expected to be used today  
without machine intrinsics, that would really help - e.g. for  
expressing something like a vector max.
>> For code that does not use machine intrinsics, I believe it would be
>> cleaner, simpler, and potentially more efficient, to have a vector
>> compare that returns <N x i1> instead. For example, in
conjunction
>> with the above-mentioned vector select, this would allow a max to be
>> expressed simply as a sequence of compare and select.
>
> Having gone down this path, I'd have to disagree with you.
OK. Hopefully my comments above will make my reasoning clearer. If not  
please let me know. You have a lot more experience with this in LLVM  
than me so pardon my ignorance :). I'm not suggesting that these  
semantics are absolutely the way to go, but we do need some way to  
address these issues.
>> In addition to the above suggestions, I'd also like to hear what
>> others think about handling vector operations that aren't powers of
>> two in size, e.g. <3 x float> operations. I gather the status quo
is
>> that only POT sizes are expected to work (although we've found some
>> bugs for things like <2 x float> that we're submitting).
Ideally
>> things like <3 x float> operands would usually be rounded up to
the
>> size supported by the machine directly. We can try to do this in the
>> frontend, but it would of course be ideal if these just worked. I'm
>> curious if anyone else out there has dealt with this already and has
>> some suggestions.
>
> Handling NPOT vectors in the code generator ideally would be great;  
> I know some people are working on widening the operations to a wider  
> legal vector type, and scalarizing is always a possibility as well.   
> The main problem here is what to do with address calculations, and  
> alignment.
Right, we've dealt with this in the past by effectively scalarizing  
loads and stores (since simply extending a load might go beyond a page  
boundary, etc.), but extending all register operations. I think this  
addresses the address calculation and alignment issues?

Thanks for the feedback, I hope to hear more.

Stefanus

--
Stefanus Du Toit <stefanus.dutoit at rapidmind.com>
   RapidMind Inc.
   phone: +1 519 885 5455 x116 -- fax: +1 519 885 1463

David Greene

2008-Jul-23 16:15 UTC

head link

[LLVMdev] Extending vector operations

On Monday 21 July 2008 15:21, Stefanus Du Toit wrote:
> We would like to extend the vector operations in llvm a bit. We're
> hoping to get some feedback on the right way to go, or some starting
> points. I had previously had some discussion on this list about a
> subset of the changes we have in mind.
Woohoo!  We've been interested in talking about this for some time.
> All of these changes are intended to make target-independent IR (i.e.
> IR without machine specific intrinsics) generate better code or be
> easier to generate from a frontend with vector support (whether from
> manual or autovectorization).
Very excellent.
> ==> 1) Vector shl, lshr, ashr
>
> I think these are no-brainers. We would like to extend the semantics
> of the shifting instructions to naturally apply to vectors as well.
> One issue is that these operations often only support a single shift
> amount for an entire vector. I assume it should be fairly
So you're assuming a shift of a vector by a scalar?  What about the
general vector-by-vector version?
> straightforward to select on this pattern, and scalarize the general
> case as necessary.
Yep.
> 2) Vector strunc, sext, zext, fptrunc and fpext
>
> Again, I think these are hopefully straightforward. Please let me know
> if you expect any issues with vector operations that change element
> sizes from the RHS to the LHS, e.g. around legalization.
Is the assumption that all elements are changed in the same way?
> 4) Vector select
>
> We consider a vector select extremely important for a number of
> operations. This would be an extension of select to support an <N x
> i1> vector mask to select between elements of <N x T> vectors for
some
> basic type T. Vector min, max, sign, etc. can be built on top of this
> operation.
Yes, merge/blend is a very important operation.  Also, it would be nice to
think about generalizing this to apply masks to all vector operations,
particularly loads and stores.
> 5) Vector comparisons that return <N x i1>
>
> This is maybe not a must-have, and perhaps more a question of
> preference. I understand the current vfcmp/vicmp semantics, returning
> a vector of iK where K matches the bitwidth of the operands being
> compared with the high bit set or not, are there for pragmatic
> reasons, and that these functions exist to aid with code emitted that
> uses machine-specific intrinsics.
As long as we have some way of representing the result of a vector
compare, I'm not too worried.  As you say, target-specific code can
convert these to masks or whatever form the target architecture has.

I don't have any experience trying to do such a conversion, though, so
there may be gotchas we're not aware of.
> For code that does not use machine intrinsics, I believe it would be
> cleaner, simpler, and potentially more efficient, to have a vector
> compare that returns <N x i1> instead. For example, in conjunction
> with the above-mentioned vector select, this would allow a max to be
> expressed simply as a sequence of compare and select.
True enough.  It also leads naturally to generalized masked vector operations,
which are _extremely_ handy.
> Vector bitshifts would actually help with the amount of code generated
> for something like a vectorized max, but makes the patterns for
> recognizing these a lot longer.
Right.
> I realize this is probably the most controversial change amongst
> these. I gather there is some concern about representing "variable
> width" i1s, but I would contend that that's the case even for i1s
> which are not vectors.
What do you mean by "variable width?"
> In addition to the above suggestions, I'd also like to hear what
> others think about handling vector operations that aren't powers of
> two in size, e.g. <3 x float> operations. I gather the status quo is
It's goodness.
> that only POT sizes are expected to work (although we've found some
> bugs for things like <2 x float> that we're submitting). Ideally
> things like <3 x float> operands would usually be rounded up to the
> size supported by the machine directly. We can try to do this in the
You might need mask support as well, especially if the operation can trap.
> frontend, but it would of course be ideal if these just worked. I'm
> curious if anyone else out there has dealt with this already and has
> some suggestions.
It's something we've thought about but not really delved into yet.
> Please let me know what you think,
Let's connect at the dev meeting along with others interested in this stuff
and start thinking about how to proceed.

                                                   -Dave

David Greene

2008-Jul-23 16:20 UTC

head link

[LLVMdev] Extending vector operations

On Monday 21 July 2008 18:33, Eli Friedman wrote:
> > 2) Vector strunc, sext, zext, fptrunc and fpext
> >
> > Again, I think these are hopefully straightforward. Please let me know
> > if you expect any issues with vector operations that change element
> > sizes from the RHS to the LHS, e.g. around legalization.
>
> These are tricky to generate efficient code for because they change
> the size of the vector and most likely introduce illegal vectors.
> It'll be a lot of work to get these working well for non-trivial
> cases.
It depends on the architecture.  For something like SSE, I can see that
this would be a real pain (the vector length changes).  For a traditional Cray 
vector machine (as an example), it's trivial.
> > 5) Vector comparisons that return <N x i1>
> It makes sense; the difficulty is making CodeGen generate efficient
> code for these constructs.  Legalization has to transform illegal
> types into legal types, and as far as I know, none of the current
> transformations varies the type transformation based on what is using
> the value.  I don't know exactly how hard it would be, though... maybe
> someone who's more familiar with the legalization infrastructure could
> say more?
Can you elaborate with a specific example?  I'm having a hard time imagining
what transformations need to do.  When I think of a vector mask, the length
of the mask is the same as the vector length of the operands.  So a compare
of two vectors with four elements results in a four bit mask.  If the input 
types change to a different vector length, then the mask type changes as well.

                                               -Dave

David Greene

2008-Jul-23 16:34 UTC

head link

[LLVMdev] Extending vector operations

On Monday 21 July 2008 18:46, Nate Begeman wrote:
> > 4) Vector select
> >
> > We consider a vector select extremely important for a number of
> > operations. This would be an extension of select to support an <N x
> > i1> vector mask to select between elements of <N x T> vectors
for some
> > basic type T. Vector min, max, sign, etc. can be built on top of this
> > operation.
>
> How is this anything other than AND/ANDN/OR on any integer vector
> type?  I don't see what adding this to the IR gets you for vectors,
> since "vector of i1" doesn't mean "vector of bits"
necessarily.
Yes, you can implement the operation with bit twiddling but it's less 
convenient.  Since we have a scalar select, it seems only natural to
provide a vector version.  Some vector hardware has direct support
for this operation.  AVX is just one example.
> > 5) Vector comparisons that return <N x i1>
> >
> > This is maybe not a must-have, and perhaps more a question of
> > preference. I understand the current vfcmp/vicmp semantics, returning
> > a vector of iK where K matches the bitwidth of the operands being
> > compared with the high bit set or not, are there for pragmatic
> > reasons, and that these functions exist to aid with code emitted that
> > uses machine-specific intrinsics.
>
> I totally disagree with this approach; A vector of i1 doesn't actually
> match what you want to do with the hardware, unless you had say, 128 x
> i1 for SSE, and it's strange when you have to spill and reload it.
I don't follow.  "What you want to do with the hardware" depends
on what
hardware you have.  Don't be restricted in thinking SSE-style only.  And
for spill-reload on SSE, why couldn't you do a bitcast to some SSE-supported
type and just dump it to memory?
> The current VICMP and VFCMP instructions do not exist for use with
> machine intrinsics; they exist to allow code written use C-style
> comparison operators to generate efficient code on a wide range of
> both scalar and vector hardware.
Well, either you convert the current model to a bitvector for some hardware
or you convert a bitvector to the current model for other hardware.  Perhaps
this can all be done by isel without too much trouble.  I honestly don't
know.
> > For code that does not use machine intrinsics, I believe it would be
> > cleaner, simpler, and potentially more efficient, to have a vector
> > compare that returns <N x i1> instead. For example, in
conjunction
> > with the above-mentioned vector select, this would allow a max to be
> > expressed simply as a sequence of compare and select.
>
> Having gone down this path, I'd have to disagree with you.
Can you elaborate?  What were your experiences?  What architecture were
you targeting?  Real-world experience is valuable.
> Handling NPOT vectors in the code generator ideally would be great; I
> know some people are working on widening the operations to a wider
> legal vector type, and scalarizing is always a possibility as well.
> The main problem here is what to do with address calculations, and
> alignment.
With a vector select one could potentially use it to insert "safe"
values
into the unaffected vector elements on fixed-VL architectures like SSE.
This would allow us to do NPOT operations on vector hardware that
requires POT vector lengths without scalarizing.

                                                -Dave

Stefanus Du Toit

2008-Jul-23 16:48 UTC

head link

[LLVMdev] Extending vector operations

On 23-Jul-08, at 12:15 PM, David Greene wrote:> On Monday 21 July 2008 15:21, Stefanus Du Toit wrote:
>
>> We would like to extend the vector operations in llvm a bit. We're
>> hoping to get some feedback on the right way to go, or some starting
>> points. I had previously had some discussion on this list about a
>> subset of the changes we have in mind.
>
> Woohoo!  We've been interested in talking about this for some time.
Glad to hear it!
>> ==>> 1) Vector shl, lshr, ashr
>>
>> I think these are no-brainers. We would like to extend the semantics
>> of the shifting instructions to naturally apply to vectors as well.
>> One issue is that these operations often only support a single shift
>> amount for an entire vector. I assume it should be fairly
>
> So you're assuming a shift of a vector by a scalar?  What about the
> general vector-by-vector version?
No, I am proposing a general vector-by-vector version, just pointing  
out that many architectures do not support this natively, but that  
this should be a simple matter of legalization.
>> 2) Vector strunc, sext, zext, fptrunc and fpext
>>
>> Again, I think these are hopefully straightforward. Please let me  
>> know
>> if you expect any issues with vector operations that change element
>> sizes from the RHS to the LHS, e.g. around legalization.
>
> Is the assumption that all elements are changed in the same way?
Since these only depend on the types of the source and destination  
operand, and vectors are homogeneous, yes.

The intention is that for any of these conversion ops,

   %A = op <N x S> %B to <N x T>

is semantically equivalent to this pseudo-IR:

   %A = undef <N x T>
   for i = 0 .. N - 1:
     %t1 = extractelement <N x S> %B, i
     %t2 = op S %t1 to T
     %A = insertelement <N x T> %A, %t2, i
>> 4) Vector select
>>
>> We consider a vector select extremely important for a number of
>> operations. This would be an extension of select to support an <N x
>> i1> vector mask to select between elements of <N x T> vectors
for
>> some
>> basic type T. Vector min, max, sign, etc. can be built on top of this
>> operation.
>
> Yes, merge/blend is a very important operation.  Also, it would be  
> nice to
> think about generalizing this to apply masks to all vector operations,
> particularly loads and stores.
Can you elaborate about what you mean by this? I've built vector IRs  
in the past that have writemasking effectively built into all IR ops,  
and I don't know if I would recommend this approach generally. But I'm  
not sure that's what you mean.
>> I realize this is probably the most controversial change amongst
>> these. I gather there is some concern about representing "variable
>> width" i1s, but I would contend that that's the case even for
i1s
>> which are not vectors.
>
> What do you mean by "variable width?"
At some point the i1 vectors may need to be converted to masks. The  
sizes for these masks will often depend on how they're used. On ISAs  
like SSE and CellSPU, a comparison generates a mask corresponding to  
the size of the operands being compared, and a selection requires a  
mask corresponding to the size of the operands being selected.

The "simple" way to deal with this is to insert appropriate conversion
code at uses of i1s, and pick a representation for a given i1 based on  
its SSA birthpoint. It's a little more ambiguous when you start adding  
phis into the mix, e.g.:

a:
	%a1 = fcmp olt <2 x float> %f1, %f2 ; yields <2 x i1>
	br label %c

b:
	%a2 = fcmp olt <2 x double> %d1, %d2 ; yields <2 x i1>
	br label %c

c:
	%a3 = phi <2 x i1> [%a1, %a], [%a2, %b]
	select <2 x i1> %a3, %s1, %s2 ; where s1, s2 are <2 x i16>
	select <2 x i1> %a3, %c1, %c2 ; where s1, s2 are <2 x i8>

The representation for %a1 may be <2 x i32>, for %a2 <2 x i64>, but
for %a3 it's less lear.

I don't think this is a huge problem, but it's something to be aware  
of. Note the same issue can affect even scalar i1 values. In fact,  
Scott Michel posted to llvm-dev recently about dealing with this issue  
for setcc in the CellSPU backend.
>> that only POT sizes are expected to work (although we've found some
>> bugs for things like <2 x float> that we're submitting).
Ideally
>> things like <3 x float> operands would usually be rounded up to
the
>> size supported by the machine directly. We can try to do this in the
>
> You might need mask support as well, especially if the operation can  
> trap.
Hmm. Yes, divisions by zero etc. are something we should think about.
>> Please let me know what you think,
>
> Let's connect at the dev meeting along with others interested in  
> this stuff
> and start thinking about how to proceed.
Sounds good.

--
Stefanus Du Toit <stefanus.dutoit at rapidmind.com>
   RapidMind Inc.
   phone: +1 519 885 5455 x116 -- fax: +1 519 885 1463

Seemingly Similar Threads

Search for more seemingly similar threads

llvm dev - Jul 2008 - [LLVMdev] Extending vector operations

[LLVMdev] Extending vector operations

[LLVMdev] Extending vector operations

[LLVMdev] Extending vector operations

[LLVMdev] Extending vector operations

[LLVMdev] Extending vector operations

[LLVMdev] Extending vector operations

[LLVMdev] Extending vector operations

[LLVMdev] Extending vector operations

[LLVMdev] Extending vector operations

[LLVMdev] Extending vector operations

[LLVMdev] Extending vector operations

Seemingly Similar Threads