thr3ads.net - llvm dev - [LLVMdev] Vector instructions [Jun 2008]

If this information is useful, please help other people find it:
Share via:

Stefanus Du Toit

2008-Jun-26 20:56 UTC

[LLVMdev] Vector instructions

Hi,

I have some questions as to the definition of various vector
instructions. In particular, I believe there are some gaps and
inconsistencies in the vector instructions, and I'm interested in
hearing whether you agree that these should be improved or whether
there are other ways to solve these problems.

==1. Shufflevector only accepts vectors of the same type

Shufflevector seems overly restrictive right now in terms of its type
requirements. In particular, it requires (this could be clearer in the
language reference, but is supported by equivalent assertions in the
source code) that the types of the sources and destinations match
exactly. Obviously it makes sense for the element types to match, but
the requirement that the number of elements in each source operand
matches the number of elements in the destination appears overly
restrictive.

I would propose to change the syntax from:
> <result> = shufflevector <n x <ty>> <v1>, <n x
<ty>> <v2>, <n x i32>
> <mask> ; yields <n x <ty>>

to:
> <result> = shufflevector <a x <ty>> <v1>, <b x
<ty>> <v2>, <d x i32>
> <mask> ; yields <d x <ty>>
With the requirement that the entries in the (still constant) mask are
within the range of [0, a + b - 1].

This allows things like taking a <2 x i32> and duplicating it into a
<4 x i32>.

This is very useful for frontends that provide general vector
functionality. I think this is more consistent with the relaxed rules
on numbers of vector elements.

2. vector select

SIMD-within-a-register ISAs generally provide some form of efficient
vector select capabilities. It would be very convenient to be able to
have a select of the form:
> <result> = select <n x i1> <cond>, <n x <ty>>
<val1>, <n x <ty>>
> <val2> ; yields <n x <ty>>
The semantics of this would be to set each element in the result to
the corresponding element in val1 or val2, depending on whether the
corresponding element in cond is set or not.

This would make the vector types more consistent with the regular
types. It's possible to do this with masking, but that requires
building masks of the appropriate size in the IR each time one wants
to do this kind of selection, which is cumbersome and will have to be
matched by codegen to be turned into something like the Cell SPU's
"selb" instruction.

3. vector trunc, sext, zext, fptrunc, fpext

These seem like obvious gaps in the IR. Conversions between integer
and floating point vectors are supported, but conversions from one
integer vector to another  integer vector type (and likewise for
float) are not. Attempting to do this by hand is extremely cumbersome,
and may hurt codegen opportunities as well as optimization
opportunities.

4. vector shl, lshr, ashr

Again, these simply seem to be simply missing (compared to the other
bitwise ops). Not having vector versions of shift right, and not
having vector trunc, makes things like converting bitmasks of various
sizes very difficult. Modern SIMD architectures have good support for
bitshifts of vector types.

There are of course two ways to interpret a shift on a vector type:
within elements and across elements. To us, shifts within elements
(i.e. the equivalent of N scalar shifts) are more important than
shifts (or rotates!) across elements, but the latter could be useful
in some situations as well.

4. vfcmp, vicmp return types

This topic came up recently on llvm-dev (thread "VFCmp failing when
unordered or UnsafeFPMath on x86"). Having vector compares return a
vector of integer elements with bit width equal to the element types
being compared seems unnecessarily inconsistent with the icmp and fcmp
instructions. Since only one bit is well-defined anyways, why not just
return a vector of i1 elements? If after codegen this is expanded into
some other width, that's fine -- the backend can do this since only
one bit of information is exposed at the IR level anyways.

Having these instructions return vector of i1 would also be nicely
consistent with a vector select operation. Vector bit shifts and trunc
would make this easier, but it seems to me that the entire IR would be
far more consistent if these operations simply returned i1 vectors.
==
I am interested in hearing feedback on whether these ideas make sense,
or if there's some good reason for things being the way they are now
that I'm not seeing. I would also be interested, especially in the
short term, in suggestions as to how to work around the lack of these
operations.

Thanks!

Stefanus

--
Stefanus Du Toit <stefanus.dutoit at rapidmind.com>
   RapidMind Inc.
   phone: +1 519 885 5455 x116 -- fax: +1 519 885 1463

Dan Gohman

2008-Jun-26 22:49 UTC

head link

[LLVMdev] Vector instructions

On Jun 26, 2008, at 1:56 PM, Stefanus Du Toit wrote:
> Hi,
>
> I have some questions as to the definition of various vector
> instructions. In particular, I believe there are some gaps and
> inconsistencies in the vector instructions, and I'm interested in
> hearing whether you agree that these should be improved or whether
> there are other ways to solve these problems.
>
> ==> 1. Shufflevector only accepts vectors of the same type
>
> Shufflevector seems overly restrictive right now in terms of its type
> requirements. In particular, it requires (this could be clearer in the
> language reference, but is supported by equivalent assertions in the
> source code) that the types of the sources and destinations match
> exactly. Obviously it makes sense for the element types to match, but
> the requirement that the number of elements in each source operand
> matches the number of elements in the destination appears overly
> restrictive.
>
> I would propose to change the syntax from:
>
>> <result> = shufflevector <n x <ty>> <v1>, <n
x <ty>> <v2>, <n x i32>
>> <mask> ; yields <n x <ty>>
>
>
> to:
>
>> <result> = shufflevector <a x <ty>> <v1>, <b
x <ty>> <v2>, <d x i32>
>> <mask> ; yields <d x <ty>>
>
> With the requirement that the entries in the (still constant) mask are
> within the range of [0, a + b - 1].
>
> This allows things like taking a <2 x i32> and duplicating it into a
> <4 x i32>.
>
> This is very useful for frontends that provide general vector
> functionality. I think this is more consistent with the relaxed rules
> on numbers of vector elements.
The alternative is to have the frontend synthesize the needed
operations with extracts, inserts, and possibly shuffles if needed.
LLVM is actually fairly well prepared to optimize code like this.
I recommend giving this a try, and reporting any problems you
encounter.
>
>
> 2. vector select
[...]>
> 3. vector trunc, sext, zext, fptrunc, fpext
[...]>
> 4. vector shl, lshr, ashr[...]

We agree that these would be useful. There are intentions to add them
to LLVM; others can say more.
>
>
> 4. vfcmp, vicmp return types
>
> This topic came up recently on llvm-dev (thread "VFCmp failing when
> unordered or UnsafeFPMath on x86"). Having vector compares return a
> vector of integer elements with bit width equal to the element types
> being compared seems unnecessarily inconsistent with the icmp and fcmp
> instructions. Since only one bit is well-defined anyways, why not just
> return a vector of i1 elements? If after codegen this is expanded into
> some other width, that's fine -- the backend can do this since only
> one bit of information is exposed at the IR level anyways.
>
> Having these instructions return vector of i1 would also be nicely
> consistent with a vector select operation. Vector bit shifts and trunc
> would make this easier, but it seems to me that the entire IR would be
> far more consistent if these operations simply returned i1 vectors.
It turns out that having them return vectors of i1 would be somewhat
complicated. For example, a <4 x i1> on an SSE2 target could expand
either to 1 <4 x i32> or 2 <2 x i64>s, and the optimal thing would
be to make the decision based on the comparison that produced them,
but LLVM isn't yet equipped for that.

vicmp and vfcmp are very much aimed at solving practical problems on
popular architectures without needing significant new infrastructure
They're relatively new, and as you say, they'll be more useful when
combined with vector shifts and friends and we start teaching LLVM
to recognize popular idioms with them.

One interesting point here is that if we ever do want to have vector
comparisons that return vectors of i1, one way to do that would be to
overload plain icmp and fcmp, which would be neatly consistent with
the way plain mul and add are overloaded for vector types instead of
having a separate vadd and vmul.

Dan

Stefanus Du Toit

2008-Jun-27 15:02 UTC

head link

[LLVMdev] Vector instructions

Hi Dan,

Thanks for your comments. I've responded inline below.

On 26-Jun-08, at 6:49 PM, Dan Gohman wrote:> On Jun 26, 2008, at 1:56 PM, Stefanus Du Toit wrote:
>>
>> ==>> 1. Shufflevector only accepts vectors of the same type
>>
>> I would propose to change the syntax from:
>>
>>> <result> = shufflevector <n x <ty>> <v1>,
<n x <ty>> <v2>, <n x i32>
>>> <mask> ; yields <n x <ty>>
>>
>> to:
>>
>>> <result> = shufflevector <a x <ty>> <v1>,
<b x <ty>> <v2>, <d x i32>
>>> <mask> ; yields <d x <ty>>
>>
>> With the requirement that the entries in the (still constant) mask  
>> are
>> within the range of [0, a + b - 1].
> The alternative is to have the frontend synthesize the needed
> operations with extracts, inserts, and possibly shuffles if needed.
> LLVM is actually fairly well prepared to optimize code like this.
> I recommend giving this a try, and reporting any problems you
> encounter.
That certainly appears to be the only option at the moment, and we'll  
have a look to see how that works out. However, note that a  
sufficiently generalized shufflevector would remove the need for  
insertelement and extractelement to exist completely.
>> 2. vector select
>> 3. vector trunc, sext, zext, fptrunc, fpext
>> 4. vector shl, lshr, ashr
> [...]
>
> We agree that these would be useful. There are intentions to add them
> to LLVM; others can say more.
OK. I'd love to hear more, especially if someone is planning to do  
this in the short term.
>> 4. vfcmp, vicmp return types
>>
>> This topic came up recently on llvm-dev (thread "VFCmp failing
when
>> unordered or UnsafeFPMath on x86"). Having vector compares return
a
>> vector of integer elements with bit width equal to the element types
>> being compared seems unnecessarily inconsistent with the icmp and  
>> fcmp
>> instructions. Since only one bit is well-defined anyways, why not  
>> just
>> return a vector of i1 elements? If after codegen this is expanded  
>> into
>> some other width, that's fine -- the backend can do this since only
>> one bit of information is exposed at the IR level anyways.
>>
>> Having these instructions return vector of i1 would also be nicely
>> consistent with a vector select operation. Vector bit shifts and  
>> trunc
>> would make this easier, but it seems to me that the entire IR would  
>> be
>> far more consistent if these operations simply returned i1 vectors.
>
> It turns out that having them return vectors of i1 would be somewhat
> complicated. For example, a <4 x i1> on an SSE2 target could expand
> either to 1 <4 x i32> or 2 <2 x i64>s, and the optimal thing
would
> be to make the decision based on the comparison that produced them,
> but LLVM isn't yet equipped for that.
Can you expand on this a bit? I'm guessing you're referring to  
specific mechanics in LLVM's code generation framework making this  
difficult to do?

 From a theoretical point of view (keeping in mind current  
architectures) there can definitely be a need to change the  
representation at various points in the program. For example, a <2 x  
i1> generated from comparing some <2 x float> types could later be  
used to select amongst two <2 x double> types, and on most current  
architectures this would require a mask to be expanded. There are  
three cases of instructions at which representations for i1 vectors  
must be chosen: instructions which generate vectors of i1s (e.g.  
vfcmp), instructions which consume them  (e.g. a vector select) and  
instructions which do as both (e.g. shufflevector, phi statements). I  
would expect generating statements to generally have only one fixed  
choice (since common ISAs generally create masks of the size of what's  
being compared), and consumers to generally dictate a representation,  
possibly necessitating conversion code to be inserted. For  
instructions that transform vectors of i1 in some way, there may be  
some additional optimization opportunity to avoid conversion code. I  
would also presume that standard optimizations such as a form of CSE  
or VN and code motion would help reduce the number of conversions.

In fact, this issue of representing i1 vectors isn't even constrained  
to vector types. Scalar code which stores i1 values in regular  
registers needs to have the same considerations (e.g. on Cell SPU  
where there is no condition register).

Perhaps all that was way off topic, and you're just referring to a  
simple mechanical issue preventing this right now. My point is just  
that representation of booleans in registers is an important  
architecturally specific issue to handle, and the frontend is (in my  
humble opinion) probably not the best place for this.
> vicmp and vfcmp are very much aimed at solving practical problems on
> popular architectures without needing significant new infrastructure
> They're relatively new, and as you say, they'll be more useful when
> combined with vector shifts and friends and we start teaching LLVM
> to recognize popular idioms with them.
Can you give me examples of how they are used today at all? I'm having  
a really hard time figuring out a good use for them (that doesn't  
involve effectively scalarizing them immediately) without these other  
instructions.
> One interesting point here is that if we ever do want to have vector
> comparisons that return vectors of i1, one way to do that would be to
> overload plain icmp and fcmp, which would be neatly consistent with
> the way plain mul and add are overloaded for vector types instead of
> having a separate vadd and vmul.
Agreed, this would be nice.

--
Stefanus Du Toit <stefanus.dutoit at rapidmind.com>
   RapidMind Inc.
   phone: +1 519 885 5455 x116 -- fax: +1 519 885 1463

Seemingly Similar Threads

Search for more seemingly similar threads

llvm dev - Jun 2008 - [LLVMdev] Vector instructions

[LLVMdev] Vector instructions

[LLVMdev] Vector instructions

[LLVMdev] Vector instructions

Seemingly Similar Threads