thr3ads.net - llvm dev - [LLVMdev] Generalizing shuffle vector [Sep 2008]

If this information is useful, please help other people find it:
Share via:

Mon Ping Wang

2008-Sep-30 03:11 UTC

[LLVMdev] Generalizing shuffle vector

Hi,

The current definition of shuffle vector is
      <result> = shufflevector <n x <ty>> <v1>, <n x
<ty>> <v2>, <n x
i32> <mask>    ; yields <n x <ty>>

The first two operands of a 'shufflevector' instruction are vectors  
with types that match each other and types that match the result of  
the instruction. The third argument is a shuffle mask, which has the  
same number of elements as the other vector type, but whose element  
type is always 'i32'. The shuffle mask operand is required to be a  
constant vector with either constant integer or undef values.

I am proposing to extend the shuffle vector definition to be
   <result> = shufflevector <n x <ty>> <v1>, <n x
<ty>> <v2>, <m x
i32> <mask>    ; yields <m x <ty>>

The third argument is the shuffle mask and it is not required to have  
the same number of elements as the other vector type but the element  
type is still always 'i32'

The reason for this change is to allow us to handle vectors much more  
naturally,  allows the generated IR to reflect the structure of the  
user code, and most importantly do a better job in code generation.   
Today, building a vector for different lengths requires a sequence of  
extracting each element in the vector and inserting each element into  
a new vector.  With this new form, it is more straightforward to write  
and reason about

typedef __attribute__(( ext_vector_type(4) )) float float4;
typedef __attribute__(( ext_vector_type(8) )) float float8;
float8 f8;
float4 f4a, f4b, f4c;
f4a = f8.hi;
f8.hi = f4b; f8.lo = f4c;

where hi and lo represent the high half and low half of the vector.   
The outgoing IR is
   %f4a = shufflevector <8xf32>%f8, undef, <4xi32> <0, 1, 2,
3>
   %f8 = shufflevector <4xf32>%f4b, <4xf32>%f4c, <8xi32>
<0, 1, 2, 3,
4, 5, 6, 7>

The problem with generating insert and extracts is that we can  
generate poor code
        %tmp16 = extractelement <4 x float> %f4b, i32 0
        %f8a = insertelement <8 x float> %f8a, float %tmp16, i32 0
        %tmp18 = extractelement <4 x float> %f4b, i32 1
        %f8c = insertelement <8 x float> %f8b, float %tmp18, i32 1
       ...
For X86, legalize will convert each insertelement to become a vector  
shuffle.  We are very careful in combining vector shuffles because we  
don't want to produce a vector shuffle whose mask is illegal or hard  
to code gen so we end up in this code to generate a sequence of unpcks  
and movhlps for this.  With the new form, Legalize will divide the  
8xf32 vector into two 4xf32 and since the two sides are the same, it  
will generate quad word moves to copy the values.

There are other cases when a user write vector code, the generation of  
extract element and insert elements will cause us to lose the  
structure of the original vector code and without this structure, the  
code generator will not generate code for it.

Please let me know if you have any comments, suggestions, or concerns,

Thanks,
   -- Mon Ping



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20080929/f8d1e9ce/attachment.html>

Stefanus Du Toit

2008-Sep-30 13:44 UTC

head link

[LLVMdev] Generalizing shuffle vector

Hi Mon Ping,

Generalizing shufflevector would be great. I have an additional  
suggestion below.

On 29-Sep-08, at 11:11 PM, Mon Ping Wang wrote:> I am proposing to extend the shuffle vector definition to be
> <result> = shufflevector <n x <ty>> <v1>, <n x
<ty>> <v2>, <m x i32>
> <mask> ; yields <m x <ty>>
>
> The third argument is the shuffle mask and it is not required to  
> have the same number of elements as the other vector type but the  
> element type is still always 'i32'
I've previously proposed an even further generalization of  
shufflevector to:

<result> = shufflevector <n1 x <ty>> <v1>, <n2 x
<ty>> <v2>, <m x i32>
<mask> ; yields <m x <ty>>

where n1 does not have the be the same as n2. This makes shufflevector  
completely generic, and in fact allows even insertelement and  
extractelement to be represented using it (not that I necessarily  
think they should be, it's just a nice side effect).

If this is feasible, it would be nice to extend it all the way. This  
lets you do things like:

float3 x;
float4 y;

  // ...

y.xyz = x;

as a single shufflevector, e.g.:

   %y2 = shufflevector <4xf32> %y1, <3xf32> %x, <4, 5, 6, 3>

I assume my proposed generalization can't hurt codegen, since it could  
always be turned into a sequence of insert and extracts which would  
provide the same behaviour as today.

--
Stefanus Du Toit <stefanus.dutoit at rapidmind.com>
   RapidMind Inc.
   phone: +1 519 885 5455 x116 -- fax: +1 519 885 1463

Evan Cheng

2008-Sep-30 17:36 UTC

head link

[LLVMdev] Generalizing shuffle vector

I agree further generalization seems like a very good idea. But I'd  
like to see what Mon Ping proposed implemented first so we have a  
better idea of the implementation cost.

Thanks,

Evan

On Sep 30, 2008, at 6:44 AM, Stefanus Du Toit wrote:
> Hi Mon Ping,
>
> Generalizing shufflevector would be great. I have an additional
> suggestion below.
>
> On 29-Sep-08, at 11:11 PM, Mon Ping Wang wrote:
>> I am proposing to extend the shuffle vector definition to be
>> <result> = shufflevector <n x <ty>> <v1>, <n
x <ty>> <v2>, <m x i32>
>> <mask> ; yields <m x <ty>>
>>
>> The third argument is the shuffle mask and it is not required to
>> have the same number of elements as the other vector type but the
>> element type is still always 'i32'
>
> I've previously proposed an even further generalization of
> shufflevector to:
>
> <result> = shufflevector <n1 x <ty>> <v1>, <n2 x
<ty>> <v2>, <m x i32>
> <mask> ; yields <m x <ty>>
>
> where n1 does not have the be the same as n2. This makes shufflevector
> completely generic, and in fact allows even insertelement and
> extractelement to be represented using it (not that I necessarily
> think they should be, it's just a nice side effect).
>
> If this is feasible, it would be nice to extend it all the way. This
> lets you do things like:
>
> float3 x;
> float4 y;
>
>  // ...
>
> y.xyz = x;
>
> as a single shufflevector, e.g.:
>
>   %y2 = shufflevector <4xf32> %y1, <3xf32> %x, <4, 5, 6,
3>
>
> I assume my proposed generalization can't hurt codegen, since it could
> always be turned into a sequence of insert and extracts which would
> provide the same behaviour as today.
>
> --
> Stefanus Du Toit <stefanus.dutoit at rapidmind.com>
>   RapidMind Inc.
>   phone: +1 519 885 5455 x116 -- fax: +1 519 885 1463
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Eli Friedman

2008-Sep-30 20:24 UTC

head link

[LLVMdev] Generalizing shuffle vector

On Mon, Sep 29, 2008 at 8:11 PM, Mon Ping Wang <wangmp at apple.com>
wrote:> The problem with generating insert and extracts is that we can generate
poor
> code
>        %tmp16 = extractelement <4 x float> %f4b, i32 0
>        %f8a = insertelement <8 x float> %f8a, float %tmp16, i32 0
>        %tmp18 = extractelement <4 x float> %f4b, i32 1
>        %f8c = insertelement <8 x float> %f8b, float %tmp18, i32 1
>       ...
> For X86, legalize will convert each insertelement to become a vector
> shuffle.  We are very careful in combining vector shuffles because we
don't
> want to produce a vector shuffle whose mask is illegal or hard to code gen
> so we end up in this code to generate a sequence of unpcks and movhlps for
> this.  With the new form, Legalize will divide the 8xf32 vector into two
> 4xf32 and since the two sides are the same, it will generate quad word
moves
> to copy the values.
I think this specific issue can be fixed without extending the
IL-level syntax; DAGCombiner could easily be made a lot more clever
about cases like this.  For example, before legalization, we can
transform an INSERT_VECTOR_ELT inserting an element into a constant
vector or a SCALAR_TO_VECTOR into a BUILD_VECTOR, and we can transform
BUILD_VECTOR into CONCAT_VECTORS or EXTRACT_SUBVECTOR for relevant
cases.  We can also make the lowering significantly more clever about
dealing with insertelement.

If what we currently have isn't sufficient, I think the first step is
to extend VECTOR_SHUFFLE to be more flexible, and implement the
relevant combines to make that effective.  I don't think it makes
sense to add this extension to the IL before it's proven that we can
do CodeGen on it reasonably.

I'm not necessarily against this extension, but since we're talking
about constructs which can already represented in the IL without
losing too much semantic information, we really want to be going about
this from the bottom up.  It's going to be hard to properly see the
concrete benefits without a solid baseline to compare against.

-Eli

Mon Ping Wang

2008-Sep-30 23:37 UTC

head link

[LLVMdev] Generalizing shuffle vector

Hi Eli,

I understand your concern and your question is a good one.  In this  
specific example, you are right about being more clever in  
DAGCombiner.  I did start to go down this road and implemented  
something similar to what you suggested and it does work reasonably  
well for that specific example.  There are other cases though that  
this approach doesn't work well where by generating insert/extracts,  
we lose the structure of the vector program.  Let me see if I can add  
some more motivation in favor of this change.  Consider the following  
case where we are doing a transpose by dividing a 16 wide vector into  
four 4 wide vectors.

typedef __attribute__(( ext_vector_type(4) )) float float4;
typedef __attribute__(( ext_vector_type(16) )) float float16;
float16 f16;
float4 f4a, f4b, f4c, f4d, t;
f4a = f16.hi.hi; f4b = f16.hi.lo; f4c = f16.lo.hi; f4d = f16.hi.hi;
f4a = __builtin_ia32_unpcklps( f4a, f4c );
t  = __builtin_ia32_unpckhps( t,  f4c );
f4c = f4b;
f4a = __builtin_ia32_unpcklps( f4a, f4d );
f4b = __builtin_ia32_unpckhps( f4b, f4d );
f16.lo.lo = __builtin_ia32_unpcklps( f4a, f4b );
f16.lo.hi = __builtin_ia32_unpckhps( f4a, f4b );
f16.hi.lo = __builtin_ia32_unpcklps( t,  f4c );
f16.hi.hi = __builtin_ia32_unpckhps( t,  f4c );

All this code is that if you have <0, 1, 2, 3, 4..., 15>, you will get  
<0, 4, 8, 12, 1, 5, 9, 13, 2 ..., 15>.  Each of builtins maps to a  
vector shuffle.  After opt is run, we end up with a extract of each  
element and an insert into the correct location.  It is difficult for  
the code generator to rebuild the correct series of shuffles as it  
requires that doing shuffles for one float4 of the result, you are  
also doing some work for another float4.  It is possible to retain the  
structure of the vectors by preventing opt from optimizing away any  
shuffle operations but we might miss some optimizations opportunities  
when the code is really extracting a few elements and using it in a  
scalar context.

 From a different perspective, the insert and extracts are breaking up  
vectors by scalarizing it elements and then in code generation, we are  
rebuilding the vectors again and their operation.  Assuming that the  
vector code is well written, it will be advantageous for us to retain  
the structure of the incoming program and optimize and change it when  
we know we can do better than that.  With insert and extract, it is  
easier for us to lose the structure of the incoming vector as we  
optimize it. It just seems cleaner to me that when we manipulate  
vectors that they can remain vectors.

   -- Mon Ping

On Sep 30, 2008, at 1:24 PM, Eli Friedman wrote:
> On Mon, Sep 29, 2008 at 8:11 PM, Mon Ping Wang <wangmp at apple.com>
> wrote:
>> The problem with generating insert and extracts is that we can  
>> generate poor
>> code
>>       %tmp16 = extractelement <4 x float> %f4b, i32 0
>>       %f8a = insertelement <8 x float> %f8a, float %tmp16, i32 0
>>       %tmp18 = extractelement <4 x float> %f4b, i32 1
>>       %f8c = insertelement <8 x float> %f8b, float %tmp18, i32 1
>>      ...
>> For X86, legalize will convert each insertelement to become a vector
>> shuffle.  We are very careful in combining vector shuffles because  
>> we don't
>> want to produce a vector shuffle whose mask is illegal or hard to  
>> code gen
>> so we end up in this code to generate a sequence of unpcks and  
>> movhlps for
>> this.  With the new form, Legalize will divide the 8xf32 vector  
>> into two
>> 4xf32 and since the two sides are the same, it will generate quad  
>> word moves
>> to copy the values.
>
> I think this specific issue can be fixed without extending the
> IL-level syntax; DAGCombiner could easily be made a lot more clever
> about cases like this.  For example, before legalization, we can
> transform an INSERT_VECTOR_ELT inserting an element into a constant
> vector or a SCALAR_TO_VECTOR into a BUILD_VECTOR, and we can transform
> BUILD_VECTOR into CONCAT_VECTORS or EXTRACT_SUBVECTOR for relevant
> cases.  We can also make the lowering significantly more clever about
> dealing with insertelement.
>
> If what we currently have isn't sufficient, I think the first step is
> to extend VECTOR_SHUFFLE to be more flexible, and implement the
> relevant combines to make that effective.  I don't think it makes
> sense to add this extension to the IL before it's proven that we can
> do CodeGen on it reasonably.
>
> I'm not necessarily against this extension, but since we're talking
> about constructs which can already represented in the IL without
> losing too much semantic information, we really want to be going about
> this from the bottom up.  It's going to be hard to properly see the
> concrete benefits without a solid baseline to compare against.
>
> -Eli
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Apparently Analagous Threads

Search for more seemingly similar threads

llvm dev - Sep 2008 - [LLVMdev] Generalizing shuffle vector

[LLVMdev] Generalizing shuffle vector

[LLVMdev] Generalizing shuffle vector

[LLVMdev] Generalizing shuffle vector

[LLVMdev] Generalizing shuffle vector

[LLVMdev] Generalizing shuffle vector

Apparently Analagous Threads