thr3ads.net - llvm dev - [LLVMdev] SelectionDAG scalarizes vector operations. [Feb 2012]

If this information is useful, please help other people find it:
Share via:

Rotem, Nadav

2012-Feb-08 08:27 UTC

[LLVMdev] SelectionDAG scalarizes vector operations.

Duncan, 

I had a few thoughts regarding our short discussion yesterday. 

 I am not sure how we can lower SEXT into the vpmovsx family of instructions. I
propose the following strategy for the ZEXT and ANYEXT family of functions. At
first, we let the Type Legalizer/VectorOpLegalizer scalarize the code.  Next, we
allow the dag-combiner to convert the BUILD_VECTOR node into a shuffle. This is
possible because all of the inputs of the build vector come from two values(src
and (undef or zero)).  Finally, the shuffle lowering code lowers the new shuffle
node into UNPCKLPS. This sequence should be optimal for all of the sane types.
Once we implement ZEXT and ANYEXT we could issue a INREG_SEXT instruction to
support SEXT.  Unfortunately, v2i64 SRA is not supported by the hardware and the
code will be scalarized ...

Currently we promote vector elements to the widest possible type, until we hit
the _first_ legal register type.  For AVX, where YMM registers extend XMM
registers, it is not clear to me why we stop at XMM sized registers. In some
cases, masks of types <4 x i1> are legalized to  <4 x i32> in XMM
registers even if they are a result of a vector-compare of <4 x i64>
types.  I also had a second observation, which contradicts the first one. In
many cases we 'over promote'. Consider the <2 x i32> type.
Promoting the elements to <2 x i64> makes us to use types which are not
supported by the instruction set. For example, not all of the shift operations
are implemented for vector i64 types.  Maybe a different strategy would be to
promote vector elements up to i32, which is the common element type for most
processors, and widen the vector from this point onwards.  I am not sure how we
can implement vector compare/select with this approach.

Thanks,
Nadav
>nadav: in my experience a lot of trouble comes from this kind of thing:
there is an x86 instruction that takes the first two elements of <4 x
i32>,
>extends them from i32 to i64, and returns <2 x i64>
>^ all one instruction
>how to represent that in LLVM IR? in LLVM IR it ends up as two IR
instructions
>first a shuffle that extracts <2 x i32> from <4 x i32> then some
kind of extension from <2 x i32> to <2 x i64>
>currently codegen doesn't do anything sensible with either of these two,
let alone realize that together they correspond to a single processor
instruction
>nadav: anyway, what I'm saying is that a bunch of extensions seen in the
IR/SDag may be due to this kind of thing
>it certainly happens all the time with IR coming from the gcc vectorizers
>we need to somehow turn the multiple nodes into one processor instruction
>in fact this is pretty much the only way you can get extending casts of
vectors with IR coming from the GCC vectorizer---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Duncan Sands

2012-Feb-08 08:36 UTC

head link

[LLVMdev] SelectionDAG scalarizes vector operations.

Hi Nadav,
> I had a few thoughts regarding our short discussion yesterday.
>
>   I am not sure how we can lower SEXT into the vpmovsx family of
instructions. I propose the following strategy for the ZEXT and ANYEXT family of
functions.
what I would like to understand first is why there are any vector xEXT nodes
at all!  As I tried to explain on IRC, I don't think you ever get these from
the GCC autovectorizer except as part of a shuffle-extend pair.  Where do you
get these nodes from?  Does the intel auto-vectorizer produce them more often
than the GCC one?

Ciao, Duncan.

  At first, we let the Type Legalizer/VectorOpLegalizer scalarize the code. 
Next, we allow the dag-combiner to convert the BUILD_VECTOR node into a shuffle.
This is possible because all of the inputs of the build vector come from two 
values(src and (undef or zero)).  Finally, the shuffle lowering code lowers the 
new shuffle node into UNPCKLPS. This sequence should be optimal for all of the 
sane types.> Once we implement ZEXT and ANYEXT we could issue a INREG_SEXT instruction
to support SEXT.  Unfortunately, v2i64 SRA is not supported by the hardware and
the code will be scalarized ...
>
> Currently we promote vector elements to the widest possible type, until we
hit the _first_ legal register type.  For AVX, where YMM registers extend XMM
registers, it is not clear to me why we stop at XMM sized registers. In some
cases, masks of types<4 x i1>  are legalized to<4 x i32>  in XMM
registers even if they are a result of a vector-compare of<4 x i64> 
types.  I also had a second observation, which contradicts the first one. In
many cases we 'over promote'. Consider the<2 x i32>  type.
Promoting the elements to<2 x i64>  makes us to use types which are not
supported by the instruction set. For example, not all of the shift operations
are implemented for vector i64 types.  Maybe a different strategy would be to
promote vector elements up to i32, which is the common element type for most
processors, and widen the vector from this point onwards.  I am not sure how we
can implement vector compare/select with this approach.
>
> Thanks,
> Nadav
>
>> nadav: in my experience a lot of trouble comes from this kind of thing:
there is an x86 instruction that takes the first two elements of<4 x i32>,
>> extends them from i32 to i64, and returns<2 x i64>
>> ^ all one instruction
>> how to represent that in LLVM IR? in LLVM IR it ends up as two IR
instructions
>> first a shuffle that extracts<2 x i32>  from<4 x i32>  then
some kind of extension from<2 x i32>  to<2 x i64>
>> currently codegen doesn't do anything sensible with either of these
two, let alone realize that together they correspond to a single processor
instruction
>> nadav: anyway, what I'm saying is that a bunch of extensions seen
in the IR/SDag may be due to this kind of thing
>> it certainly happens all the time with IR coming from the gcc
vectorizers
>> we need to somehow turn the multiple nodes into one processor
instruction
>> in fact this is pretty much the only way you can get extending casts of
vectors with IR coming from the GCC vectorizer
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>

Rotem, Nadav

2012-Feb-08 09:02 UTC

head link

[LLVMdev] SelectionDAG scalarizes vector operations.

We generate xEXT nodes in many cases.  Unlike GCC which vectorizes inner loops,
we vectorize the implicit outermost loop of data-parallel workloads (also called
whole function vectorization).  We vectorize code even if the user uses xEXT
instructions, uses mixed types, etc.  We choose a vectorization factor which is
likely to generate more legal vector types, but if the user mixes types then we
are forced to make a decision.  We rely on the LLVM code generator to produce
quality code.  To my understanding, the GCC vectorizer does not vectorize code
if it thinks that it misses a single operation.


-----Original Message-----
From: Duncan Sands [mailto:baldrick at free.fr] 
Sent: Wednesday, February 08, 2012 10:36
To: Rotem, Nadav
Cc: llvmdev at cs.uiuc.edu
Subject: Re: SelectionDAG scalarizes vector operations.

Hi Nadav,
> I had a few thoughts regarding our short discussion yesterday.
>
>   I am not sure how we can lower SEXT into the vpmovsx family of
instructions. I propose the following strategy for the ZEXT and ANYEXT family of
functions.
what I would like to understand first is why there are any vector xEXT nodes at
all!  As I tried to explain on IRC, I don't think you ever get these from
the GCC autovectorizer except as part of a shuffle-extend pair.  Where do you
get these nodes from?  Does the intel auto-vectorizer produce them more often
than the GCC one?

Ciao, Duncan.

  At first, we let the Type Legalizer/VectorOpLegalizer scalarize the code. 
Next, we allow the dag-combiner to convert the BUILD_VECTOR node into a shuffle.
This is possible because all of the inputs of the build vector come from two
values(src and (undef or zero)).  Finally, the shuffle lowering code lowers the
new shuffle node into UNPCKLPS. This sequence should be optimal for all of the
sane types.> Once we implement ZEXT and ANYEXT we could issue a INREG_SEXT instruction
to support SEXT.  Unfortunately, v2i64 SRA is not supported by the hardware and
the code will be scalarized ...
>
> Currently we promote vector elements to the widest possible type, until we
hit the _first_ legal register type.  For AVX, where YMM registers extend XMM
registers, it is not clear to me why we stop at XMM sized registers. In some
cases, masks of types<4 x i1>  are legalized to<4 x i32>  in XMM
registers even if they are a result of a vector-compare of<4 x i64> 
types.  I also had a second observation, which contradicts the first one. In
many cases we 'over promote'. Consider the<2 x i32>  type.
Promoting the elements to<2 x i64>  makes us to use types which are not
supported by the instruction set. For example, not all of the shift operations
are implemented for vector i64 types.  Maybe a different strategy would be to
promote vector elements up to i32, which is the common element type for most
processors, and widen the vector from this point onwards.  I am not sure how we
can implement vector compare/select with this approach.
>
> Thanks,
> Nadav
>
>> nadav: in my experience a lot of trouble comes from this kind of 
>> thing: there is an x86 instruction that takes the first two elements 
>> of<4 x i32>, extends them from i32 to i64, and returns<2 x
i64> ^ all
>> one instruction how to represent that in LLVM IR? in LLVM IR it ends 
>> up as two IR instructions first a shuffle that extracts<2 x i32>
>> from<4 x i32>  then some kind of extension from<2 x i32> 
to<2 x i64>
>> currently codegen doesn't do anything sensible with either of these
>> two, let alone realize that together they correspond to a single 
>> processor instruction
>> nadav: anyway, what I'm saying is that a bunch of extensions seen
in
>> the IR/SDag may be due to this kind of thing it certainly happens all 
>> the time with IR coming from the gcc vectorizers we need to somehow 
>> turn the multiple nodes into one processor instruction in fact this 
>> is pretty much the only way you can get extending casts of vectors 
>> with IR coming from the GCC vectorizer
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for 
> the sole use of the intended recipient(s). Any review or distribution 
> by others is strictly prohibited. If you are not the intended 
> recipient, please contact the sender and delete all copies.
>
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Reasonably Related Threads

Search for more seemingly similar threads

llvm dev - Feb 2012 - [LLVMdev] SelectionDAG scalarizes vector operations.

[LLVMdev] SelectionDAG scalarizes vector operations.

[LLVMdev] SelectionDAG scalarizes vector operations.

[LLVMdev] SelectionDAG scalarizes vector operations.

Reasonably Related Threads