Duncan, I had a few thoughts regarding our short discussion yesterday. I am not sure how we can lower SEXT into the vpmovsx family of instructions. I propose the following strategy for the ZEXT and ANYEXT family of functions. At first, we let the Type Legalizer/VectorOpLegalizer scalarize the code. Next, we allow the dag-combiner to convert the BUILD_VECTOR node into a shuffle. This is possible because all of the inputs of the build vector come from two values(src and (undef or zero)). Finally, the shuffle lowering code lowers the new shuffle node into UNPCKLPS. This sequence should be optimal for all of the sane types. Once we implement ZEXT and ANYEXT we could issue a INREG_SEXT instruction to support SEXT. Unfortunately, v2i64 SRA is not supported by the hardware and the code will be scalarized ... Currently we promote vector elements to the widest possible type, until we hit the _first_ legal register type. For AVX, where YMM registers extend XMM registers, it is not clear to me why we stop at XMM sized registers. In some cases, masks of types <4 x i1> are legalized to <4 x i32> in XMM registers even if they are a result of a vector-compare of <4 x i64> types. I also had a second observation, which contradicts the first one. In many cases we 'over promote'. Consider the <2 x i32> type. Promoting the elements to <2 x i64> makes us to use types which are not supported by the instruction set. For example, not all of the shift operations are implemented for vector i64 types. Maybe a different strategy would be to promote vector elements up to i32, which is the common element type for most processors, and widen the vector from this point onwards. I am not sure how we can implement vector compare/select with this approach. Thanks, Nadav>nadav: in my experience a lot of trouble comes from this kind of thing: there is an x86 instruction that takes the first two elements of <4 x i32>, >extends them from i32 to i64, and returns <2 x i64> >^ all one instruction >how to represent that in LLVM IR? in LLVM IR it ends up as two IR instructions >first a shuffle that extracts <2 x i32> from <4 x i32> then some kind of extension from <2 x i32> to <2 x i64> >currently codegen doesn't do anything sensible with either of these two, let alone realize that together they correspond to a single processor instruction >nadav: anyway, what I'm saying is that a bunch of extensions seen in the IR/SDag may be due to this kind of thing >it certainly happens all the time with IR coming from the gcc vectorizers >we need to somehow turn the multiple nodes into one processor instruction >in fact this is pretty much the only way you can get extending casts of vectors with IR coming from the GCC vectorizer--------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
Hi Nadav,> I had a few thoughts regarding our short discussion yesterday. > > I am not sure how we can lower SEXT into the vpmovsx family of instructions. I propose the following strategy for the ZEXT and ANYEXT family of functions.what I would like to understand first is why there are any vector xEXT nodes at all! As I tried to explain on IRC, I don't think you ever get these from the GCC autovectorizer except as part of a shuffle-extend pair. Where do you get these nodes from? Does the intel auto-vectorizer produce them more often than the GCC one? Ciao, Duncan. At first, we let the Type Legalizer/VectorOpLegalizer scalarize the code. Next, we allow the dag-combiner to convert the BUILD_VECTOR node into a shuffle. This is possible because all of the inputs of the build vector come from two values(src and (undef or zero)). Finally, the shuffle lowering code lowers the new shuffle node into UNPCKLPS. This sequence should be optimal for all of the sane types.> Once we implement ZEXT and ANYEXT we could issue a INREG_SEXT instruction to support SEXT. Unfortunately, v2i64 SRA is not supported by the hardware and the code will be scalarized ... > > Currently we promote vector elements to the widest possible type, until we hit the _first_ legal register type. For AVX, where YMM registers extend XMM registers, it is not clear to me why we stop at XMM sized registers. In some cases, masks of types<4 x i1> are legalized to<4 x i32> in XMM registers even if they are a result of a vector-compare of<4 x i64> types. I also had a second observation, which contradicts the first one. In many cases we 'over promote'. Consider the<2 x i32> type. Promoting the elements to<2 x i64> makes us to use types which are not supported by the instruction set. For example, not all of the shift operations are implemented for vector i64 types. Maybe a different strategy would be to promote vector elements up to i32, which is the common element type for most processors, and widen the vector from this point onwards. I am not sure how we can implement vector compare/select with this approach. > > Thanks, > Nadav > >> nadav: in my experience a lot of trouble comes from this kind of thing: there is an x86 instruction that takes the first two elements of<4 x i32>, >> extends them from i32 to i64, and returns<2 x i64> >> ^ all one instruction >> how to represent that in LLVM IR? in LLVM IR it ends up as two IR instructions >> first a shuffle that extracts<2 x i32> from<4 x i32> then some kind of extension from<2 x i32> to<2 x i64> >> currently codegen doesn't do anything sensible with either of these two, let alone realize that together they correspond to a single processor instruction >> nadav: anyway, what I'm saying is that a bunch of extensions seen in the IR/SDag may be due to this kind of thing >> it certainly happens all the time with IR coming from the gcc vectorizers >> we need to somehow turn the multiple nodes into one processor instruction >> in fact this is pretty much the only way you can get extending casts of vectors with IR coming from the GCC vectorizer > --------------------------------------------------------------------- > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. >
We generate xEXT nodes in many cases. Unlike GCC which vectorizes inner loops, we vectorize the implicit outermost loop of data-parallel workloads (also called whole function vectorization). We vectorize code even if the user uses xEXT instructions, uses mixed types, etc. We choose a vectorization factor which is likely to generate more legal vector types, but if the user mixes types then we are forced to make a decision. We rely on the LLVM code generator to produce quality code. To my understanding, the GCC vectorizer does not vectorize code if it thinks that it misses a single operation. -----Original Message----- From: Duncan Sands [mailto:baldrick at free.fr] Sent: Wednesday, February 08, 2012 10:36 To: Rotem, Nadav Cc: llvmdev at cs.uiuc.edu Subject: Re: SelectionDAG scalarizes vector operations. Hi Nadav,> I had a few thoughts regarding our short discussion yesterday. > > I am not sure how we can lower SEXT into the vpmovsx family of instructions. I propose the following strategy for the ZEXT and ANYEXT family of functions.what I would like to understand first is why there are any vector xEXT nodes at all! As I tried to explain on IRC, I don't think you ever get these from the GCC autovectorizer except as part of a shuffle-extend pair. Where do you get these nodes from? Does the intel auto-vectorizer produce them more often than the GCC one? Ciao, Duncan. At first, we let the Type Legalizer/VectorOpLegalizer scalarize the code. Next, we allow the dag-combiner to convert the BUILD_VECTOR node into a shuffle. This is possible because all of the inputs of the build vector come from two values(src and (undef or zero)). Finally, the shuffle lowering code lowers the new shuffle node into UNPCKLPS. This sequence should be optimal for all of the sane types.> Once we implement ZEXT and ANYEXT we could issue a INREG_SEXT instruction to support SEXT. Unfortunately, v2i64 SRA is not supported by the hardware and the code will be scalarized ... > > Currently we promote vector elements to the widest possible type, until we hit the _first_ legal register type. For AVX, where YMM registers extend XMM registers, it is not clear to me why we stop at XMM sized registers. In some cases, masks of types<4 x i1> are legalized to<4 x i32> in XMM registers even if they are a result of a vector-compare of<4 x i64> types. I also had a second observation, which contradicts the first one. In many cases we 'over promote'. Consider the<2 x i32> type. Promoting the elements to<2 x i64> makes us to use types which are not supported by the instruction set. For example, not all of the shift operations are implemented for vector i64 types. Maybe a different strategy would be to promote vector elements up to i32, which is the common element type for most processors, and widen the vector from this point onwards. I am not sure how we can implement vector compare/select with this approach. > > Thanks, > Nadav > >> nadav: in my experience a lot of trouble comes from this kind of >> thing: there is an x86 instruction that takes the first two elements >> of<4 x i32>, extends them from i32 to i64, and returns<2 x i64> ^ all >> one instruction how to represent that in LLVM IR? in LLVM IR it ends >> up as two IR instructions first a shuffle that extracts<2 x i32> >> from<4 x i32> then some kind of extension from<2 x i32> to<2 x i64> >> currently codegen doesn't do anything sensible with either of these >> two, let alone realize that together they correspond to a single >> processor instruction >> nadav: anyway, what I'm saying is that a bunch of extensions seen in >> the IR/SDag may be due to this kind of thing it certainly happens all >> the time with IR coming from the gcc vectorizers we need to somehow >> turn the multiple nodes into one processor instruction in fact this >> is pretty much the only way you can get extending casts of vectors >> with IR coming from the GCC vectorizer > --------------------------------------------------------------------- > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. >--------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
Possibly Parallel Threads
- [LLVMdev] SelectionDAG scalarizes vector operations.
- [LLVMdev] SelectionDAG scalarizes vector operations.
- [LLVMdev] SelectionDAG scalarizes vector operations.
- [LLVMdev] SelectionDAG scalarizes vector operations.
- [LLVMdev] SelectionDAG scalarizes vector operations.