Saito, Hideki via llvm-dev
2016-Jun-29 21:43 UTC
[llvm-dev] Question about VectorLegalizer::ExpandStore() with v4i1
Rob, Ahmed, and Jingu, [I'm sorry if my point of view is too x86 centric.]>>the tricky part about fixing it is the need to settle on a memory layout for these vectors >> (packed vs byte per i1; packed would be compatible with AVX512, I think).I agree with Ahmed here, in principle. It's actually more than that, since vector compare in AVX2 and below produces the same bitwidth per element as the compared data. For example, in a mixed data type code, it isn't rare to feed integer vector compare (0/FFFFFFFF, not even 0/1) consumed in double precision blend (or compute) and vice versa ---- mask conversion between 32bit-per-elem and 64bit-per-elem has to happen. We need to minimize conversion between 0/1 logic and 0/-1 logic, and also conversion between different element sizes. Doing so for AVX2 and below is challenging enough. Introduction of AVX512F in Xeon Phi added another challenge to the vectorizer developers. Addition of AVX512BW and VL should make it easier. Without AVX512BW and VL (i.e., all of today's x86 targets), optimal representation of the result of compare is determined by how it is consumed, and it is not a good idea to have such optimization in multiple different places. If the legalizer has to blindly legalize v4i1 without knowing how it is consumed, it is best to look at what happens to v8i1. We can then let the same optimizer work to get the optimal ASM code out in the end, whether vectorization factor is 4 or 8. In the end, I may be agreeing to Rob, but not because of the reasons Rob mentioned. One of the headaches is movmskps/pmovmskb do not have a quick reverse instruction (MIC-AVX512 and below). I do not know LLVM's X86 CodeGen enough to say whether it internally has mask-to/from-vector nodes. If it has, I'd hope X86 CodeGen can cancel out such things in a peephole manner very efficiently so that blindly going for i1-per-elem (at type legalization time) is good enough for most (if not all) cases ----- and I also hope that is good (or good enough) for other (i.e., non-x86) backends. Thanks, Hideki Saito Vectorizer Technical Lead Intel Compiler and Languages ----------------------------------------------------- Message: 8 Date: Tue, 28 Jun 2016 10:57:09 -0700 (PDT) From: Rob Cameron via llvm-dev <llvm-dev at lists.llvm.org> To: Ahmed Bougacha <ahmed.bougacha at gmail.com> Cc: llvm-dev <llvm-dev at lists.llvm.org> Subject: Re: [llvm-dev] Question about VectorLegalizer::ExpandStore() with v4i1 Message-ID: <1150997581.449524.1467136629022.JavaMail.zimbra at sfu.ca> Content-Type: text/plain; charset=utf-8 Hi, Ahmed. A packed representation, one bit per i1, is natural and best for our work, for sure. In the Parabix project, we produced very fast text and byte stream processing applications using packed bit streams, stored 128 bits at a time for SSE/Neon/Altivec registers, 256 bits at a time for AVX, 512 bits at a time for AVX 512. I also think that the one bit per i1 approach is best and most consistent overall. Vectors are not arrays. Vectors are intended to be treated as single values. Whereas an array of i1 could reasonably be viewed as an array of bytes, a vector of i1 should be packed. The use of vector types in general should signify that efficient loading, storing and manipulating of vectors is more important than manipulation of individual elements. The entire point is to provide a natural model for SIMD instruction sets, it seems to me. As you say, the packed representation makes a lot of sense for AVX512. But even the existing SSE and AVX instruction sets use a packed representation in many cases. For example, the SSE operation movmskps produces a 4xi1 and pmovmskb produces 16xi1, both in packed form. In addition, any icmp or fcmp operation can be easily implemented using two instructions to produce packed i1 values. Our software relies on this packed representation extensively.> > JinGu, > > Your analysis is correct, vectors of i1 are incorrectly legalized. > This is a known issue (http://llvm.org/PR22603); the tricky part about > fixing it is the need to settle on a memory layout for these vectors > (packed vs byte per i1; packed would be compatible with AVX512, I > think). > > -Ahmed >