thr3ads.net - llvm dev - [llvm-dev] Question about VectorLegalizer::ExpandStore() with v4i1 [Jun 2016]

If this information is useful, please help other people find it:
Share via:

Saito, Hideki via llvm-dev

2016-Jun-29 21:43 UTC

[llvm-dev] Question about VectorLegalizer::ExpandStore() with v4i1

Rob, Ahmed, and Jingu,

[I'm sorry if my point of view is too x86 centric.]
>>the tricky part about fixing it is the need to settle on a memory layout
for these vectors
>> (packed vs byte per i1;  packed would be compatible with AVX512, I
think).
I agree with Ahmed here, in principle. It's actually more than that, since
vector compare
in AVX2 and below produces the same bitwidth per element as the compared data.
For example, in a mixed data type code, it isn't rare to feed integer vector
compare
(0/FFFFFFFF, not even 0/1) consumed in double precision blend (or compute) and
vice versa
---- mask conversion between 32bit-per-elem and 64bit-per-elem has to happen.
We need to minimize conversion between 0/1 logic and 0/-1 logic, and also
conversion
between different element sizes. Doing so for AVX2 and below is challenging
enough.
Introduction of AVX512F in Xeon Phi added another challenge to the vectorizer
developers.
Addition of AVX512BW and VL should make it easier.

Without AVX512BW and VL (i.e., all of today's x86 targets), optimal
representation of
the result of compare is determined by how it is consumed, and it is not a good
idea
to have such optimization in multiple different places. If the legalizer has to
blindly
legalize v4i1 without knowing how it is consumed, it is best to look at what
happens
to v8i1. We can then let the same optimizer work to get the optimal ASM code out
in the end, whether vectorization factor is 4 or 8.

In the end, I may be agreeing to Rob, but not because of the reasons Rob
mentioned.
One of the headaches is movmskps/pmovmskb do not have a quick reverse
instruction
(MIC-AVX512 and below). I do not know LLVM's X86 CodeGen enough to say
whether it
internally has mask-to/from-vector nodes. If it has, I'd hope X86 CodeGen
can cancel out such
things in a peephole manner very efficiently so that blindly going for
i1-per-elem (at type
legalization time) is good enough for most (if not all) cases ----- and I also
hope that is
good (or good enough) for other (i.e., non-x86) backends.

Thanks,
Hideki Saito
Vectorizer Technical Lead
Intel Compiler and Languages

-----------------------------------------------------
Message: 8
Date: Tue, 28 Jun 2016 10:57:09 -0700 (PDT)
From: Rob Cameron via llvm-dev <llvm-dev at lists.llvm.org>
To: Ahmed Bougacha <ahmed.bougacha at gmail.com>
Cc: llvm-dev <llvm-dev at lists.llvm.org>
Subject: Re: [llvm-dev] Question about VectorLegalizer::ExpandStore()
	with	v4i1
Message-ID: <1150997581.449524.1467136629022.JavaMail.zimbra at sfu.ca>
Content-Type: text/plain; charset=utf-8

Hi, Ahmed.

A packed representation, one bit per i1, is natural and best for our
work, for sure.   In the Parabix project, we produced very fast text
and byte stream processing applications using packed bit streams,
stored 128 bits at a time for SSE/Neon/Altivec registers, 256 bits at
a time for AVX, 512 bits at a time for AVX 512.   

I also think that the one bit per i1 approach is best and most consistent
overall.   Vectors are not arrays.   Vectors are intended to be treated
as single values.  Whereas an array of i1 could reasonably be viewed as
an array of bytes, a vector of i1 should be packed. 

The use of vector types in general should signify that efficient loading,
storing and manipulating of vectors is more important than manipulation of
individual elements.   The entire point is to provide a natural model for
SIMD instruction sets, it seems to me.

As you say, the packed representation makes a lot of sense for AVX512.
But even the existing SSE and AVX instruction sets use a packed representation
in many cases.   For example, the SSE operation movmskps produces a 4xi1
and pmovmskb produces 16xi1, both in packed form.   In addition, any
icmp or fcmp operation can be easily implemented using two instructions
to produce packed i1 values.   Our software relies on this packed
representation extensively.

> 
> JinGu,
> 
> Your analysis is correct, vectors of i1 are incorrectly legalized.
> This is a known issue (http://llvm.org/PR22603); the tricky part about
> fixing it is the need to settle on a memory layout for these vectors
> (packed vs byte per i1;  packed would be compatible with AVX512, I
> think).
> 
> -Ahmed
>

Reasonably Related Threads

Search for more maybe matching threads

llvm dev - Jun 2016 - Question about VectorLegalizer::ExpandStore() with v4i1

[llvm-dev] Question about VectorLegalizer::ExpandStore() with v4i1

Reasonably Related Threads

Wisdom of the Ancients