thr3ads.net - llvm dev - [LLVMdev] Bug #16941 [Oct 2013]

If this information is useful, please help other people find it:
Share via:

Dmitry Babokin

2013-Oct-25 21:16 UTC

[LLVMdev] Bug #16941

Nadav,

The problem appears only for vectors longer than available hardware
register (in doubleword elements, i.e. more than 4 on SSE4 and more than 8
on AVX). Select does weird thing. <8 x i1> mask comes as two XMM
registers,
select converts them to a single XMM registers (i.e. 8 x 16 bit),
immediately after it converts back to two XMM registers and does blend.
Conversion forth and back has huge overhead.

I'm attaching 3 files with vectors of length 4, 8 and 16. Try 4 on SEE4 and
you'll see that both cases work well, 8 demonstrates the difference on
SSE4. The same on AVX (8 vs 16).




On Wed, Oct 23, 2013 at 1:41 AM, Nadav Rotem <nrotem at apple.com> wrote:
>
> On Oct 21, 2013, at 12:09 PM, Dmitry Babokin <babokin at gmail.com>
wrote:
>
> By the way, I'm curious, is the any reason why you focus on SSE4, not
AVX?
> Seems that vectorizer should care the most about the latest silicon.
>
>
> I am interested in looking at the SSE4 code because lowering of AVX code
> is more complicated, especially for masks.  The problem that <8 x i1>
can
> be legalized to <8 x i32> for YMM, or <8 x i16> for XMM.  ISPC
worked
> around this limitation by explicitly extending the mask. The SEXT
> canonicalization reverted the code pattern that ISPC generated.
>
> Thanks,
> Nadav
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131026/59618dad/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: v4.ll
Type: application/octet-stream
Size: 464 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131026/59618dad/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: v8.ll
Type: application/octet-stream
Size: 464 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131026/59618dad/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: v16.ll
Type: application/octet-stream
Size: 482 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131026/59618dad/attachment-0002.obj>

Nadav Rotem

2013-Oct-26 00:25 UTC

head link

[LLVMdev] Bug #16941

Hi Dmitry, 

Yes, this is a known problem with legalizing vector masks. The type <8 x
i1> is legalized to 8 x i16, on SSE, but your operands are legalized to <4
x i32>.  Type-legalization is performed per-node and we don’t have a good way
to support instructions that mix the mask and operand type.  Why does ISPC
generate illegal vector types ?  Does ISPC rely on the LLVM codegen to split the
vectors to increase ILP ? In that case ISPC should generate two vectors
operations.
 
Thanks,
Nadav


On Oct 25, 2013, at 2:16 PM, Dmitry Babokin <babokin at gmail.com> wrote:
> Nadav,
> 
> The problem appears only for vectors longer than available hardware
register (in doubleword elements, i.e. more than 4 on SSE4 and more than 8 on
AVX). Select does weird thing. <8 x i1> mask comes as two XMM registers,
select converts them to a single XMM registers (i.e. 8 x 16 bit), immediately
after it converts back to two XMM registers and does blend. Conversion forth and
back has huge overhead.
> 
> I'm attaching 3 files with vectors of length 4, 8 and 16. Try 4 on SEE4
and you'll see that both cases work well, 8 demonstrates the difference on
SSE4. The same on AVX (8 vs 16).
> 
> 
> 
> 
> On Wed, Oct 23, 2013 at 1:41 AM, Nadav Rotem <nrotem at apple.com>
wrote:
> 
> On Oct 21, 2013, at 12:09 PM, Dmitry Babokin <babokin at gmail.com>
wrote:
> 
>> By the way, I'm curious, is the any reason why you focus on SSE4,
not AVX? Seems that vectorizer should care the most about the latest silicon.
>> 
> 
> I am interested in looking at the SSE4 code because lowering of AVX code is
more complicated, especially for masks.  The problem that <8 x i1> can be
legalized to <8 x i32> for YMM, or <8 x i16> for XMM.  ISPC worked
around this limitation by explicitly extending the mask. The SEXT
canonicalization reverted the code pattern that ISPC generated.
> 
> Thanks,
> Nadav   
> 
> <v4.ll><v8.ll><v16.ll>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131025/c9df9473/attachment.html>

Dmitry Babokin

2013-Oct-26 13:36 UTC

head link

[LLVMdev] Bug #16941

Hi Nadav,

ISPC is generating long vectors (on corresponding ISPC targets) this way
since the every beginning of ISPC as far as I know. There's no such things
in official LLVM documents as "illegal vectors", so people do expect
that
arbitrary long vectors are supported and generated reasonably well. Note,
not super-optimal, but reasonably well. Keeping it this way allows
considering LLVM as a good vehicle for experiments with vector code
generation. I may be mistaken in this statement, but this way my impression.

You are right, we do it to increase ILP (and hope that we are not running
out of registers) and rely on LLVM to split the vectors. Redesigning this
approach to manual split in LLVM IR is quite a significant effort for us,
while the only issue that we are aware of is this "select" problem.
And
actually we are not using this select, it's LLVM who decides that it's
beneficial. All other arithmetic instructions work quite well. And to avoid
ambiguity with mask representation we are not carrying it around as <8 x
i1>, we convert immediately to <8 x i32>. So, we'd like LLVM to
allow us
doing it efficiently.

I'm not familiar with LLVM codegen, but this seems to me that conversion to
8x16 bits mask and back happens *within* a single node and this may be
fixed in the codegen.

So I propose fixing it in one of the following ways:
a) avoid doing "sext+and"=>"select" transformation for
vectors longer than
architectural register.
b) fix select to avoid internal conversion 8x16bit (doing operations on
8x32bits is way more effective anyway).

Dmitry.

On Sat, Oct 26, 2013 at 4:25 AM, Nadav Rotem <nrotem at apple.com> wrote:
> Hi Dmitry,
>
> Yes, this is a known problem with legalizing vector masks. The type <8 x
> i1> is legalized to 8 x i16, on SSE, but your operands are legalized to
<4
> x i32>.  Type-legalization is performed per-node and we don’t have a
good
> way to support instructions that mix the mask and operand type.  Why does
> ISPC generate illegal vector types ?  Does ISPC rely on the LLVM codegen to
> split the vectors to increase ILP ? In that case ISPC should generate two
> vectors operations.
>
> Thanks,
> Nadav
>
>
> On Oct 25, 2013, at 2:16 PM, Dmitry Babokin <babokin at gmail.com>
wrote:
>
> Nadav,
>
> The problem appears only for vectors longer than available hardware
> register (in doubleword elements, i.e. more than 4 on SSE4 and more than 8
> on AVX). Select does weird thing. <8 x i1> mask comes as two XMM
registers,
> select converts them to a single XMM registers (i.e. 8 x 16 bit),
> immediately after it converts back to two XMM registers and does blend.
> Conversion forth and back has huge overhead.
>
> I'm attaching 3 files with vectors of length 4, 8 and 16. Try 4 on SEE4
> and you'll see that both cases work well, 8 demonstrates the difference
on
> SSE4. The same on AVX (8 vs 16).
>
>
>
>
> On Wed, Oct 23, 2013 at 1:41 AM, Nadav Rotem <nrotem at apple.com>
wrote:
>
>>
>> On Oct 21, 2013, at 12:09 PM, Dmitry Babokin <babokin at
gmail.com> wrote:
>>
>> By the way, I'm curious, is the any reason why you focus on SSE4,
not
>> AVX? Seems that vectorizer should care the most about the latest
silicon.
>>
>>
>> I am interested in looking at the SSE4 code because lowering of AVX
code
>> is more complicated, especially for masks.  The problem that <8 x
i1> can
>> be legalized to <8 x i32> for YMM, or <8 x i16> for XMM. 
ISPC worked
>> around this limitation by explicitly extending the mask. The SEXT
>> canonicalization reverted the code pattern that ISPC generated.
>>
>> Thanks,
>> Nadav
>>
>
> <v4.ll><v8.ll><v16.ll>
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131026/3894dd70/attachment.html>

Possibly Parallel Threads

Search for more reasonably related threads

llvm dev - Oct 2013 - [LLVMdev] Bug #16941

[LLVMdev] Bug #16941

[LLVMdev] Bug #16941

[LLVMdev] Bug #16941

Possibly Parallel Threads