thr3ads.net - llvm dev - [LLVMdev] Bug #16941 [Oct 2013]

If this information is useful, please help other people find it:
Share via:

Dmitry Babokin

2013-Oct-21 19:09 UTC

[LLVMdev] Bug #16941

Nadav,

You are right, ISPC may issue intrinsics as a result of AST selection.
Though I believe that we should stick to LLVM IR whenever is possible.
Intrinsics may appear to be boundaries for optimizations (on both data and
control flow) and are generally not optimizable. LLVM may improve over time
from performance stand point and we would benefit from it (or it may play
against us, like in this case). We can change out IR generation, but not in
favor of intrinsics (in long term, though we may use them as workaround, os
course).

I'm not sure that select is really a canonical form of this operation, as
it really assumes AND in this case. But this is a philosophical question,
so no point to argue :) In any case it should lead to more efficient code.
Which means that a) this transformation should not happen or b) code
generation for this instruction combination should be tuned. This should
benefit LLVM in general IMHO. It also may be the case that this just leads
to the bad code only in our specific environment, but at this point it
doesn't seems to be the case.

I'll try to come up with small SSE4 reproducer.

By the way, I'm curious, is the any reason why you focus on SSE4, not AVX?
Seems that vectorizer should care the most about the latest silicon.

Dmitry.

On Mon, Oct 21, 2013 at 10:18 PM, Nadav Rotem <nrotem at apple.com> wrote:
> Hi Dmitry,
>
> ISPC does some instruction selection as part of vectorization (on ASTs!)
> by placing intrinsics for specific operations.  The SEXT to i32 pattern was
> implemented because LLVM did not support vector-selects when this code was
> written.
>
> Can you submit a small SSE4 test case that demonstrates the problem?
>  Select is the canonical form of this operations, and SEXT is usually more
> difficult to lower.
>
> Thanks,
> Nadav
>
> On Oct 21, 2013, at 11:12 AM, Dmitry Babokin <babokin at gmail.com>
wrote:
>
> Nadav,
>
> You are absolutely right, it's ISPC workload. I've checked SSE4 and
it's
> also severely affected.
>
> We use intrinsics only for conversion <N x i32> <=> i32, i.e.
movmsk.ps.
> For the rest we use general LLVM instructions. And I actually would really
> like to stick this way. We rely on LLVM's ability to produce efficient
code
> from general LLVM IR. Relying on intrinsics too much would be a crunch and
> a path to nowhere for many reasons :)
>
> What is the reason for this transformation, if it doesn't lead to
> efficient code?
>
> Dmitry.
>
>
>
> On Mon, Oct 21, 2013 at 7:01 PM, Nadav Rotem <nrotem at apple.com>
wrote:
>
>> Hi Dmitry.
>>
>> This looks like an ISPC workload. ISPC works around a limitation in
>> selection dag which does not know how to legalize mask types when both
128
>> and 256 bit registers are available. ISPC works around this problem by
>> expanding the mask to i32s and using intrinsics. Can you please verify
that
>> this regression only happens on AVX ? Can you change ISPC to use
intrinsics
>> ?
>>
>> Thanks
>> Nadav
>>
>> Sent from my iPhone
>>
>> > On Oct 21, 2013, at 4:04, Dmitry Babokin <babokin at
gmail.com> wrote:
>> >
>> > Nadav,
>> >
>> > Could you please have a look at bug #16941 and let us know what
you
>> think about it? It's performance regression after one of your
commits.
>> >
>> > Thanks.
>> >
>> > Dmitry.
>>
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131021/5209cc70/attachment.html>

Nadav Rotem

2013-Oct-22 21:41 UTC

head link

[LLVMdev] Bug #16941

On Oct 21, 2013, at 12:09 PM, Dmitry Babokin <babokin at gmail.com> wrote:
> By the way, I'm curious, is the any reason why you focus on SSE4, not
AVX? Seems that vectorizer should care the most about the latest silicon.
> 
I am interested in looking at the SSE4 code because lowering of AVX code is more
complicated, especially for masks.  The problem that <8 x i1> can be
legalized to <8 x i32> for YMM, or <8 x i16> for XMM.  ISPC worked
around this limitation by explicitly extending the mask. The SEXT
canonicalization reverted the code pattern that ISPC generated.

Thanks,
Nadav   
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131022/f6e399de/attachment.html>

Dmitry Babokin

2013-Oct-25 21:16 UTC

head link

[LLVMdev] Bug #16941

Nadav,

The problem appears only for vectors longer than available hardware
register (in doubleword elements, i.e. more than 4 on SSE4 and more than 8
on AVX). Select does weird thing. <8 x i1> mask comes as two XMM
registers,
select converts them to a single XMM registers (i.e. 8 x 16 bit),
immediately after it converts back to two XMM registers and does blend.
Conversion forth and back has huge overhead.

I'm attaching 3 files with vectors of length 4, 8 and 16. Try 4 on SEE4 and
you'll see that both cases work well, 8 demonstrates the difference on
SSE4. The same on AVX (8 vs 16).




On Wed, Oct 23, 2013 at 1:41 AM, Nadav Rotem <nrotem at apple.com> wrote:
>
> On Oct 21, 2013, at 12:09 PM, Dmitry Babokin <babokin at gmail.com>
wrote:
>
> By the way, I'm curious, is the any reason why you focus on SSE4, not
AVX?
> Seems that vectorizer should care the most about the latest silicon.
>
>
> I am interested in looking at the SSE4 code because lowering of AVX code
> is more complicated, especially for masks.  The problem that <8 x i1>
can
> be legalized to <8 x i32> for YMM, or <8 x i16> for XMM.  ISPC
worked
> around this limitation by explicitly extending the mask. The SEXT
> canonicalization reverted the code pattern that ISPC generated.
>
> Thanks,
> Nadav
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131026/59618dad/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: v4.ll
Type: application/octet-stream
Size: 464 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131026/59618dad/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: v8.ll
Type: application/octet-stream
Size: 464 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131026/59618dad/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: v16.ll
Type: application/octet-stream
Size: 482 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131026/59618dad/attachment-0002.obj>

Reasonably Related Threads

Search for more possibly parallel threads

llvm dev - Oct 2013 - [LLVMdev] Bug #16941

[LLVMdev] Bug #16941

[LLVMdev] Bug #16941

[LLVMdev] Bug #16941

Reasonably Related Threads