thr3ads.net - llvm dev - [LLVMdev] branch on vector compare? [Sep 2012]

If this information is useful, please help other people find it:
Share via:

Stephen

2012-Sep-04 22:24 UTC

[LLVMdev] branch on vector compare?

Roland Scheidegger <sroland <at> vmware.com>
writes:> This looks quite similar to something I filed a bug on (12312). Michael
> Liao submitted fixes for this, so I think
> if you change it to
>   %16 = fcmp ogt <4 x float> %15, %cr
>   %17 = sext <4 x i1> %16 to <4 x i32>
>   %18 = bitcast <4 x i32> %17 to i128
>   %19 = icmp ne i128 %18, 0
>   br i1 %19, label %true1, label %false2
> 
> should do the trick (one cmpps + one ptest + one br instruction).
> This, however, requires sse41 which I don't know if you have - you say
> the extractelements go through memory which I've never seen then again
> our code didn't try to extract the i1 directly (even without fixes for
> ptest the above sequence will result in only 2 extraction steps instead
> of 4 if you're on x64 and the cpu supports sse41 but I guess without
> sse41 and hence no pextrd/q it probably also will go through memory).
> Though on altivec this sequence might not produce anything good, the
> free sext requires llvm 2.7 on x86 to work at all (certainly shouldn't
> be a problem nowadays but on other backends it might be different) and
> for the ptest sequence very recent svn is required.
> I don't think the current code can generate movmskps + test (probably
> the next best thing without sse41) instead of ptest though if you only
> got sse.

Thanks Roland, sign extending gets me part of the way at least.
I'm on version 3.1 and as you say in bug report, there are a
few extraneous instructions. For the record, casting to a <4 x i8>
seems to do a better job for x86 (shuffle, movd, test, jump). Using
<4 x i32> seems to issue a pextrd for each element. For x64, it seems
to be the same for either. I suppose it's all academic seeing as the
ptest patch looks good.

Looking at it again, I'm not sure how I saw memory spills. Certainly
I can't reproduce them without using -O0. It's possible I was did
that accidentally when investigating the issue.

Thanks,
Stephen.

Roland Scheidegger

2012-Sep-05 16:07 UTC

head link

[LLVMdev] branch on vector compare?

Am 05.09.2012 00:24, schrieb Stephen:> Roland Scheidegger <sroland <at> vmware.com> writes:
>> This looks quite similar to something I filed a bug on (12312). Michael
>> Liao submitted fixes for this, so I think
>> if you change it to
>>   %16 = fcmp ogt <4 x float> %15, %cr
>>   %17 = sext <4 x i1> %16 to <4 x i32>
>>   %18 = bitcast <4 x i32> %17 to i128
>>   %19 = icmp ne i128 %18, 0
>>   br i1 %19, label %true1, label %false2
>>
>> should do the trick (one cmpps + one ptest + one br instruction).
>> This, however, requires sse41 which I don't know if you have - you
say
>> the extractelements go through memory which I've never seen then
again
>> our code didn't try to extract the i1 directly (even without fixes
for
>> ptest the above sequence will result in only 2 extraction steps instead
>> of 4 if you're on x64 and the cpu supports sse41 but I guess
without
>> sse41 and hence no pextrd/q it probably also will go through memory).
>> Though on altivec this sequence might not produce anything good, the
>> free sext requires llvm 2.7 on x86 to work at all (certainly
shouldn't
>> be a problem nowadays but on other backends it might be different) and
>> for the ptest sequence very recent svn is required.
>> I don't think the current code can generate movmskps + test
(probably
>> the next best thing without sse41) instead of ptest though if you only
>> got sse.
> 
> 
> Thanks Roland, sign extending gets me part of the way at least.
> I'm on version 3.1 and as you say in bug report, there are a
> few extraneous instructions. For the record, casting to a <4 x i8>
> seems to do a better job for x86 (shuffle, movd, test, jump). Using
> <4 x i32> seems to issue a pextrd for each element. For x64, it seems
> to be the same for either. I suppose it's all academic seeing as the
> ptest patch looks good.
Yes <4 x i8> cast looks like a good idea. Just be careful though if you
also need to target cpus without ssse3, IIRC without pshufb this will
create some horrible code (could have been with older llvm version
though). Though if you don't have ssse3 you also won't have pextrd,
which means more shuffling to extract the values if you sign-extend them
to <4 x i32> too (if you're targeting altivec, probably no such issue
as
I think it doesn't have such blatantly missing shuffle instructions).
But yes ptest looks like the obvious winner. For cpus not having sse41
(and there's tons of them still in use not to mention still sold) it
would be nice if llvm could come up with pmovmskb/movmskps/movmskpd +
test (these instructions look like they were intended for exactly that
use case after all). But the <4 x i8> sign-extend solution shouldn't
hurt performance too much neither, if you've got ssse3.

Roland

Michael LIAO

2012-Sep-06 05:17 UTC

head link

[LLVMdev] branch on vector compare?

On Wed, Sep 5, 2012 at 9:07 AM, Roland Scheidegger <sroland at vmware.com>
wrote:> Am 05.09.2012 00:24, schrieb Stephen:
>> Roland Scheidegger <sroland <at> vmware.com> writes:
>>> This looks quite similar to something I filed a bug on (12312).
Michael
>>> Liao submitted fixes for this, so I think
>>> if you change it to
>>>   %16 = fcmp ogt <4 x float> %15, %cr
>>>   %17 = sext <4 x i1> %16 to <4 x i32>
>>>   %18 = bitcast <4 x i32> %17 to i128
>>>   %19 = icmp ne i128 %18, 0
>>>   br i1 %19, label %true1, label %false2
>>>
>>> should do the trick (one cmpps + one ptest + one br instruction).
>>> This, however, requires sse41 which I don't know if you have -
you say
>>> the extractelements go through memory which I've never seen
then again
>>> our code didn't try to extract the i1 directly (even without
fixes for
>>> ptest the above sequence will result in only 2 extraction steps
instead
>>> of 4 if you're on x64 and the cpu supports sse41 but I guess
without
>>> sse41 and hence no pextrd/q it probably also will go through
memory).
>>> Though on altivec this sequence might not produce anything good,
the
>>> free sext requires llvm 2.7 on x86 to work at all (certainly
shouldn't
>>> be a problem nowadays but on other backends it might be different)
and
>>> for the ptest sequence very recent svn is required.
>>> I don't think the current code can generate movmskps + test
(probably
>>> the next best thing without sse41) instead of ptest though if you
only
>>> got sse.
>>
>>
>> Thanks Roland, sign extending gets me part of the way at least.
>> I'm on version 3.1 and as you say in bug report, there are a
>> few extraneous instructions. For the record, casting to a <4 x
i8>
>> seems to do a better job for x86 (shuffle, movd, test, jump). Using
>> <4 x i32> seems to issue a pextrd for each element. For x64, it
seems
>> to be the same for either. I suppose it's all academic seeing as
the
>> ptest patch looks good.
>
> Yes <4 x i8> cast looks like a good idea. Just be careful though if
you
> also need to target cpus without ssse3, IIRC without pshufb this will
> create some horrible code (could have been with older llvm version
> though). Though if you don't have ssse3 you also won't have pextrd,
> which means more shuffling to extract the values if you sign-extend them
> to <4 x i32> too (if you're targeting altivec, probably no such
issue as
> I think it doesn't have such blatantly missing shuffle instructions).
> But yes ptest looks like the obvious winner. For cpus not having sse41
> (and there's tons of them still in use not to mention still sold) it
> would be nice if llvm could come up with pmovmskb/movmskps/movmskpd +
> test (these instructions look like they were intended for exactly that
> use case after all). But the <4 x i8> sign-extend solution
shouldn't
> hurt performance too much neither, if you've got ssse3.
If all you need is to test all flags are the same among elements, we
could add a pseudo PTEST support on CPU without SSE4.1, i.e.

we could replace

cmpltps %xmm0, %xmm1
ptest %xmm1, %xmm1
jz LABEL

to

cmpltps %xmm0, %xmm1
movmskps %xmm0, %r8d
test %r8d, %r8d
jz LABEL

It looks to me much more efficient and only relies on SSE. But, we
have to ensure the 2 operands to PTEST are the same and it's generated
from packed CMP.

I am figuring out how to simplify the checking of these pre-conditions.

Just off-topic issue, most vector IR so far operates on element-wise
or vertically. The generalized issue from here and PR12312 is that we
don't have simply way to express horizontal operations easily, like
primitives

float %s = reduce fadd <N x float> %x
i32 %m = reduce max <N x i32> %x
i1 %c = any <N x i1> %x or i1 %c = reduce or <N x i1>
i1 %c = all <N x i1> %x or i1 %c = reduce and <N x i1>

one more interesting example would be scan, horizontal operation but
still generate vector

<N x i32> %s = scan add <N x i32> %x, 0 ; exclusive scan
<N x i32> %s = scan add <N x i32> %x, 1; inclusive scan

With these primitives, some workloads may be simplified in IR and
backend (like X86) could support some directly.

- michael

>
> Roland
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Seemingly Similar Threads

Search for more seemingly similar threads

llvm dev - Sep 2012 - [LLVMdev] branch on vector compare?

[LLVMdev] branch on vector compare?

[LLVMdev] branch on vector compare?

[LLVMdev] branch on vector compare?

Seemingly Similar Threads