thr3ads.net - llvm dev - [LLVMdev] SIMD for sdiv <2 x i64> [Jul 2015]

If this information is useful, please help other people find it:
Share via:

zhi chen

2015-Jul-24 06:06 UTC

[LLVMdev] SIMD for sdiv <2 x i64>

It seems that that it's hard to vectorize int64 in LLVM. For example, LLVM
3.4 generates very complicated code for the following IR. I am running on a
Haswell processor. Is it because there is no alternative AVX/2 instructions
for int64? The same thing also happens to zext <2 x i32> -> <2 x
i64> and
trunc <2 x i64> -> <2 x i32>. Any ideas to optimize these
instructions?
Thanks.

%sub.ptr.sub.i6.i.i.i.i = sub <2 x i64> %sub.ptr.lhs.cast.i4.i.i.i.i,
%sub.ptr.rhs.cast.i5.i.i.i.i
%sub.ptr.div.i7.i.i.i.i = sdiv <2 x i64> %sub.ptr.sub.i6.i.i.i.i, <i64
24,
i64 24>

Assembly:
    vpsubq  %xmm6, %xmm5, %xmm5
    vmovq   %xmm5, %rax
    movabsq $3074457345618258603, %rbx # imm = 0x2AAAAAAAAAAAAAAB

    imulq   %rbx
    movq    %rdx, %rcx

    movq    %rcx, %rax

    shrq    $63, %rax

    shrq    $2, %rcx
    addl    %eax, %ecx
    vpextrq $1, %xmm5, %rax

    imulq   %rbx
    movq    %rdx, %rax

    shrq    $63, %rax

    shrq    $2, %rdx
    addl    %eax, %edx

    movslq  %edx, %rax
    vmovq   %rax, %xmm5

    movslq  %ecx, %rax
    vmovq   %rax, %xmm6
    vpunpcklqdq %xmm5, %xmm6, %xmm5 # xmm5 = xmm6[0],xmm5[0]
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150723/4c853c43/attachment.html>

Benjamin Kramer

2015-Jul-24 10:42 UTC

head link

[LLVMdev] SIMD for sdiv <2 x i64>

> On 24.07.2015, at 08:06, zhi chen <zchenhn at gmail.com> wrote:
> 
> It seems that that it's hard to vectorize int64 in LLVM. For example,
LLVM 3.4 generates very complicated code for the following IR. I am running on a
Haswell processor. Is it because there is no alternative AVX/2 instructions for
int64? The same thing also happens to zext <2 x i32> -> <2 x i64>
and trunc <2 x i64> -> <2 x i32>. Any ideas to optimize these
instructions? Thanks.
> 
> %sub.ptr.sub.i6.i.i.i.i = sub <2 x i64> %sub.ptr.lhs.cast.i4.i.i.i.i,
%sub.ptr.rhs.cast.i5.i.i.i.i
> %sub.ptr.div.i7.i.i.i.i = sdiv <2 x i64> %sub.ptr.sub.i6.i.i.i.i,
<i64 24, i64 24>
> 
> Assembly:
>     vpsubq  %xmm6, %xmm5, %xmm5
>     vmovq   %xmm5, %rax
>     movabsq $3074457345618258603, %rbx # imm = 0x2AAAAAAAAAAAAAAB
>     imulq   %rbx
>     movq    %rdx, %rcx
>     movq    %rcx, %rax
>     shrq    $63, %rax
>     shrq    $2, %rcx
>     addl    %eax, %ecx 
>     vpextrq $1, %xmm5, %rax
>     imulq   %rbx
>     movq    %rdx, %rax
>     shrq    $63, %rax
>     shrq    $2, %rdx
>     addl    %eax, %edx
>     movslq  %edx, %rax
>     vmovq   %rax, %xmm5
>     movslq  %ecx, %rax
>     vmovq   %rax, %xmm6
>     vpunpcklqdq %xmm5, %xmm6, %xmm5 # xmm5 = xmm6[0],xmm5[0]      
AVX2 doesn't have integer vector division instructions and LLVM lowers
divides by constants into (128 bit) multiplies. However, AVX2 doesn't have a
way to get to the upper 64 bits of a 64x64->128 bit multiply either, so LLVM
uses the scalar imulq instruction to do that. There's not much room to
optimize here given the limitations of AVX2.

You seem to be subtracting pointers though, so if you can guarantee that the
pointers are aligned you could set the exact bit on your 'sdiv'
instruction. That should give better code.

- Ben

Philip Reames

2015-Jul-24 17:16 UTC

head link

[LLVMdev] SIMD for sdiv <2 x i64>

On 07/24/2015 03:42 AM, Benjamin Kramer wrote:>> On 24.07.2015, at 08:06, zhi chen <zchenhn at gmail.com> wrote:
>>
>> It seems that that it's hard to vectorize int64 in LLVM. For
example, LLVM 3.4 generates very complicated code for the following IR. I am
running on a Haswell processor. Is it because there is no alternative AVX/2
instructions for int64? The same thing also happens to zext <2 x i32>
-> <2 x i64> and trunc <2 x i64> -> <2 x i32>. Any ideas
to optimize these instructions? Thanks.
>>
>> %sub.ptr.sub.i6.i.i.i.i = sub <2 x i64>
%sub.ptr.lhs.cast.i4.i.i.i.i, %sub.ptr.rhs.cast.i5.i.i.i.i
>> %sub.ptr.div.i7.i.i.i.i = sdiv <2 x i64> %sub.ptr.sub.i6.i.i.i.i,
<i64 24, i64 24>
>>
>> Assembly:
>>      vpsubq  %xmm6, %xmm5, %xmm5
>>      vmovq   %xmm5, %rax
>>      movabsq $3074457345618258603, %rbx # imm = 0x2AAAAAAAAAAAAAAB
>>      imulq   %rbx
>>      movq    %rdx, %rcx
>>      movq    %rcx, %rax
>>      shrq    $63, %rax
>>      shrq    $2, %rcx
>>      addl    %eax, %ecx
>>      vpextrq $1, %xmm5, %rax
>>      imulq   %rbx
>>      movq    %rdx, %rax
>>      shrq    $63, %rax
>>      shrq    $2, %rdx
>>      addl    %eax, %edx
>>      movslq  %edx, %rax
>>      vmovq   %rax, %xmm5
>>      movslq  %ecx, %rax
>>      vmovq   %rax, %xmm6
>>      vpunpcklqdq %xmm5, %xmm6, %xmm5 # xmm5 = xmm6[0],xmm5[0]
> AVX2 doesn't have integer vector division instructions and LLVM lowers
divides by constants into (128 bit) multiplies. However, AVX2 doesn't have a
way to get to the upper 64 bits of a 64x64->128 bit multiply either, so LLVM
uses the scalar imulq instruction to do that. There's not much room to
optimize here given the limitations of AVX2.
>
> You seem to be subtracting pointers though, so if you can guarantee that
the pointers are aligned you could set the exact bit on your 'sdiv'
instruction. That should give better code.Depending on what you're using the result of the divide for, there might 
be optimizations which could be applied as well.  Can you give a 
slightly larger context for your source IR?  (1-2 level of uses/defs out 
from the instructions would help)>
> - Ben
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Possibly Parallel Threads

Search for more maybe matching threads

llvm dev - Jul 2015 - [LLVMdev] SIMD for sdiv <2 x i64>

[LLVMdev] SIMD for sdiv <2 x i64>

[LLVMdev] SIMD for sdiv <2 x i64>

[LLVMdev] SIMD for sdiv <2 x i64>

Possibly Parallel Threads