thr3ads.net - llvm dev - [llvm-dev] Where's the optimiser gone (part 11): use the proper instruction for sign extension [Mar 2019]

If this information is useful, please help other people find it:
Share via:

Stefan Kanthak via llvm-dev

2019-Mar-04 07:06 UTC

[llvm-dev] Where's the optimiser gone (part 11): use the proper instruction for sign extension

Compile with -O3 -m32 (see <https://godbolt.org/z/yCpBpM>):

long lsign(long x)
{
    return (x > 0) - (x < 0);
}


long long llsign(long long x)
{
    return (x > 0) - (x < 0);
}


While the code generated for the "long" version of this function is
quite
OK, the code for the "long long" version misses an obvious
optimisation:


lsign: # @lsign
    mov     eax, dword ptr [esp + 4]    |    mov     eax, dword ptr [esp + 4]
    xor     ecx, ecx                    |
    test    eax, eax                    |    cdq
    setg    cl                          |    neg     eax
    sar     eax, 31                     |    adc     edx, edx
    add     eax, ecx                    |    mov     eax, edx
    ret                                 |    ret

llsign: # @llsign
    xor     ecx, ecx                    |    xor     edx, edx
    mov     eax, dword ptr [esp + 8]    |    mov     eax, dword ptr [esp + 8]
    cmp     ecx, dword ptr [esp + 4]    |    cmp     edx, dword ptr [esp + 4]
    sbb     ecx, eax                    |    sbb     edx, eax
    setl    cl                          |    cdq
    sar     eax, 31                     |    setl    al
    movzx   ecx, cl                     |    movzx   eax, al
    add     eax, ecx                    |    add     eax, edx
    mov     edx, eax                    |    ret
    sar     edx, 31
    ret

NOTE: not just here this sequence SHOULD be replaced with

    mov     edx, eax                    |    cdq
    sar     edx, 31

Although CDQ is the proper instruction for sign extension, LLVM/clang
doesn't
seem to like it.

stay tuned
Stefan Kanthak

Craig Topper via llvm-dev

2019-Mar-04 08:18 UTC

head link

[llvm-dev] Where's the optimiser gone (part 11): use the proper instruction for sign extension

It's fairly difficult to use CDQ in LLVM without tying the hands of the
register allocator. It would potentially require a new post-RA combine pass
to detect the "mov edx, eax; sar edx, 31" pattern. It's going to
be even
harder to bias register allocation in hopes of using CDQ for the lsign case.

CDQ is implemented in the shifter unit on a least the last several
generations of Intel CPUs so its going to perform similarly to SAR. And the
move only requires decoder bandwidth and no execution resources on recent
CPUs. Do you performance data for this optimization?

~Craig


On Sun, Mar 3, 2019 at 11:08 PM Stefan Kanthak via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Compile with -O3 -m32 (see <https://godbolt.org/z/yCpBpM>):
>
> long lsign(long x)
> {
>     return (x > 0) - (x < 0);
> }
>
>
> long long llsign(long long x)
> {
>     return (x > 0) - (x < 0);
> }
>
>
> While the code generated for the "long" version of this function
is quite
> OK, the code for the "long long" version misses an obvious
optimisation:
>
>
> lsign: # @lsign
>     mov     eax, dword ptr [esp + 4]    |    mov     eax, dword ptr [esp +
> 4]
>     xor     ecx, ecx                    |
>     test    eax, eax                    |    cdq
>     setg    cl                          |    neg     eax
>     sar     eax, 31                     |    adc     edx, edx
>     add     eax, ecx                    |    mov     eax, edx
>     ret                                 |    ret
>
> llsign: # @llsign
>     xor     ecx, ecx                    |    xor     edx, edx
>     mov     eax, dword ptr [esp + 8]    |    mov     eax, dword ptr [esp +
> 8]
>     cmp     ecx, dword ptr [esp + 4]    |    cmp     edx, dword ptr [esp +
> 4]
>     sbb     ecx, eax                    |    sbb     edx, eax
>     setl    cl                          |    cdq
>     sar     eax, 31                     |    setl    al
>     movzx   ecx, cl                     |    movzx   eax, al
>     add     eax, ecx                    |    add     eax, edx
>     mov     edx, eax                    |    ret
>     sar     edx, 31
>     ret
>
> NOTE: not just here this sequence SHOULD be replaced with
>
>     mov     edx, eax                    |    cdq
>     sar     edx, 31
>
> Although CDQ is the proper instruction for sign extension, LLVM/clang
> doesn't
> seem to like it.
>
> stay tuned
> Stefan Kanthak
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190304/7302e57a/attachment.html>

Stefan Kanthak via llvm-dev

2019-Mar-04 09:08 UTC

head link

[llvm-dev] Where's the optimiser gone (part 11): use the proper instruction for sign extension

"Craig Topper" <craig.topper at gmail.com> wrote:
> It's fairly difficult to use CDQ in LLVM without tying the hands of the
> register allocator.
Its hands are but already tied when it has to return a quadword in EDX:EAX,
uses the DIV/IDIV and MUL/IMUL instructions or any shifts with variable
shift count.
In the case of llsign() it uses 3 registers, although the job can be done
with just EAX and EDX.
> It would potentially require a new post-RA combine pass
> to detect the "mov edx, eax; sar edx, 31" pattern. It's going
to be even
> harder to bias register allocation in hopes of using CDQ for the lsign
case.
> 
> CDQ is implemented in the shifter unit on a least the last several
> generations of Intel CPUs so its going to perform similarly to SAR. And the
> move only requires decoder bandwidth and no execution resources on recent
> CPUs. Do you performance data for this optimization?
No, I don't have such data.

Regarding the llsign() function: instead of "mov edx, eax; sar edx,
31"
the compiler SHOULD generate EITHER a "cdq" OR a "mov edx,
ecx" here.
Except for this final step AND the use of ECX instead of EDX it did a
pretty good job; compare the generated code against GCC's, ICC's or
MSVC's,
which emit AWFUL code in that instance.

Regarding the lsign() function: setCC r8 and other operations on partial
registers are typically slower than operations on the full registers, or
introduce dependencies.

regards
Stefan
> On Sun, Mar 3, 2019 at 11:08 PM Stefan Kanthak via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
> 
>> Compile with -O3 -m32 (see <https://godbolt.org/z/yCpBpM>):
>>
>> long lsign(long x)
>> {
>>     return (x > 0) - (x < 0);
>> }
>>
>>
>> long long llsign(long long x)
>> {
>>     return (x > 0) - (x < 0);
>> }
>>
>>
>> While the code generated for the "long" version of this
function is quite
>> OK, the code for the "long long" version misses an obvious
optimisation:
>>
>>
>> lsign: # @lsign
>>     mov     eax, dword ptr [esp + 4]    |    mov     eax, dword ptr
[esp +
>> 4]
>>     xor     ecx, ecx                    |
>>     test    eax, eax                    |    cdq
>>     setg    cl                          |    neg     eax
>>     sar     eax, 31                     |    adc     edx, edx
>>     add     eax, ecx                    |    mov     eax, edx
>>     ret                                 |    ret
>>
>> llsign: # @llsign
>>     xor     ecx, ecx                    |    xor     edx, edx
>>     mov     eax, dword ptr [esp + 8]    |    mov     eax, dword ptr
[esp +
>> 8]
>>     cmp     ecx, dword ptr [esp + 4]    |    cmp     edx, dword ptr
[esp +
>> 4]
>>     sbb     ecx, eax                    |    sbb     edx, eax
>>     setl    cl                          |    cdq
>>     sar     eax, 31                     |    setl    al
>>     movzx   ecx, cl                     |    movzx   eax, al
>>     add     eax, ecx                    |    add     eax, edx
>>     mov     edx, eax                    |    ret
>>     sar     edx, 31
>>     ret
>>
>> NOTE: not just here this sequence SHOULD be replaced with
>>
>>     mov     edx, eax                    |    cdq
>>     sar     edx, 31
>>
>> Although CDQ is the proper instruction for sign extension, LLVM/clang
>> doesn't
>> seem to like it.
>>
>> stay tuned
>> Stefan Kanthak
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>

Reasonably Related Threads

Search for more possibly parallel threads

llvm dev - Mar 2019 - Where's the optimiser gone (part 11): use the proper instruction for sign extension

[llvm-dev] Where's the optimiser gone (part 11): use the proper instruction for sign extension

[llvm-dev] Where's the optimiser gone (part 11): use the proper instruction for sign extension

[llvm-dev] Where's the optimiser gone (part 11): use the proper instruction for sign extension

Reasonably Related Threads