Stefan Kanthak via llvm-dev
2019-Mar-04 07:06 UTC
[llvm-dev] Where's the optimiser gone (part 11): use the proper instruction for sign extension
Compile with -O3 -m32 (see <https://godbolt.org/z/yCpBpM>): long lsign(long x) { return (x > 0) - (x < 0); } long long llsign(long long x) { return (x > 0) - (x < 0); } While the code generated for the "long" version of this function is quite OK, the code for the "long long" version misses an obvious optimisation: lsign: # @lsign mov eax, dword ptr [esp + 4] | mov eax, dword ptr [esp + 4] xor ecx, ecx | test eax, eax | cdq setg cl | neg eax sar eax, 31 | adc edx, edx add eax, ecx | mov eax, edx ret | ret llsign: # @llsign xor ecx, ecx | xor edx, edx mov eax, dword ptr [esp + 8] | mov eax, dword ptr [esp + 8] cmp ecx, dword ptr [esp + 4] | cmp edx, dword ptr [esp + 4] sbb ecx, eax | sbb edx, eax setl cl | cdq sar eax, 31 | setl al movzx ecx, cl | movzx eax, al add eax, ecx | add eax, edx mov edx, eax | ret sar edx, 31 ret NOTE: not just here this sequence SHOULD be replaced with mov edx, eax | cdq sar edx, 31 Although CDQ is the proper instruction for sign extension, LLVM/clang doesn't seem to like it. stay tuned Stefan Kanthak
Craig Topper via llvm-dev
2019-Mar-04 08:18 UTC
[llvm-dev] Where's the optimiser gone (part 11): use the proper instruction for sign extension
It's fairly difficult to use CDQ in LLVM without tying the hands of the register allocator. It would potentially require a new post-RA combine pass to detect the "mov edx, eax; sar edx, 31" pattern. It's going to be even harder to bias register allocation in hopes of using CDQ for the lsign case. CDQ is implemented in the shifter unit on a least the last several generations of Intel CPUs so its going to perform similarly to SAR. And the move only requires decoder bandwidth and no execution resources on recent CPUs. Do you performance data for this optimization? ~Craig On Sun, Mar 3, 2019 at 11:08 PM Stefan Kanthak via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Compile with -O3 -m32 (see <https://godbolt.org/z/yCpBpM>): > > long lsign(long x) > { > return (x > 0) - (x < 0); > } > > > long long llsign(long long x) > { > return (x > 0) - (x < 0); > } > > > While the code generated for the "long" version of this function is quite > OK, the code for the "long long" version misses an obvious optimisation: > > > lsign: # @lsign > mov eax, dword ptr [esp + 4] | mov eax, dword ptr [esp + > 4] > xor ecx, ecx | > test eax, eax | cdq > setg cl | neg eax > sar eax, 31 | adc edx, edx > add eax, ecx | mov eax, edx > ret | ret > > llsign: # @llsign > xor ecx, ecx | xor edx, edx > mov eax, dword ptr [esp + 8] | mov eax, dword ptr [esp + > 8] > cmp ecx, dword ptr [esp + 4] | cmp edx, dword ptr [esp + > 4] > sbb ecx, eax | sbb edx, eax > setl cl | cdq > sar eax, 31 | setl al > movzx ecx, cl | movzx eax, al > add eax, ecx | add eax, edx > mov edx, eax | ret > sar edx, 31 > ret > > NOTE: not just here this sequence SHOULD be replaced with > > mov edx, eax | cdq > sar edx, 31 > > Although CDQ is the proper instruction for sign extension, LLVM/clang > doesn't > seem to like it. > > stay tuned > Stefan Kanthak > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190304/7302e57a/attachment.html>
Stefan Kanthak via llvm-dev
2019-Mar-04 09:08 UTC
[llvm-dev] Where's the optimiser gone (part 11): use the proper instruction for sign extension
"Craig Topper" <craig.topper at gmail.com> wrote:> It's fairly difficult to use CDQ in LLVM without tying the hands of the > register allocator.Its hands are but already tied when it has to return a quadword in EDX:EAX, uses the DIV/IDIV and MUL/IMUL instructions or any shifts with variable shift count. In the case of llsign() it uses 3 registers, although the job can be done with just EAX and EDX.> It would potentially require a new post-RA combine pass > to detect the "mov edx, eax; sar edx, 31" pattern. It's going to be even > harder to bias register allocation in hopes of using CDQ for the lsign case. > > CDQ is implemented in the shifter unit on a least the last several > generations of Intel CPUs so its going to perform similarly to SAR. And the > move only requires decoder bandwidth and no execution resources on recent > CPUs. Do you performance data for this optimization?No, I don't have such data. Regarding the llsign() function: instead of "mov edx, eax; sar edx, 31" the compiler SHOULD generate EITHER a "cdq" OR a "mov edx, ecx" here. Except for this final step AND the use of ECX instead of EDX it did a pretty good job; compare the generated code against GCC's, ICC's or MSVC's, which emit AWFUL code in that instance. Regarding the lsign() function: setCC r8 and other operations on partial registers are typically slower than operations on the full registers, or introduce dependencies. regards Stefan> On Sun, Mar 3, 2019 at 11:08 PM Stefan Kanthak via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> Compile with -O3 -m32 (see <https://godbolt.org/z/yCpBpM>): >> >> long lsign(long x) >> { >> return (x > 0) - (x < 0); >> } >> >> >> long long llsign(long long x) >> { >> return (x > 0) - (x < 0); >> } >> >> >> While the code generated for the "long" version of this function is quite >> OK, the code for the "long long" version misses an obvious optimisation: >> >> >> lsign: # @lsign >> mov eax, dword ptr [esp + 4] | mov eax, dword ptr [esp + >> 4] >> xor ecx, ecx | >> test eax, eax | cdq >> setg cl | neg eax >> sar eax, 31 | adc edx, edx >> add eax, ecx | mov eax, edx >> ret | ret >> >> llsign: # @llsign >> xor ecx, ecx | xor edx, edx >> mov eax, dword ptr [esp + 8] | mov eax, dword ptr [esp + >> 8] >> cmp ecx, dword ptr [esp + 4] | cmp edx, dword ptr [esp + >> 4] >> sbb ecx, eax | sbb edx, eax >> setl cl | cdq >> sar eax, 31 | setl al >> movzx ecx, cl | movzx eax, al >> add eax, ecx | add eax, edx >> mov edx, eax | ret >> sar edx, 31 >> ret >> >> NOTE: not just here this sequence SHOULD be replaced with >> >> mov edx, eax | cdq >> sar edx, 31 >> >> Although CDQ is the proper instruction for sign extension, LLVM/clang >> doesn't >> seem to like it. >> >> stay tuned >> Stefan Kanthak >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >
Reasonably Related Threads
- A pattern for portable __builtin_add_overflow()
- Rather poor code optimisation of current clang/LLVM targeting Intel x86 (both -64 and -32)
- Rather poor code optimisation of current clang/LLVM targeting Intel x86 (both -64 and -32)
- [LLVMdev] Compiling integer mod
- [LLVMdev] Possible error in docs.