Stefan Kanthak via llvm-dev
2019-Mar-15 14:52 UTC
[llvm-dev] Where's the optimiser gone (part 12): superfluous 64-bit integer divisions and multiplications
Pick <https://www.hackersdelight.org/hdcodetxt/divmnu64.c.txt>, perform some MINOR modifications -- remove the code for the case n==1, replace nlz() with __builtin_clz() -- then compile it for a 32-bit machine: see <https://godbolt.org/z/BynK7S> For a slight variation see <https://godbolt.org/z/DZdn9l>. Deficiency #1 (lines 35 to 39) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ qhat = (un[j+n]*b + un[j+n-1])/vn[n-1]; rhat = (un[j+n]*b + un[j+n-1]) - qhat*vn[n-1]; clang does NOT recognize the explicit computation of the remainder rhat directly following the quotient qhat and generates a call to __udivdi3() instead of a call to __udivmoddi4() (which __udivdi3() calls anyway!) which provides both quotient and remainder. Even after defining the macro OPTIMIZE which produces qhat = (un[j+n]*b + un[j+n-1])/vn[n-1]; rhat = (un[j+n]*b + un[j+n-1])%vn[n-1]; clang gets no clue! JFTR: GCC and even MSVC recognize this sequence and generate calls to __udivmoddi4() and _aulldvrm() respectively! Deficiency #2 (line 42) ~~~~~~~~~~~~~~~~~~~~~~~ if (qhat >= b || qhat*vn[n-2] > b*rhat + un[j+n-2]) The second term is evaluated only if qhat is LESS than 2**32, so a single 32x32-bit multiplication producing a 64-bit product is sufficient here; clang but performs a 64x64-bit multiplication. JFTR: with OPTIMIZE defined, clang (like MSVC; GCC but fails) generates a single MUL here. Deficiency #3 (line 51) ~~~~~~~~~~~~~~~~~~~~~~~ The "loop" in lines 42 to 46 leaves qhat LESS than 2**32, so for p = qhat*vn[i]; again a single 32x32-bit multiplication producing a 64-bit product is sufficient; clang but performs a 64x64-bit multiplication. JFTR: with OPTIMIZE defined, clang (like MSVC and GCC) generates a single MUL here. stay tuned Stefan Kanthak