thr3ads.net - llvm dev - [llvm-dev] [cfe-dev] Portable multiplication 64 x 64 -> 128 for int128 reimplementation [Jan 2019]

If this information is useful, please help other people find it:
Share via:

Arthur O'Dwyer via llvm-dev

2018-Dec-31 21:20 UTC

[llvm-dev] [cfe-dev] Portable multiplication 64 x 64 -> 128 for int128 reimplementation

On Sun, Dec 30, 2018 at 4:46 PM Paweł Bylica <chfast at gmail.com> wrote:
> Hi Arthur, Craig,
>
> Thanks for you comments about GCC/Clang intrinsics. I never considered
> using them, but they might be better alternative to inline assembly.
> Is there a one for regular MUL?
>
I'm not sure, but I think there currently does not exist any intrinsic to
generate the top half of a 64x64=128 multiply, except for `_mulx_64`.
If Clang stopped requiring `-mbmi2`, I would then expect the `_mulx_64`
intrinsic to generate a regular MUL instruction; similar to
how_addcarry_u64 generates ADCX/ADOX when available/useful and a regular
ADC otherwise.
MSVC calls this intrinsic `_umul128
<https://docs.microsoft.com/en-us/cpp/intrinsics/umul128?view=vs-2017>`,
and on MSVC it does generate a regular MUL instruction rather than forcing
MULX.


Anyway, I want to go the opposite direction. [...] mulx() helper
without> using __int128 type in a way that a compiler would recognize that it should
> use MUL/MULX instruction.
> A possible implementation looks like [SNIPPED]
>
Interesting trivia: There are at least three ways to write the final
"return" statement in this function. Clang generates different code
for
each one of them. If someone does pursue writing an InstCombine
optimization for this, it would be good to generate the same efficient code
for all three versions.
https://godbolt.org/z/-Cozee (LLVM IR: https://godbolt.org/z/_1pDoz)

–Arthur
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20181231/8ec6396b/attachment.html>

Craig Topper via llvm-dev

2018-Dec-31 21:41 UTC

head link

[llvm-dev] [cfe-dev] Portable multiplication 64 x 64 -> 128 for int128 reimplementation

On trunk we never generate MULX. We used to blindly use it anytime bmi2 was
enabled, but its a longer encoding and isn't a guaranteed register
allocation improvement. So I took it out a few weeks ago. We need a more
precise heuristic for when to use it.

LLVM trunk will never generate ADCX/ADOX either. This was removed in
September. We used to inconsistently generate them when adx was enabled
unless we could use the RMW form or the immediate form of ADC. But that
didn't really make any sense. The only reason to use ADCX or ADOX is when
you want to carefully manage the flags to have two interleaved dependency
chains. But that would require a special analysis to determine when to do
that and we don't have that.

~Craig


On Mon, Dec 31, 2018 at 1:21 PM Arthur O'Dwyer <arthur.j.odwyer at
gmail.com>
wrote:
> On Sun, Dec 30, 2018 at 4:46 PM Paweł Bylica <chfast at gmail.com>
wrote:
>
>> Hi Arthur, Craig,
>>
>> Thanks for you comments about GCC/Clang intrinsics. I never considered
>> using them, but they might be better alternative to inline assembly.
>> Is there a one for regular MUL?
>>
>
> I'm not sure, but I think there currently does not exist any intrinsic
to
> generate the top half of a 64x64=128 multiply, except for `_mulx_64`.
> If Clang stopped requiring `-mbmi2`, I would then expect the `_mulx_64`
> intrinsic to generate a regular MUL instruction; similar to
> how_addcarry_u64 generates ADCX/ADOX when available/useful and a regular
> ADC otherwise.
> MSVC calls this intrinsic `_umul128
>
<https://docs.microsoft.com/en-us/cpp/intrinsics/umul128?view=vs-2017>`,
> and on MSVC it does generate a regular MUL instruction rather than forcing
> MULX.
>
>
> Anyway, I want to go the opposite direction. [...] mulx() helper without
>> using __int128 type in a way that a compiler would recognize that it
should
>> use MUL/MULX instruction.
>> A possible implementation looks like [SNIPPED]
>>
>
> Interesting trivia: There are at least three ways to write the final
> "return" statement in this function. Clang generates different
code for
> each one of them. If someone does pursue writing an InstCombine
> optimization for this, it would be good to generate the same efficient code
> for all three versions.
> https://godbolt.org/z/-Cozee (LLVM IR: https://godbolt.org/z/_1pDoz)
>
> –Arthur
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20181231/af65dce8/attachment.html>

Paweł Bylica via llvm-dev

2019-Jan-02 20:27 UTC

head link

[llvm-dev] [cfe-dev] Portable multiplication 64 x 64 -> 128 for int128 reimplementation

Thanks again for all the comments.

As suggested, I created my first pattern match in AggresiveInstCombine:
https://reviews.llvm.org/D56214.

Suggestions welcome (preferably in the review).

Bests,
Paweł

On Mon, Dec 31, 2018 at 10:41 PM Craig Topper <craig.topper at gmail.com>
wrote:
> On trunk we never generate MULX. We used to blindly use it anytime bmi2
> was enabled, but its a longer encoding and isn't a guaranteed register
> allocation improvement. So I took it out a few weeks ago. We need a more
> precise heuristic for when to use it.
>
> LLVM trunk will never generate ADCX/ADOX either. This was removed in
> September. We used to inconsistently generate them when adx was enabled
> unless we could use the RMW form or the immediate form of ADC. But that
> didn't really make any sense. The only reason to use ADCX or ADOX is
when
> you want to carefully manage the flags to have two interleaved dependency
> chains. But that would require a special analysis to determine when to do
> that and we don't have that.
>
> ~Craig
>
>
> On Mon, Dec 31, 2018 at 1:21 PM Arthur O'Dwyer <arthur.j.odwyer at
gmail.com>
> wrote:
>
>> On Sun, Dec 30, 2018 at 4:46 PM Paweł Bylica <chfast at
gmail.com> wrote:
>>
>>> Hi Arthur, Craig,
>>>
>>> Thanks for you comments about GCC/Clang intrinsics. I never
considered
>>> using them, but they might be better alternative to inline
assembly.
>>> Is there a one for regular MUL?
>>>
>>
>> I'm not sure, but I think there currently does not exist any
intrinsic to
>> generate the top half of a 64x64=128 multiply, except for `_mulx_64`.
>> If Clang stopped requiring `-mbmi2`, I would then expect the `_mulx_64`
>> intrinsic to generate a regular MUL instruction; similar to
>> how_addcarry_u64 generates ADCX/ADOX when available/useful and a
regular
>> ADC otherwise.
>> MSVC calls this intrinsic `_umul128
>>
<https://docs.microsoft.com/en-us/cpp/intrinsics/umul128?view=vs-2017>`,
>> and on MSVC it does generate a regular MUL instruction rather than
forcing
>> MULX.
>>
>>
>> Anyway, I want to go the opposite direction. [...] mulx() helper
without
>>> using __int128 type in a way that a compiler would recognize that
it should
>>> use MUL/MULX instruction.
>>> A possible implementation looks like [SNIPPED]
>>>
>>
>> Interesting trivia: There are at least three ways to write the final
>> "return" statement in this function. Clang generates
different code for
>> each one of them. If someone does pursue writing an InstCombine
>> optimization for this, it would be good to generate the same efficient
code
>> for all three versions.
>> https://godbolt.org/z/-Cozee (LLVM IR: https://godbolt.org/z/_1pDoz)
>>
>> –Arthur
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190102/69455c80/attachment.html>

llvm dev - Jan 2019 - [cfe-dev] Portable multiplication 64 x 64 -> 128 for int128 reimplementation

[llvm-dev] [cfe-dev] Portable multiplication 64 x 64 -> 128 for int128 reimplementation

[llvm-dev] [cfe-dev] Portable multiplication 64 x 64 -> 128 for int128 reimplementation

[llvm-dev] [cfe-dev] Portable multiplication 64 x 64 -> 128 for int128 reimplementation