thr3ads.net - llvm dev - [llvm-dev] [RFC] carry-less multiplication instruction [Jul 2020]

If this information is useful, please help other people find it:
Share via:

Nicolai Hähnle via llvm-dev

2020-Jul-05 12:12 UTC

[llvm-dev] [RFC] carry-less multiplication instruction

On 05.07.20 12:21, Roman Lebedev via llvm-dev wrote:> On Sun, Jul 5, 2020 at 12:18 PM Shawn Landden via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>>
>>
>>
>> Carry-less multiplication[1] instructions exist (at least optionally)
on many architectures: armv8, RISC-V, x86_64, POWER, SPARC, C64x, and possibly
more.
>>
>> This proposal is to add a llvm.clmul instruction. Or if that is
contentious, llvm.experimental.bitmanip.clmul instruction. It takes two integer
operands of the same width, and returns an integer with twice the width of the
operands. (Is there a good reason to make these the same width, as all the other
operations do even when it doesn’t really make sense for the mathematical
operation–like multiplication or ctpop/ctlz/cttz?)
>>
>> If the CPU does not have a dedication clmul operation, it can be
lowered to regular multiplication, by using holes to avoid carrys.
>>
>> ==Where is clmul used?=>>
>> While somewhat specialized, the RISC-V manual documents many uses: [2]
>>
>> The classic applications forclmulare Cyclic Redundancy Check (CRC) [11,
26]
>>
>> and Galois/CounterMode (GCM), but more applications exist, including
the following examples.There are obvious applications in hashing and pseudo
random number generations. For exam-ple, it has been reported that hashes based
on carry-less multiplications can outperform Google’sCityHash [17].
>>
>> clmulof a number with itself inserts zeroes between each input bit.
This can be useful for generatingMorton code [23].
>>
>> clmulof a number with -1 calculates the prefix XOR operation. This can
be useful for decodinggray codes.Another application of XOR prefix sums
calculated withclmulis branchless tracking of quotedstrings in high-performance
parsers. [16]
>>
>> Carry-less multiply can also be used to implement Erasure code
efficiently. [14]
>>
>> ==clmul lowering without hardware support=>> A 8x8=>16 clmul
can also be lowered to a 32x32=>64 multiplication when there is no
specialized instruction (also 15x15=>30, to a 60x60=>120, or if bitreverse
is available 16x16=>32 to TWO 64x64=>64 multiplications)[3].
>>
>> [1] https://en.wikipedia.org/wiki/Carry-less_product
>> [2] (page 30)
https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf
>> [3] https://www.bearssl.org/constanttime.html
> 
> What benefit would this intrinsic would bring to the middle-end IR,
> over it's current naive expanded form?
Isn't a "naive" expansion of NxN carryless multiply extremely
involved?
I'd expect something like 2N shifts, N truncs, N selects, and N xors.

That link mentions an alternative that is more efficient, but I wouldn't 
exactly call it naive...

Cheers,
Nicolai

> 
> Note that teaching backends to produce it, or even adding it to
> backend (ISD opcodes)
> and matching it in DAGCombiner has much lower barrier of entry, i
> would suggest to start there.
> 
>> (First posted to discord
>>
>> --
>> Shawn Landden
> Roman
> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
-- 
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.

Chris Lattner via llvm-dev

2020-Jul-05 20:07 UTC

head link

[llvm-dev] [RFC] carry-less multiplication instruction

> On Jul 5, 2020, at 5:12 AM, Nicolai Hähnle via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
>>> [1] https://en.wikipedia.org/wiki/Carry-less_product
<https://en.wikipedia.org/wiki/Carry-less_product>
>>> [2] (page 30)
https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf
<https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf>
>>> [3] https://www.bearssl.org/constanttime.html
<https://www.bearssl.org/constanttime.html>
>> What benefit would this intrinsic would bring to the middle-end IR,
>> over it's current naive expanded form?
> 
> Isn't a "naive" expansion of NxN carryless multiply extremely
involved? I'd expect something like 2N shifts, N truncs, N selects, and N
xors.
> 
> That link mentions an alternative that is more efficient, but I
wouldn't exactly call it naive…

No, the typical expansion (used by the existing LLVM code generator) uses “grade
school math” to decompose large multiplications into smaller ones.

Wide multiplications (e.g. those on X86) are pretty easy to pattern match from
“sign/zero extend operands from 64 bits to 128 bits, then do a 128x128=128
multiply” for example.  This is all already handled by the existing selection
dag infra, my understanding is that it works well in practice.

-Chris

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200705/5b92964b/attachment.html>

Roman Lebedev via llvm-dev

2020-Jul-05 20:09 UTC

head link

[llvm-dev] [RFC] carry-less multiplication instruction

On Sun, Jul 5, 2020 at 11:07 PM Chris Lattner <clattner at nondot.org>
wrote:>
>
>
> On Jul 5, 2020, at 5:12 AM, Nicolai Hähnle via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
>
> [1] https://en.wikipedia.org/wiki/Carry-less_product
> [2] (page 30)
https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf
> [3] https://www.bearssl.org/constanttime.html
>
> What benefit would this intrinsic would bring to the middle-end IR,
> over it's current naive expanded form?
>
>
> Isn't a "naive" expansion of NxN carryless multiply extremely
involved? I'd expect something like 2N shifts, N truncs, N selects, and N
xors.
>
> That link mentions an alternative that is more efficient, but I
wouldn't exactly call it naive…
>
>
> No, the typical expansion (used by the existing LLVM code generator) uses
“grade school math” to decompose large multiplications into smaller ones.
>
> Wide multiplications (e.g. those on X86) are pretty easy to pattern match
from “sign/zero extend operands from 64 bits to 128 bits, then do a 128x128=128
multiply” for example.That is what i meant, yes.
> This is all already handled by the existing selection dag infra, my
understanding is that it works well in practice.
>
> -ChrisRoman

James Y Knight via llvm-dev

2020-Jul-05 21:28 UTC

head link

[llvm-dev] [RFC] carry-less multiplication instruction

"Carry-less multiplication" isn't a wider bit width
multiplication, it's an
entirely different operation. (See the Wikipedia page linked in the
proposal).

On Sun, Jul 5, 2020, 4:07 PM Chris Lattner via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
>
> On Jul 5, 2020, at 5:12 AM, Nicolai Hähnle via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> [1] https://en.wikipedia.org/wiki/Carry-less_product
> [2] (page 30)
>
https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf
> [3] https://www.bearssl.org/constanttime.html
>
> What benefit would this intrinsic would bring to the middle-end IR,
> over it's current naive expanded form?
>
>
> Isn't a "naive" expansion of NxN carryless multiply extremely
involved?
> I'd expect something like 2N shifts, N truncs, N selects, and N xors.
>
> That link mentions an alternative that is more efficient, but I
wouldn't
> exactly call it naive…
>
>
> No, the typical expansion (used by the existing LLVM code generator) uses
> “grade school math” to decompose large multiplications into smaller ones.
>
> Wide multiplications (e.g. those on X86) are pretty easy to pattern match
> from “sign/zero extend operands from 64 bits to 128 bits, then do a
> 128x128=128 multiply” for example.  This is all already handled by the
> existing selection dag infra, my understanding is that it works well in
> practice.
>
> -Chris
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200705/2cb09fd3/attachment.html>

Shawn Landden via llvm-dev

2020-Jul-06 10:44 UTC

head link

[llvm-dev] [RFC] carry-less multiplication instruction

<div> </div><div> </div><div>05.07.2020, 07:12,
"Nicolai Hähnle"
<nhaehnle@gmail.com>:</div><blockquote><p>On 05.07.20
12:21, Roman Lebedev via llvm-dev wrote:</p><blockquote> On Sun, Jul
5, 2020 at 12:18 PM Shawn Landden via llvm-dev<br /> <<a
href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>>
wrote:<blockquote> This proposal is to add a llvm.clmul
instruction.</blockquote><br /> What benefit would this intrinsic
would bring to the middle-end IR,<br /> over it's current naive
expanded form?</blockquote><p><br />Isn't a
"naive" expansion of NxN carryless multiply extremely
involved?</p></blockquote><div><br />Yes it is. And this
is then sped up with a table (such as in the official GCM spec), however using a
table can introduce key-dependent loads and security problems. The 32+32->64
or 64+64->64 multiplication lowering is generally constant-time and does not
have these security problems.</div><blockquote><p>I'd
expect something like 2N shifts, N truncs, N selects, and N xors.<br
/><br />That link mentions an alternative that is more efficient, but I
wouldn't<br />exactly call it naive...<br /><br
/>Cheers,<br />Nicolai<br /><br
/> </p><blockquote><br /> Note that teaching backends to
produce it, or even adding it to<br /> backend (ISD opcodes)<br
/> and matching it in DAGCombiner has much lower barrier of entry, i<br
/> would suggest to start there.<br /> <blockquote> (First posted
to discord<br /><br /> --<br /> Shawn
Landden</blockquote> Roman<br
/> <blockquote> _______________________________________________<br
/> LLVM Developers mailing list<br /> <a
href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br
/> <a
href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a></blockquote> _______________________________________________<br
/> LLVM Developers mailing list<br /> <a
href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br
/> <a
href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br
/> </blockquote><p> </p>--<br />Lerne, wie die Welt
wirklich ist,<br />Aber vergiss niemals, wie sie sein
sollte.</blockquote><div> </div><div> </div><div>-- </div><div>Shawn
Landden</div><div> </div>

Shawn Landden via llvm-dev

2020-Jul-06 11:41 UTC

head link

[llvm-dev] [RFC] carry-less multiplication instruction

05.07.2020, 07:12, "Nicolai Hähnle" <nhaehnle at
gmail.com>:> On 05.07.20 12:21, Roman Lebedev via llvm-dev wrote:
>>  On Sun, Jul 5, 2020 at 12:18 PM Shawn Landden via llvm-dev
>>  <llvm-dev at lists.llvm.org> wrote:
>>>  This proposal is to add a llvm.clmul instruction.
>>
>>  What benefit would this intrinsic would bring to the middle-end IR,
>>  over it's current naive expanded form?
>
> Isn't a "naive" expansion of NxN carryless multiply extremely
involved?
> I'd expect something like 2N shifts, N truncs, N selects, and N xors.Yes it is. And this is then sped up with a table (such as in the official GCM
spec), however using a table can introduce key-dependent loads and security
problems. The 32+32->64 or 64+64->64 multiplication lowering is generally
constant-time and does not have these security problems.>
> That link mentions an alternative that is more efficient, but I
wouldn't
> exactly call it naive...
>
> Cheers,
> Nicolai

llvm dev - Jul 2020 - [RFC] carry-less multiplication instruction

[llvm-dev] [RFC] carry-less multiplication instruction

[llvm-dev] [RFC] carry-less multiplication instruction

[llvm-dev] [RFC] carry-less multiplication instruction

[llvm-dev] [RFC] carry-less multiplication instruction

[llvm-dev] [RFC] carry-less multiplication instruction

[llvm-dev] [RFC] carry-less multiplication instruction