Nicolai Hähnle via llvm-dev
2020-Jul-05 12:12 UTC
[llvm-dev] [RFC] carry-less multiplication instruction
On 05.07.20 12:21, Roman Lebedev via llvm-dev wrote:> On Sun, Jul 5, 2020 at 12:18 PM Shawn Landden via llvm-dev > <llvm-dev at lists.llvm.org> wrote: >> >> >> >> Carry-less multiplication[1] instructions exist (at least optionally) on many architectures: armv8, RISC-V, x86_64, POWER, SPARC, C64x, and possibly more. >> >> This proposal is to add a llvm.clmul instruction. Or if that is contentious, llvm.experimental.bitmanip.clmul instruction. It takes two integer operands of the same width, and returns an integer with twice the width of the operands. (Is there a good reason to make these the same width, as all the other operations do even when it doesn’t really make sense for the mathematical operation–like multiplication or ctpop/ctlz/cttz?) >> >> If the CPU does not have a dedication clmul operation, it can be lowered to regular multiplication, by using holes to avoid carrys. >> >> ==Where is clmul used?=>> >> While somewhat specialized, the RISC-V manual documents many uses: [2] >> >> The classic applications forclmulare Cyclic Redundancy Check (CRC) [11, 26] >> >> and Galois/CounterMode (GCM), but more applications exist, including the following examples.There are obvious applications in hashing and pseudo random number generations. For exam-ple, it has been reported that hashes based on carry-less multiplications can outperform Google’sCityHash [17]. >> >> clmulof a number with itself inserts zeroes between each input bit. This can be useful for generatingMorton code [23]. >> >> clmulof a number with -1 calculates the prefix XOR operation. This can be useful for decodinggray codes.Another application of XOR prefix sums calculated withclmulis branchless tracking of quotedstrings in high-performance parsers. [16] >> >> Carry-less multiply can also be used to implement Erasure code efficiently. [14] >> >> ==clmul lowering without hardware support=>> A 8x8=>16 clmul can also be lowered to a 32x32=>64 multiplication when there is no specialized instruction (also 15x15=>30, to a 60x60=>120, or if bitreverse is available 16x16=>32 to TWO 64x64=>64 multiplications)[3]. >> >> [1] https://en.wikipedia.org/wiki/Carry-less_product >> [2] (page 30) https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf >> [3] https://www.bearssl.org/constanttime.html > > What benefit would this intrinsic would bring to the middle-end IR, > over it's current naive expanded form?Isn't a "naive" expansion of NxN carryless multiply extremely involved? I'd expect something like 2N shifts, N truncs, N selects, and N xors. That link mentions an alternative that is more efficient, but I wouldn't exactly call it naive... Cheers, Nicolai> > Note that teaching backends to produce it, or even adding it to > backend (ISD opcodes) > and matching it in DAGCombiner has much lower barrier of entry, i > would suggest to start there. > >> (First posted to discord >> >> -- >> Shawn Landden > Roman > >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-- Lerne, wie die Welt wirklich ist, Aber vergiss niemals, wie sie sein sollte.
Chris Lattner via llvm-dev
2020-Jul-05 20:07 UTC
[llvm-dev] [RFC] carry-less multiplication instruction
> On Jul 5, 2020, at 5:12 AM, Nicolai Hähnle via llvm-dev <llvm-dev at lists.llvm.org> wrote: > >>> [1] https://en.wikipedia.org/wiki/Carry-less_product <https://en.wikipedia.org/wiki/Carry-less_product> >>> [2] (page 30) https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf <https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf> >>> [3] https://www.bearssl.org/constanttime.html <https://www.bearssl.org/constanttime.html> >> What benefit would this intrinsic would bring to the middle-end IR, >> over it's current naive expanded form? > > Isn't a "naive" expansion of NxN carryless multiply extremely involved? I'd expect something like 2N shifts, N truncs, N selects, and N xors. > > That link mentions an alternative that is more efficient, but I wouldn't exactly call it naive…No, the typical expansion (used by the existing LLVM code generator) uses “grade school math” to decompose large multiplications into smaller ones. Wide multiplications (e.g. those on X86) are pretty easy to pattern match from “sign/zero extend operands from 64 bits to 128 bits, then do a 128x128=128 multiply” for example. This is all already handled by the existing selection dag infra, my understanding is that it works well in practice. -Chris -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200705/5b92964b/attachment.html>
Roman Lebedev via llvm-dev
2020-Jul-05 20:09 UTC
[llvm-dev] [RFC] carry-less multiplication instruction
On Sun, Jul 5, 2020 at 11:07 PM Chris Lattner <clattner at nondot.org> wrote:> > > > On Jul 5, 2020, at 5:12 AM, Nicolai Hähnle via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > [1] https://en.wikipedia.org/wiki/Carry-less_product > [2] (page 30) https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf > [3] https://www.bearssl.org/constanttime.html > > What benefit would this intrinsic would bring to the middle-end IR, > over it's current naive expanded form? > > > Isn't a "naive" expansion of NxN carryless multiply extremely involved? I'd expect something like 2N shifts, N truncs, N selects, and N xors. > > That link mentions an alternative that is more efficient, but I wouldn't exactly call it naive… > > > No, the typical expansion (used by the existing LLVM code generator) uses “grade school math” to decompose large multiplications into smaller ones. > > Wide multiplications (e.g. those on X86) are pretty easy to pattern match from “sign/zero extend operands from 64 bits to 128 bits, then do a 128x128=128 multiply” for example.That is what i meant, yes.> This is all already handled by the existing selection dag infra, my understanding is that it works well in practice. > > -ChrisRoman
James Y Knight via llvm-dev
2020-Jul-05 21:28 UTC
[llvm-dev] [RFC] carry-less multiplication instruction
"Carry-less multiplication" isn't a wider bit width multiplication, it's an entirely different operation. (See the Wikipedia page linked in the proposal). On Sun, Jul 5, 2020, 4:07 PM Chris Lattner via llvm-dev < llvm-dev at lists.llvm.org> wrote:> > > On Jul 5, 2020, at 5:12 AM, Nicolai Hähnle via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > [1] https://en.wikipedia.org/wiki/Carry-less_product > [2] (page 30) > https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf > [3] https://www.bearssl.org/constanttime.html > > What benefit would this intrinsic would bring to the middle-end IR, > over it's current naive expanded form? > > > Isn't a "naive" expansion of NxN carryless multiply extremely involved? > I'd expect something like 2N shifts, N truncs, N selects, and N xors. > > That link mentions an alternative that is more efficient, but I wouldn't > exactly call it naive… > > > No, the typical expansion (used by the existing LLVM code generator) uses > “grade school math” to decompose large multiplications into smaller ones. > > Wide multiplications (e.g. those on X86) are pretty easy to pattern match > from “sign/zero extend operands from 64 bits to 128 bits, then do a > 128x128=128 multiply” for example. This is all already handled by the > existing selection dag infra, my understanding is that it works well in > practice. > > -Chris > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200705/2cb09fd3/attachment.html>
Shawn Landden via llvm-dev
2020-Jul-06 10:44 UTC
[llvm-dev] [RFC] carry-less multiplication instruction
<div> </div><div> </div><div>05.07.2020, 07:12, "Nicolai Hähnle" <nhaehnle@gmail.com>:</div><blockquote><p>On 05.07.20 12:21, Roman Lebedev via llvm-dev wrote:</p><blockquote> On Sun, Jul 5, 2020 at 12:18 PM Shawn Landden via llvm-dev<br /> <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<blockquote> This proposal is to add a llvm.clmul instruction.</blockquote><br /> What benefit would this intrinsic would bring to the middle-end IR,<br /> over it's current naive expanded form?</blockquote><p><br />Isn't a "naive" expansion of NxN carryless multiply extremely involved?</p></blockquote><div><br />Yes it is. And this is then sped up with a table (such as in the official GCM spec), however using a table can introduce key-dependent loads and security problems. The 32+32->64 or 64+64->64 multiplication lowering is generally constant-time and does not have these security problems.</div><blockquote><p>I'd expect something like 2N shifts, N truncs, N selects, and N xors.<br /><br />That link mentions an alternative that is more efficient, but I wouldn't<br />exactly call it naive...<br /><br />Cheers,<br />Nicolai<br /><br /> </p><blockquote><br /> Note that teaching backends to produce it, or even adding it to<br /> backend (ISD opcodes)<br /> and matching it in DAGCombiner has much lower barrier of entry, i<br /> would suggest to start there.<br /> <blockquote> (First posted to discord<br /><br /> --<br /> Shawn Landden</blockquote> Roman<br /> <blockquote> _______________________________________________<br /> LLVM Developers mailing list<br /> <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br /> <a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a></blockquote> _______________________________________________<br /> LLVM Developers mailing list<br /> <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br /> <a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br /> </blockquote><p> </p>--<br />Lerne, wie die Welt wirklich ist,<br />Aber vergiss niemals, wie sie sein sollte.</blockquote><div> </div><div> </div><div>-- </div><div>Shawn Landden</div><div> </div>
Shawn Landden via llvm-dev
2020-Jul-06 11:41 UTC
[llvm-dev] [RFC] carry-less multiplication instruction
05.07.2020, 07:12, "Nicolai Hähnle" <nhaehnle at gmail.com>:> On 05.07.20 12:21, Roman Lebedev via llvm-dev wrote: >> On Sun, Jul 5, 2020 at 12:18 PM Shawn Landden via llvm-dev >> <llvm-dev at lists.llvm.org> wrote: >>> This proposal is to add a llvm.clmul instruction. >> >> What benefit would this intrinsic would bring to the middle-end IR, >> over it's current naive expanded form? > > Isn't a "naive" expansion of NxN carryless multiply extremely involved? > I'd expect something like 2N shifts, N truncs, N selects, and N xors.Yes it is. And this is then sped up with a table (such as in the official GCM spec), however using a table can introduce key-dependent loads and security problems. The 32+32->64 or 64+64->64 multiplication lowering is generally constant-time and does not have these security problems.> > That link mentions an alternative that is more efficient, but I wouldn't > exactly call it naive... > > Cheers, > Nicolai