thr3ads.net - llvm dev - [llvm-dev] [RFC] carry-less multiplication instruction [Jul 2020]

If this information is useful, please help other people find it:
Share via:

Shawn Landden via llvm-dev

2020-Jul-09 16:39 UTC

[llvm-dev] [RFC] carry-less multiplication instruction

05.07.2020, 05:22, "Roman Lebedev" <lebedev.ri at
gmail.com>:> On Sun, Jul 5, 2020 at 12:18 PM Shawn Landden via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>>  Carry-less multiplication[1] instructions exist (at least optionally)
on many architectures: armv8, RISC-V, x86_64, POWER, SPARC, C64x, and possibly
more.
>>
>>  This proposal is to add a llvm.clmul instruction. Or if that is
contentious, llvm.experimental.bitmanip.clmul instruction. It takes two integer
operands of the same width, and returns an integer with twice the width of the
operands. (Is there a good reason to make these the same width, as all the other
operations do even when it doesn’t really make sense for the mathematical
operation–like multiplication or ctpop/ctlz/cttz?)
>>
>>  If the CPU does not have a dedication clmul operation, it can be
lowered to regular multiplication, by using holes to avoid carrys.
>>
>>  ==Where is clmul used?=>>
>>  While somewhat specialized, the RISC-V manual documents many uses: [2]
>>
>>  The classic applications forclmulare Cyclic Redundancy Check (CRC)
[11, 26]
>>
>>  and Galois/CounterMode (GCM), but more applications exist, including
the following examples.There are obvious applications in hashing and pseudo
random number generations. For exam-ple, it has been reported that hashes based
on carry-less multiplications can outperform Google’sCityHash [17].
>>
>>  clmulof a number with itself inserts zeroes between each input bit.
This can be useful for generatingMorton code [23].
>>
>>  clmulof a number with -1 calculates the prefix XOR operation. This can
be useful for decodinggray codes.Another application of XOR prefix sums
calculated withclmulis branchless tracking of quotedstrings in high-performance
parsers. [16]
>>
>>  Carry-less multiply can also be used to implement Erasure code
efficiently. [14]
>>
>>  ==clmul lowering without hardware support=>>  A 8x8=>16 clmul
can also be lowered to a 32x32=>64 multiplication when there is no
specialized instruction (also 15x15=>30, to a 60x60=>120, or if bitreverse
is available 16x16=>32 to TWO 64x64=>64 multiplications)[3].
>>
>>  [1] https://en.wikipedia.org/wiki/Carry-less_product
>>  [2] (page 30)
https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf
>>  [3] https://www.bearssl.org/constanttime.html
>
> What benefit would this intrinsic would bring to the middle-end IR,
> over it's current naive expanded form?
>
> Note that teaching backends to produce it, or even adding it to
> backend (ISD opcodes)
> and matching it in DAGCombiner has much lower barrier of entry, i
> would suggest to start there.
It cannot be matched.>
>>  (First posted to discord
>>
>>  --
>>  Shawn Landden
>
> Roman
>
>>  _______________________________________________
>>  LLVM Developers mailing list
>>  llvm-dev at lists.llvm.org
>>  https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 
Shawn Landden

Krzysztof Parzyszek via llvm-dev

2020-Jul-09 17:27 UTC

head link

[llvm-dev] [RFC] carry-less multiplication instruction

FWIW, Hexagon has a pass to recognize polynomial multiplication:
  llvm/lib/Target/Hexagon/HexagonLoopIdiomRecognition.cpp
See "PolynomialMultiplyRecognize"

--
Krzysztof Parzyszek  kparzysz at quicinc.com   AI tools development
> -----Original Message-----
> From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of
Shawn Landden
> via llvm-dev
> Sent: Thursday, July 9, 2020 11:40 AM
> To: Roman Lebedev <lebedev.ri at gmail.com>
> Cc: llvm-dev at lists.llvm.org
> Subject: [EXT] Re: [llvm-dev] [RFC] carry-less multiplication instruction
>
>
>
> 05.07.2020, 05:22, "Roman Lebedev" <lebedev.ri at
gmail.com>:
> > On Sun, Jul 5, 2020 at 12:18 PM Shawn Landden via llvm-dev
> > <llvm-dev at lists.llvm.org> wrote:
> >>  Carry-less multiplication[1] instructions exist (at least
optionally) on
> many architectures: armv8, RISC-V, x86_64, POWER, SPARC, C64x, and possibly
> more.
> >>
> >>  This proposal is to add a llvm.clmul instruction. Or if that is
> >> contentious, llvm.experimental.bitmanip.clmul instruction. It
takes
> >> two integer operands of the same width, and returns an integer
with
> >> twice the width of the operands. (Is there a good reason to make
> >> these the same width, as all the other operations do even when it
> >> doesn’t really make sense for the mathematical operation–like
> >> multiplication or ctpop/ctlz/cttz?)
> >>
> >>  If the CPU does not have a dedication clmul operation, it can be
lowered
> to regular multiplication, by using holes to avoid carrys.
> >>
> >>  ==Where is clmul used?=> >>
> >>  While somewhat specialized, the RISC-V manual documents many
uses:
> >> [2]
> >>
> >>  The classic applications forclmulare Cyclic Redundancy Check
(CRC)
> >> [11, 26]
> >>
> >>  and Galois/CounterMode (GCM), but more applications exist,
including the
> following examples.There are obvious applications in hashing and pseudo
> random number generations. For exam-ple, it has been reported that hashes
> based on carry-less multiplications can outperform Google’sCityHash [17].
> >>
> >>  clmulof a number with itself inserts zeroes between each input
bit. This
> can be useful for generatingMorton code [23].
> >>
> >>  clmulof a number with -1 calculates the prefix XOR operation.
This
> >> can be useful for decodinggray codes.Another application of XOR
> >> prefix sums calculated withclmulis branchless tracking of
> >> quotedstrings in high-performance parsers. [16]
> >>
> >>  Carry-less multiply can also be used to implement Erasure code
> >> efficiently. [14]
> >>
> >>  ==clmul lowering without hardware support=> >>  A
8x8=>16 clmul can also be lowered to a 32x32=>64 multiplication when
> there is no specialized instruction (also 15x15=>30, to a 60x60=>120,
or if
> bitreverse is available 16x16=>32 to TWO 64x64=>64
multiplications)[3].
> >>
> >>  [1] https://en.wikipedia.org/wiki/Carry-less_product
> >>  [2] (page 30)
> >>
https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmani
> >> p-0.92.pdf
> >>  [3] https://www.bearssl.org/constanttime.html
> >
> > What benefit would this intrinsic would bring to the middle-end IR,
> > over it's current naive expanded form?
> >
> > Note that teaching backends to produce it, or even adding it to
> > backend (ISD opcodes) and matching it in DAGCombiner has much lower
> > barrier of entry, i would suggest to start there.
> It cannot be matched.
> >
> >>  (First posted to discord
> >>
> >>  --
> >>  Shawn Landden
> >
> > Roman
> >
> >>  _______________________________________________
> >>  LLVM Developers mailing list
> >>  llvm-dev at lists.llvm.org
> >>  https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
> --
> Shawn Landden
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Craig Topper via llvm-dev

2020-Jul-09 19:15 UTC

head link

[llvm-dev] [RFC] carry-less multiplication instruction

I think i'd prefer to have the output type match the input type. Makes it
more similar to other intrinsics and binary operators. We should be able to
zero extend the inputs if you want something like 64x64->128.

Are you planning to expose this to C through clang? What types would we
expose?

~Craig


On Thu, Jul 9, 2020 at 10:28 AM Krzysztof Parzyszek via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> FWIW, Hexagon has a pass to recognize polynomial multiplication:
>   llvm/lib/Target/Hexagon/HexagonLoopIdiomRecognition.cpp
> See "PolynomialMultiplyRecognize"
>
> --
> Krzysztof Parzyszek  kparzysz at quicinc.com   AI tools development
>
> > -----Original Message-----
> > From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of
Shawn
> Landden
> > via llvm-dev
> > Sent: Thursday, July 9, 2020 11:40 AM
> > To: Roman Lebedev <lebedev.ri at gmail.com>
> > Cc: llvm-dev at lists.llvm.org
> > Subject: [EXT] Re: [llvm-dev] [RFC] carry-less multiplication
instruction
> >
> >
> >
> > 05.07.2020, 05:22, "Roman Lebedev" <lebedev.ri at
gmail.com>:
> > > On Sun, Jul 5, 2020 at 12:18 PM Shawn Landden via llvm-dev
> > > <llvm-dev at lists.llvm.org> wrote:
> > >>  Carry-less multiplication[1] instructions exist (at least
> optionally) on
> > many architectures: armv8, RISC-V, x86_64, POWER, SPARC, C64x, and
> possibly
> > more.
> > >>
> > >>  This proposal is to add a llvm.clmul instruction. Or if that
is
> > >> contentious, llvm.experimental.bitmanip.clmul instruction. It
takes
> > >> two integer operands of the same width, and returns an
integer with
> > >> twice the width of the operands. (Is there a good reason to
make
> > >> these the same width, as all the other operations do even
when it
> > >> doesn’t really make sense for the mathematical operation–like
> > >> multiplication or ctpop/ctlz/cttz?)
> > >>
> > >>  If the CPU does not have a dedication clmul operation, it
can be
> lowered
> > to regular multiplication, by using holes to avoid carrys.
> > >>
> > >>  ==Where is clmul used?=> > >>
> > >>  While somewhat specialized, the RISC-V manual documents many
uses:
> > >> [2]
> > >>
> > >>  The classic applications forclmulare Cyclic Redundancy Check
(CRC)
> > >> [11, 26]
> > >>
> > >>  and Galois/CounterMode (GCM), but more applications exist,
including
> the
> > following examples.There are obvious applications in hashing and
pseudo
> > random number generations. For exam-ple, it has been reported that
hashes
> > based on carry-less multiplications can outperform Google’sCityHash
[17].
> > >>
> > >>  clmulof a number with itself inserts zeroes between each
input bit.
> This
> > can be useful for generatingMorton code [23].
> > >>
> > >>  clmulof a number with -1 calculates the prefix XOR
operation. This
> > >> can be useful for decodinggray codes.Another application of
XOR
> > >> prefix sums calculated withclmulis branchless tracking of
> > >> quotedstrings in high-performance parsers. [16]
> > >>
> > >>  Carry-less multiply can also be used to implement Erasure
code
> > >> efficiently. [14]
> > >>
> > >>  ==clmul lowering without hardware support=> > >>
A 8x8=>16 clmul can also be lowered to a 32x32=>64 multiplication
> when
> > there is no specialized instruction (also 15x15=>30, to a
60x60=>120, or
> if
> > bitreverse is available 16x16=>32 to TWO 64x64=>64
multiplications)[3].
> > >>
> > >>  [1] https://en.wikipedia.org/wiki/Carry-less_product
> > >>  [2] (page 30)
> > >>
https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmani
> > >> p-0.92.pdf
> > >>  [3] https://www.bearssl.org/constanttime.html
> > >
> > > What benefit would this intrinsic would bring to the middle-end
IR,
> > > over it's current naive expanded form?
> > >
> > > Note that teaching backends to produce it, or even adding it to
> > > backend (ISD opcodes) and matching it in DAGCombiner has much
lower
> > > barrier of entry, i would suggest to start there.
> > It cannot be matched.
> > >
> > >>  (First posted to discord
> > >>
> > >>  --
> > >>  Shawn Landden
> > >
> > > Roman
> > >
> > >>  _______________________________________________
> > >>  LLVM Developers mailing list
> > >>  llvm-dev at lists.llvm.org
> > >>  https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >
> > --
> > Shawn Landden
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200709/13f0ae9f/attachment.html>

llvm dev - Jul 2020 - [RFC] carry-less multiplication instruction

[llvm-dev] [RFC] carry-less multiplication instruction

[llvm-dev] [RFC] carry-less multiplication instruction

[llvm-dev] [RFC] carry-less multiplication instruction