thr3ads.net - llvm dev - [llvm-dev] [RFC] carry-less multiplication instruction [Jul 2020]

If this information is useful, please help other people find it:
Share via:

Shawn Landden via llvm-dev

2020-Jul-05 09:18 UTC

[llvm-dev] [RFC] carry-less multiplication instruction

<div> </div><div><div><p>Carry-less
multiplication[1] instructions exist (at least optionally) on many
architectures: armv8, RISC-V, x86_64, POWER, SPARC, C64x, and possibly
more.</p><p>This proposal is to add a
<code>llvm.clmul</code> instruction. Or if that is contentious,
<code>llvm.experimental.bitmanip.clmul</code> instruction. It takes
two integer operands of the same width, and returns an integer with twice the
width of the operands. (Is there a good reason to make these the same width, as
all the other operations do even when it doesn’t really make sense for the
mathematical operation–like multiplication or
ctpop/ctlz/cttz?)</p><p>If the CPU does not have a dedication clmul
operation, it can be lowered to regular multiplication, by using holes to avoid
carrys.</p><p>==Where is clmul used?==</p><p>While
somewhat specialized, the RISC-V manual documents many uses:
[2]</p><p>The classic applications forclmulare Cyclic Redundancy
Check (CRC) [11, 26]</p><p>and Galois/CounterMode (GCM), but more
applications exist, including the following examples.There are obvious
applications in hashing and pseudo random number generations. For exam-ple, it
has been reported that hashes based on carry-less multiplications can outperform
Google’sCityHash [17].</p><p>clmulof a number with itself inserts
zeroes between each input bit. This can be useful for generatingMorton code
[23].</p><p>clmulof a number with -1 calculates the prefix XOR
operation. This can be useful for decodinggray codes.Another application of XOR
prefix sums calculated withclmulis branchless tracking of quotedstrings in
high-performance parsers. [16]</p><p>Carry-less multiply can also be
used to implement Erasure code efficiently. [14]</p><p>==clmul
lowering without hardware support==<br />A 8x8=>16 clmul can also be
lowered to a 32x32=>64 multiplication when there is no specialized
instruction (also 15x15=>30, to a 60x60=>120, or if bitreverse is
available 16x16=>32 to TWO 64x64=>64
multiplications)[3].</p><p>[1] <a
href="https://en.wikipedia.org/wiki/Carry-less_product">https://en.wikipedia.org/wiki/Carry-less_product</a><br
/><a
href="https://en.wikipedia.org/wiki/Carry-less_product%5B2%5D%20(page%2030)%20https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf%5B3%5D%20https://www.bearssl.org/constanttime.html">[2]
(page 30) </a><a
href="https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf">https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf</a><br
/><a
href="https://en.wikipedia.org/wiki/Carry-less_product%5B2%5D%20(page%2030)%20https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf%5B3%5D%20https://www.bearssl.org/constanttime.html">[3]
</a><a
href="https://www.bearssl.org/constanttime.html">https://www.bearssl.org/constanttime.html</a></p><p> </p><p>(First
posted to
discord</p></div></div><div>-- </div><div>Shawn
Landden</div><div> </div>

Roman Lebedev via llvm-dev

2020-Jul-05 10:21 UTC

head link

[llvm-dev] [RFC] carry-less multiplication instruction

On Sun, Jul 5, 2020 at 12:18 PM Shawn Landden via llvm-dev
<llvm-dev at lists.llvm.org> wrote:>
>
>
> Carry-less multiplication[1] instructions exist (at least optionally) on
many architectures: armv8, RISC-V, x86_64, POWER, SPARC, C64x, and possibly
more.
>
> This proposal is to add a llvm.clmul instruction. Or if that is
contentious, llvm.experimental.bitmanip.clmul instruction. It takes two integer
operands of the same width, and returns an integer with twice the width of the
operands. (Is there a good reason to make these the same width, as all the other
operations do even when it doesn’t really make sense for the mathematical
operation–like multiplication or ctpop/ctlz/cttz?)
>
> If the CPU does not have a dedication clmul operation, it can be lowered to
regular multiplication, by using holes to avoid carrys.
>
> ==Where is clmul used?=>
> While somewhat specialized, the RISC-V manual documents many uses: [2]
>
> The classic applications forclmulare Cyclic Redundancy Check (CRC) [11, 26]
>
> and Galois/CounterMode (GCM), but more applications exist, including the
following examples.There are obvious applications in hashing and pseudo random
number generations. For exam-ple, it has been reported that hashes based on
carry-less multiplications can outperform Google’sCityHash [17].
>
> clmulof a number with itself inserts zeroes between each input bit. This
can be useful for generatingMorton code [23].
>
> clmulof a number with -1 calculates the prefix XOR operation. This can be
useful for decodinggray codes.Another application of XOR prefix sums calculated
withclmulis branchless tracking of quotedstrings in high-performance parsers.
[16]
>
> Carry-less multiply can also be used to implement Erasure code efficiently.
[14]
>
> ==clmul lowering without hardware support=> A 8x8=>16 clmul can also
be lowered to a 32x32=>64 multiplication when there is no specialized
instruction (also 15x15=>30, to a 60x60=>120, or if bitreverse is
available 16x16=>32 to TWO 64x64=>64 multiplications)[3].
>
> [1] https://en.wikipedia.org/wiki/Carry-less_product
> [2] (page 30)
https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf
> [3] https://www.bearssl.org/constanttime.html
What benefit would this intrinsic would bring to the middle-end IR,
over it's current naive expanded form?

Note that teaching backends to produce it, or even adding it to
backend (ISD opcodes)
and matching it in DAGCombiner has much lower barrier of entry, i
would suggest to start there.
> (First posted to discord
>
> --
> Shawn LanddenRoman
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Nicolai Hähnle via llvm-dev

2020-Jul-05 12:12 UTC

head link

[llvm-dev] [RFC] carry-less multiplication instruction

On 05.07.20 12:21, Roman Lebedev via llvm-dev wrote:> On Sun, Jul 5, 2020 at 12:18 PM Shawn Landden via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>>
>>
>>
>> Carry-less multiplication[1] instructions exist (at least optionally)
on many architectures: armv8, RISC-V, x86_64, POWER, SPARC, C64x, and possibly
more.
>>
>> This proposal is to add a llvm.clmul instruction. Or if that is
contentious, llvm.experimental.bitmanip.clmul instruction. It takes two integer
operands of the same width, and returns an integer with twice the width of the
operands. (Is there a good reason to make these the same width, as all the other
operations do even when it doesn’t really make sense for the mathematical
operation–like multiplication or ctpop/ctlz/cttz?)
>>
>> If the CPU does not have a dedication clmul operation, it can be
lowered to regular multiplication, by using holes to avoid carrys.
>>
>> ==Where is clmul used?=>>
>> While somewhat specialized, the RISC-V manual documents many uses: [2]
>>
>> The classic applications forclmulare Cyclic Redundancy Check (CRC) [11,
26]
>>
>> and Galois/CounterMode (GCM), but more applications exist, including
the following examples.There are obvious applications in hashing and pseudo
random number generations. For exam-ple, it has been reported that hashes based
on carry-less multiplications can outperform Google’sCityHash [17].
>>
>> clmulof a number with itself inserts zeroes between each input bit.
This can be useful for generatingMorton code [23].
>>
>> clmulof a number with -1 calculates the prefix XOR operation. This can
be useful for decodinggray codes.Another application of XOR prefix sums
calculated withclmulis branchless tracking of quotedstrings in high-performance
parsers. [16]
>>
>> Carry-less multiply can also be used to implement Erasure code
efficiently. [14]
>>
>> ==clmul lowering without hardware support=>> A 8x8=>16 clmul
can also be lowered to a 32x32=>64 multiplication when there is no
specialized instruction (also 15x15=>30, to a 60x60=>120, or if bitreverse
is available 16x16=>32 to TWO 64x64=>64 multiplications)[3].
>>
>> [1] https://en.wikipedia.org/wiki/Carry-less_product
>> [2] (page 30)
https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf
>> [3] https://www.bearssl.org/constanttime.html
> 
> What benefit would this intrinsic would bring to the middle-end IR,
> over it's current naive expanded form?
Isn't a "naive" expansion of NxN carryless multiply extremely
involved?
I'd expect something like 2N shifts, N truncs, N selects, and N xors.

That link mentions an alternative that is more efficient, but I wouldn't 
exactly call it naive...

Cheers,
Nicolai

> 
> Note that teaching backends to produce it, or even adding it to
> backend (ISD opcodes)
> and matching it in DAGCombiner has much lower barrier of entry, i
> would suggest to start there.
> 
>> (First posted to discord
>>
>> --
>> Shawn Landden
> Roman
> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
-- 
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.

James Y Knight via llvm-dev

2020-Jul-05 14:54 UTC

head link

[llvm-dev] [RFC] carry-less multiplication instruction

It'd be useful in your proposal to to note which of the existing llvm
target specific intrinsics this generic intrinsic can effectively
supersede. (E.g. llvm.x86.pclmulqdq for x86.)

When we are already supporting a given function via target specific
intrinsics for a number of different targets, that seems a pretty good
argument for making it available as a more generic target independent
intrinsic.

On Sun, Jul 5, 2020, 5:18 AM Shawn Landden via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
>
> Carry-less multiplication[1] instructions exist (at least optionally) on
> many architectures: armv8, RISC-V, x86_64, POWER, SPARC, C64x, and possibly
> more.
>
> This proposal is to add a llvm.clmul instruction. Or if that is
> contentious, llvm.experimental.bitmanip.clmul instruction. It takes two
> integer operands of the same width, and returns an integer with twice the
> width of the operands. (Is there a good reason to make these the same
> width, as all the other operations do even when it doesn’t really make
> sense for the mathematical operation–like multiplication or
> ctpop/ctlz/cttz?)
>
> If the CPU does not have a dedication clmul operation, it can be lowered
> to regular multiplication, by using holes to avoid carrys.
>
> ==Where is clmul used?=>
> While somewhat specialized, the RISC-V manual documents many uses: [2]
>
> The classic applications forclmulare Cyclic Redundancy Check (CRC) [11, 26]
>
> and Galois/CounterMode (GCM), but more applications exist, including the
> following examples.There are obvious applications in hashing and pseudo
> random number generations. For exam-ple, it has been reported that hashes
> based on carry-less multiplications can outperform Google’sCityHash [17].
>
> clmulof a number with itself inserts zeroes between each input bit. This
> can be useful for generatingMorton code [23].
>
> clmulof a number with -1 calculates the prefix XOR operation. This can be
> useful for decodinggray codes.Another application of XOR prefix sums
> calculated withclmulis branchless tracking of quotedstrings in
> high-performance parsers. [16]
>
> Carry-less multiply can also be used to implement Erasure code
> efficiently. [14]
>
> ==clmul lowering without hardware support=> A 8x8=>16 clmul can also
be lowered to a 32x32=>64 multiplication when
> there is no specialized instruction (also 15x15=>30, to a 60x60=>120,
or if
> bitreverse is available 16x16=>32 to TWO 64x64=>64
multiplications)[3].
>
> [1] https://en.wikipedia.org/wiki/Carry-less_product
> [2] (page 30)
>
<https://en.wikipedia.org/wiki/Carry-less_product%5B2%5D%20(page%2030)%20https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf%5B3%5D%20https://www.bearssl.org/constanttime.html>
>
https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf
> [3]
>
<https://en.wikipedia.org/wiki/Carry-less_product%5B2%5D%20(page%2030)%20https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf%5B3%5D%20https://www.bearssl.org/constanttime.html>
> https://www.bearssl.org/constanttime.html
>
>
>
> (First posted to discord
> --
> Shawn Landden
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200705/a56891ea/attachment.html>

Craig Topper via llvm-dev

2020-Jul-05 18:44 UTC

head link

[llvm-dev] [RFC] carry-less multiplication instruction

Shawn,

Are you able to summarize the different instructions from the various
targets. It looks like there different implementation choices made for each
target. For example, X86 takes two v2i64 inputs and picks either an even or
odd element from each to multiply to produce a v1i128 result. It looks like
RISC-V has instructions to produce either the high half of the result or
the low half of the result. Those are the only two I checked.

Will a common intrinsic need custom handling for each target or is there a
common version that multiple targets use that we should choose for the
intrinsic?

~Craig


On Sun, Jul 5, 2020 at 7:55 AM James Y Knight via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> It'd be useful in your proposal to to note which of the existing llvm
> target specific intrinsics this generic intrinsic can effectively
> supersede. (E.g. llvm.x86.pclmulqdq for x86.)
>
> When we are already supporting a given function via target specific
> intrinsics for a number of different targets, that seems a pretty good
> argument for making it available as a more generic target independent
> intrinsic.
>
> On Sun, Jul 5, 2020, 5:18 AM Shawn Landden via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>>
>>
>> Carry-less multiplication[1] instructions exist (at least optionally)
on
>> many architectures: armv8, RISC-V, x86_64, POWER, SPARC, C64x, and
possibly
>> more.
>>
>> This proposal is to add a llvm.clmul instruction. Or if that is
>> contentious, llvm.experimental.bitmanip.clmul instruction. It takes two
>> integer operands of the same width, and returns an integer with twice
the
>> width of the operands. (Is there a good reason to make these the same
>> width, as all the other operations do even when it doesn’t really make
>> sense for the mathematical operation–like multiplication or
>> ctpop/ctlz/cttz?)
>>
>> If the CPU does not have a dedication clmul operation, it can be
lowered
>> to regular multiplication, by using holes to avoid carrys.
>>
>> ==Where is clmul used?=>>
>> While somewhat specialized, the RISC-V manual documents many uses: [2]
>>
>> The classic applications forclmulare Cyclic Redundancy Check (CRC) [11,
>> 26]
>>
>> and Galois/CounterMode (GCM), but more applications exist, including
the
>> following examples.There are obvious applications in hashing and pseudo
>> random number generations. For exam-ple, it has been reported that
hashes
>> based on carry-less multiplications can outperform Google’sCityHash
[17].
>>
>> clmulof a number with itself inserts zeroes between each input bit.
This
>> can be useful for generatingMorton code [23].
>>
>> clmulof a number with -1 calculates the prefix XOR operation. This can
be
>> useful for decodinggray codes.Another application of XOR prefix sums
>> calculated withclmulis branchless tracking of quotedstrings in
>> high-performance parsers. [16]
>>
>> Carry-less multiply can also be used to implement Erasure code
>> efficiently. [14]
>>
>> ==clmul lowering without hardware support=>> A 8x8=>16 clmul
can also be lowered to a 32x32=>64 multiplication when
>> there is no specialized instruction (also 15x15=>30, to a
60x60=>120, or if
>> bitreverse is available 16x16=>32 to TWO 64x64=>64
multiplications)[3].
>>
>> [1] https://en.wikipedia.org/wiki/Carry-less_product
>> [2] (page 30)
>>
<https://en.wikipedia.org/wiki/Carry-less_product%5B2%5D%20(page%2030)%20https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf%5B3%5D%20https://www.bearssl.org/constanttime.html>
>>
https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf
>> [3]
>>
<https://en.wikipedia.org/wiki/Carry-less_product%5B2%5D%20(page%2030)%20https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf%5B3%5D%20https://www.bearssl.org/constanttime.html>
>> https://www.bearssl.org/constanttime.html
>>
>>
>>
>> (First posted to discord
>> --
>> Shawn Landden
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200705/7d5f05ff/attachment.html>

Steve (Numerics) Canon via llvm-dev

2020-Jul-08 16:23 UTC

head link

[llvm-dev] [RFC] carry-less multiplication instruction

FWIW, this seems like a no-brainer to me (as llvm.experimental initially),
assuming that it can be designed in such a way that it would eliminate the need
for intrinsics on at least two targets (I think it should be possible to do so,
with a small amount of back-end work).

– Steve
> On Jul 5, 2020, at 5:18 AM, Shawn Landden via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Carry-less multiplication[1] instructions exist (at least optionally) on
many architectures: armv8, RISC-V, x86_64, POWER, SPARC, C64x, and possibly
more.
> 
> This proposal is to add a llvm.clmul instruction. Or if that is
contentious, llvm.experimental.bitmanip.clmul instruction. It takes two integer
operands of the same width, and returns an integer with twice the width of the
operands. (Is there a good reason to make these the same width, as all the other
operations do even when it doesn’t really make sense for the mathematical
operation–like multiplication or ctpop/ctlz/cttz?)
> 
> If the CPU does not have a dedication clmul operation, it can be lowered to
regular multiplication, by using holes to avoid carrys.
> 
> ==Where is clmul used?=> 
> While somewhat specialized, the RISC-V manual documents many uses: [2]
> 
> The classic applications forclmulare Cyclic Redundancy Check (CRC) [11, 26]
> 
> and Galois/CounterMode (GCM), but more applications exist, including the
following examples.There are obvious applications in hashing and pseudo random
number generations. For exam-ple, it has been reported that hashes based on
carry-less multiplications can outperform Google’sCityHash [17].
> 
> clmulof a number with itself inserts zeroes between each input bit. This
can be useful for generatingMorton code [23].
> 
> clmulof a number with -1 calculates the prefix XOR operation. This can be
useful for decodinggray codes.Another application of XOR prefix sums calculated
withclmulis branchless tracking of quotedstrings in high-performance parsers.
[16]
> 
> Carry-less multiply can also be used to implement Erasure code efficiently.
[14]
> 
> ==clmul lowering without hardware support=> A 8x8=>16 clmul can also
be lowered to a 32x32=>64 multiplication when there is no specialized
instruction (also 15x15=>30, to a 60x60=>120, or if bitreverse is
available 16x16=>32 to TWO 64x64=>64 multiplications)[3].
> 
> [1] https://en.wikipedia.org/wiki/Carry-less_product
<https://en.wikipedia.org/wiki/Carry-less_product>
> [2] (page 30) 
<https://en.wikipedia.org/wiki/Carry-less_product%5B2%5D%20(page%2030)%20https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf%5B3%5D%20https://www.bearssl.org/constanttime.html>https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf
<https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf>
> [3] 
<https://en.wikipedia.org/wiki/Carry-less_product%5B2%5D%20(page%2030)%20https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf%5B3%5D%20https://www.bearssl.org/constanttime.html>https://www.bearssl.org/constanttime.html
<https://www.bearssl.org/constanttime.html>
>  
> (First posted to discord
> 
> -- 
> Shawn Landden
>  
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200708/05fc4700/attachment.html>

Roman Lebedev via llvm-dev

2020-Jul-09 14:41 UTC

head link

[llvm-dev] [RFC] carry-less multiplication instruction

(As per IRC discussion)

I understand that the carry-less multiplication algorithm has it's uses
since/and it is implemented as an instruction in many architectures
and that adding it as a general-purpose intrinsic will allow us
to drop target-specific intrinsics as by-product.

What i do *NOT* understand is: what is the actual/main goal/driving
factor of adding an LLVM intrinsic for it?

The use that was mentioned is crypto, and i'm personally not really
registering anything else. Am i just misreading it?
The crypto use-case doesn't make sense to me, because
as of this moment LLVM "explicitly" has zero constant-time
guarantees for LLVM IR instructions/intrinsics.

I feel like it's a really important question.
If there isn't interest in crypto/constant-time here, i think it would
be best to explicitly state so. If there is, i think it may be good
to hear from Chandler (who IIRC is driving some constant-time
work for C++/LLVM, that is not yet widely public)

Roman

On Wed, Jul 8, 2020 at 7:23 PM Steve (Numerics) Canon via llvm-dev
<llvm-dev at lists.llvm.org> wrote:>
> FWIW, this seems like a no-brainer to me (as llvm.experimental initially),
assuming that it can be designed in such a way that it would eliminate the need
for intrinsics on at least two targets (I think it should be possible to do so,
with a small amount of back-end work).
>
> – Steve
>
> On Jul 5, 2020, at 5:18 AM, Shawn Landden via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
>
> Carry-less multiplication[1] instructions exist (at least optionally) on
many architectures: armv8, RISC-V, x86_64, POWER, SPARC, C64x, and possibly
more.
>
> This proposal is to add a llvm.clmul instruction. Or if that is
contentious, llvm.experimental.bitmanip.clmul instruction. It takes two integer
operands of the same width, and returns an integer with twice the width of the
operands. (Is there a good reason to make these the same width, as all the other
operations do even when it doesn’t really make sense for the mathematical
operation–like multiplication or ctpop/ctlz/cttz?)
>
> If the CPU does not have a dedication clmul operation, it can be lowered to
regular multiplication, by using holes to avoid carrys.
>
> ==Where is clmul used?=>
> While somewhat specialized, the RISC-V manual documents many uses: [2]
>
> The classic applications forclmulare Cyclic Redundancy Check (CRC) [11, 26]
>
> and Galois/CounterMode (GCM), but more applications exist, including the
following examples.There are obvious applications in hashing and pseudo random
number generations. For exam-ple, it has been reported that hashes based on
carry-less multiplications can outperform Google’sCityHash [17].
>
> clmulof a number with itself inserts zeroes between each input bit. This
can be useful for generatingMorton code [23].
>
> clmulof a number with -1 calculates the prefix XOR operation. This can be
useful for decodinggray codes.Another application of XOR prefix sums calculated
withclmulis branchless tracking of quotedstrings in high-performance parsers.
[16]
>
> Carry-less multiply can also be used to implement Erasure code efficiently.
[14]
>
> ==clmul lowering without hardware support=> A 8x8=>16 clmul can also
be lowered to a 32x32=>64 multiplication when there is no specialized
instruction (also 15x15=>30, to a 60x60=>120, or if bitreverse is
available 16x16=>32 to TWO 64x64=>64 multiplications)[3].
>
> [1] https://en.wikipedia.org/wiki/Carry-less_product
> [2] (page 30)
https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf
> [3] https://www.bearssl.org/constanttime.html
>
>
>
> (First posted to discord
>
> --
> Shawn Landden
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Shawn Landden via llvm-dev

2020-Jul-09 16:39 UTC

head link

[llvm-dev] [RFC] carry-less multiplication instruction

05.07.2020, 05:22, "Roman Lebedev" <lebedev.ri at
gmail.com>:> On Sun, Jul 5, 2020 at 12:18 PM Shawn Landden via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>>  Carry-less multiplication[1] instructions exist (at least optionally)
on many architectures: armv8, RISC-V, x86_64, POWER, SPARC, C64x, and possibly
more.
>>
>>  This proposal is to add a llvm.clmul instruction. Or if that is
contentious, llvm.experimental.bitmanip.clmul instruction. It takes two integer
operands of the same width, and returns an integer with twice the width of the
operands. (Is there a good reason to make these the same width, as all the other
operations do even when it doesn’t really make sense for the mathematical
operation–like multiplication or ctpop/ctlz/cttz?)
>>
>>  If the CPU does not have a dedication clmul operation, it can be
lowered to regular multiplication, by using holes to avoid carrys.
>>
>>  ==Where is clmul used?=>>
>>  While somewhat specialized, the RISC-V manual documents many uses: [2]
>>
>>  The classic applications forclmulare Cyclic Redundancy Check (CRC)
[11, 26]
>>
>>  and Galois/CounterMode (GCM), but more applications exist, including
the following examples.There are obvious applications in hashing and pseudo
random number generations. For exam-ple, it has been reported that hashes based
on carry-less multiplications can outperform Google’sCityHash [17].
>>
>>  clmulof a number with itself inserts zeroes between each input bit.
This can be useful for generatingMorton code [23].
>>
>>  clmulof a number with -1 calculates the prefix XOR operation. This can
be useful for decodinggray codes.Another application of XOR prefix sums
calculated withclmulis branchless tracking of quotedstrings in high-performance
parsers. [16]
>>
>>  Carry-less multiply can also be used to implement Erasure code
efficiently. [14]
>>
>>  ==clmul lowering without hardware support=>>  A 8x8=>16 clmul
can also be lowered to a 32x32=>64 multiplication when there is no
specialized instruction (also 15x15=>30, to a 60x60=>120, or if bitreverse
is available 16x16=>32 to TWO 64x64=>64 multiplications)[3].
>>
>>  [1] https://en.wikipedia.org/wiki/Carry-less_product
>>  [2] (page 30)
https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf
>>  [3] https://www.bearssl.org/constanttime.html
>
> What benefit would this intrinsic would bring to the middle-end IR,
> over it's current naive expanded form?
>
> Note that teaching backends to produce it, or even adding it to
> backend (ISD opcodes)
> and matching it in DAGCombiner has much lower barrier of entry, i
> would suggest to start there.
It cannot be matched.>
>>  (First posted to discord
>>
>>  --
>>  Shawn Landden
>
> Roman
>
>>  _______________________________________________
>>  LLVM Developers mailing list
>>  llvm-dev at lists.llvm.org
>>  https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 
Shawn Landden

Shawn Landden via llvm-dev

2020-Jul-19 15:23 UTC

head link

[llvm-dev] Difficulty matching zext in tablegen/SelectionDAG (was: [RFC] Carry-less multiplication)

In trying to implement my RFC I ran into difficulty selecting
zero-extends, either in a custom lowering, or in tablegen.

For example, Power 8 has <2 x i32> -> <2 x i64>, <4 x i16>
-> <4 x
i32>, <1 x i64> -> <1 x i128> instructions.

And the generated MIR looks different for the different cases:

LLVM ERROR: Cannot select: t10: v2i64 = clmul t36, t35
  t36: v2i64 = bitcast t33
    t33: v16i8 vector_shuffle<0,1,2,3,16,17,18,19,4,5,6,7,20,21,22,23>
t32, t29
      t32: v16i8 = bitcast t2
        t2: v4i32,ch = CopyFromReg t0, Register:v4i32 %0
          t1: v4i32 = Register %0
      t29: v16i8 = bitcast t19
        t19: v4i32 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>,
Constant:i32<0>, Constant:i32<0>
          t18: i32 = Constant<0>
          t18: i32 = Constant<0>
          t18: i32 = Constant<0>
          t18: i32 = Constant<0>
  t35: v2i64 = bitcast t30
    t30: v16i8 vector_shuffle<0,1,2,3,16,17,18,19,4,5,6,7,20,21,22,23>
t28, t29
      t28: v16i8 = bitcast t4
        t4: v4i32,ch = CopyFromReg t0, Register:v4i32 %1
          t3: v4i32 = Register %1
      t29: v16i8 = bitcast t19
        t19: v4i32 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>,
Constant:i32<0>, Constant:i32<0>
          t18: i32 = Constant<0>
          t18: i32 = Constant<0>
          t18: i32 = Constant<0>
          t18: i32 = Constant<0>

LLVM ERROR: Cannot select: t9: v1i128 = clmul t24, t28
  t24: v1i128 = bitcast t23
    t23: v2i64 = BUILD_VECTOR t2, Constant:i64<0>
      t2: i64,ch = CopyFromReg t0, Register:i64 %0
        t1: i64 = Register %0
      t13: i64 = Constant<0>
  t28: v1i128 = bitcast t27
    t27: v2i64 = BUILD_VECTOR t4, Constant:i64<0>
      t4: i64,ch = CopyFromReg t0, Register:i64 %1
        t3: i64 = Register %1
      t13: i64 = Constant<0>


I might be able to use demandedBits in a custom rule, but it sure
would nice if zext actually worked, like it would in LLVM-IR. The
burden of learning totally distinct tools for different levels of the
compiler (in addition to the assembly of the target) is quite
annoying.

Possibly Parallel Threads

Search for more maybe matching threads

llvm dev - Jul 2020 - [RFC] carry-less multiplication instruction

[llvm-dev] [RFC] carry-less multiplication instruction

[llvm-dev] [RFC] carry-less multiplication instruction

[llvm-dev] [RFC] carry-less multiplication instruction

[llvm-dev] [RFC] carry-less multiplication instruction

[llvm-dev] [RFC] carry-less multiplication instruction

[llvm-dev] [RFC] carry-less multiplication instruction

[llvm-dev] [RFC] carry-less multiplication instruction

[llvm-dev] [RFC] carry-less multiplication instruction

[llvm-dev] Difficulty matching zext in tablegen/SelectionDAG (was: [RFC] Carry-less multiplication)

Possibly Parallel Threads