thr3ads.net - llvm dev - [llvm-dev] [AARCH64][NEON] Do we need extra builtin for vmull_high

If this information is useful, please help other people find it:
Share via:

Alexey Lapshin via llvm-dev

2020-May-18 12:28 UTC

[llvm-dev] [AARCH64][NEON] Do we need extra builtin for vmull_high_p64?

Folks, we encountered a problem: for vmull_high_p64 intrinsic there was not
generated PMULL2 instruction.
This happened because the vmull_high_p64 is implemented through vmull_p64:

arm_neon.h:
__ai poly128_t vmull_high_p64(poly64x2_t __p0, poly64x2_t __p1) {

   poly128_t __ret;

   __ret = vmull_p64((poly64_t)(vget_high_p64(__p0)),
(poly64_t)(vget_high_p64(__p1)));

   return __ret;

}

__ai poly128_t vmull_p64(poly64_t __p0, poly64_t __p1) {

  poly128_t __ret;
  __ret = (poly128_t) __builtin_neon_vmull_p64(__p0, __p1);
  return __ret;
}

There also exist pattern to convert this into PMULL2:

def : Pat<(int_aarch64_neon_pmull64 (extractelt (v2i64 V128:$Rn), (i64 1)),
(extractelt (v2i64 V128:$Rm), (i64 1))),
(PMULLv2i64 V128:$Rn, V128:$Rm)>;

The problem is that ISel apply that pattern only when corresponding IR is inside
basic block.
Some optimizations could hoist extraction operators out of current basic
block(Loop invariant code motion).
In the result PMULL2 is not used.

GlobalISel could resolve that problem. But it does not handle this pattern yet
and switched on by default for -O0 only.
Another alternative to have PMULL2 is to create specific builtin for
vmull_high_p64 intrinsic.

Would it be OK to add extra builtin for vmull_high_p64 intrinsic to resolve this
problem(

__builtin_neon_vmull_high_p64/llvm.aarch64.neon.pmull_high_64) ?


Thank you, Alexey.


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200518/3731b1e7/attachment.html>

Eli Friedman via llvm-dev

2020-May-18 19:25 UTC

head link

[llvm-dev] [AARCH64][NEON] Do we need extra builtin for vmull_high_p64?

For this specific sort of issue, we have some code in
CodeGenPrepare::tryToSinkFreeOperands to try to rearrange the IR so the
necessary instructions are in the same basic block.

If we can't make that work, we could consider adding a separate intrinsic.

-Eli

From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Alexey
Lapshin via llvm-dev
Sent: Monday, May 18, 2020 5:28 AM
To: llvm-dev at lists.llvm.org
Subject: [EXT] [llvm-dev] [AARCH64][NEON] Do we need extra builtin for
vmull_high_p64?


Folks, we encountered a problem: for vmull_high_p64 intrinsic there was not
generated PMULL2 instruction.
This happened because the vmull_high_p64 is implemented through vmull_p64:

arm_neon.h:
__ai poly128_t vmull_high_p64(poly64x2_t __p0, poly64x2_t __p1) {

   poly128_t __ret;

   __ret = vmull_p64((poly64_t)(vget_high_p64(__p0)),
(poly64_t)(vget_high_p64(__p1)));

   return __ret;

}

__ai poly128_t vmull_p64(poly64_t __p0, poly64_t __p1) {
  poly128_t __ret;
  __ret = (poly128_t) __builtin_neon_vmull_p64(__p0, __p1);
  return __ret;
}

There also exist pattern to convert this into PMULL2:

def : Pat<(int_aarch64_neon_pmull64 (extractelt (v2i64 V128:$Rn), (i64 1)),
(extractelt (v2i64 V128:$Rm), (i64 1))),
(PMULLv2i64 V128:$Rn, V128:$Rm)>;

The problem is that ISel apply that pattern only when corresponding IR is inside
basic block.
Some optimizations could hoist extraction operators out of current basic
block(Loop invariant code motion).
In the result PMULL2 is not used.

GlobalISel could resolve that problem. But it does not handle this pattern yet
and switched on by default for -O0 only.
Another alternative to have PMULL2 is to create specific builtin for
vmull_high_p64 intrinsic.

Would it be OK to add extra builtin for vmull_high_p64 intrinsic to resolve this
problem(

__builtin_neon_vmull_high_p64/llvm.aarch64.neon.pmull_high_64) ?



Thank you, Alexey.



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200518/edc4799b/attachment.html>

llvm dev - May 2020 - [AARCH64][NEON] Do we need extra builtin for vmull_high_p64?

[llvm-dev] [AARCH64][NEON] Do we need extra builtin for vmull_high_p64?

[llvm-dev] [AARCH64][NEON] Do we need extra builtin for vmull_high_p64?