thr3ads.net - llvm dev - [llvm-dev] Vectorization of math function failed? [Aug 2020]

If this information is useful, please help other people find it:
Share via:

Alexandre Bique via llvm-dev

2020-Aug-31 21:42 UTC

[llvm-dev] Vectorization of math function failed?

Hi,

After reading
https://llvm.org/docs/Vectorizers.html#vectorization-of-function-calls
I decided to write the following C++ program:

#include <cmath>

using v4f32 = float __attribute__((__vector_size__(16)));

v4f32 fct1(v4f32 x)
{
  v4f32 y;
  y[0] = std::sin(x[0]);
  y[1] = std::sin(x[1]);
  y[2] = std::sin(x[2]);
  y[3] = std::sin(x[3]);
  return y;
}

v4f32 fct2(v4f32 x)
{
  v4f32 y;
  for (int i = 0; i < 4; ++i)
    y[i] = std::sin(x[i]);
  return y;
}

void fct3(float *x)
{
#pragma clang loop vectorize(enable)
  for (int i = 0; i < 16; ++i)
    x[i] = sinf(x[i]);
}

Which I compiled with: clang++ -O3 -march=native -mtune=native -c -o
vec.o vec.cc -lmvec -fno-math-errno

And here is what I get:

vec.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <_Z4fct1Dv4_f>:
   0: 48 83 ec 48          sub    $0x48,%rsp
   4: c5 f8 29 04 24        vmovaps %xmm0,(%rsp)
   9: e8 00 00 00 00        callq  e <_Z4fct1Dv4_f+0xe>
   e: c5 f8 29 44 24 30    vmovaps %xmm0,0x30(%rsp)
  14: c5 fa 16 04 24        vmovshdup (%rsp),%xmm0
  19: e8 00 00 00 00        callq  1e <_Z4fct1Dv4_f+0x1e>
  1e: c5 f8 29 44 24 20    vmovaps %xmm0,0x20(%rsp)
  24: c4 e3 79 05 04 24 01 vpermilpd $0x1,(%rsp),%xmm0
  2b: e8 00 00 00 00        callq  30 <_Z4fct1Dv4_f+0x30>
  30: c5 f9 29 44 24 10    vmovapd %xmm0,0x10(%rsp)
  36: c4 e3 79 04 04 24 e7 vpermilps $0xe7,(%rsp),%xmm0
  3d: e8 00 00 00 00        callq  42 <_Z4fct1Dv4_f+0x42>
  42: c5 f8 28 4c 24 30    vmovaps 0x30(%rsp),%xmm1
  48: c4 e3 71 21 4c 24 20 vinsertps $0x10,0x20(%rsp),%xmm1,%xmm1
  4f: 10
  50: c4 e3 71 21 4c 24 10 vinsertps $0x20,0x10(%rsp),%xmm1,%xmm1
  57: 20
  58: c4 e3 71 21 c0 30    vinsertps $0x30,%xmm0,%xmm1,%xmm0
  5e: 48 83 c4 48          add    $0x48,%rsp
  62: c3                    retq
  63: 66 2e 0f 1f 84 00 00 nopw   %cs:0x0(%rax,%rax,1)
  6a: 00 00 00
  6d: 0f 1f 00              nopl   (%rax)

0000000000000070 <_Z4fct2Dv4_f>:
  70: 48 83 ec 48          sub    $0x48,%rsp
  74: c5 f8 29 04 24        vmovaps %xmm0,(%rsp)
  79: e8 00 00 00 00        callq  7e <_Z4fct2Dv4_f+0xe>
  7e: c5 f8 29 44 24 30    vmovaps %xmm0,0x30(%rsp)
  84: c5 fa 16 04 24        vmovshdup (%rsp),%xmm0
  89: e8 00 00 00 00        callq  8e <_Z4fct2Dv4_f+0x1e>
  8e: c5 f8 29 44 24 20    vmovaps %xmm0,0x20(%rsp)
  94: c4 e3 79 05 04 24 01 vpermilpd $0x1,(%rsp),%xmm0
  9b: e8 00 00 00 00        callq  a0 <_Z4fct2Dv4_f+0x30>
  a0: c5 f9 29 44 24 10    vmovapd %xmm0,0x10(%rsp)
  a6: c4 e3 79 04 04 24 e7 vpermilps $0xe7,(%rsp),%xmm0
  ad: e8 00 00 00 00        callq  b2 <_Z4fct2Dv4_f+0x42>
  b2: c5 f8 28 4c 24 30    vmovaps 0x30(%rsp),%xmm1
  b8: c4 e3 71 21 4c 24 20 vinsertps $0x10,0x20(%rsp),%xmm1,%xmm1
  bf: 10
  c0: c4 e3 71 21 4c 24 10 vinsertps $0x20,0x10(%rsp),%xmm1,%xmm1
  c7: 20
  c8: c4 e3 71 21 c0 30    vinsertps $0x30,%xmm0,%xmm1,%xmm0
  ce: 48 83 c4 48          add    $0x48,%rsp
  d2: c3                    retq
  d3: 66 2e 0f 1f 84 00 00 nopw   %cs:0x0(%rax,%rax,1)
  da: 00 00 00
  dd: 0f 1f 00              nopl   (%rax)

00000000000000e0 <_Z4fct3Pf>:
  e0: 53                    push   %rbx
  e1: 48 83 ec 10          sub    $0x10,%rsp
  e5: 48 89 fb              mov    %rdi,%rbx
  e8: c5 fa 10 07          vmovss (%rdi),%xmm0
  ec: c5 fa 10 4f 04        vmovss 0x4(%rdi),%xmm1
  f1: c5 fa 11 4c 24 0c    vmovss %xmm1,0xc(%rsp)
  f7: e8 00 00 00 00        callq  fc <_Z4fct3Pf+0x1c>
  fc: c5 fa 11 03          vmovss %xmm0,(%rbx)
 100: c5 fa 10 44 24 0c    vmovss 0xc(%rsp),%xmm0
 106: e8 00 00 00 00        callq  10b <_Z4fct3Pf+0x2b>
 10b: c5 fa 11 43 04        vmovss %xmm0,0x4(%rbx)
 110: c5 fa 10 43 08        vmovss 0x8(%rbx),%xmm0
 115: e8 00 00 00 00        callq  11a <_Z4fct3Pf+0x3a>
 11a: c5 fa 11 43 08        vmovss %xmm0,0x8(%rbx)
 11f: c5 fa 10 43 0c        vmovss 0xc(%rbx),%xmm0
 124: e8 00 00 00 00        callq  129 <_Z4fct3Pf+0x49>
 129: c5 fa 11 43 0c        vmovss %xmm0,0xc(%rbx)
 12e: c5 fa 10 43 10        vmovss 0x10(%rbx),%xmm0
 133: e8 00 00 00 00        callq  138 <_Z4fct3Pf+0x58>
 138: c5 fa 11 43 10        vmovss %xmm0,0x10(%rbx)
 13d: c5 fa 10 43 14        vmovss 0x14(%rbx),%xmm0
 142: e8 00 00 00 00        callq  147 <_Z4fct3Pf+0x67>
 147: c5 fa 11 43 14        vmovss %xmm0,0x14(%rbx)
 14c: c5 fa 10 43 18        vmovss 0x18(%rbx),%xmm0
 151: e8 00 00 00 00        callq  156 <_Z4fct3Pf+0x76>
 156: c5 fa 11 43 18        vmovss %xmm0,0x18(%rbx)
 15b: c5 fa 10 43 1c        vmovss 0x1c(%rbx),%xmm0
 160: e8 00 00 00 00        callq  165 <_Z4fct3Pf+0x85>
 165: c5 fa 11 43 1c        vmovss %xmm0,0x1c(%rbx)
 16a: c5 fa 10 43 20        vmovss 0x20(%rbx),%xmm0
 16f: e8 00 00 00 00        callq  174 <_Z4fct3Pf+0x94>
 174: c5 fa 11 43 20        vmovss %xmm0,0x20(%rbx)
 179: c5 fa 10 43 24        vmovss 0x24(%rbx),%xmm0
 17e: e8 00 00 00 00        callq  183 <_Z4fct3Pf+0xa3>
 183: c5 fa 11 43 24        vmovss %xmm0,0x24(%rbx)
 188: c5 fa 10 43 28        vmovss 0x28(%rbx),%xmm0
 18d: e8 00 00 00 00        callq  192 <_Z4fct3Pf+0xb2>
 192: c5 fa 11 43 28        vmovss %xmm0,0x28(%rbx)
 197: c5 fa 10 43 2c        vmovss 0x2c(%rbx),%xmm0
 19c: e8 00 00 00 00        callq  1a1 <_Z4fct3Pf+0xc1>
 1a1: c5 fa 11 43 2c        vmovss %xmm0,0x2c(%rbx)
 1a6: c5 fa 10 43 30        vmovss 0x30(%rbx),%xmm0
 1ab: e8 00 00 00 00        callq  1b0 <_Z4fct3Pf+0xd0>
 1b0: c5 fa 11 43 30        vmovss %xmm0,0x30(%rbx)
 1b5: c5 fa 10 43 34        vmovss 0x34(%rbx),%xmm0
 1ba: e8 00 00 00 00        callq  1bf <_Z4fct3Pf+0xdf>
 1bf: c5 fa 11 43 34        vmovss %xmm0,0x34(%rbx)
 1c4: c5 fa 10 43 38        vmovss 0x38(%rbx),%xmm0
 1c9: e8 00 00 00 00        callq  1ce <_Z4fct3Pf+0xee>
 1ce: c5 fa 11 43 38        vmovss %xmm0,0x38(%rbx)
 1d3: c5 fa 10 43 3c        vmovss 0x3c(%rbx),%xmm0
 1d8: e8 00 00 00 00        callq  1dd <_Z4fct3Pf+0xfd>
 1dd: c5 fa 11 43 3c        vmovss %xmm0,0x3c(%rbx)
 1e2: 48 83 c4 10          add    $0x10,%rsp
 1e6: 5b                    pop    %rbx
 1e7: c3                    retq

As you can see there is no call to a vectorized version of sin.
Did I do something wrong?

By the way I am on Linux with glibc 2.32 which has libmvec.

Regards,
-- 
Alexandre Bique

Brian Cain via llvm-dev

2020-Sep-01 02:05 UTC

head link

[llvm-dev] Vectorization of math function failed?

If you're using clang you could try to see if it emits any hints about
optimizations using the remarks:

https://clang.llvm.org/docs/UsersManual.html#options-to-emit-optimization-reports

On Mon, Aug 31, 2020, 4:43 PM Alexandre Bique via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hi,
>
> After reading
> https://llvm.org/docs/Vectorizers.html#vectorization-of-function-calls
> I decided to write the following C++ program:
>
> #include <cmath>
>
> using v4f32 = float __attribute__((__vector_size__(16)));
>
> v4f32 fct1(v4f32 x)
> {
>   v4f32 y;
>   y[0] = std::sin(x[0]);
>   y[1] = std::sin(x[1]);
>   y[2] = std::sin(x[2]);
>   y[3] = std::sin(x[3]);
>   return y;
> }
>
> v4f32 fct2(v4f32 x)
> {
>   v4f32 y;
>   for (int i = 0; i < 4; ++i)
>     y[i] = std::sin(x[i]);
>   return y;
> }
>
> void fct3(float *x)
> {
> #pragma clang loop vectorize(enable)
>   for (int i = 0; i < 16; ++i)
>     x[i] = sinf(x[i]);
> }
>
> Which I compiled with: clang++ -O3 -march=native -mtune=native -c -o
> vec.o vec.cc -lmvec -fno-math-errno
>
> And here is what I get:
>
> vec.o:     file format elf64-x86-64
>
>
> Disassembly of section .text:
>
> 0000000000000000 <_Z4fct1Dv4_f>:
>    0: 48 83 ec 48          sub    $0x48,%rsp
>    4: c5 f8 29 04 24        vmovaps %xmm0,(%rsp)
>    9: e8 00 00 00 00        callq  e <_Z4fct1Dv4_f+0xe>
>    e: c5 f8 29 44 24 30    vmovaps %xmm0,0x30(%rsp)
>   14: c5 fa 16 04 24        vmovshdup (%rsp),%xmm0
>   19: e8 00 00 00 00        callq  1e <_Z4fct1Dv4_f+0x1e>
>   1e: c5 f8 29 44 24 20    vmovaps %xmm0,0x20(%rsp)
>   24: c4 e3 79 05 04 24 01 vpermilpd $0x1,(%rsp),%xmm0
>   2b: e8 00 00 00 00        callq  30 <_Z4fct1Dv4_f+0x30>
>   30: c5 f9 29 44 24 10    vmovapd %xmm0,0x10(%rsp)
>   36: c4 e3 79 04 04 24 e7 vpermilps $0xe7,(%rsp),%xmm0
>   3d: e8 00 00 00 00        callq  42 <_Z4fct1Dv4_f+0x42>
>   42: c5 f8 28 4c 24 30    vmovaps 0x30(%rsp),%xmm1
>   48: c4 e3 71 21 4c 24 20 vinsertps $0x10,0x20(%rsp),%xmm1,%xmm1
>   4f: 10
>   50: c4 e3 71 21 4c 24 10 vinsertps $0x20,0x10(%rsp),%xmm1,%xmm1
>   57: 20
>   58: c4 e3 71 21 c0 30    vinsertps $0x30,%xmm0,%xmm1,%xmm0
>   5e: 48 83 c4 48          add    $0x48,%rsp
>   62: c3                    retq
>   63: 66 2e 0f 1f 84 00 00 nopw   %cs:0x0(%rax,%rax,1)
>   6a: 00 00 00
>   6d: 0f 1f 00              nopl   (%rax)
>
> 0000000000000070 <_Z4fct2Dv4_f>:
>   70: 48 83 ec 48          sub    $0x48,%rsp
>   74: c5 f8 29 04 24        vmovaps %xmm0,(%rsp)
>   79: e8 00 00 00 00        callq  7e <_Z4fct2Dv4_f+0xe>
>   7e: c5 f8 29 44 24 30    vmovaps %xmm0,0x30(%rsp)
>   84: c5 fa 16 04 24        vmovshdup (%rsp),%xmm0
>   89: e8 00 00 00 00        callq  8e <_Z4fct2Dv4_f+0x1e>
>   8e: c5 f8 29 44 24 20    vmovaps %xmm0,0x20(%rsp)
>   94: c4 e3 79 05 04 24 01 vpermilpd $0x1,(%rsp),%xmm0
>   9b: e8 00 00 00 00        callq  a0 <_Z4fct2Dv4_f+0x30>
>   a0: c5 f9 29 44 24 10    vmovapd %xmm0,0x10(%rsp)
>   a6: c4 e3 79 04 04 24 e7 vpermilps $0xe7,(%rsp),%xmm0
>   ad: e8 00 00 00 00        callq  b2 <_Z4fct2Dv4_f+0x42>
>   b2: c5 f8 28 4c 24 30    vmovaps 0x30(%rsp),%xmm1
>   b8: c4 e3 71 21 4c 24 20 vinsertps $0x10,0x20(%rsp),%xmm1,%xmm1
>   bf: 10
>   c0: c4 e3 71 21 4c 24 10 vinsertps $0x20,0x10(%rsp),%xmm1,%xmm1
>   c7: 20
>   c8: c4 e3 71 21 c0 30    vinsertps $0x30,%xmm0,%xmm1,%xmm0
>   ce: 48 83 c4 48          add    $0x48,%rsp
>   d2: c3                    retq
>   d3: 66 2e 0f 1f 84 00 00 nopw   %cs:0x0(%rax,%rax,1)
>   da: 00 00 00
>   dd: 0f 1f 00              nopl   (%rax)
>
> 00000000000000e0 <_Z4fct3Pf>:
>   e0: 53                    push   %rbx
>   e1: 48 83 ec 10          sub    $0x10,%rsp
>   e5: 48 89 fb              mov    %rdi,%rbx
>   e8: c5 fa 10 07          vmovss (%rdi),%xmm0
>   ec: c5 fa 10 4f 04        vmovss 0x4(%rdi),%xmm1
>   f1: c5 fa 11 4c 24 0c    vmovss %xmm1,0xc(%rsp)
>   f7: e8 00 00 00 00        callq  fc <_Z4fct3Pf+0x1c>
>   fc: c5 fa 11 03          vmovss %xmm0,(%rbx)
>  100: c5 fa 10 44 24 0c    vmovss 0xc(%rsp),%xmm0
>  106: e8 00 00 00 00        callq  10b <_Z4fct3Pf+0x2b>
>  10b: c5 fa 11 43 04        vmovss %xmm0,0x4(%rbx)
>  110: c5 fa 10 43 08        vmovss 0x8(%rbx),%xmm0
>  115: e8 00 00 00 00        callq  11a <_Z4fct3Pf+0x3a>
>  11a: c5 fa 11 43 08        vmovss %xmm0,0x8(%rbx)
>  11f: c5 fa 10 43 0c        vmovss 0xc(%rbx),%xmm0
>  124: e8 00 00 00 00        callq  129 <_Z4fct3Pf+0x49>
>  129: c5 fa 11 43 0c        vmovss %xmm0,0xc(%rbx)
>  12e: c5 fa 10 43 10        vmovss 0x10(%rbx),%xmm0
>  133: e8 00 00 00 00        callq  138 <_Z4fct3Pf+0x58>
>  138: c5 fa 11 43 10        vmovss %xmm0,0x10(%rbx)
>  13d: c5 fa 10 43 14        vmovss 0x14(%rbx),%xmm0
>  142: e8 00 00 00 00        callq  147 <_Z4fct3Pf+0x67>
>  147: c5 fa 11 43 14        vmovss %xmm0,0x14(%rbx)
>  14c: c5 fa 10 43 18        vmovss 0x18(%rbx),%xmm0
>  151: e8 00 00 00 00        callq  156 <_Z4fct3Pf+0x76>
>  156: c5 fa 11 43 18        vmovss %xmm0,0x18(%rbx)
>  15b: c5 fa 10 43 1c        vmovss 0x1c(%rbx),%xmm0
>  160: e8 00 00 00 00        callq  165 <_Z4fct3Pf+0x85>
>  165: c5 fa 11 43 1c        vmovss %xmm0,0x1c(%rbx)
>  16a: c5 fa 10 43 20        vmovss 0x20(%rbx),%xmm0
>  16f: e8 00 00 00 00        callq  174 <_Z4fct3Pf+0x94>
>  174: c5 fa 11 43 20        vmovss %xmm0,0x20(%rbx)
>  179: c5 fa 10 43 24        vmovss 0x24(%rbx),%xmm0
>  17e: e8 00 00 00 00        callq  183 <_Z4fct3Pf+0xa3>
>  183: c5 fa 11 43 24        vmovss %xmm0,0x24(%rbx)
>  188: c5 fa 10 43 28        vmovss 0x28(%rbx),%xmm0
>  18d: e8 00 00 00 00        callq  192 <_Z4fct3Pf+0xb2>
>  192: c5 fa 11 43 28        vmovss %xmm0,0x28(%rbx)
>  197: c5 fa 10 43 2c        vmovss 0x2c(%rbx),%xmm0
>  19c: e8 00 00 00 00        callq  1a1 <_Z4fct3Pf+0xc1>
>  1a1: c5 fa 11 43 2c        vmovss %xmm0,0x2c(%rbx)
>  1a6: c5 fa 10 43 30        vmovss 0x30(%rbx),%xmm0
>  1ab: e8 00 00 00 00        callq  1b0 <_Z4fct3Pf+0xd0>
>  1b0: c5 fa 11 43 30        vmovss %xmm0,0x30(%rbx)
>  1b5: c5 fa 10 43 34        vmovss 0x34(%rbx),%xmm0
>  1ba: e8 00 00 00 00        callq  1bf <_Z4fct3Pf+0xdf>
>  1bf: c5 fa 11 43 34        vmovss %xmm0,0x34(%rbx)
>  1c4: c5 fa 10 43 38        vmovss 0x38(%rbx),%xmm0
>  1c9: e8 00 00 00 00        callq  1ce <_Z4fct3Pf+0xee>
>  1ce: c5 fa 11 43 38        vmovss %xmm0,0x38(%rbx)
>  1d3: c5 fa 10 43 3c        vmovss 0x3c(%rbx),%xmm0
>  1d8: e8 00 00 00 00        callq  1dd <_Z4fct3Pf+0xfd>
>  1dd: c5 fa 11 43 3c        vmovss %xmm0,0x3c(%rbx)
>  1e2: 48 83 c4 10          add    $0x10,%rsp
>  1e6: 5b                    pop    %rbx
>  1e7: c3                    retq
>
> As you can see there is no call to a vectorized version of sin.
> Did I do something wrong?
>
> By the way I am on Linux with glibc 2.32 which has libmvec.
>
> Regards,
> --
> Alexandre Bique
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200831/a4d03079/attachment.html>

Alexandre Bique via llvm-dev

2020-Sep-01 06:46 UTC

head link

[llvm-dev] Vectorization of math function failed?

I've tried to do:

clang++ -O3 -march=native -mtune=native \
-Rpass=loop-vectorize,slp-vectorize
-Rpass-missed=loop-vectorize,slp-vectorize
-Rpass-analysis=loop-vectorize,slp-vectorize \
-ffast-math -ffp-model=fast -ffp-exception-behavior=ignore -ffp-contract=fast \
-c -o vec.o vec.cc

But I've got no feedback.

-- 
Alexandre Bique

Possibly Parallel Threads

Search for more maybe matching threads

llvm dev - Aug 2020 - Vectorization of math function failed?

[llvm-dev] Vectorization of math function failed?

[llvm-dev] Vectorization of math function failed?

[llvm-dev] Vectorization of math function failed?

Possibly Parallel Threads