Alexandre Bique via llvm-dev
2020-Aug-31 21:42 UTC
[llvm-dev] Vectorization of math function failed?
Hi, After reading https://llvm.org/docs/Vectorizers.html#vectorization-of-function-calls I decided to write the following C++ program: #include <cmath> using v4f32 = float __attribute__((__vector_size__(16))); v4f32 fct1(v4f32 x) { v4f32 y; y[0] = std::sin(x[0]); y[1] = std::sin(x[1]); y[2] = std::sin(x[2]); y[3] = std::sin(x[3]); return y; } v4f32 fct2(v4f32 x) { v4f32 y; for (int i = 0; i < 4; ++i) y[i] = std::sin(x[i]); return y; } void fct3(float *x) { #pragma clang loop vectorize(enable) for (int i = 0; i < 16; ++i) x[i] = sinf(x[i]); } Which I compiled with: clang++ -O3 -march=native -mtune=native -c -o vec.o vec.cc -lmvec -fno-math-errno And here is what I get: vec.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <_Z4fct1Dv4_f>: 0: 48 83 ec 48 sub $0x48,%rsp 4: c5 f8 29 04 24 vmovaps %xmm0,(%rsp) 9: e8 00 00 00 00 callq e <_Z4fct1Dv4_f+0xe> e: c5 f8 29 44 24 30 vmovaps %xmm0,0x30(%rsp) 14: c5 fa 16 04 24 vmovshdup (%rsp),%xmm0 19: e8 00 00 00 00 callq 1e <_Z4fct1Dv4_f+0x1e> 1e: c5 f8 29 44 24 20 vmovaps %xmm0,0x20(%rsp) 24: c4 e3 79 05 04 24 01 vpermilpd $0x1,(%rsp),%xmm0 2b: e8 00 00 00 00 callq 30 <_Z4fct1Dv4_f+0x30> 30: c5 f9 29 44 24 10 vmovapd %xmm0,0x10(%rsp) 36: c4 e3 79 04 04 24 e7 vpermilps $0xe7,(%rsp),%xmm0 3d: e8 00 00 00 00 callq 42 <_Z4fct1Dv4_f+0x42> 42: c5 f8 28 4c 24 30 vmovaps 0x30(%rsp),%xmm1 48: c4 e3 71 21 4c 24 20 vinsertps $0x10,0x20(%rsp),%xmm1,%xmm1 4f: 10 50: c4 e3 71 21 4c 24 10 vinsertps $0x20,0x10(%rsp),%xmm1,%xmm1 57: 20 58: c4 e3 71 21 c0 30 vinsertps $0x30,%xmm0,%xmm1,%xmm0 5e: 48 83 c4 48 add $0x48,%rsp 62: c3 retq 63: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 6a: 00 00 00 6d: 0f 1f 00 nopl (%rax) 0000000000000070 <_Z4fct2Dv4_f>: 70: 48 83 ec 48 sub $0x48,%rsp 74: c5 f8 29 04 24 vmovaps %xmm0,(%rsp) 79: e8 00 00 00 00 callq 7e <_Z4fct2Dv4_f+0xe> 7e: c5 f8 29 44 24 30 vmovaps %xmm0,0x30(%rsp) 84: c5 fa 16 04 24 vmovshdup (%rsp),%xmm0 89: e8 00 00 00 00 callq 8e <_Z4fct2Dv4_f+0x1e> 8e: c5 f8 29 44 24 20 vmovaps %xmm0,0x20(%rsp) 94: c4 e3 79 05 04 24 01 vpermilpd $0x1,(%rsp),%xmm0 9b: e8 00 00 00 00 callq a0 <_Z4fct2Dv4_f+0x30> a0: c5 f9 29 44 24 10 vmovapd %xmm0,0x10(%rsp) a6: c4 e3 79 04 04 24 e7 vpermilps $0xe7,(%rsp),%xmm0 ad: e8 00 00 00 00 callq b2 <_Z4fct2Dv4_f+0x42> b2: c5 f8 28 4c 24 30 vmovaps 0x30(%rsp),%xmm1 b8: c4 e3 71 21 4c 24 20 vinsertps $0x10,0x20(%rsp),%xmm1,%xmm1 bf: 10 c0: c4 e3 71 21 4c 24 10 vinsertps $0x20,0x10(%rsp),%xmm1,%xmm1 c7: 20 c8: c4 e3 71 21 c0 30 vinsertps $0x30,%xmm0,%xmm1,%xmm0 ce: 48 83 c4 48 add $0x48,%rsp d2: c3 retq d3: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) da: 00 00 00 dd: 0f 1f 00 nopl (%rax) 00000000000000e0 <_Z4fct3Pf>: e0: 53 push %rbx e1: 48 83 ec 10 sub $0x10,%rsp e5: 48 89 fb mov %rdi,%rbx e8: c5 fa 10 07 vmovss (%rdi),%xmm0 ec: c5 fa 10 4f 04 vmovss 0x4(%rdi),%xmm1 f1: c5 fa 11 4c 24 0c vmovss %xmm1,0xc(%rsp) f7: e8 00 00 00 00 callq fc <_Z4fct3Pf+0x1c> fc: c5 fa 11 03 vmovss %xmm0,(%rbx) 100: c5 fa 10 44 24 0c vmovss 0xc(%rsp),%xmm0 106: e8 00 00 00 00 callq 10b <_Z4fct3Pf+0x2b> 10b: c5 fa 11 43 04 vmovss %xmm0,0x4(%rbx) 110: c5 fa 10 43 08 vmovss 0x8(%rbx),%xmm0 115: e8 00 00 00 00 callq 11a <_Z4fct3Pf+0x3a> 11a: c5 fa 11 43 08 vmovss %xmm0,0x8(%rbx) 11f: c5 fa 10 43 0c vmovss 0xc(%rbx),%xmm0 124: e8 00 00 00 00 callq 129 <_Z4fct3Pf+0x49> 129: c5 fa 11 43 0c vmovss %xmm0,0xc(%rbx) 12e: c5 fa 10 43 10 vmovss 0x10(%rbx),%xmm0 133: e8 00 00 00 00 callq 138 <_Z4fct3Pf+0x58> 138: c5 fa 11 43 10 vmovss %xmm0,0x10(%rbx) 13d: c5 fa 10 43 14 vmovss 0x14(%rbx),%xmm0 142: e8 00 00 00 00 callq 147 <_Z4fct3Pf+0x67> 147: c5 fa 11 43 14 vmovss %xmm0,0x14(%rbx) 14c: c5 fa 10 43 18 vmovss 0x18(%rbx),%xmm0 151: e8 00 00 00 00 callq 156 <_Z4fct3Pf+0x76> 156: c5 fa 11 43 18 vmovss %xmm0,0x18(%rbx) 15b: c5 fa 10 43 1c vmovss 0x1c(%rbx),%xmm0 160: e8 00 00 00 00 callq 165 <_Z4fct3Pf+0x85> 165: c5 fa 11 43 1c vmovss %xmm0,0x1c(%rbx) 16a: c5 fa 10 43 20 vmovss 0x20(%rbx),%xmm0 16f: e8 00 00 00 00 callq 174 <_Z4fct3Pf+0x94> 174: c5 fa 11 43 20 vmovss %xmm0,0x20(%rbx) 179: c5 fa 10 43 24 vmovss 0x24(%rbx),%xmm0 17e: e8 00 00 00 00 callq 183 <_Z4fct3Pf+0xa3> 183: c5 fa 11 43 24 vmovss %xmm0,0x24(%rbx) 188: c5 fa 10 43 28 vmovss 0x28(%rbx),%xmm0 18d: e8 00 00 00 00 callq 192 <_Z4fct3Pf+0xb2> 192: c5 fa 11 43 28 vmovss %xmm0,0x28(%rbx) 197: c5 fa 10 43 2c vmovss 0x2c(%rbx),%xmm0 19c: e8 00 00 00 00 callq 1a1 <_Z4fct3Pf+0xc1> 1a1: c5 fa 11 43 2c vmovss %xmm0,0x2c(%rbx) 1a6: c5 fa 10 43 30 vmovss 0x30(%rbx),%xmm0 1ab: e8 00 00 00 00 callq 1b0 <_Z4fct3Pf+0xd0> 1b0: c5 fa 11 43 30 vmovss %xmm0,0x30(%rbx) 1b5: c5 fa 10 43 34 vmovss 0x34(%rbx),%xmm0 1ba: e8 00 00 00 00 callq 1bf <_Z4fct3Pf+0xdf> 1bf: c5 fa 11 43 34 vmovss %xmm0,0x34(%rbx) 1c4: c5 fa 10 43 38 vmovss 0x38(%rbx),%xmm0 1c9: e8 00 00 00 00 callq 1ce <_Z4fct3Pf+0xee> 1ce: c5 fa 11 43 38 vmovss %xmm0,0x38(%rbx) 1d3: c5 fa 10 43 3c vmovss 0x3c(%rbx),%xmm0 1d8: e8 00 00 00 00 callq 1dd <_Z4fct3Pf+0xfd> 1dd: c5 fa 11 43 3c vmovss %xmm0,0x3c(%rbx) 1e2: 48 83 c4 10 add $0x10,%rsp 1e6: 5b pop %rbx 1e7: c3 retq As you can see there is no call to a vectorized version of sin. Did I do something wrong? By the way I am on Linux with glibc 2.32 which has libmvec. Regards, -- Alexandre Bique
Brian Cain via llvm-dev
2020-Sep-01 02:05 UTC
[llvm-dev] Vectorization of math function failed?
If you're using clang you could try to see if it emits any hints about optimizations using the remarks: https://clang.llvm.org/docs/UsersManual.html#options-to-emit-optimization-reports On Mon, Aug 31, 2020, 4:43 PM Alexandre Bique via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Hi, > > After reading > https://llvm.org/docs/Vectorizers.html#vectorization-of-function-calls > I decided to write the following C++ program: > > #include <cmath> > > using v4f32 = float __attribute__((__vector_size__(16))); > > v4f32 fct1(v4f32 x) > { > v4f32 y; > y[0] = std::sin(x[0]); > y[1] = std::sin(x[1]); > y[2] = std::sin(x[2]); > y[3] = std::sin(x[3]); > return y; > } > > v4f32 fct2(v4f32 x) > { > v4f32 y; > for (int i = 0; i < 4; ++i) > y[i] = std::sin(x[i]); > return y; > } > > void fct3(float *x) > { > #pragma clang loop vectorize(enable) > for (int i = 0; i < 16; ++i) > x[i] = sinf(x[i]); > } > > Which I compiled with: clang++ -O3 -march=native -mtune=native -c -o > vec.o vec.cc -lmvec -fno-math-errno > > And here is what I get: > > vec.o: file format elf64-x86-64 > > > Disassembly of section .text: > > 0000000000000000 <_Z4fct1Dv4_f>: > 0: 48 83 ec 48 sub $0x48,%rsp > 4: c5 f8 29 04 24 vmovaps %xmm0,(%rsp) > 9: e8 00 00 00 00 callq e <_Z4fct1Dv4_f+0xe> > e: c5 f8 29 44 24 30 vmovaps %xmm0,0x30(%rsp) > 14: c5 fa 16 04 24 vmovshdup (%rsp),%xmm0 > 19: e8 00 00 00 00 callq 1e <_Z4fct1Dv4_f+0x1e> > 1e: c5 f8 29 44 24 20 vmovaps %xmm0,0x20(%rsp) > 24: c4 e3 79 05 04 24 01 vpermilpd $0x1,(%rsp),%xmm0 > 2b: e8 00 00 00 00 callq 30 <_Z4fct1Dv4_f+0x30> > 30: c5 f9 29 44 24 10 vmovapd %xmm0,0x10(%rsp) > 36: c4 e3 79 04 04 24 e7 vpermilps $0xe7,(%rsp),%xmm0 > 3d: e8 00 00 00 00 callq 42 <_Z4fct1Dv4_f+0x42> > 42: c5 f8 28 4c 24 30 vmovaps 0x30(%rsp),%xmm1 > 48: c4 e3 71 21 4c 24 20 vinsertps $0x10,0x20(%rsp),%xmm1,%xmm1 > 4f: 10 > 50: c4 e3 71 21 4c 24 10 vinsertps $0x20,0x10(%rsp),%xmm1,%xmm1 > 57: 20 > 58: c4 e3 71 21 c0 30 vinsertps $0x30,%xmm0,%xmm1,%xmm0 > 5e: 48 83 c4 48 add $0x48,%rsp > 62: c3 retq > 63: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) > 6a: 00 00 00 > 6d: 0f 1f 00 nopl (%rax) > > 0000000000000070 <_Z4fct2Dv4_f>: > 70: 48 83 ec 48 sub $0x48,%rsp > 74: c5 f8 29 04 24 vmovaps %xmm0,(%rsp) > 79: e8 00 00 00 00 callq 7e <_Z4fct2Dv4_f+0xe> > 7e: c5 f8 29 44 24 30 vmovaps %xmm0,0x30(%rsp) > 84: c5 fa 16 04 24 vmovshdup (%rsp),%xmm0 > 89: e8 00 00 00 00 callq 8e <_Z4fct2Dv4_f+0x1e> > 8e: c5 f8 29 44 24 20 vmovaps %xmm0,0x20(%rsp) > 94: c4 e3 79 05 04 24 01 vpermilpd $0x1,(%rsp),%xmm0 > 9b: e8 00 00 00 00 callq a0 <_Z4fct2Dv4_f+0x30> > a0: c5 f9 29 44 24 10 vmovapd %xmm0,0x10(%rsp) > a6: c4 e3 79 04 04 24 e7 vpermilps $0xe7,(%rsp),%xmm0 > ad: e8 00 00 00 00 callq b2 <_Z4fct2Dv4_f+0x42> > b2: c5 f8 28 4c 24 30 vmovaps 0x30(%rsp),%xmm1 > b8: c4 e3 71 21 4c 24 20 vinsertps $0x10,0x20(%rsp),%xmm1,%xmm1 > bf: 10 > c0: c4 e3 71 21 4c 24 10 vinsertps $0x20,0x10(%rsp),%xmm1,%xmm1 > c7: 20 > c8: c4 e3 71 21 c0 30 vinsertps $0x30,%xmm0,%xmm1,%xmm0 > ce: 48 83 c4 48 add $0x48,%rsp > d2: c3 retq > d3: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) > da: 00 00 00 > dd: 0f 1f 00 nopl (%rax) > > 00000000000000e0 <_Z4fct3Pf>: > e0: 53 push %rbx > e1: 48 83 ec 10 sub $0x10,%rsp > e5: 48 89 fb mov %rdi,%rbx > e8: c5 fa 10 07 vmovss (%rdi),%xmm0 > ec: c5 fa 10 4f 04 vmovss 0x4(%rdi),%xmm1 > f1: c5 fa 11 4c 24 0c vmovss %xmm1,0xc(%rsp) > f7: e8 00 00 00 00 callq fc <_Z4fct3Pf+0x1c> > fc: c5 fa 11 03 vmovss %xmm0,(%rbx) > 100: c5 fa 10 44 24 0c vmovss 0xc(%rsp),%xmm0 > 106: e8 00 00 00 00 callq 10b <_Z4fct3Pf+0x2b> > 10b: c5 fa 11 43 04 vmovss %xmm0,0x4(%rbx) > 110: c5 fa 10 43 08 vmovss 0x8(%rbx),%xmm0 > 115: e8 00 00 00 00 callq 11a <_Z4fct3Pf+0x3a> > 11a: c5 fa 11 43 08 vmovss %xmm0,0x8(%rbx) > 11f: c5 fa 10 43 0c vmovss 0xc(%rbx),%xmm0 > 124: e8 00 00 00 00 callq 129 <_Z4fct3Pf+0x49> > 129: c5 fa 11 43 0c vmovss %xmm0,0xc(%rbx) > 12e: c5 fa 10 43 10 vmovss 0x10(%rbx),%xmm0 > 133: e8 00 00 00 00 callq 138 <_Z4fct3Pf+0x58> > 138: c5 fa 11 43 10 vmovss %xmm0,0x10(%rbx) > 13d: c5 fa 10 43 14 vmovss 0x14(%rbx),%xmm0 > 142: e8 00 00 00 00 callq 147 <_Z4fct3Pf+0x67> > 147: c5 fa 11 43 14 vmovss %xmm0,0x14(%rbx) > 14c: c5 fa 10 43 18 vmovss 0x18(%rbx),%xmm0 > 151: e8 00 00 00 00 callq 156 <_Z4fct3Pf+0x76> > 156: c5 fa 11 43 18 vmovss %xmm0,0x18(%rbx) > 15b: c5 fa 10 43 1c vmovss 0x1c(%rbx),%xmm0 > 160: e8 00 00 00 00 callq 165 <_Z4fct3Pf+0x85> > 165: c5 fa 11 43 1c vmovss %xmm0,0x1c(%rbx) > 16a: c5 fa 10 43 20 vmovss 0x20(%rbx),%xmm0 > 16f: e8 00 00 00 00 callq 174 <_Z4fct3Pf+0x94> > 174: c5 fa 11 43 20 vmovss %xmm0,0x20(%rbx) > 179: c5 fa 10 43 24 vmovss 0x24(%rbx),%xmm0 > 17e: e8 00 00 00 00 callq 183 <_Z4fct3Pf+0xa3> > 183: c5 fa 11 43 24 vmovss %xmm0,0x24(%rbx) > 188: c5 fa 10 43 28 vmovss 0x28(%rbx),%xmm0 > 18d: e8 00 00 00 00 callq 192 <_Z4fct3Pf+0xb2> > 192: c5 fa 11 43 28 vmovss %xmm0,0x28(%rbx) > 197: c5 fa 10 43 2c vmovss 0x2c(%rbx),%xmm0 > 19c: e8 00 00 00 00 callq 1a1 <_Z4fct3Pf+0xc1> > 1a1: c5 fa 11 43 2c vmovss %xmm0,0x2c(%rbx) > 1a6: c5 fa 10 43 30 vmovss 0x30(%rbx),%xmm0 > 1ab: e8 00 00 00 00 callq 1b0 <_Z4fct3Pf+0xd0> > 1b0: c5 fa 11 43 30 vmovss %xmm0,0x30(%rbx) > 1b5: c5 fa 10 43 34 vmovss 0x34(%rbx),%xmm0 > 1ba: e8 00 00 00 00 callq 1bf <_Z4fct3Pf+0xdf> > 1bf: c5 fa 11 43 34 vmovss %xmm0,0x34(%rbx) > 1c4: c5 fa 10 43 38 vmovss 0x38(%rbx),%xmm0 > 1c9: e8 00 00 00 00 callq 1ce <_Z4fct3Pf+0xee> > 1ce: c5 fa 11 43 38 vmovss %xmm0,0x38(%rbx) > 1d3: c5 fa 10 43 3c vmovss 0x3c(%rbx),%xmm0 > 1d8: e8 00 00 00 00 callq 1dd <_Z4fct3Pf+0xfd> > 1dd: c5 fa 11 43 3c vmovss %xmm0,0x3c(%rbx) > 1e2: 48 83 c4 10 add $0x10,%rsp > 1e6: 5b pop %rbx > 1e7: c3 retq > > As you can see there is no call to a vectorized version of sin. > Did I do something wrong? > > By the way I am on Linux with glibc 2.32 which has libmvec. > > Regards, > -- > Alexandre Bique > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200831/a4d03079/attachment.html>
Alexandre Bique via llvm-dev
2020-Sep-01 06:46 UTC
[llvm-dev] Vectorization of math function failed?
I've tried to do: clang++ -O3 -march=native -mtune=native \ -Rpass=loop-vectorize,slp-vectorize -Rpass-missed=loop-vectorize,slp-vectorize -Rpass-analysis=loop-vectorize,slp-vectorize \ -ffast-math -ffp-model=fast -ffp-exception-behavior=ignore -ffp-contract=fast \ -c -o vec.o vec.cc But I've got no feedback. -- Alexandre Bique
Possibly Parallel Threads
- [LLVMdev] Unexpected spilling of vector register during lane extraction on some x86_64 targets
- [LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
- [LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
- [LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
- [LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!