Alexandre Bique via llvm-dev
2020-Aug-31 21:21 UTC
[llvm-dev] Should llvm optimize 1.0 / x ?
Hi, Here is a small C++ program: vec.cc: #include <cmath> using v4f32 = float __attribute__((__vector_size__(16))); v4f32 fct1(v4f32 x) { return 1.0 / x; } v4f32 fct2(v4f32 x) { return __builtin_ia32_rcpps(x); } Which is compiled to: vec.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <_Z4fct1Dv4_f>: 0: c4 e2 79 18 0d 00 00 vbroadcastss 0x0(%rip),%xmm1 # 9 <_Z4fct1Dv4_f+0x9> 7: 00 00 9: c5 f0 5e c0 vdivps %xmm0,%xmm1,%xmm0 d: c3 retq e: 66 90 xchg %ax,%ax 0000000000000010 <_Z4fct2Dv4_f>: 10: c5 f8 53 c0 vrcpps %xmm0,%xmm0 14: c3 retq As you can see, 1.0 / x is not turned into vrcpps. Is it because of precision or a missing optimization? Regards, -- Alexandre Bique
Quentin Colombet via llvm-dev
2020-Aug-31 22:59 UTC
[llvm-dev] Should llvm optimize 1.0 / x ?
Hi Alexandre, Have you tried to compile this with fast-math enabled (`-ffast-math` https://clang.llvm.org/docs/UsersManual.html#controlling-floating-point-behavior)? I would expect LLVM to require the `arcp` flag to perform this optimization (https://www.llvm.org/docs/LangRef.html#fast-math-flags). Cheers, -Quentin> On Aug 31, 2020, at 2:21 PM, Alexandre Bique via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > Hi, > > Here is a small C++ program: > > vec.cc: > > #include <cmath> > > using v4f32 = float __attribute__((__vector_size__(16))); > > v4f32 fct1(v4f32 x) > { > return 1.0 / x; > } > > v4f32 fct2(v4f32 x) > { > return __builtin_ia32_rcpps(x); > } > > Which is compiled to: > > vec.o: file format elf64-x86-64 > > > Disassembly of section .text: > > 0000000000000000 <_Z4fct1Dv4_f>: > 0: c4 e2 79 18 0d 00 00 vbroadcastss 0x0(%rip),%xmm1 # 9 > <_Z4fct1Dv4_f+0x9> > 7: 00 00 > 9: c5 f0 5e c0 vdivps %xmm0,%xmm1,%xmm0 > d: c3 retq > e: 66 90 xchg %ax,%ax > > 0000000000000010 <_Z4fct2Dv4_f>: > 10: c5 f8 53 c0 vrcpps %xmm0,%xmm0 > 14: c3 retq > > > As you can see, 1.0 / x is not turned into vrcpps. Is it because of > precision or a missing optimization? > > Regards, > -- > Alexandre Bique > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Alexandre Bique via llvm-dev
2020-Sep-01 06:44 UTC
[llvm-dev] Should llvm optimize 1.0 / x ?
Hi Quentin, You are correct, I could manage to get clang to use vrcpps, but not in a satisfying way: clang++ -O3 -march=native -mtune=native \ -Rpass=loop-vectorize -Rpass-missed=loop-vectorize -Rpass-analysis=loop-vectorize \ -ffast-math -ffp-model=fast -ffp-exception-behavior=ignore -ffp-contract=fast \ -c -o vec.o vec.cc 0000000000000140 <_Z4fct4Dv4_f>: 140: c5 f8 53 c8 vrcpps %xmm0,%xmm1 144: c4 e2 79 18 15 00 00 vbroadcastss 0x0(%rip),%xmm2 # 14d <_Z4fct4Dv4_f+0xd> 14b: 00 00 14d: c4 e2 71 ac c2 vfnmadd213ps %xmm2,%xmm1,%xmm0 152: c4 e2 71 98 c1 vfmadd132ps %xmm1,%xmm1,%xmm0 157: c3 retq 158: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) 15f: 00 0000000000000160 <_Z4fct5Dv4_f>: 160: c5 f8 53 c0 vrcpps %xmm0,%xmm0 164: c3 retq As you can see, fct4 is not equivalent to fct5. Regards, Alexandre Bique On Tue, Sep 1, 2020 at 12:59 AM Quentin Colombet <qcolombet at apple.com> wrote:> > Hi Alexandre, > > Have you tried to compile this with fast-math enabled (`-ffast-math` https://clang.llvm.org/docs/UsersManual.html#controlling-floating-point-behavior)? > > I would expect LLVM to require the `arcp` flag to perform this optimization (https://www.llvm.org/docs/LangRef.html#fast-math-flags). > > Cheers, > -Quentin > > > > On Aug 31, 2020, at 2:21 PM, Alexandre Bique via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > > > Hi, > > > > Here is a small C++ program: > > > > vec.cc: > > > > #include <cmath> > > > > using v4f32 = float __attribute__((__vector_size__(16))); > > > > v4f32 fct1(v4f32 x) > > { > > return 1.0 / x; > > } > > > > v4f32 fct2(v4f32 x) > > { > > return __builtin_ia32_rcpps(x); > > } > > > > Which is compiled to: > > > > vec.o: file format elf64-x86-64 > > > > > > Disassembly of section .text: > > > > 0000000000000000 <_Z4fct1Dv4_f>: > > 0: c4 e2 79 18 0d 00 00 vbroadcastss 0x0(%rip),%xmm1 # 9 > > <_Z4fct1Dv4_f+0x9> > > 7: 00 00 > > 9: c5 f0 5e c0 vdivps %xmm0,%xmm1,%xmm0 > > d: c3 retq > > e: 66 90 xchg %ax,%ax > > > > 0000000000000010 <_Z4fct2Dv4_f>: > > 10: c5 f8 53 c0 vrcpps %xmm0,%xmm0 > > 14: c3 retq > > > > > > As you can see, 1.0 / x is not turned into vrcpps. Is it because of > > precision or a missing optimization? > > > > Regards, > > -- > > Alexandre Bique > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >