Alexandre Bique via llvm-dev
2020-Sep-01 06:44 UTC
[llvm-dev] Should llvm optimize 1.0 / x ?
Hi Quentin, You are correct, I could manage to get clang to use vrcpps, but not in a satisfying way: clang++ -O3 -march=native -mtune=native \ -Rpass=loop-vectorize -Rpass-missed=loop-vectorize -Rpass-analysis=loop-vectorize \ -ffast-math -ffp-model=fast -ffp-exception-behavior=ignore -ffp-contract=fast \ -c -o vec.o vec.cc 0000000000000140 <_Z4fct4Dv4_f>: 140: c5 f8 53 c8 vrcpps %xmm0,%xmm1 144: c4 e2 79 18 15 00 00 vbroadcastss 0x0(%rip),%xmm2 # 14d <_Z4fct4Dv4_f+0xd> 14b: 00 00 14d: c4 e2 71 ac c2 vfnmadd213ps %xmm2,%xmm1,%xmm0 152: c4 e2 71 98 c1 vfmadd132ps %xmm1,%xmm1,%xmm0 157: c3 retq 158: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) 15f: 00 0000000000000160 <_Z4fct5Dv4_f>: 160: c5 f8 53 c0 vrcpps %xmm0,%xmm0 164: c3 retq As you can see, fct4 is not equivalent to fct5. Regards, Alexandre Bique On Tue, Sep 1, 2020 at 12:59 AM Quentin Colombet <qcolombet at apple.com> wrote:> > Hi Alexandre, > > Have you tried to compile this with fast-math enabled (`-ffast-math` https://clang.llvm.org/docs/UsersManual.html#controlling-floating-point-behavior)? > > I would expect LLVM to require the `arcp` flag to perform this optimization (https://www.llvm.org/docs/LangRef.html#fast-math-flags). > > Cheers, > -Quentin > > > > On Aug 31, 2020, at 2:21 PM, Alexandre Bique via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > > > Hi, > > > > Here is a small C++ program: > > > > vec.cc: > > > > #include <cmath> > > > > using v4f32 = float __attribute__((__vector_size__(16))); > > > > v4f32 fct1(v4f32 x) > > { > > return 1.0 / x; > > } > > > > v4f32 fct2(v4f32 x) > > { > > return __builtin_ia32_rcpps(x); > > } > > > > Which is compiled to: > > > > vec.o: file format elf64-x86-64 > > > > > > Disassembly of section .text: > > > > 0000000000000000 <_Z4fct1Dv4_f>: > > 0: c4 e2 79 18 0d 00 00 vbroadcastss 0x0(%rip),%xmm1 # 9 > > <_Z4fct1Dv4_f+0x9> > > 7: 00 00 > > 9: c5 f0 5e c0 vdivps %xmm0,%xmm1,%xmm0 > > d: c3 retq > > e: 66 90 xchg %ax,%ax > > > > 0000000000000010 <_Z4fct2Dv4_f>: > > 10: c5 f8 53 c0 vrcpps %xmm0,%xmm0 > > 14: c3 retq > > > > > > As you can see, 1.0 / x is not turned into vrcpps. Is it because of > > precision or a missing optimization? > > > > Regards, > > -- > > Alexandre Bique > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >
On 9/1/20 1:44 AM, Alexandre Bique via llvm-dev wrote:> Hi Quentin, > > You are correct, I could manage to get clang to use vrcpps, but not in > a satisfying way: > > clang++ -O3 -march=native -mtune=native \ > -Rpass=loop-vectorize -Rpass-missed=loop-vectorize > -Rpass-analysis=loop-vectorize \ > -ffast-math -ffp-model=fast -ffp-exception-behavior=ignore -ffp-contract=fast \ > -c -o vec.o vec.cc > > 0000000000000140 <_Z4fct4Dv4_f>: > 140: c5 f8 53 c8 vrcpps %xmm0,%xmm1 > 144: c4 e2 79 18 15 00 00 vbroadcastss 0x0(%rip),%xmm2 # 14d > <_Z4fct4Dv4_f+0xd> > 14b: 00 00 > 14d: c4 e2 71 ac c2 vfnmadd213ps %xmm2,%xmm1,%xmm0 > 152: c4 e2 71 98 c1 vfmadd132ps %xmm1,%xmm1,%xmm0 > 157: c3 retq > 158: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) > 15f: 00 > > 0000000000000160 <_Z4fct5Dv4_f>: > 160: c5 f8 53 c0 vrcpps %xmm0,%xmm0 > 164: c3 retq > > As you can see, fct4 is not equivalent to fct5.Perhaps it's better ;) It looks like the compiler has generated one Newton iteration after the estimate to increase the precision of the answer. The reciprocal estimate is, after all, only an estimate, and for many applications, is not sufficient on its own. This behavior is generally adjustable. Try using -mrecip=vec-divf:0 (or -mrecip=all:0) to turn off all of the Newton iterations. -Hal> > Regards, > Alexandre Bique > > On Tue, Sep 1, 2020 at 12:59 AM Quentin Colombet <qcolombet at apple.com> wrote: >> Hi Alexandre, >> >> Have you tried to compile this with fast-math enabled (`-ffast-math` https://clang.llvm.org/docs/UsersManual.html#controlling-floating-point-behavior)? >> >> I would expect LLVM to require the `arcp` flag to perform this optimization (https://www.llvm.org/docs/LangRef.html#fast-math-flags). >> >> Cheers, >> -Quentin >> >> >>> On Aug 31, 2020, at 2:21 PM, Alexandre Bique via llvm-dev <llvm-dev at lists.llvm.org> wrote: >>> >>> Hi, >>> >>> Here is a small C++ program: >>> >>> vec.cc: >>> >>> #include <cmath> >>> >>> using v4f32 = float __attribute__((__vector_size__(16))); >>> >>> v4f32 fct1(v4f32 x) >>> { >>> return 1.0 / x; >>> } >>> >>> v4f32 fct2(v4f32 x) >>> { >>> return __builtin_ia32_rcpps(x); >>> } >>> >>> Which is compiled to: >>> >>> vec.o: file format elf64-x86-64 >>> >>> >>> Disassembly of section .text: >>> >>> 0000000000000000 <_Z4fct1Dv4_f>: >>> 0: c4 e2 79 18 0d 00 00 vbroadcastss 0x0(%rip),%xmm1 # 9 >>> <_Z4fct1Dv4_f+0x9> >>> 7: 00 00 >>> 9: c5 f0 5e c0 vdivps %xmm0,%xmm1,%xmm0 >>> d: c3 retq >>> e: 66 90 xchg %ax,%ax >>> >>> 0000000000000010 <_Z4fct2Dv4_f>: >>> 10: c5 f8 53 c0 vrcpps %xmm0,%xmm0 >>> 14: c3 retq >>> >>> >>> As you can see, 1.0 / x is not turned into vrcpps. Is it because of >>> precision or a missing optimization? >>> >>> Regards, >>> -- >>> Alexandre Bique >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory
Alexandre Bique via llvm-dev
2020-Sep-01 07:35 UTC
[llvm-dev] Should llvm optimize 1.0 / x ?
On Tue, Sep 1, 2020 at 9:05 AM Hal Finkel <hfinkel at anl.gov> wrote:> Perhaps it's better ;) > > It looks like the compiler has generated one Newton iteration after the > estimate to increase the precision of the answer. The reciprocal > estimate is, after all, only an estimate, and for many applications, is > not sufficient on its own.Yes.> This behavior is generally adjustable. Try using -mrecip=vec-divf:0 (or > -mrecip=all:0) to turn off all of the Newton iterations.Thank you very much! It did the job! -- Alexandre BIQUE