thr3ads.net - llvm dev - [llvm-dev] Should llvm optimize 1.0 / x ? [Sep 2020]

If this information is useful, please help other people find it:
Share via:

Alexandre Bique via llvm-dev

2020-Sep-01 06:44 UTC

[llvm-dev] Should llvm optimize 1.0 / x ?

Hi Quentin,

You are correct, I could manage to get clang to use vrcpps, but not in
a satisfying way:

clang++ -O3 -march=native -mtune=native \
-Rpass=loop-vectorize -Rpass-missed=loop-vectorize
-Rpass-analysis=loop-vectorize \
-ffast-math -ffp-model=fast -ffp-exception-behavior=ignore -ffp-contract=fast \
-c -o vec.o vec.cc

0000000000000140 <_Z4fct4Dv4_f>:
 140: c5 f8 53 c8          vrcpps %xmm0,%xmm1
 144: c4 e2 79 18 15 00 00 vbroadcastss 0x0(%rip),%xmm2        # 14d
<_Z4fct4Dv4_f+0xd>
 14b: 00 00
 14d: c4 e2 71 ac c2        vfnmadd213ps %xmm2,%xmm1,%xmm0
 152: c4 e2 71 98 c1        vfmadd132ps %xmm1,%xmm1,%xmm0
 157: c3                    retq
 158: 0f 1f 84 00 00 00 00 nopl   0x0(%rax,%rax,1)
 15f: 00

0000000000000160 <_Z4fct5Dv4_f>:
 160: c5 f8 53 c0          vrcpps %xmm0,%xmm0
 164: c3                    retq

As you can see, fct4 is not equivalent to fct5.

Regards,
Alexandre Bique

On Tue, Sep 1, 2020 at 12:59 AM Quentin Colombet <qcolombet at apple.com>
wrote:>
> Hi Alexandre,
>
> Have you tried to compile this with fast-math enabled (`-ffast-math`
https://clang.llvm.org/docs/UsersManual.html#controlling-floating-point-behavior)?
>
> I would expect LLVM to require the `arcp` flag to perform this optimization
(https://www.llvm.org/docs/LangRef.html#fast-math-flags).
>
> Cheers,
> -Quentin
>
>
> > On Aug 31, 2020, at 2:21 PM, Alexandre Bique via llvm-dev <llvm-dev
at lists.llvm.org> wrote:
> >
> > Hi,
> >
> > Here is a small C++ program:
> >
> > vec.cc:
> >
> > #include <cmath>
> >
> > using v4f32 = float __attribute__((__vector_size__(16)));
> >
> > v4f32 fct1(v4f32 x)
> > {
> >  return 1.0 / x;
> > }
> >
> > v4f32 fct2(v4f32 x)
> > {
> >  return __builtin_ia32_rcpps(x);
> > }
> >
> > Which is compiled to:
> >
> > vec.o:     file format elf64-x86-64
> >
> >
> > Disassembly of section .text:
> >
> > 0000000000000000 <_Z4fct1Dv4_f>:
> >   0: c4 e2 79 18 0d 00 00 vbroadcastss 0x0(%rip),%xmm1        # 9
> > <_Z4fct1Dv4_f+0x9>
> >   7: 00 00
> >   9: c5 f0 5e c0          vdivps %xmm0,%xmm1,%xmm0
> >   d: c3                    retq
> >   e: 66 90                xchg   %ax,%ax
> >
> > 0000000000000010 <_Z4fct2Dv4_f>:
> >  10: c5 f8 53 c0          vrcpps %xmm0,%xmm0
> >  14: c3                    retq
> >
> >
> > As you can see, 1.0 / x is not turned into vrcpps. Is it because of
> > precision or a missing optimization?
> >
> > Regards,
> > --
> > Alexandre Bique
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

Hal Finkel via llvm-dev

2020-Sep-01 07:05 UTC

head link

[llvm-dev] Should llvm optimize 1.0 / x ?

On 9/1/20 1:44 AM, Alexandre Bique via llvm-dev wrote:> Hi Quentin,
>
> You are correct, I could manage to get clang to use vrcpps, but not in
> a satisfying way:
>
> clang++ -O3 -march=native -mtune=native \
> -Rpass=loop-vectorize -Rpass-missed=loop-vectorize
> -Rpass-analysis=loop-vectorize \
> -ffast-math -ffp-model=fast -ffp-exception-behavior=ignore
-ffp-contract=fast \
> -c -o vec.o vec.cc
>
> 0000000000000140 <_Z4fct4Dv4_f>:
>   140: c5 f8 53 c8          vrcpps %xmm0,%xmm1
>   144: c4 e2 79 18 15 00 00 vbroadcastss 0x0(%rip),%xmm2        # 14d
> <_Z4fct4Dv4_f+0xd>
>   14b: 00 00
>   14d: c4 e2 71 ac c2        vfnmadd213ps %xmm2,%xmm1,%xmm0
>   152: c4 e2 71 98 c1        vfmadd132ps %xmm1,%xmm1,%xmm0
>   157: c3                    retq
>   158: 0f 1f 84 00 00 00 00 nopl   0x0(%rax,%rax,1)
>   15f: 00
>
> 0000000000000160 <_Z4fct5Dv4_f>:
>   160: c5 f8 53 c0          vrcpps %xmm0,%xmm0
>   164: c3                    retq
>
> As you can see, fct4 is not equivalent to fct5.

Perhaps it's better ;)

It looks like the compiler has generated one Newton iteration after the 
estimate to increase the precision of the answer. The reciprocal 
estimate is, after all, only an estimate, and for many applications, is 
not sufficient on its own.

This behavior is generally adjustable. Try using -mrecip=vec-divf:0 (or 
-mrecip=all:0) to turn off all of the Newton iterations.

  -Hal

>
> Regards,
> Alexandre Bique
>
> On Tue, Sep 1, 2020 at 12:59 AM Quentin Colombet <qcolombet at
apple.com> wrote:
>> Hi Alexandre,
>>
>> Have you tried to compile this with fast-math enabled (`-ffast-math`
https://clang.llvm.org/docs/UsersManual.html#controlling-floating-point-behavior)?
>>
>> I would expect LLVM to require the `arcp` flag to perform this
optimization (https://www.llvm.org/docs/LangRef.html#fast-math-flags).
>>
>> Cheers,
>> -Quentin
>>
>>
>>> On Aug 31, 2020, at 2:21 PM, Alexandre Bique via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>>>
>>> Hi,
>>>
>>> Here is a small C++ program:
>>>
>>> vec.cc:
>>>
>>> #include <cmath>
>>>
>>> using v4f32 = float __attribute__((__vector_size__(16)));
>>>
>>> v4f32 fct1(v4f32 x)
>>> {
>>>   return 1.0 / x;
>>> }
>>>
>>> v4f32 fct2(v4f32 x)
>>> {
>>>   return __builtin_ia32_rcpps(x);
>>> }
>>>
>>> Which is compiled to:
>>>
>>> vec.o:     file format elf64-x86-64
>>>
>>>
>>> Disassembly of section .text:
>>>
>>> 0000000000000000 <_Z4fct1Dv4_f>:
>>>    0: c4 e2 79 18 0d 00 00 vbroadcastss 0x0(%rip),%xmm1        # 9
>>> <_Z4fct1Dv4_f+0x9>
>>>    7: 00 00
>>>    9: c5 f0 5e c0          vdivps %xmm0,%xmm1,%xmm0
>>>    d: c3                    retq
>>>    e: 66 90                xchg   %ax,%ax
>>>
>>> 0000000000000010 <_Z4fct2Dv4_f>:
>>>   10: c5 f8 53 c0          vrcpps %xmm0,%xmm0
>>>   14: c3                    retq
>>>
>>>
>>> As you can see, 1.0 / x is not turned into vrcpps. Is it because of
>>> precision or a missing optimization?
>>>
>>> Regards,
>>> --
>>> Alexandre Bique
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

Alexandre Bique via llvm-dev

2020-Sep-01 07:35 UTC

head link

[llvm-dev] Should llvm optimize 1.0 / x ?

On Tue, Sep 1, 2020 at 9:05 AM Hal Finkel <hfinkel at anl.gov>
wrote:> Perhaps it's better ;)
>
> It looks like the compiler has generated one Newton iteration after the
> estimate to increase the precision of the answer. The reciprocal
> estimate is, after all, only an estimate, and for many applications, is
> not sufficient on its own.
Yes.
> This behavior is generally adjustable. Try using -mrecip=vec-divf:0 (or
> -mrecip=all:0) to turn off all of the Newton iterations.
Thank you very much! It did the job!
-- 
Alexandre BIQUE

llvm dev - Sep 2020 - Should llvm optimize 1.0 / x ?

[llvm-dev] Should llvm optimize 1.0 / x ?

[llvm-dev] Should llvm optimize 1.0 / x ?

[llvm-dev] Should llvm optimize 1.0 / x ?