thr3ads.net - llvm dev - [LLVMdev] FPOpFusion = Fast and Multiply-and-add combines [Aug 2014]

If this information is useful, please help other people find it:
Share via:

Samuel F Antao

2014-Jul-31 15:50 UTC

[LLVMdev] FPOpFusion = Fast and Multiply-and-add combines

Hi Tim,

Thanks for the thorough explanation. It makes perfect sense.

I was not aware fast-math is supposed to prevent more precision being used
than what is in the standard.

I came across this issue while looking into the output or different
compilers. XL and Microsoft compiler seem
to have that turned on by default. But I assume that clang follows what gcc
does, and have that turned off.

Thanks again,
Samuel

Tim Northover <t.p.northover at gmail.com> wrote on 07/31/2014 09:54:55
AM:
> From: Tim Northover <t.p.northover at gmail.com>
> To: Samuel F Antao/Watson/IBM at IBMUS
> Cc: "llvmdev at cs.uiuc.edu" <llvmdev at cs.uiuc.edu>,
Olivier H
> Sallenave/Watson/IBM at IBMUS
> Date: 07/31/2014 09:55 AM
> Subject: Re: [LLVMdev] FPOpFusion = Fast and Multiply-and-add combines
>
> Hi Samuel,
>
> On 30 July 2014 22:37, Samuel F Antao <sfantao at us.ibm.com> wrote:
> > In the DAGCombiner, during the combination of mul and add/subtract
into
> > multiply-and-add/subtract, this option is expected to be Fast in order
to> > enable the combine. This means, that by default no multiply-and-add
opcodes> > are going to be generated. If I understand it correctly, this is
undesirable> > given that multiply-and-add for targets like PPC (I am not sure about
all> > the other targets) does not pose any rounding problem and it can even
be> > more accurate than performing the two operations separately.
>
> That extra precision is actually what we're being very careful to
> avoid unless specifically told we're allowed. It can be just as
> harmful to carefully written floating-point code as dropping precision
> would be.
>
> > Also, in TargetOptions.h I read:
> >
> > Standard, // Only allow fusion of 'blessed' ops (currently
just
fmuladd)> >
> > which made me suspect that the check against Fast in the DAGCombiner
is
not> > correct.
>
> I think it's OK. In the IR there are 3 different ways to express mul +
add:>
> 1. fmul + fadd. This must not be fused into a single step without
> intermediate rounding (unless we're in Fast mode).
> 2. call @llvm.fmuladd. This *may* be fused or not, depending on
> profitability (unless we're in Strict mode, in which case it's
> separate).
> 3. call @llvm.fma. This must not be split into two operations (unless
> we're in Fast mode).
>
> That middle one is there because C actually allows you to allow &
> disallow contraction within a limited region with "#pragma STDC
> FP_CONTRACT ON". So we need a way to represent the idea that it's
not
> usually OK to fuse them (i.e. not Fast mode), but this particular one
> actually is OK.
>
> Cheers.
>
> Tim.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140731/92c5be11/attachment.html>

Sanjay Patel

2014-Aug-06 18:30 UTC

head link

[LLVMdev] FPOpFusion = Fast and Multiply-and-add combines

Hi Samuel,

I don't think clang follows what gcc does regarding FMA - at least by
default. I don't have a PPC compiler to test with, but for x86-64 using
clang trunk and gcc 4.9:

$ cat fma.c
float foo(float x, float y, float z) { return x * y + z; }

$ ./clang -march=core-avx2 -O2 -S fma.c -o - | grep ss
    vmulss    %xmm1, %xmm0, %xmm0
    vaddss    %xmm2, %xmm0, %xmm0

$ ./gcc -march=core-avx2 -O2 -S fma.c -o - | grep ss
    vfmadd132ss    %xmm1, %xmm2, %xmm0

----------------------------------------------------------------------

This was brought up in Dec 2013 on this list:
http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-December/068868.html

I don't see an answer as to whether this is a bug for all the other
compilers, a deficiency in clang's default settings, or just an
implementation choice.

Sanjay

On Thu, Jul 31, 2014 at 9:50 AM, Samuel F Antao <sfantao at us.ibm.com>
wrote:
> Hi Tim,
>
> Thanks for the thorough explanation. It makes perfect sense.
>
> I was not aware fast-math is supposed to prevent more precision being used
> than what is in the standard.
>
> I came across this issue while looking into the output or different
> compilers. XL and Microsoft compiler seem
> to have that turned on by default. But I assume that clang follows what
> gcc does, and have that turned off.
>
> Thanks again,
> Samuel
>
> Tim Northover <t.p.northover at gmail.com> wrote on 07/31/2014
09:54:55 AM:
>
> > From: Tim Northover <t.p.northover at gmail.com>
> > To: Samuel F Antao/Watson/IBM at IBMUS
> > Cc: "llvmdev at cs.uiuc.edu" <llvmdev at cs.uiuc.edu>,
Olivier H
> > Sallenave/Watson/IBM at IBMUS
> > Date: 07/31/2014 09:55 AM
> > Subject: Re: [LLVMdev] FPOpFusion = Fast and Multiply-and-add combines
>
> >
> > Hi Samuel,
> >
> > On 30 July 2014 22:37, Samuel F Antao <sfantao at us.ibm.com>
wrote:
> > > In the DAGCombiner, during the combination of mul and
add/subtract into
> > > multiply-and-add/subtract, this option is expected to be Fast in
order
> to
> > > enable the combine. This means, that by default no
multiply-and-add
> opcodes
> > > are going to be generated. If I understand it correctly, this is
> undesirable
> > > given that multiply-and-add for targets like PPC (I am not sure
about
> all
> > > the other targets) does not pose any rounding problem and it can
even
> be
> > > more accurate than performing the two operations separately.
> >
> > That extra precision is actually what we're being very careful to
> > avoid unless specifically told we're allowed. It can be just as
> > harmful to carefully written floating-point code as dropping precision
> > would be.
> >
> > > Also, in TargetOptions.h I read:
> > >
> > > Standard, // Only allow fusion of 'blessed' ops
(currently just
> fmuladd)
> > >
> > > which made me suspect that the check against Fast in the
DAGCombiner
> is not
> > > correct.
> >
> > I think it's OK. In the IR there are 3 different ways to express
mul +
> add:
> >
> > 1. fmul + fadd. This must not be fused into a single step without
> > intermediate rounding (unless we're in Fast mode).
> > 2. call @llvm.fmuladd. This *may* be fused or not, depending on
> > profitability (unless we're in Strict mode, in which case it's
> > separate).
> > 3. call @llvm.fma. This must not be split into two operations (unless
> > we're in Fast mode).
> >
> > That middle one is there because C actually allows you to allow &
> > disallow contraction within a limited region with "#pragma STDC
> > FP_CONTRACT ON". So we need a way to represent the idea that
it's not
> > usually OK to fuse them (i.e. not Fast mode), but this particular one
> > actually is OK.
> >
> > Cheers.
> >
> > Tim.
> >
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140806/d52031f0/attachment.html>

Samuel F Antao

2014-Aug-07 02:37 UTC

head link

[LLVMdev] FPOpFusion = Fast and Multiply-and-add combines

Hi Sanjay,

You are right. I tried XL and gcc 4.8.2 for PPC and I also got
multiply-and-add operations.

I supported my statement on what I read in the gcc man page. -ffast-math is
used in clang to set fp-contract to fast (default is standard) and in gcc
it activates (among others) the flag -funsafe-math-optimizations whose
description includes:

"Allow optimizations for floating-point arithmetic that (a) assume that
arguments and results are valid and (b) may
violate IEEE or ANSI standards."

I am not a floating point expert, for the applications I care usually more
precision is better, and that is what muladd provides. Given Tim's
explanation, I thought that muladd would conflict with (b) and some user
would expect the exact roundings for the mul and add. However, I find this
statement in Section 5 of IEEE floating point standard:

"Each of the computational operations that return a numeric result
specified by this standard shall be performed as if it first produced an
intermediate result correct to infinite precision and with unbounded range,
and then rounded that intermediate result, ..."

which perfectly fits what the muladd instructions in PPC and also in avx2
are doing: using infinite precision after the multiply.

It may be possible there is something in the C/C++ standards I am not
aware, that makes the fusing illegal. As you said, another reason may be
just implementation choice. But in that case I believe we would be doing a
bad choice as I suspect there are much more users looking for faster
execution that taking advantage of a particular rounding property.

Maybe there is someone who can shed some light on this?

Thanks,
Samuel

Sanjay Patel <spatel at rotateright.com> wrote on 08/06/2014 02:30:17 PM:
> From: Sanjay Patel <spatel at rotateright.com>
> To: Samuel F Antao/Watson/IBM at IBMUS
> Cc: Tim Northover <t.p.northover at gmail.com>, Olivier H Sallenave/
> Watson/IBM at IBMUS, "llvmdev at cs.uiuc.edu" <llvmdev at
cs.uiuc.edu>
> Date: 08/06/2014 02:30 PM
> Subject: Re: [LLVMdev] FPOpFusion = Fast and Multiply-and-add combines
>
> Hi Samuel,
> I don't think clang follows what gcc does regarding FMA - at least
> by default. I don't have a PPC compiler to test with, but for x86-64
> using clang trunk and gcc 4.9:
>
> $ cat fma.c
> float foo(float x, float y, float z) { return x * y + z; }
>
> $ ./clang -march=core-avx2 -O2 -S fma.c -o - | grep ss
>     vmulss    %xmm1, %xmm0, %xmm0
>     vaddss    %xmm2, %xmm0, %xmm0
>
> $ ./gcc -march=core-avx2 -O2 -S fma.c -o - | grep ss
>     vfmadd132ss    %xmm1, %xmm2, %xmm0
>
> ----------------------------------------------------------------------
> This was brought up in Dec 2013 on this list:
> http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-December/068868.html
>
> I don't see an answer as to whether this is a bug for all the other
> compilers, a deficiency in clang's default settings, or just an
> implementation choice.
>
> Sanjay
>
> On Thu, Jul 31, 2014 at 9:50 AM, Samuel F Antao <sfantao at
us.ibm.com>
wrote:> Hi Tim,
>
> Thanks for the thorough explanation. It makes perfect sense.
>
> I was not aware fast-math is supposed to prevent more precision
> being used than what is in the standard.
>
> I came across this issue while looking into the output or different
> compilers. XL and Microsoft compiler seem
> to have that turned on by default. But I assume that clang follows
> what gcc does, and have that turned off.
>
> Thanks again,
> Samuel
>
> Tim Northover <t.p.northover at gmail.com> wrote on 07/31/2014
09:54:55 AM:
>
> > From: Tim Northover <t.p.northover at gmail.com>
> > To: Samuel F Antao/Watson/IBM at IBMUS
> > Cc: "llvmdev at cs.uiuc.edu" <llvmdev at cs.uiuc.edu>,
Olivier H
> > Sallenave/Watson/IBM at IBMUS
> > Date: 07/31/2014 09:55 AM
> > Subject: Re: [LLVMdev] FPOpFusion = Fast and Multiply-and-add combines
>
> >
> > Hi Samuel,
> >
> > On 30 July 2014 22:37, Samuel F Antao <sfantao at us.ibm.com>
wrote:
> > > In the DAGCombiner, during the combination of mul and
add/subtract
into> > > multiply-and-add/subtract, this option is expected to be Fast in
order to> > > enable the combine. This means, that by default no multiply-and-
> add opcodes
> > > are going to be generated. If I understand it correctly, this is
> undesirable
> > > given that multiply-and-add for targets like PPC (I am not sure
about
all> > > the other targets) does not pose any rounding problem and it can
even
be> > > more accurate than performing the two operations separately.
> >
> > That extra precision is actually what we're being very careful to
> > avoid unless specifically told we're allowed. It can be just as
> > harmful to carefully written floating-point code as dropping precision
> > would be.
> >
> > > Also, in TargetOptions.h I read:
> > >
> > > Standard, // Only allow fusion of 'blessed' ops
(currently just
fmuladd)> > >
> > > which made me suspect that the check against Fast in the
> DAGCombiner is not
> > > correct.
> >
> > I think it's OK. In the IR there are 3 different ways to express
mul +
add:> >
> > 1. fmul + fadd. This must not be fused into a single step without
> > intermediate rounding (unless we're in Fast mode).
> > 2. call @llvm.fmuladd. This *may* be fused or not, depending on
> > profitability (unless we're in Strict mode, in which case it's
> > separate).
> > 3. call @llvm.fma. This must not be split into two operations (unless
> > we're in Fast mode).
> >
> > That middle one is there because C actually allows you to allow &
> > disallow contraction within a limited region with "#pragma STDC
> > FP_CONTRACT ON". So we need a way to represent the idea that
it's not
> > usually OK to fuse them (i.e. not Fast mode), but this particular one
> > actually is OK.
> >
> > Cheers.
> >
> > Tim.
> >
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140806/16c8a05e/attachment.html>

Apparently Analagous Threads

Search for more reasonably related threads

llvm dev - Aug 2014 - [LLVMdev] FPOpFusion = Fast and Multiply-and-add combines

[LLVMdev] FPOpFusion = Fast and Multiply-and-add combines

[LLVMdev] FPOpFusion = Fast and Multiply-and-add combines

[LLVMdev] FPOpFusion = Fast and Multiply-and-add combines

Apparently Analagous Threads