Hi all, Thanks for the info. Few observations from my side : LLVM : cortex-a8 vfpv3 : no vmla or vfma instruction emitted cortex-a8 vfpv4 : no vmla or vfma instruction emitted (This is invalid though as cortex-a8 does not have vfpv4) cortex-a8 vfpv4 with ffp-contract=fast : vfma instruction emitted ( this seems a bug to me!! If cortex-a8 doesn't come with vfpv4 then vfma instructions generated will be invalid ) cortex-a15 vfpv4 : vmla instruction emitted (which is a NEON instruction) cortex-a15 vfpv4 with ffp-contract=fast vfma instruction emitted. GCC : cortex-a8 vfpv3 : vmla instruction emitted cortex-a15 vfpv4 : vfma instruction emitted I agree to the point that NEON and VFP instructions shouldn't be used interchangeably. However, if gcc emits vmla (NEON) instruction with cortex-a8 then shouldn't LLVM also emit vmla (NEON) instruction? Can someone please clarify on this point? The performance gain with vmla instruction is huge. Somewhere i read that LLVM prefers precision accuracy over performance. Is this true and hence LLVM is not emiting vmla instructions for cortex-a8? On Thu, Dec 19, 2013 at 6:41 AM, Kay Tiong Khoo <kkhoo at perfwizard.com>wrote:> Just to clarify: gcc 4.8.1 generates that fma at -O2; no FP relaxation or > other flags specified. > > > On Wed, Dec 18, 2013 at 6:02 PM, Kay Tiong Khoo <kkhoo at perfwizard.com>wrote: > >> Thanks for the explanation, Tim! >> >> gcc 4.8.1 *does* generate an fma for your code example for an x86 target >> that supports fma. I'd bet that the HW vendors' compilers do the same, but >> I don't have any of those installed at the moment to test that theory. So >> this is a bug in those compilers? Do you know how they justify it? >> >> I see section 6.5 "Expressions" in the C standard, and I can see that >> 6.5.8 would seem to agree with you assuming that a "floating expression" is >> a subset of "expression"...is there any other part of the standard that you >> know of that I can reference? >> >> This is made a little weirder by the fact that gcc and clang have a >> 'fast' setting for fp-contract, but the C standard that I'm looking at >> states that it is just an "on-off-switch". >> >> >> On Wed, Dec 18, 2013 at 11:17 AM, Tim Northover <t.p.northover at gmail.com>wrote: >> >>> > http://llvm.org/bugs/show_bug.cgi?id=17188 >>> > http://llvm.org/bugs/show_bug.cgi?id=17211 >>> >>> Ah, thanks. That makes a lot more sense now. >>> >>> > Correct - clang is different than gcc, icc, msvc, xlc, etc. on this. >>> Still >>> > haven't seen any explanation for how this is better though... >>> >>> That would be because it follows what C tells us a compiler has to do >>> by default but provides overrides in either direction if you know what >>> you're doing. >>> >>> The key point is that LLVM (currently) has no notion of statement >>> boundaries, so it would fuse the operations in this function: >>> >>> float foo(float accum, float lhs, float rhs) { >>> float product = lhs * rhs; >>> return accum + product; >>> } >>> >>> This isn't allowed even under FP_CONTRACT=on (the multiply and add do >>> not occur within a single expression), so LLVM can't in good >>> conscience enable these optimisations by default. >>> >>> Cheers. >>> >>> Tim. >>> >> >> > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >-- With regards, Suyog Sarda -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131219/9ffd9395/attachment.html>
> cortex-a8 vfpv4 with ffp-contract=fast : vfma instruction emitted ( this > seems a bug to me!! If cortex-a8 doesn't come with vfpv4 then vfma > instructions generated will be invalid )If I'm understanding correctly, you've specifically told it this Cortex-A8 *does* come with vfpv4. Those kinds of odd combinations can be useful sometimes (if only for tests), so I'm not sure policing them is a good idea.> cortex-a15 vfpv4 : vmla instruction emitted (which is a NEON instruction)I get a VFP vmla here rather than a NEON one (clang -target armv7-linux-gnueabihf -mcpu=cortex-a15): "vmla.f32 s0, s1, s2". Are you seeing something different?> However, if gcc emits vmla (NEON) instruction with cortex-a8 then shouldn't > LLVM also emit vmla (NEON) instruction?It appears we've decided in the past that vmla just isn't worth it on Cortex-A8. There's this comment in the source: // Some processors have FP multiply-accumulate instructions that don't // play nicely with other VFP / NEON instructions, and it's generally better // to just not use them. Sufficient benchmarking evidence could overturn that decision, but I assume the people who added it in the first place didn't do so on a whim.> The performance gain with vmla instruction is huge.Is it, on Cortex-A8? The TRM referrs to them jumping across pipelines in odd ways, and that was a very primitive core so it's almost certainly not going to be just as good as a vmul (in fact if I'm reading correctly, it takes pretty much exactly the same time as separate vmul and vadd instructions, 10 cycles vs 2 * 5). Cheers. Tim.
Hi, One more addition to above observation : LLVM : cortex-a15 + vfpv4-d16 + ffast-math option WITHOUT ffp-contract=fast option also emits vfma instruction. On Thu, Dec 19, 2013 at 1:30 PM, suyog sarda <sardask01 at gmail.com> wrote:> Hi all, > > > Thanks for the info. Few observations from my side : > > > LLVM : > > > cortex-a8 vfpv3 : no vmla or vfma instruction emitted > > cortex-a8 vfpv4 : no vmla or vfma instruction emitted (This is invalid > though as cortex-a8 does not have vfpv4) > > cortex-a8 vfpv4 with ffp-contract=fast : vfma instruction emitted ( this > seems a bug to me!! If cortex-a8 doesn't come with vfpv4 then vfma > instructions generated will be invalid ) > > > cortex-a15 vfpv4 : vmla instruction emitted (which is a NEON instruction) > > cortex-a15 vfpv4 with ffp-contract=fast vfma instruction emitted. > > > GCC : > > > cortex-a8 vfpv3 : vmla instruction emitted > > cortex-a15 vfpv4 : vfma instruction emitted > > > I agree to the point that NEON and VFP instructions shouldn't be used > interchangeably. > > > However, if gcc emits vmla (NEON) instruction with cortex-a8 then > shouldn't LLVM also emit vmla (NEON) instruction? Can someone please > clarify on this point? The performance gain with vmla instruction is huge. > Somewhere i read that LLVM prefers precision accuracy over performance. Is > this true and hence LLVM is not emiting vmla instructions for cortex-a8? > > > > On Thu, Dec 19, 2013 at 6:41 AM, Kay Tiong Khoo <kkhoo at perfwizard.com>wrote: > >> Just to clarify: gcc 4.8.1 generates that fma at -O2; no FP relaxation or >> other flags specified. >> >> >> On Wed, Dec 18, 2013 at 6:02 PM, Kay Tiong Khoo <kkhoo at perfwizard.com>wrote: >> >>> Thanks for the explanation, Tim! >>> >>> gcc 4.8.1 *does* generate an fma for your code example for an x86 target >>> that supports fma. I'd bet that the HW vendors' compilers do the same, but >>> I don't have any of those installed at the moment to test that theory. So >>> this is a bug in those compilers? Do you know how they justify it? >>> >>> I see section 6.5 "Expressions" in the C standard, and I can see that >>> 6.5.8 would seem to agree with you assuming that a "floating expression" is >>> a subset of "expression"...is there any other part of the standard that you >>> know of that I can reference? >>> >>> This is made a little weirder by the fact that gcc and clang have a >>> 'fast' setting for fp-contract, but the C standard that I'm looking at >>> states that it is just an "on-off-switch". >>> >>> >>> On Wed, Dec 18, 2013 at 11:17 AM, Tim Northover <t.p.northover at gmail.com >>> > wrote: >>> >>>> > http://llvm.org/bugs/show_bug.cgi?id=17188 >>>> > http://llvm.org/bugs/show_bug.cgi?id=17211 >>>> >>>> Ah, thanks. That makes a lot more sense now. >>>> >>>> > Correct - clang is different than gcc, icc, msvc, xlc, etc. on this. >>>> Still >>>> > haven't seen any explanation for how this is better though... >>>> >>>> That would be because it follows what C tells us a compiler has to do >>>> by default but provides overrides in either direction if you know what >>>> you're doing. >>>> >>>> The key point is that LLVM (currently) has no notion of statement >>>> boundaries, so it would fuse the operations in this function: >>>> >>>> float foo(float accum, float lhs, float rhs) { >>>> float product = lhs * rhs; >>>> return accum + product; >>>> } >>>> >>>> This isn't allowed even under FP_CONTRACT=on (the multiply and add do >>>> not occur within a single expression), so LLVM can't in good >>>> conscience enable these optimisations by default. >>>> >>>> Cheers. >>>> >>>> Tim. >>>> >>> >>> >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >> >> > > > -- > With regards, > Suyog Sarda >-- With regards, Suyog Sarda -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131219/7f18b610/attachment.html>
Hi Tim,> > cortex-a15 vfpv4 : vmla instruction emitted (which is a NEON instruction) > > I get a VFP vmla here rather than a NEON one (clang -target > armv7-linux-gnueabihf -mcpu=cortex-a15): "vmla.f32 s0, s1, s2". Are > you seeing something different? >As per Renato comment above, vmla instruction is NEON instruction while vmfa is VFP instruction. Correct me if i am wrong on this.> > > However, if gcc emits vmla (NEON) instruction with cortex-a8 then > shouldn't > > LLVM also emit vmla (NEON) instruction? > > It appears we've decided in the past that vmla just isn't worth it on > Cortex-A8. There's this comment in the source: > > // Some processors have FP multiply-accumulate instructions that don't > // play nicely with other VFP / NEON instructions, and it's generally > better > // to just not use them. > > Sufficient benchmarking evidence could overturn that decision, but I > assume the people who added it in the first place didn't do so on a > whim. > > > The performance gain with vmla instruction is huge. > > Is it, on Cortex-A8? The TRM referrs to them jumping across pipelines > in odd ways, and that was a very primitive core so it's almost > certainly not going to be just as good as a vmul (in fact if I'm > reading correctly, it takes pretty much exactly the same time as > separate vmul and vadd instructions, 10 cycles vs 2 * 5). >It may seem that total number of cycles are more or less same for single vmla and vmul+vadd. However, when vmul+vadd combination is used instead of vmla, then intermediate results will be generated which needs to be stored in memory for future access. This will lead to lot of load/store ops being inserted which degrade performance. Correct me if i am wrong on this, but my observation till date have shown this.> > Cheers. > > Tim. >-- With regards, Suyog Sarda -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131219/fe30dc82/attachment.html>