thr3ads.net - llvm dev - [LLVMdev] LLVM ARM VMLA instruction [Dec 2013]

If this information is useful, please help other people find it:
Share via:

suyog sarda

2013-Dec-19 08:00 UTC

[LLVMdev] LLVM ARM VMLA instruction

Hi all,


Thanks for the info. Few observations from my side :


LLVM :


cortex-a8 vfpv3 : no vmla or vfma instruction emitted

cortex-a8 vfpv4 : no vmla or vfma instruction emitted (This is invalid
though as cortex-a8 does not have vfpv4)

cortex-a8 vfpv4 with ffp-contract=fast : vfma instruction emitted ( this
seems a bug to me!! If cortex-a8 doesn't come with vfpv4 then vfma
instructions generated will be invalid )


cortex-a15 vfpv4 : vmla instruction emitted (which is a NEON instruction)

cortex-a15 vfpv4 with ffp-contract=fast vfma instruction emitted.


GCC :


cortex-a8 vfpv3 : vmla instruction emitted

cortex-a15 vfpv4 : vfma instruction emitted


I agree to the point that NEON and VFP instructions shouldn't be used
interchangeably.


However, if gcc emits vmla (NEON) instruction with cortex-a8 then shouldn't
LLVM also emit vmla (NEON) instruction? Can someone please clarify on this
point? The performance gain with vmla instruction is huge. Somewhere i read
that LLVM prefers precision accuracy over performance. Is this true and
hence LLVM is not emiting vmla instructions for cortex-a8?



On Thu, Dec 19, 2013 at 6:41 AM, Kay Tiong Khoo <kkhoo at
perfwizard.com>wrote:
> Just to clarify: gcc 4.8.1 generates that fma at -O2; no FP relaxation or
> other flags specified.
>
>
> On Wed, Dec 18, 2013 at 6:02 PM, Kay Tiong Khoo <kkhoo at
perfwizard.com>wrote:
>
>> Thanks for the explanation, Tim!
>>
>> gcc 4.8.1 *does* generate an fma for your code example for an x86
target
>> that supports fma. I'd bet that the HW vendors' compilers do
the same, but
>> I don't have any of those installed at the moment to test that
theory. So
>> this is a bug in those compilers? Do you know how they justify it?
>>
>> I see section 6.5 "Expressions" in the C standard, and I can
see that
>> 6.5.8 would seem to agree with you assuming that a "floating
expression" is
>> a subset of "expression"...is there any other part of the
standard that you
>> know of that I can reference?
>>
>> This is made a little weirder by the fact that gcc and clang have a
>> 'fast' setting for fp-contract, but the C standard that I'm
looking at
>> states that it is just an "on-off-switch".
>>
>>
>> On Wed, Dec 18, 2013 at 11:17 AM, Tim Northover <t.p.northover at
gmail.com>wrote:
>>
>>> > http://llvm.org/bugs/show_bug.cgi?id=17188
>>> > http://llvm.org/bugs/show_bug.cgi?id=17211
>>>
>>> Ah, thanks. That makes a lot more sense now.
>>>
>>> > Correct - clang is different than gcc, icc, msvc, xlc, etc. on
this.
>>> Still
>>> > haven't seen any explanation for how this is better
though...
>>>
>>> That would be because it follows what C tells us a compiler has to
do
>>> by default but provides overrides in either direction if you know
what
>>> you're doing.
>>>
>>> The key point is that LLVM (currently) has no notion of statement
>>> boundaries, so it would fuse the operations in this function:
>>>
>>> float foo(float accum, float lhs, float rhs) {
>>>   float product = lhs * rhs;
>>>   return accum + product;
>>> }
>>>
>>> This isn't allowed even under FP_CONTRACT=on (the multiply and
add do
>>> not occur within a single expression), so LLVM can't in good
>>> conscience enable these optimisations by default.
>>>
>>> Cheers.
>>>
>>> Tim.
>>>
>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>

-- 
With regards,
Suyog Sarda
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131219/9ffd9395/attachment.html>

Tim Northover

2013-Dec-19 08:35 UTC

head link

[LLVMdev] LLVM ARM VMLA instruction

> cortex-a8 vfpv4 with ffp-contract=fast : vfma instruction emitted ( this
> seems a bug to me!! If cortex-a8 doesn't come with vfpv4 then vfma
> instructions generated will be invalid )
If I'm understanding correctly, you've specifically told it this
Cortex-A8 *does* come with vfpv4. Those kinds of odd combinations can
be useful sometimes (if only for tests), so I'm not sure policing them
is a good idea.
> cortex-a15 vfpv4 : vmla instruction emitted (which is a NEON instruction)
I get a VFP vmla here rather than a NEON one (clang -target
armv7-linux-gnueabihf -mcpu=cortex-a15): "vmla.f32 s0, s1, s2". Are
you seeing something different?
> However, if gcc emits vmla (NEON) instruction with cortex-a8 then
shouldn't
> LLVM also emit vmla (NEON) instruction?
It appears we've decided in the past that vmla just isn't worth it on
Cortex-A8. There's this comment in the source:

// Some processors have FP multiply-accumulate instructions that don't
// play nicely with other VFP / NEON instructions, and it's generally better
// to just not use them.

Sufficient benchmarking evidence could overturn that decision, but I
assume the people who added it in the first place didn't do so on a
whim.
> The performance gain with vmla instruction is huge.
Is it, on Cortex-A8? The TRM referrs to them jumping across pipelines
in odd ways, and that was a very primitive core so it's almost
certainly not going to be just as good as a vmul (in fact if I'm
reading correctly, it takes pretty much exactly the same time as
separate vmul and vadd instructions, 10 cycles vs 2 * 5).

Cheers.

Tim.

suyog sarda

2013-Dec-19 08:36 UTC

head link

[LLVMdev] LLVM ARM VMLA instruction

Hi,

One more addition to above observation :

LLVM :

cortex-a15 + vfpv4-d16 + ffast-math option WITHOUT ffp-contract=fast option
also emits vfma instruction.




On Thu, Dec 19, 2013 at 1:30 PM, suyog sarda <sardask01 at gmail.com>
wrote:
> Hi all,
>
>
> Thanks for the info. Few observations from my side :
>
>
> LLVM :
>
>
> cortex-a8 vfpv3 : no vmla or vfma instruction emitted
>
> cortex-a8 vfpv4 : no vmla or vfma instruction emitted (This is invalid
> though as cortex-a8 does not have vfpv4)
>
> cortex-a8 vfpv4 with ffp-contract=fast : vfma instruction emitted ( this
> seems a bug to me!! If cortex-a8 doesn't come with vfpv4 then vfma
> instructions generated will be invalid )
>
>
> cortex-a15 vfpv4 : vmla instruction emitted (which is a NEON instruction)
>
> cortex-a15 vfpv4 with ffp-contract=fast vfma instruction emitted.
>
>
> GCC :
>
>
> cortex-a8 vfpv3 : vmla instruction emitted
>
> cortex-a15 vfpv4 : vfma instruction emitted
>
>
> I agree to the point that NEON and VFP instructions shouldn't be used
> interchangeably.
>
>
> However, if gcc emits vmla (NEON) instruction with cortex-a8 then
> shouldn't LLVM also emit vmla (NEON) instruction? Can someone please
> clarify on this point? The performance gain with vmla instruction is huge.
> Somewhere i read that LLVM prefers precision accuracy over performance. Is
> this true and hence LLVM is not emiting vmla instructions for cortex-a8?
>
>
>
> On Thu, Dec 19, 2013 at 6:41 AM, Kay Tiong Khoo <kkhoo at
perfwizard.com>wrote:
>
>> Just to clarify: gcc 4.8.1 generates that fma at -O2; no FP relaxation
or
>> other flags specified.
>>
>>
>> On Wed, Dec 18, 2013 at 6:02 PM, Kay Tiong Khoo <kkhoo at
perfwizard.com>wrote:
>>
>>> Thanks for the explanation, Tim!
>>>
>>> gcc 4.8.1 *does* generate an fma for your code example for an x86
target
>>> that supports fma. I'd bet that the HW vendors' compilers
do the same, but
>>> I don't have any of those installed at the moment to test that
theory. So
>>> this is a bug in those compilers? Do you know how they justify it?
>>>
>>> I see section 6.5 "Expressions" in the C standard, and I
can see that
>>> 6.5.8 would seem to agree with you assuming that a "floating
expression" is
>>> a subset of "expression"...is there any other part of the
standard that you
>>> know of that I can reference?
>>>
>>> This is made a little weirder by the fact that gcc and clang have a
>>> 'fast' setting for fp-contract, but the C standard that
I'm looking at
>>> states that it is just an "on-off-switch".
>>>
>>>
>>> On Wed, Dec 18, 2013 at 11:17 AM, Tim Northover <t.p.northover
at gmail.com
>>> > wrote:
>>>
>>>> > http://llvm.org/bugs/show_bug.cgi?id=17188
>>>> > http://llvm.org/bugs/show_bug.cgi?id=17211
>>>>
>>>> Ah, thanks. That makes a lot more sense now.
>>>>
>>>> > Correct - clang is different than gcc, icc, msvc, xlc,
etc. on this.
>>>> Still
>>>> > haven't seen any explanation for how this is better
though...
>>>>
>>>> That would be because it follows what C tells us a compiler has
to do
>>>> by default but provides overrides in either direction if you
know what
>>>> you're doing.
>>>>
>>>> The key point is that LLVM (currently) has no notion of
statement
>>>> boundaries, so it would fuse the operations in this function:
>>>>
>>>> float foo(float accum, float lhs, float rhs) {
>>>>   float product = lhs * rhs;
>>>>   return accum + product;
>>>> }
>>>>
>>>> This isn't allowed even under FP_CONTRACT=on (the multiply
and add do
>>>> not occur within a single expression), so LLVM can't in
good
>>>> conscience enable these optimisations by default.
>>>>
>>>> Cheers.
>>>>
>>>> Tim.
>>>>
>>>
>>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>
>
> --
> With regards,
> Suyog Sarda
>


-- 
With regards,
Suyog Sarda
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131219/7f18b610/attachment.html>

suyog sarda

2013-Dec-19 08:50 UTC

head link

[LLVMdev] LLVM ARM VMLA instruction

Hi Tim,





> > cortex-a15 vfpv4 : vmla instruction emitted (which is a NEON
instruction)
>
> I get a VFP vmla here rather than a NEON one (clang -target
> armv7-linux-gnueabihf -mcpu=cortex-a15): "vmla.f32 s0, s1, s2".
Are
> you seeing something different?
>
As per Renato comment above, vmla instruction is NEON instruction while
vmfa is VFP instruction. Correct me if i am wrong on this.

>
> > However, if gcc emits vmla (NEON) instruction with cortex-a8 then
> shouldn't
> > LLVM also emit vmla (NEON) instruction?
>
> It appears we've decided in the past that vmla just isn't worth it
on
> Cortex-A8. There's this comment in the source:
>
> // Some processors have FP multiply-accumulate instructions that don't
> // play nicely with other VFP / NEON instructions, and it's generally
> better
> // to just not use them.
>
> Sufficient benchmarking evidence could overturn that decision, but I
> assume the people who added it in the first place didn't do so on a
> whim.
>
> > The performance gain with vmla instruction is huge.
>
> Is it, on Cortex-A8? The TRM referrs to them jumping across pipelines
> in odd ways, and that was a very primitive core so it's almost
> certainly not going to be just as good as a vmul (in fact if I'm
> reading correctly, it takes pretty much exactly the same time as
> separate vmul and vadd instructions, 10 cycles vs 2 * 5).
>
It may seem that total number of cycles are more or less same for single
vmla and vmul+vadd. However, when vmul+vadd combination is used instead of
vmla, then intermediate results will be generated which needs to be stored
in memory for future access. This will lead to lot of load/store ops being
inserted which degrade performance. Correct me if i am wrong on this, but
my observation till date have shown this.

>
> Cheers.
>
> Tim.
>


-- 
With regards,
Suyog Sarda
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131219/fe30dc82/attachment.html>

Maybe Matching Threads

Search for more seemingly similar threads

llvm dev - Dec 2013 - [LLVMdev] LLVM ARM VMLA instruction

[LLVMdev] LLVM ARM VMLA instruction

[LLVMdev] LLVM ARM VMLA instruction

[LLVMdev] LLVM ARM VMLA instruction

[LLVMdev] LLVM ARM VMLA instruction

Maybe Matching Threads