thr3ads.net - llvm dev - [LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ? [Feb 2013]

If this information is useful, please help other people find it:
Share via:

Bob Wilson

2013-Feb-11 18:21 UTC

[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?

In theory, the backend should choose the best instructions for the selected
target processor.  VMLA is not always the best choice.  Lang Hames did some
measurements a while back to come up with the current behavior, but I don't
remember exactly what he found.  CC'ing Lang.

On Feb 11, 2013, at 8:12 AM, Renato Golin <renato.golin at linaro.org>
wrote:
> On 11 February 2013 15:51, Sebastien DELDON-GNB <sebastien.deldon at
st.com> wrote:
> Indeed problem is with generation of vmla.f64. Affected benchmark is MILC
from SPEC 2006 suite and disabling vmlx forwarding gives a 10% speed-up on
complete benchmark execution ! So it is worth a try.
> 
> 
> Hi Sebastien,
> 
> Ineed, worth having a look. Including Bob Wilson (who introduced the code
in the first place, and is a connoisseur of NEON in LLVM) to see if he has a
better idea of the problem.
> 
> 
> Now going back to vmla generation through LLMV intrinsic usage. I’ve looked
at .td file and it seems to me that when there is a “pattern” to generate
instruction, no intrinsic is defined to generate it, correct ?
> 
> 
> Correct.
> 
> 
> Is it possible for an instruction that is generated through a “pattern” to
add also an LLVM intrinsic. My goal here is to not rely on LLVM to generate VMLA
but rather having my front-end to generate call to a VLMA intrinsic I would have
defined when it thinks it’s appropriate to generate one.
> 
> No, and I'm not sure we should have one.
> 
> I understand why you want one, but that's too much back-end knowledge
to a front-end, and any pass that can transform a pair of VMLAs into an
intrinsic call, can also transform into VMLA+VMUL+VADD. In this case, disabling
the optimization is probably the best course of action.
> 
> In your compiler, you may prefer to leave it always disabled, then you
should set it when creating the Target.
> 
> If we find that this optimization produces worse code in more cases than
not, than we should leave it disable by default and let the user enable when
necessary. I'll let Bob follow up on that, since I don't know what
benchmarks he used.
> 
> cheers,
> --renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130211/2d013c57/attachment.html>

Lang Hames

2013-Feb-11 22:18 UTC

head link

[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?

Hi Bob, Seb, Renalto,

My VMLA performance work was on Swift, rather than Cortex-A9.

Sebastian - is vmlx-forwarding really the only variable you changed between
your tests?

As far as I can see the VMLx forwarding attribute only exists to restrict
the application of one DAG combine optimization: PerformVMULCombine in
ARMISelLowering.cpp, which turns (A + B) * C into (A * C) + (B * C). This
combine only ever triggers when vmlx-forwarding is on. I'd usually expect
this to increase vmla formation, rather than decrease it, but under some
circumstances (e.g. when the (A * C) and (B * C) expressions have existing
uses) it might block their formation.

If you want to narrow the conditions for when PerformVMULCombine applies,
please feel free. Please don't remove the dependence of this optimization
on vmlx-forwarding though - we don't want it applying to targets that
don't
have that feature.

Regards,
Lang.



On Mon, Feb 11, 2013 at 10:21 AM, Bob Wilson <bob.wilson at apple.com>
wrote:
> In theory, the backend should choose the best instructions for the
> selected target processor.  VMLA is not always the best choice.  Lang Hames
> did some measurements a while back to come up with the current behavior,
> but I don't remember exactly what he found.  CC'ing Lang.
>
> On Feb 11, 2013, at 8:12 AM, Renato Golin <renato.golin at
linaro.org> wrote:
>
> On 11 February 2013 15:51, Sebastien DELDON-GNB <sebastien.deldon at
st.com>wrote:
>
>> Indeed problem is with generation of vmla.f64. Affected benchmark is
MILC
>> from SPEC 2006 suite and disabling vmlx forwarding gives a 10% speed-up
on
>> complete benchmark execution ! So it is worth a try.
>>
>
> Hi Sebastien,
>
> Ineed, worth having a look. Including Bob Wilson (who introduced the code
> in the first place, and is a connoisseur of NEON in LLVM) to see if he has
> a better idea of the problem.
>
>
> Now going back to vmla generation through LLMV intrinsic usage. I’ve
>> looked at .td file and it seems to me that when there is a “pattern” to
>> generate instruction, no intrinsic is defined to generate it, correct ?
>>
>
> Correct.
>
>
> Is it possible for an instruction that is generated through a “pattern” to
>> add also an LLVM intrinsic. My goal here is to not rely on LLVM to
generate
>> VMLA but rather having my front-end to generate call to a VLMA
intrinsic I
>> would have defined when it thinks it’s appropriate to generate one.
>>
> No, and I'm not sure we should have one.
>
> I understand why you want one, but that's too much back-end knowledge
to a
> front-end, and any pass that can transform a pair of VMLAs into an
> intrinsic call, can also transform into VMLA+VMUL+VADD. In this case,
> disabling the optimization is probably the best course of action.
>
> In your compiler, you may prefer to leave it always disabled, then you
> should set it when creating the Target.
>
> If we find that this optimization produces worse code in more cases than
> not, than we should leave it disable by default and let the user enable
> when necessary. I'll let Bob follow up on that, since I don't know
what
> benchmarks he used.
>
> cheers,
> --renato
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130211/ea7d296c/attachment.html>

Sebastien DELDON-GNB

2013-Feb-12 09:05 UTC

head link

[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?

Hi all,

Sorry for my naïve question but what is Swift ?
Yes vmlx-forwarding is the only variable I changed in my tests.
I did the experiment on another popular FP benchmark and observe a 14% speed-up
only by disabling vmlx-forwarding.

Best Regards
Seb



My VMLA performance work was on Swift, rather than Cortex-A9.

Sebastian - is vmlx-forwarding really the only variable you changed between your
tests?

As far as I can see the VMLx forwarding attribute only exists to restrict the
application of one DAG combine optimization: PerformVMULCombine in
ARMISelLowering.cpp, which turns (A + B) * C into (A * C) + (B * C). This
combine only ever triggers when vmlx-forwarding is on. I'd usually expect
this to increase vmla formation, rather than decrease it, but under some
circumstances (e.g. when the (A * C) and (B * C) expressions have existing uses)
it might block their formation.

If you want to narrow the conditions for when PerformVMULCombine applies, please
feel free. Please don't remove the dependence of this optimization on
vmlx-forwarding though - we don't want it applying to targets that don't
have that feature.

Regards,
Lang.


On Mon, Feb 11, 2013 at 10:21 AM, Bob Wilson <bob.wilson at
apple.com<mailto:bob.wilson at apple.com>> wrote:
In theory, the backend should choose the best instructions for the selected
target processor.  VMLA is not always the best choice.  Lang Hames did some
measurements a while back to come up with the current behavior, but I don't
remember exactly what he found.  CC'ing Lang.

On Feb 11, 2013, at 8:12 AM, Renato Golin <renato.golin at
linaro.org<mailto:renato.golin at linaro.org>> wrote:


On 11 February 2013 15:51, Sebastien DELDON-GNB <sebastien.deldon at
st.com<mailto:sebastien.deldon at st.com>> wrote:

Indeed problem is with generation of vmla.f64. Affected benchmark is MILC from
SPEC 2006 suite and disabling vmlx forwarding gives a 10% speed-up on complete
benchmark execution ! So it is worth a try.

Hi Sebastien,

Ineed, worth having a look. Including Bob Wilson (who introduced the code in the
first place, and is a connoisseur of NEON in LLVM) to see if he has a better
idea of the problem.



Now going back to vmla generation through LLMV intrinsic usage. I've looked
at .td file and it seems to me that when there is a "pattern" to
generate instruction, no intrinsic is defined to generate it, correct ?

Correct.



Is it possible for an instruction that is generated through a
"pattern" to add also an LLVM intrinsic. My goal here is to not rely
on LLVM to generate VMLA but rather having my front-end to generate call to a
VLMA intrinsic I would have defined when it thinks it's appropriate to
generate one.
No, and I'm not sure we should have one.

I understand why you want one, but that's too much back-end knowledge to a
front-end, and any pass that can transform a pair of VMLAs into an intrinsic
call, can also transform into VMLA+VMUL+VADD. In this case, disabling the
optimization is probably the best course of action.

In your compiler, you may prefer to leave it always disabled, then you should
set it when creating the Target.

If we find that this optimization produces worse code in more cases than not,
than we should leave it disable by default and let the user enable when
necessary. I'll let Bob follow up on that, since I don't know what
benchmarks he used.

cheers,
--renato


_______________________________________________
LLVM Developers mailing list
LLVMdev at cs.uiuc.edu<mailto:LLVMdev at cs.uiuc.edu>        
http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130212/b0f39fa0/attachment.html>

Reasonably Related Threads

Search for more seemingly similar threads

llvm dev - Feb 2013 - [LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?

[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?

[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?

[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?

Reasonably Related Threads