thr3ads.net - similar to: "[LLVMdev] LLVM ARM VMLA instruction"

Displaying 20 results from an estimated 1000 matches similar to: "[LLVMdev] LLVM ARM VMLA instruction"

2013 Dec 19

[LLVMdev] LLVM ARM VMLA instruction

Test case name : >> llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c - >> This is a 4x4 matrix multiplication, we can make small changes to make it a >> 3x3 matrix multiplication for making things simple to understand . >> > > This is one very specific case. How does that behave on all other cases? > Normally, every big improvement comes with

[LLVMdev] LLVM ARM VMLA instruction

2013 Dec 19

[LLVMdev] LLVM ARM VMLA instruction

On 19 December 2013 08:50, suyog sarda <sardask01 at gmail.com> wrote: > It may seem that total number of cycles are more or less same for single > vmla and vmul+vadd. However, when vmul+vadd combination is used instead of > vmla, then intermediate results will be generated which needs to be stored > in memory for future access. This will lead to lot of load/store ops being >

[LLVMdev] LLVM ARM VMLA instruction

2013 Dec 19

[LLVMdev] LLVM ARM VMLA instruction

On Thu, Dec 19, 2013 at 4:36 PM, Renato Golin <renato.golin at linaro.org>wrote: > On 19 December 2013 08:50, suyog sarda <sardask01 at gmail.com> wrote: > >> It may seem that total number of cycles are more or less same for single >> vmla and vmul+vadd. However, when vmul+vadd combination is used instead of >> vmla, then intermediate results will be generated

[LLVMdev] LLVM ARM VMLA instruction

2013 Dec 19

[LLVMdev] LLVM ARM VMLA instruction

Hi Tim, > > cortex-a15 vfpv4 : vmla instruction emitted (which is a NEON instruction) > > I get a VFP vmla here rather than a NEON one (clang -target > armv7-linux-gnueabihf -mcpu=cortex-a15): "vmla.f32 s0, s1, s2". Are > you seeing something different? > As per Renato comment above, vmla instruction is NEON instruction while vmfa is VFP instruction. Correct

[LLVMdev] LLVM ARM VMLA instruction

2013 Dec 19

[LLVMdev] LLVM ARM VMLA instruction

On Thu, Dec 19, 2013 at 2:43 PM, Tim Northover <t.p.northover at gmail.com>wrote: > > As per Renato comment above, vmla instruction is NEON instruction while > vmfa is VFP instruction. Correct me if i am wrong on this. > > My version of the ARM architecture reference manual (v7 A & R) lists > versions requiring NEON and versions requiring VFP. (Section > A8.8.337).

[LLVMdev] LLVM ARM VMLA instruction

2013 Dec 19

[LLVMdev] LLVM ARM VMLA instruction

On 19 December 2013 11:16, suyog sarda <sardask01 at gmail.com> wrote: > Test case name : > llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c - > This is a 4x4 matrix multiplication, we can make small changes to make it a > 3x3 matrix multiplication for making things simple to understand . > This is one very specific case. How does that behave on all

[LLVMdev] LLVM ARM VMLA instruction

2013 Dec 19

[LLVMdev] LLVM ARM VMLA instruction

Hi all, Thanks for the info. Few observations from my side : LLVM : cortex-a8 vfpv3 : no vmla or vfma instruction emitted cortex-a8 vfpv4 : no vmla or vfma instruction emitted (This is invalid though as cortex-a8 does not have vfpv4) cortex-a8 vfpv4 with ffp-contract=fast : vfma instruction emitted ( this seems a bug to me!! If cortex-a8 doesn't come with vfpv4 then vfma instructions

[LLVMdev] LLVM ARM VMLA instruction

2013 Dec 18

[LLVMdev] LLVM ARM VMLA instruction

Hi, Hi, I was going through Code of LLVM instruction code generation for ARM. I came across VMLA instruction hazards (Floating point multiply and accumulate). I was comparing assembly code emitted by LLVM and GCC, where i saw that GCC was happily using VMLA instruction for floating point while LLVM never used it, instead it used a pair of VMUL and VADD instruction. I wanted to know if there is

[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?

2013 Feb 08

[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?

Hi Renato, Thanks for the answer, it confirms what I was suspecting. My problem is that this behavior is controlled by vmlx forwarding on cortex-a9 for which despite asking on this list, I couldn't get a clear understanding what this option is meant for. So here are my new questions: Why for cortex-a9 vmlx-forwarding is enabled by default ? Is it to guarantee correctness or for performance

[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?

2013 Feb 11

[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?

Hi Bob, Seb, Renalto, My VMLA performance work was on Swift, rather than Cortex-A9. Sebastian - is vmlx-forwarding really the only variable you changed between your tests? As far as I can see the VMLx forwarding attribute only exists to restrict the application of one DAG combine optimization: PerformVMULCombine in ARMISelLowering.cpp, which turns (A + B) * C into (A * C) + (B * C). This

[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?

2013 Feb 11

[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?

In theory, the backend should choose the best instructions for the selected target processor. VMLA is not always the best choice. Lang Hames did some measurements a while back to come up with the current behavior, but I don't remember exactly what he found. CC'ing Lang. On Feb 11, 2013, at 8:12 AM, Renato Golin <renato.golin at linaro.org> wrote: > On 11 February 2013 15:51,

[LLVMdev] LLVM ARM VMLA instruction

2013 Dec 20

[LLVMdev] LLVM ARM VMLA instruction

Hi Suyog, > I tested it on A15, i don't have access to A8 rightnow, but i intend to test > it for A8 as well. That's extremely dodgy, the two processors are very different. > I don't think i > will get A8 hardware soon, can someone please check it on A8 hardware as > well (Sorry for the trouble)? I've got a BeagleBone hanging around, and tested Clang against a

[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?

2013 Feb 08

[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?

On 8 February 2013 10:40, Sebastien DELDON-GNB <sebastien.deldon at st.com>wrote: > Hi all,**** > > ** ** > > Everything is in the tile, I would like to enforce generation of vmla.f32 > instruction for scalar operations on cortex-a9, so is there a LLMV neon > intrinsic available for that ?**** > > Hi Sebastien, LLVM doesn't use intrinsics when there is a

[LLVMdev] LLVM ARM VMLA instruction

2013 Dec 19

[LLVMdev] LLVM ARM VMLA instruction

> cortex-a8 vfpv4 with ffp-contract=fast : vfma instruction emitted ( this > seems a bug to me!! If cortex-a8 doesn't come with vfpv4 then vfma > instructions generated will be invalid ) If I'm understanding correctly, you've specifically told it this Cortex-A8 *does* come with vfpv4. Those kinds of odd combinations can be useful sometimes (if only for tests), so I'm not

[LLVMdev] LLVM ARM VMLA instruction

2013 Dec 19

[LLVMdev] LLVM ARM VMLA instruction

> As per Renato comment above, vmla instruction is NEON instruction while vmfa is VFP instruction. Correct me if i am wrong on this. My version of the ARM architecture reference manual (v7 A & R) lists versions requiring NEON and versions requiring VFP. (Section A8.8.337). Split in just the way you'd expect (SIMD variants need NEON). > It may seem that total number of cycles are

[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?

2013 Feb 08

[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?

On 8 February 2013 12:28, Sebastien DELDON-GNB <sebastien.deldon at st.com>wrote: > Why for cortex-a9 vmlx-forwarding is enabled by default ? Is it to > guarantee correctness or for performance purpose ? I’ve made some > experiments and DISABLING vmlx-forwarding for cortex-a9 leads to generation > of more vmla/vmls .f32 and significantly improve some benchmarks. I’ve not >

[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?

2013 Feb 11

[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?

On 11 February 2013 15:51, Sebastien DELDON-GNB <sebastien.deldon at st.com>wrote: > Indeed problem is with generation of vmla.f64. Affected benchmark is MILC > from SPEC 2006 suite and disabling vmlx forwarding gives a 10% speed-up on > complete benchmark execution ! So it is worth a try. > Hi Sebastien, Ineed, worth having a look. Including Bob Wilson (who introduced the

[LLVMdev] LLVM ARM VMLA instruction

2013 Dec 18

[LLVMdev] LLVM ARM VMLA instruction

> I believe that's the NEON VMLA, not the VFP one. Turns out I was misreading the assembly. I wish "vmla" and "vfma" weren't so similar-looking. For Suyog that means the option "-ffp-contract=fast" is needed to get vfma when needed. Sorry about the bad information earlier. Cheers. Tim.

[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?

2013 Feb 12

[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?

Hi all, Sorry for my naïve question but what is Swift ? Yes vmlx-forwarding is the only variable I changed in my tests. I did the experiment on another popular FP benchmark and observe a 14% speed-up only by disabling vmlx-forwarding. Best Regards Seb My VMLA performance work was on Swift, rather than Cortex-A9. Sebastian - is vmlx-forwarding really the only variable you changed between your

[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?

2013 Feb 11

[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?

Hi Renato, Indeed problem is with generation of vmla.f64. Affected benchmark is MILC from SPEC 2006 suite and disabling vmlx forwarding gives a 10% speed-up on complete benchmark execution ! So it is worth a try. Now going back to vmla generation through LLMV intrinsic usage. I've looked at .td file and it seems to me that when there is a "pattern" to generate instruction, no

similar to: [LLVMdev] LLVM ARM VMLA instruction