search for: vmla

Displaying 20 results from an estimated 61 matches for "vmla".

Did you mean: vma
2013 Dec 19
4
[LLVMdev] LLVM ARM VMLA instruction
Hi Tim, > > cortex-a15 vfpv4 : vmla instruction emitted (which is a NEON instruction) > > I get a VFP vmla here rather than a NEON one (clang -target > armv7-linux-gnueabihf -mcpu=cortex-a15): "vmla.f32 s0, s1, s2". Are > you seeing something different? > As per Renato comment above, vmla instruction is NEON...
2013 Dec 19
0
[LLVMdev] LLVM ARM VMLA instruction
...nstructions generated will be invalid ) If I'm understanding correctly, you've specifically told it this Cortex-A8 *does* come with vfpv4. Those kinds of odd combinations can be useful sometimes (if only for tests), so I'm not sure policing them is a good idea. > cortex-a15 vfpv4 : vmla instruction emitted (which is a NEON instruction) I get a VFP vmla here rather than a NEON one (clang -target armv7-linux-gnueabihf -mcpu=cortex-a15): "vmla.f32 s0, s1, s2". Are you seeing something different? > However, if gcc emits vmla (NEON) instruction with cortex-a8 then should...
2013 Dec 19
3
[LLVMdev] LLVM ARM VMLA instruction
Hi all, Thanks for the info. Few observations from my side : LLVM : cortex-a8 vfpv3 : no vmla or vfma instruction emitted cortex-a8 vfpv4 : no vmla or vfma instruction emitted (This is invalid though as cortex-a8 does not have vfpv4) cortex-a8 vfpv4 with ffp-contract=fast : vfma instruction emitted ( this seems a bug to me!! If cortex-a8 doesn't come with vfpv4 then vfma instructions...
2013 Dec 19
0
[LLVMdev] LLVM ARM VMLA instruction
On 19 December 2013 08:50, suyog sarda <sardask01 at gmail.com> wrote: > It may seem that total number of cycles are more or less same for single > vmla and vmul+vadd. However, when vmul+vadd combination is used instead of > vmla, then intermediate results will be generated which needs to be stored > in memory for future access. This will lead to lot of load/store ops being > inserted which degrade performance. Correct me if i am wrong on...
2013 Dec 19
2
[LLVMdev] LLVM ARM VMLA instruction
On Thu, Dec 19, 2013 at 4:36 PM, Renato Golin <renato.golin at linaro.org>wrote: > On 19 December 2013 08:50, suyog sarda <sardask01 at gmail.com> wrote: > >> It may seem that total number of cycles are more or less same for single >> vmla and vmul+vadd. However, when vmul+vadd combination is used instead of >> vmla, then intermediate results will be generated which needs to be stored >> in memory for future access. This will lead to lot of load/store ops being >> inserted which degrade performance. Correct me if i...
2013 Dec 19
3
[LLVMdev] LLVM ARM VMLA instruction
...than in bits and pieces. I was basically comparing performance of clang and gcc code for benchmarks listed in llvm trunk. I found that wherever there was floating point ops (specifically floating point multiplication), performance with clang was bad. On analyzing further those issues, i came across vmla instruction by gcc. The test cases hit by bad performance of clang are : Test Case No of vmla instructions emitted by gcc (clang does not emit vmla for cortex-a8) =========== ======================================================= llvm/projects/test-suite/SingleSource/Benchmarks/Misc-C++/Large/sp...
2013 Feb 11
2
[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?
In theory, the backend should choose the best instructions for the selected target processor. VMLA is not always the best choice. Lang Hames did some measurements a while back to come up with the current behavior, but I don't remember exactly what he found. CC'ing Lang. On Feb 11, 2013, at 8:12 AM, Renato Golin <renato.golin at linaro.org> wrote: > On 11 February 2013 15:51,...
2013 Feb 08
2
[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?
...ear understanding what this option is meant for. So here are my new questions: Why for cortex-a9 vmlx-forwarding is enabled by default ? Is it to guarantee correctness or for performance purpose ? I've made some experiments and DISABLING vmlx-forwarding for cortex-a9 leads to generation of more vmla/vmls .f32 and significantly improve some benchmarks. I've not enter into a case where it significantly degrades performance or give incorrect answers. Thus my goal is to use my front-end to generate llvm neon intrinsics that maps to LLVM vmla/vmls f32 when I think it is appropriate and not to...
2013 Feb 11
3
[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?
Hi Renato, Indeed problem is with generation of vmla.f64. Affected benchmark is MILC from SPEC 2006 suite and disabling vmlx forwarding gives a 10% speed-up on complete benchmark execution ! So it is worth a try. Now going back to vmla generation through LLMV intrinsic usage. I've looked at .td file and it seems to me that when there is a "p...
2013 Dec 19
0
[LLVMdev] LLVM ARM VMLA instruction
> As per Renato comment above, vmla instruction is NEON instruction while vmfa is VFP instruction. Correct me if i am wrong on this. My version of the ARM architecture reference manual (v7 A & R) lists versions requiring NEON and versions requiring VFP. (Section A8.8.337). Split in just the way you'd expect (SIMD variants ne...
2013 Dec 19
0
[LLVMdev] LLVM ARM VMLA instruction
...is small and that we decide to pay the price, but not until we know that the cost is. This was tested on real hardware. Time taken for a 4x4 matrix > multiplication: > What hardware? A7? A8? A9? A15? Also, as stated by Renato - "there is a pipeline stall between two > sequential VMLAs (possibly due to the need of re-use of some registers) and > this made code much slower than a sequence of VMLA+VMUL+VADD" , when i use > -mcpu=cortex-a15 as option, clang emits vmla instructions back to > back(sequential) . Is there something different with cortex-a15 regarding >...
2013 Dec 18
2
[LLVMdev] LLVM ARM VMLA instruction
Hi, Hi, I was going through Code of LLVM instruction code generation for ARM. I came across VMLA instruction hazards (Floating point multiply and accumulate). I was comparing assembly code emitted by LLVM and GCC, where i saw that GCC was happily using VMLA instruction for floating point while LLVM never used it, instead it used a pair of VMUL and VADD instruction. I wanted to know if there i...
2013 Feb 11
0
[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?
Hi Bob, Seb, Renalto, My VMLA performance work was on Swift, rather than Cortex-A9. Sebastian - is vmlx-forwarding really the only variable you changed between your tests? As far as I can see the VMLx forwarding attribute only exists to restrict the application of one DAG combine optimization: PerformVMULCombine in ARMISelLow...
2013 Feb 08
0
[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?
...Sebastien DELDON-GNB <sebastien.deldon at st.com>wrote: > Why for cortex-a9 vmlx-forwarding is enabled by default ? Is it to > guarantee correctness or for performance purpose ? I’ve made some > experiments and DISABLING vmlx-forwarding for cortex-a9 leads to generation > of more vmla/vmls .f32 and significantly improve some benchmarks. I’ve not > enter into a case where it significantly degrades performance or give > incorrect answers. > I believe this is what you're looking for: http://article.gmane.org/gmane.comp.compilers.llvm.cvs/90709 Performance only, but...
2013 Dec 18
1
[LLVMdev] LLVM ARM VMLA instruction
Hi, I was going through Code of LLVM instruction code generation for ARM. I came across VMLA instruction hazards (Floating point multiply and accumulate). I was comparing assembly code emitted by LLVM and GCC, where i saw that GCC was happily using VMLA instruction for floating point while LLVM never used it, instead it used a pair of VMUL and VADD instruction. I wanted to know if there i...
2013 Dec 20
0
[LLVMdev] LLVM ARM VMLA instruction
...hardware as > well (Sorry for the trouble)? I've got a BeagleBone hanging around, and tested Clang against a hacked version of itself (without the VMLx disabling on Cortex-A8). The results (for matmul_f64_4x4, -O3 -mcpu=cortex=a8) were: 1. vfpv3-d16, stock Clang: 96.2s 2. vfpv3-d16, clang + vmla: 95.7s 3. vfpv3, stock clang: 82.9s 4. vfpv3, clang + vmla: 81.1s Worth investigating more, but as the others have said nowhere near enough data on its own. Especially since Evan clearly did some benchmarking himself before specifically disabling the vmla formation. > Also, I will > be glad...
2013 Feb 08
2
[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?
Hi all, Everything is in the tile, I would like to enforce generation of vmla.f32 instruction for scalar operations on cortex-a9, so is there a LLMV neon intrinsic available for that ? Thanks for your answers Best Regards Seb -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130208/84f...
2013 Dec 19
1
[LLVMdev] LLVM ARM VMLA instruction
On Thu, Dec 19, 2013 at 2:43 PM, Tim Northover <t.p.northover at gmail.com>wrote: > > As per Renato comment above, vmla instruction is NEON instruction while > vmfa is VFP instruction. Correct me if i am wrong on this. > > My version of the ARM architecture reference manual (v7 A & R) lists > versions requiring NEON and versions requiring VFP. (Section > A8.8.337). Split in just the way you'd...
2013 Feb 11
0
[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?
On 11 February 2013 15:51, Sebastien DELDON-GNB <sebastien.deldon at st.com>wrote: > Indeed problem is with generation of vmla.f64. Affected benchmark is MILC > from SPEC 2006 suite and disabling vmlx forwarding gives a 10% speed-up on > complete benchmark execution ! So it is worth a try. > Hi Sebastien, Ineed, worth having a look. Including Bob Wilson (who introduced the code in the first place, and is a conno...
2013 Feb 08
0
[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?
On 8 February 2013 10:40, Sebastien DELDON-GNB <sebastien.deldon at st.com>wrote: > Hi all,**** > > ** ** > > Everything is in the tile, I would like to enforce generation of vmla.f32 > instruction for scalar operations on cortex-a9, so is there a LLMV neon > intrinsic available for that ?**** > > Hi Sebastien, LLVM doesn't use intrinsics when there is a clear way of representing the same thing on standard IR. In the case of VMLA, it is generated from a pat...