similar to: [LLVMdev] ARM NEON VMUL.f32 issue

Displaying 20 results from an estimated 2000 matches similar to: "[LLVMdev] ARM NEON VMUL.f32 issue"

2013 Mar 20
0
[LLVMdev] ARM NEON VMUL.f32 issue
Hi, | The question is: | * is this a problem with the test, that shouldn't be expecting values below FLT_MIN, or | * is it a bug in the lowering, that should only be lowering to NEON's VMUL when unsafe-math is on, or | * neither, and people should disable that when they want correctness? Note that if you go for the second option, IMO unsafe-math is _far_ too "aggressive" an
2013 Mar 19
0
[LLVMdev] ARM NEON VMUL.f32 issue
Hi Renato, You're right. Strictly speaking, using NEON for scalar floating point isn't completely safe for exactly this reason (also NaNs, IIRC). We generally do it anyway because on common cores (cortex-a8) VFP is pretty terrible and the NEON approximation is correct for the vast majority of use-cases that people care about. Yes, that's cutting some corners. Would you mind making
2013 Jun 07
3
[LLVMdev] NEON vector instructions and the fast math IR flags
On 7 June 2013 08:48, Tobias Grosser <tobias at grosser.es> wrote: > When to set which subtarget feature is a policy decision, where I honestly > don't have any opinion on for clang. The best is probably to mirror the gcc > behavior on linux targets. > Not really, since GCC has no special behaviour for Darwin, AFAIK. My change will only generate SP-FP on NEON for A5 and A8
2013 Jun 07
0
[LLVMdev] NEON vector instructions and the fast math IR flags
On 06/06/2013 11:58 PM, Renato Golin wrote: > On 7 June 2013 07:05, Owen Anderson <resistor at mac.com> wrote: Hi Owen, hi Renato, thanks for your replies. >> Darwin uses NEON for floating point, but does *not* (and should not). >> globally enable fast math flags. Use of NEON for FP needs to remain >> achievable without globally setting the fast math flags. Fast
2013 Jun 07
0
[LLVMdev] NEON vector instructions and the fast math IR flags
On Jun 7, 2013, at 3:14 AM, Renato Golin <renato.golin at linaro.org> wrote: > On 7 June 2013 08:48, Tobias Grosser <tobias at grosser.es> wrote: > When to set which subtarget feature is a policy decision, where I honestly don't have any opinion on for clang. The best is probably to mirror the gcc behavior on linux targets. > > Not really, since GCC has no special
2013 Jun 07
3
[LLVMdev] NEON vector instructions and the fast math IR flags
On 7 June 2013 07:05, Owen Anderson <resistor at mac.com> wrote: > Darwin uses NEON for floating point, but does *not* (and should not). > globally enable fast math flags. Use of NEON for FP needs to remain > achievable without globally setting the fast math flags. Fast math may > imply reasonably imply NEON, but the opposite direction is not accurate. > > That said, I
2013 Feb 08
0
[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?
On 8 February 2013 10:40, Sebastien DELDON-GNB <sebastien.deldon at st.com>wrote: > Hi all,**** > > ** ** > > Everything is in the tile, I would like to enforce generation of vmla.f32 > instruction for scalar operations on cortex-a9, so is there a LLMV neon > intrinsic available for that ?**** > > Hi Sebastien, LLVM doesn't use intrinsics when there is a
2013 Jun 07
2
[LLVMdev] NEON vector instructions and the fast math IR flags
On 06/07/2013 06:49 AM, Arnold Schwaighofer wrote: > > On Jun 7, 2013, at 3:14 AM, Renato Golin <renato.golin at linaro.org> wrote: > >> On 7 June 2013 08:48, Tobias Grosser <tobias at grosser.es> wrote: >> When to set which subtarget feature is a policy decision, where I honestly don't have any opinion on for clang. The best is probably to mirror the gcc
2013 Feb 11
0
[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?
On 11 February 2013 15:51, Sebastien DELDON-GNB <sebastien.deldon at st.com>wrote: > Indeed problem is with generation of vmla.f64. Affected benchmark is MILC > from SPEC 2006 suite and disabling vmlx forwarding gives a 10% speed-up on > complete benchmark execution ! So it is worth a try. > Hi Sebastien, Ineed, worth having a look. Including Bob Wilson (who introduced the
2013 Feb 11
2
[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?
In theory, the backend should choose the best instructions for the selected target processor. VMLA is not always the best choice. Lang Hames did some measurements a while back to come up with the current behavior, but I don't remember exactly what he found. CC'ing Lang. On Feb 11, 2013, at 8:12 AM, Renato Golin <renato.golin at linaro.org> wrote: > On 11 February 2013 15:51,
2013 Feb 08
2
[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?
Hi all, Everything is in the tile, I would like to enforce generation of vmla.f32 instruction for scalar operations on cortex-a9, so is there a LLMV neon intrinsic available for that ? Thanks for your answers Best Regards Seb -------------- next part -------------- An HTML attachment was scrubbed... URL:
2013 Feb 11
0
[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?
Hi Bob, Seb, Renalto, My VMLA performance work was on Swift, rather than Cortex-A9. Sebastian - is vmlx-forwarding really the only variable you changed between your tests? As far as I can see the VMLx forwarding attribute only exists to restrict the application of one DAG combine optimization: PerformVMULCombine in ARMISelLowering.cpp, which turns (A + B) * C into (A * C) + (B * C). This
2013 Feb 11
3
[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?
Hi Renato, Indeed problem is with generation of vmla.f64. Affected benchmark is MILC from SPEC 2006 suite and disabling vmlx forwarding gives a 10% speed-up on complete benchmark execution ! So it is worth a try. Now going back to vmla generation through LLMV intrinsic usage. I've looked at .td file and it seems to me that when there is a "pattern" to generate instruction, no
2014 Dec 07
3
[LLVMdev] NEON intrinsics preventing redundant load optimization?
Hi all, I’m not sure if this is the right list, so apologies if not. Doing some profiling I noticed some of my hand-tuned matrix multiply code with NEON intrinsics was much slower through a C++ template wrapper vs calling the intrinsics function directly. It turned out clang/LLVM was unable to eliminate a temporary even though the case seemed quite straightforward. Unfortunately any loads
2013 Dec 19
4
[LLVMdev] LLVM ARM VMLA instruction
Hi Tim, > > cortex-a15 vfpv4 : vmla instruction emitted (which is a NEON instruction) > > I get a VFP vmla here rather than a NEON one (clang -target > armv7-linux-gnueabihf -mcpu=cortex-a15): "vmla.f32 s0, s1, s2". Are > you seeing something different? > As per Renato comment above, vmla instruction is NEON instruction while vmfa is VFP instruction. Correct
2013 Dec 19
0
[LLVMdev] LLVM ARM VMLA instruction
On 19 December 2013 08:50, suyog sarda <sardask01 at gmail.com> wrote: > It may seem that total number of cycles are more or less same for single > vmla and vmul+vadd. However, when vmul+vadd combination is used instead of > vmla, then intermediate results will be generated which needs to be stored > in memory for future access. This will lead to lot of load/store ops being >
2013 Feb 08
2
[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?
Hi Renato, Thanks for the answer, it confirms what I was suspecting. My problem is that this behavior is controlled by vmlx forwarding on cortex-a9 for which despite asking on this list, I couldn't get a clear understanding what this option is meant for. So here are my new questions: Why for cortex-a9 vmlx-forwarding is enabled by default ? Is it to guarantee correctness or for performance
2013 Dec 19
0
[LLVMdev] LLVM ARM VMLA instruction
> As per Renato comment above, vmla instruction is NEON instruction while vmfa is VFP instruction. Correct me if i am wrong on this. My version of the ARM architecture reference manual (v7 A & R) lists versions requiring NEON and versions requiring VFP. (Section A8.8.337). Split in just the way you'd expect (SIMD variants need NEON). > It may seem that total number of cycles are
2013 Dec 19
0
[LLVMdev] LLVM ARM VMLA instruction
> cortex-a8 vfpv4 with ffp-contract=fast : vfma instruction emitted ( this > seems a bug to me!! If cortex-a8 doesn't come with vfpv4 then vfma > instructions generated will be invalid ) If I'm understanding correctly, you've specifically told it this Cortex-A8 *does* come with vfpv4. Those kinds of odd combinations can be useful sometimes (if only for tests), so I'm not
2016 Feb 11
4
Vectorization with fast-math on irregular ISA sub-sets
----- Original Message ----- > From: "Renato Golin" <renato.golin at linaro.org> > To: "Hal Finkel" <hfinkel at anl.gov> > Cc: "James Molloy" <James.Molloy at arm.com>, "Nadav Rotem" <nrotem at apple.com>, "Arnold Schwaighofer" > <aschwaighofer at apple.com>, "LLVM Dev" <llvm-dev at