similar to: [LLVMdev] speed up memcpy intrinsic using ARM Neon registers

Displaying 20 results from an estimated 6000 matches similar to: "[LLVMdev] speed up memcpy intrinsic using ARM Neon registers"

2009 Nov 10
3
[LLVMdev] speed up memcpy intrinsic using ARM Neon registers
On Nov 9, 2009, at 5:59 PM, David Conrad wrote: > On Nov 9, 2009, at 7:34 PM, Neel Nagar wrote: > >> I tried to speed up Dhrystone on ARM Cortex-A8 by optimizing the >> memcpy intrinsic. I used the Neon load multiple instruction to move >> up >> to 48 bytes at a time . Over 15 scalar instructions collapsed down >> into these 2 Neon instructions. Nice. Thanks
2009 Nov 10
0
[LLVMdev] speed up memcpy intrinsic using ARM Neon registers
On Nov 9, 2009, at 7:34 PM, Neel Nagar wrote: > I tried to speed up Dhrystone on ARM Cortex-A8 by optimizing the > memcpy intrinsic. I used the Neon load multiple instruction to move up > to 48 bytes at a time . Over 15 scalar instructions collapsed down > into these 2 Neon instructions. > > fldmiad r3, {d0, d1, d2, d3, d4, d5} @ SrcLine dhrystone.c 359 > fstmiad
2009 Nov 11
0
[LLVMdev] speed up memcpy intrinsic using ARM Neon registers
On Nov 11, 2009, at 3:27 AM, Rodolph Perfetta wrote: > > If you know about the alignment, maybe use structured load/store > (vst1.64/vld1.64 {dn-dm}). You may also want to work on whole cache > lines > (64 bytes on A8). You can find more in this discussion: > http://groups.google.com/group/beagleboard/browse_thread/thread/12c7bd415fbc >
2009 Nov 10
0
[LLVMdev] speed up memcpy intrinsic using ARM Neon registers
On Nov 9, 2009, at 11:13 PM, Evan Cheng wrote: >> >> On the A8, an ARM store after NEON stores to the same 16-byte block >> incurs a ~20 cycle penalty since the NEON unit executes behind ARM. >> It's worse if the NEON store was split across a 16-byte boundary, >> then >> there could be a 50 cycle stall. >> >> See
2013 Jun 07
0
[LLVMdev] NEON vector instructions and the fast math IR flags
On Jun 6, 2013, at 8:35 PM, Tobias Grosser <grosser at google.com> wrote: > I understand that some users do not require 754 compliant floating point behavior (clang on darwin?), which means they would probably not need this change. However, it should also not hurt them performance-wise as such users would probably set the relevant global fast-math flags to reduce the precision
2013 Jun 07
2
[LLVMdev] NEON vector instructions and the fast math IR flags
Hi, I was recently looking into the translation of LLVM-IR vector instructions to ARM NEON assembly. Specifically, when this is legal to do and when we need to be careful. I attached a very simple test case: define <4 x float> @fooP(<4 x float> %A, <4 x float> %B) { %C = fmul <4 x float> %A, %B ret <4 x float> %C } If fooP is compiled with "llc -march=arm
2013 Jun 07
3
[LLVMdev] NEON vector instructions and the fast math IR flags
On 7 June 2013 07:05, Owen Anderson <resistor at mac.com> wrote: > Darwin uses NEON for floating point, but does *not* (and should not). > globally enable fast math flags. Use of NEON for FP needs to remain > achievable without globally setting the fast math flags. Fast math may > imply reasonably imply NEON, but the opposite direction is not accurate. > > That said, I
2013 Mar 19
4
[LLVMdev] ARM NEON VMUL.f32 issue
Hi folks, I just "fixed" a bug on ARM LNT regarding lowering of a VMUL.f32 as NEON and not VFP. The former is not IEEE 754 compliant, while the latter is, and that was failing TSVC. The question is: * is this a problem with the test, that shouldn't be expecting values below FLT_MIN, or * is it a bug in the lowering, that should only be lowering to NEON's VMUL when unsafe-math
2013 Feb 12
3
[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?
I did the initial work on vmla formation. The default settings for cortex-a8 / a9 due to micro-architecture difference (i believe a8 TRM talks about vmla hazards) and extensive testing. That said, given the limitation of the current pre-RA scheduling pass, it's likely the use of vmla can caused regressions. Im not opposed to changing the setting for a9. However, it's not a good idea to
2013 Jun 07
3
[LLVMdev] NEON vector instructions and the fast math IR flags
On 7 June 2013 08:48, Tobias Grosser <tobias at grosser.es> wrote: > When to set which subtarget feature is a policy decision, where I honestly > don't have any opinion on for clang. The best is probably to mirror the gcc > behavior on linux targets. > Not really, since GCC has no special behaviour for Darwin, AFAIK. My change will only generate SP-FP on NEON for A5 and A8
2013 Mar 19
0
[LLVMdev] ARM NEON VMUL.f32 issue
Hi Renato, You're right. Strictly speaking, using NEON for scalar floating point isn't completely safe for exactly this reason (also NaNs, IIRC). We generally do it anyway because on common cores (cortex-a8) VFP is pretty terrible and the NEON approximation is correct for the vast majority of use-cases that people care about. Yes, that's cutting some corners. Would you mind making
2013 Jun 07
0
[LLVMdev] NEON vector instructions and the fast math IR flags
On 06/06/2013 11:58 PM, Renato Golin wrote: > On 7 June 2013 07:05, Owen Anderson <resistor at mac.com> wrote: Hi Owen, hi Renato, thanks for your replies. >> Darwin uses NEON for floating point, but does *not* (and should not). >> globally enable fast math flags. Use of NEON for FP needs to remain >> achievable without globally setting the fast math flags. Fast
2014 Nov 24
2
[RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics
>> >> a. Simplest use case to validate this optimization for correctness. >> >> b. Simplest use case to validate this optimization for performance. >> >> >> >> Would prefer something like opusdec that can be executed on command >> >> line. >> > >> > >> > The easiest thing to use is probably opus_demo (opusdec
2014 Nov 24
3
[RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics
On 21 November 2014 at 18:06, Timothy B. Terriberry <tterribe at xiph.org> wrote: > > Viswanath Puttagunta wrote: >> >> a. Simplest use case to validate this optimization for correctness. >> b. Simplest use case to validate this optimization for performance. >> >> Would prefer something like opusdec that can be executed on command >> line. > >
2014 Dec 19
3
[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
Viswanath Puttagunta wrote: > I responded to your feedback before I started on RFCv3.. and took your > silence as approval :).. I guess that email got lost in your inbox sea > some where.. so re-posting the responses. Sorry, I did see it but I guess I read it rather more quickly than I thought. Apologies for that. > guidance. I wouldn't know where else to put this. Without
2013 Jun 07
0
[LLVMdev] NEON vector instructions and the fast math IR flags
On Jun 7, 2013, at 3:14 AM, Renato Golin <renato.golin at linaro.org> wrote: > On 7 June 2013 08:48, Tobias Grosser <tobias at grosser.es> wrote: > When to set which subtarget feature is a policy decision, where I honestly don't have any opinion on for clang. The best is probably to mirror the gcc behavior on linux targets. > > Not really, since GCC has no special
2009 Jul 03
4
[LLVMdev] llvm-gcc cross compiler for ARM Linux failing
I suspect that my llvm-gcc cross compiler is using the wrong assembler because it does not recognize "-mcpu=cortex-a8". I was trying to build a cross compiler for a Mac host. Now I am trying to build on x86_64 Linux. I am targeting a Beagle board with an ARM Cortex-A8 and Angstrom Linux. TRIED: to use the script in llvm/utils/crosstool/ARM/build-install-linux.sh I used the recommended
2014 Dec 18
2
[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
Almost there... just a few nits left. Viswanath Puttagunta wrote: > +if OPUS_ARM_NEON_INTR > +CELT_SOURCES += $(CELT_SOURCES_ARM_NEON_INTR) > +OPUS_ARM_NEON_INTR_CPPFLAGS = -mfpu=neon -O3 I'll repeat: I don't think you should change the optimization level here. > + /* Just unroll the rest of the loop */ I saw you decided to keep this unrolled, but you didn't actually
2013 Feb 08
2
[LLVMdev] Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?
Hi all, Everything is in the tile, I would like to enforce generation of vmla.f32 instruction for scalar operations on cortex-a9, so is there a LLMV neon intrinsic available for that ? Thanks for your answers Best Regards Seb -------------- next part -------------- An HTML attachment was scrubbed... URL:
2012 Jun 25
2
[LLVMdev] RE : Is llc broken for Cortex-A9 + neon ?
Hi Anton, You're right it fails with a different message with llc 3.0. Anyway thanks for your help. Best Regards Seb > -----Original Message----- > From: Anton Korobeynikov [mailto:anton at korobeynikov.info] > Sent: Monday, June 25, 2012 3:39 PM > To: Sebastien DELDON-GNB > Cc: LLVMdev at cs.uiuc.edu; Rotem, Nadav > Subject: Re: RE : [LLVMdev] Is llc broken for Cortex-A9