thr3ads.net - similar to: "[LLVMdev] NEON intrinsics preventing redundant load optimization?"

Displaying 20 results from an estimated 1000 matches similar to: "[LLVMdev] NEON intrinsics preventing redundant load optimization?"

[LLVMdev] Simple NEON optimization

2010 Nov 12

[LLVMdev] Simple NEON optimization

Hi folks, me again, So, I want to implement a simple optimization in a NEON case I've seen these days, most as a matter of exercise, but it also simplifies (just a bit) the code generated. The case is simple: uint32x2_t x, res; res = vceq_u32(x, vcreate_u32(0)); This will generate the following code: ; zero d16 vmov.i32 d16, #0x0 ; load a

[LLVMdev] RE : Vector argument passing abi for ARM ?

2012 Jul 05

[LLVMdev] RE : Vector argument passing abi for ARM ?

Hi Duncan, I also thought it was a bug, especially since it worked with LLVM 3.0, but since it is not defined by ABI, I was not sure if I need to submit it as a BUG. I wanted to be sure that it is an actual BUG before submitting it and got the not-a-bug answer. Here is a small example to reproduce the problem I'm experiencing: ; ModuleID = 'bugparam.ll' target datalayout =

[LLVMdev] RE : Vector argument passing abi for ARM ?

2012 Jul 05

[LLVMdev] RE : Vector argument passing abi for ARM ?

Hi Sebastien, > I also thought it was a bug, especially since it worked with LLVM 3.0, but since it is not defined by ABI, I was not sure if I need to submit it as a BUG. yes it is a bug. > I wanted to be sure that it is an actual BUG before submitting it and got the not-a-bug answer. I didn't read Nadav's reply as saying there was no bug, in fact he explicitly said in his email

[LLVMdev] Vectorizer using Instruction, not opcodes

2013 Feb 04

[LLVMdev] Vectorizer using Instruction, not opcodes

On 4 February 2013 18:25, Arnold Schwaighofer <aschwaighofer at apple.com>wrote: > For cases where this approach breaks really badly we could consider adding > a specialized api or parameters (like the type of a user/use). But we > should do so only as a last resort and backed by actual code that would > benefit from doing so. > Very sensible, more or less what I had in

[LLVMdev] NEON intrinsics preventing redundant load optimization?

2015 Jan 05

[LLVMdev] NEON intrinsics preventing redundant load optimization?

On 4 Jan 2015, at 21:06, Tim Northover <t.p.northover at gmail.com> wrote: >>> I’ve managed to replace the load/store intrinsics with pointer dereferences (along with a typedef to get the alignment correct). This generates 100% the same IR + asm as the auto-vectorized C version (both using -O3), and works with the toolchain in the latest XCode. Are there any concerns around doing

[LLVMdev] NEON intrinsics preventing redundant load optimization?

2014 Dec 10

[LLVMdev] NEON intrinsics preventing redundant load optimization?

On 9 Dec 2014, at 02:20, Jim Grosbach <grosbach at apple.com> wrote: >> On Dec 8, 2014, at 1:05 AM, Simon Taylor <simontaylor1 at ntlworld.com> wrote: >> >> On 8 Dec 2014, at 00:13, Renato Golin <renato.golin at linaro.org> wrote: >> >>> On 7 December 2014 at 19:15, Simon Taylor <simontaylor1 at ntlworld.com> wrote: >>>> Is

[LLVMdev] NEON intrinsics preventing redundant load optimization?

2014 Dec 08

[LLVMdev] NEON intrinsics preventing redundant load optimization?

On 8 Dec 2014, at 00:13, Renato Golin <renato.golin at linaro.org> wrote: > On 7 December 2014 at 19:15, Simon Taylor <simontaylor1 at ntlworld.com> wrote: >> Is there something about the use of intrinsics that prevents the compiler optimizing out the redundant store on the stack? Is there any hope for this improving in the future, or anything I can do now to improve the

[LLVMdev] NEON intrinsics preventing redundant load optimization?

2015 Jan 05

[LLVMdev] NEON intrinsics preventing redundant load optimization?

Hi all, Sorry for arriving late to the party. First, some context: vld1 is not the same as a pointer dereference. The alignment requirements are different (which I saw you hacked around in your testcase using attribute((aligned(4))) ), and in big endian environments they do totally different things (VLD1 does element-wise byteswapping and pointer dereferences byteswaps the entire 128-bit

[LLVMdev] Vectorizer using Instruction, not opcodes

2013 Feb 04

[LLVMdev] Vectorizer using Instruction, not opcodes

Hi all, My take on this is that, as you state below, at the IR level we are only roughly estimating cost, at best (or we would have to lower the code and then estimate cost - something we don't want to do). I would propose for estimating the "worst case costs" and see how far we get with this. My rational here is that we don't want vectorization to decrease performance relative

[PATCH 0/5] ARM NEON optimization for samplerate converter

2011 Sep 01

[PATCH 0/5] ARM NEON optimization for samplerate converter

From: Jyri Sarha <jsarha at ti.com> I optimized Speex resampler for NEON capable ARM CPUs. The first patch should speed up resampling on any platform that can spare the increased memory usage. It would be nice to have these merged to the master branch. Please let me know if there is anything I can do to help the the merge. The patches have been rebased on top of master branch in

[LLVMdev] Simple NEON optimization

2010 Nov 12

[LLVMdev] Simple NEON optimization

On Nov 12, 2010, at 7:23 AM, Renato Golin wrote: > Hi folks, me again, > > So, I want to implement a simple optimization in a NEON case I've seen > these days, most as a matter of exercise, but it also simplifies (just > a bit) the code generated. > > The case is simple: > > uint32x2_t x, res; > res = vceq_u32(x, vcreate_u32(0)); > > This

[LLVMdev] Vectorizer using Instruction, not opcodes

2013 Feb 04

[LLVMdev] Vectorizer using Instruction, not opcodes

Hi folks, I've been thinking on how to implement some of the costs and there is a lot of instructions which cost depend on other instructions around. Casts are one obvious case, since arithmetic and memory instructions can, sometimes, cast values for free. The cost model receives Opcodes, which lose the info on the history of the values being vectorized, and I thought we could pass the whole

[LLVMdev] Unaligned vector memory access for ARM/NEON.

2012 Sep 05

[LLVMdev] Unaligned vector memory access for ARM/NEON.

Hmmm. Well, it's entirely possible that it's LLVM that's confused about the alignment requirements here. :) I think I see, in general, where. I twiddled the IR to give it higher alignment (16 bytes) and get: extend: @ @extend @ BB#0: vldr d16, [r0] vmovl.s16 q8, d16 vstmia r1, {d16, d17} vldr d16, [r0, #8] add r0, r1, #16 vmovl.s16 q8, d16 vstmia

[LLVMdev] Vectorizer using Instruction, not opcodes

2013 Feb 04

[LLVMdev] Vectorizer using Instruction, not opcodes

----- Original Message ----- > From: "Renato Golin" <renato.golin at linaro.org> > To: "Arnold Schwaighofer" <aschwaighofer at apple.com> > Cc: "LLVM Dev" <llvmdev at cs.uiuc.edu>, "Nadav Rotem" <nrotem at apple.com>, "Hal Finkel" <hfinkel at anl.gov> > Sent: Monday, February 4, 2013 1:38:03 PM >

[LLVMdev] Vector argument passing abi for ARM ?

2012 Jul 05

[LLVMdev] Vector argument passing abi for ARM ?

Hi Sebastien, > Thanks for the quick answer, how do I know which type is legal/illegal with respect to calling convention ? the code generators are supposed to produce working code no matter what the parameter type is. The fact that the ARM ABI doesn't specify how <2 x i8> is passed just means that the code generators can pass it using whatever technique it feels like (since it

[LLVMdev] Vectorizer using Instruction, not opcodes

2013 Feb 04

[LLVMdev] Vectorizer using Instruction, not opcodes

The loop vectorized does not estimate the cost of vectorization by looking at the IR you list below. It does not vectorize and then run the CostAnalysis pass. It estimates the cost itself before it even performs the vectorization. The way it works is that it looks at all the scalar instructions and asks: What is the cost if I execute the scalar instruction as a vector instruction. Therefore, it

[LLVMdev] Unaligned vector memory access for ARM/NEON.

2012 Sep 06

[LLVMdev] Unaligned vector memory access for ARM/NEON.

Hello, Thanks again. We did try overestimating the alignment, and saw the vldr you reference here. It looks like a recent change (r161962?) did enable vld1 generation for this case (great!) on darwin, but not linux. I'm not sure if the effect of lowering load <4 x i16>* align 2 to vld1.16 this was intentional in this change or not. If so, my question is what is the preferable way to

[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics

2014 Nov 09

[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics

Hello, This patch introduces ARM NEON Intrinsics to optimize kf_bfly4 routine in celt part of libopus. Using NEON optimized kf_bfly4(_neon) routine helped improve performance of opus_fft_impl function by about 21.4%. The end use case was decoding a music opus ogg file. The end use case saw performance improvement of about 4.47%. This patch has 2 components i. Actual neon code to improve

[LLVMdev] Vector argument passing abi for ARM ?

2012 Jul 05

[LLVMdev] Vector argument passing abi for ARM ?

Hi Rotem, Thanks for the quick answer, how do I know which type is legal/illegal with respect to calling convention ? Best Regards Seb > -----Original Message----- > From: Rotem, Nadav [mailto:nadav.rotem at intel.com] > Sent: Thursday, July 05, 2012 11:21 AM > To: Sebastien DELDON-GNB; llvmdev at cs.uiuc.edu > Subject: RE: Vector argument passing abi for ARM ? > > The

[LLVMdev] neon registers llvm using

2014 Mar 10

[LLVMdev] neon registers llvm using

Hi, Everyone: Can anyone let me know the default NEON registers llvm going to use with armv7 devices? For example, d10 and d11 are treated as default zero? I am using Xcode5 + llvm and I got a case that compiler will generate neon codes " vst.8 {d10, d11}, [r1] " from C codes: "int aMV[4]; ...... aMV[0] = aMV[1] = aMV[2] = aMV[3] = 0; " and I

similar to: [LLVMdev] NEON intrinsics preventing redundant load optimization?