thr3ads.net - search: "vld1.64"

Displaying 17 results from an estimated 17 matches for "vld1.64".

[LLVMdev] Question about LLVM NEON intrinsics

2012 Sep 21

[LLVMdev] Question about LLVM NEON intrinsics

On 21 September 2012 09:28, Sebastien DELDON-GNB <sebastien.deldon at st.com> wrote: > declare <16 x float> @llvm.arm.neon.vmaxs.v16f32(<16 x float>, <16 x float>) nounwind readnone > > llc fails with following message: > > SplitVectorResult #0: 0x2258350: v16f32 = llvm.arm.neon.vmaxs 0x2258250, 0x2258050, 0x2258150 [ORD=3] [ID=0] > > LLVM ERROR: Do not

[LLVMdev] Question about LLVM NEON intrinsics

2012 Sep 21

[LLVMdev] Question about LLVM NEON intrinsics

Hi all, I would like to know if LLVM Neon intrinsics are designed to support only 'Legal' types for NEON units. Using llc -march=arm -mcpu=cortex-a9 vmax4.ll -o vmax4.s on following ll code: ; ModuleID = 'vmax.ll' target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-n32" target triple =

[LLVMdev] RE : Question about LLVM NEON intrinsics

2012 Sep 21

[LLVMdev] RE : Question about LLVM NEON intrinsics

Hello Renato, You're pointing me at ARM intrinsics related to loads, problem that I've reported in original e-mail, is not support for vector loads, but support for 'vmaxs'. For instance, there is no vector loads of 16 floats in ARM ISA but it is legal to write in LLVM: ; ModuleID = 'vadd.ll' target datalayout =

[LLVMdev] Question about LLVM NEON intrinsics

2012 Sep 21

[LLVMdev] Question about LLVM NEON intrinsics

On Fri, Sep 21, 2012 at 1:28 AM, Sebastien DELDON-GNB <sebastien.deldon at st.com> wrote: > Hi all, > > I would like to know if LLVM Neon intrinsics are designed to support only 'Legal' types for NEON units. > Using llc -march=arm -mcpu=cortex-a9 vmax4.ll -o vmax4.s on following ll code: > > > ; ModuleID = 'vmax.ll' > target datalayout =

[LLVMdev] RE : Question about LLVM NEON intrinsics

2012 Sep 21

[LLVMdev] RE : Question about LLVM NEON intrinsics

Hi Eli, Thanks for the answer, it clarifies the situation for me. Do you know if there is Pass in LLVM that could be adapted to 'legalize' intrinsics calls ? Or shall I define my own intrinsics for non supported types ? Best Regards Seb ________________________________________ De : Eli Friedman [eli.friedman at gmail.com] Date d'envoi : vendredi 21 septembre 2012 11:54 À : Sebastien

[LLVMdev] Question about LLVM NEON intrinsics

2012 Sep 21

[LLVMdev] Question about LLVM NEON intrinsics

On Sep 21, 2012, at 2:58 AM, Sebastien DELDON-GNB <sebastien.deldon at st.com> wrote: > Hi Eli, > > Thanks for the answer, it clarifies the situation for me. Do you know if there is Pass in LLVM that could be adapted to 'legalize' intrinsics calls ? > Or shall I define my own intrinsics for non supported types ? You should never generate these sorts of intrinsics with

[LLVMdev] Registers and Register Units

2012 May 31

[LLVMdev] Registers and Register Units

You may have noticed Andy and me committing TableGen patches for "register units". I thought I'd better explain what they are. Some targets have instructions that operate on sequences of registers. I'll use ARM examples because it is the most notorious. ARM has, for example: vld1.64 {d1, d2}, [r0] The instruction loads two d-registers, but they must be consecutive. ARM also

[LLVMdev] speed up memcpy intrinsic using ARM Neon registers

2009 Nov 11

[LLVMdev] speed up memcpy intrinsic using ARM Neon registers

On Nov 11, 2009, at 3:27 AM, Rodolph Perfetta wrote: > > If you know about the alignment, maybe use structured load/store > (vst1.64/vld1.64 {dn-dm}). You may also want to work on whole cache > lines > (64 bytes on A8). You can find more in this discussion: > http://groups.google.com/group/beagleboard/browse_thread/thread/12c7bd415fbc >

[LLVMdev] NEON intrinsics preventing redundant load optimization?

2015 Jan 05

[LLVMdev] NEON intrinsics preventing redundant load optimization?

On 4 Jan 2015, at 21:06, Tim Northover <t.p.northover at gmail.com> wrote: >>> I’ve managed to replace the load/store intrinsics with pointer dereferences (along with a typedef to get the alignment correct). This generates 100% the same IR + asm as the auto-vectorized C version (both using -O3), and works with the toolchain in the latest XCode. Are there any concerns around doing

[LLVMdev] speed up memcpy intrinsic using ARM Neon registers

2009 Nov 10

[LLVMdev] speed up memcpy intrinsic using ARM Neon registers

On Nov 9, 2009, at 5:59 PM, David Conrad wrote: > On Nov 9, 2009, at 7:34 PM, Neel Nagar wrote: > >> I tried to speed up Dhrystone on ARM Cortex-A8 by optimizing the >> memcpy intrinsic. I used the Neon load multiple instruction to move >> up >> to 48 bytes at a time . Over 15 scalar instructions collapsed down >> into these 2 Neon instructions. Nice. Thanks

[LLVMdev] Vectorizer using Instruction, not opcodes

2013 Feb 04

[LLVMdev] Vectorizer using Instruction, not opcodes

On 4 February 2013 18:25, Arnold Schwaighofer <aschwaighofer at apple.com>wrote: > For cases where this approach breaks really badly we could consider adding > a specialized api or parameters (like the type of a user/use). But we > should do so only as a last resort and backed by actual code that would > benefit from doing so. > Very sensible, more or less what I had in

[LLVMdev] NEON intrinsics preventing redundant load optimization?

2015 Jan 05

[LLVMdev] NEON intrinsics preventing redundant load optimization?

Hi all, Sorry for arriving late to the party. First, some context: vld1 is not the same as a pointer dereference. The alignment requirements are different (which I saw you hacked around in your testcase using attribute((aligned(4))) ), and in big endian environments they do totally different things (VLD1 does element-wise byteswapping and pointer dereferences byteswaps the entire 128-bit

[LLVMdev] speed up memcpy intrinsic using ARM Neon registers

2009 Nov 10

[LLVMdev] speed up memcpy intrinsic using ARM Neon registers

I tried to speed up Dhrystone on ARM Cortex-A8 by optimizing the memcpy intrinsic. I used the Neon load multiple instruction to move up to 48 bytes at a time . Over 15 scalar instructions collapsed down into these 2 Neon instructions. fldmiad r3, {d0, d1, d2, d3, d4, d5} @ SrcLine dhrystone.c 359 fstmiad r1, {d0, d1, d2, d3, d4, d5} It seems like this should be faster. But I did

[LLVMdev] Vectorizer using Instruction, not opcodes

2013 Feb 04

[LLVMdev] Vectorizer using Instruction, not opcodes

Hi all, My take on this is that, as you state below, at the IR level we are only roughly estimating cost, at best (or we would have to lower the code and then estimate cost - something we don't want to do). I would propose for estimating the "worst case costs" and see how far we get with this. My rational here is that we don't want vectorization to decrease performance relative

[LLVMdev] Vectorizer using Instruction, not opcodes

2013 Feb 04

[LLVMdev] Vectorizer using Instruction, not opcodes

----- Original Message ----- > From: "Renato Golin" <renato.golin at linaro.org> > To: "Arnold Schwaighofer" <aschwaighofer at apple.com> > Cc: "LLVM Dev" <llvmdev at cs.uiuc.edu>, "Nadav Rotem" <nrotem at apple.com>, "Hal Finkel" <hfinkel at anl.gov> > Sent: Monday, February 4, 2013 1:38:03 PM >

[LLVMdev] Vectorizer using Instruction, not opcodes

2013 Feb 04

[LLVMdev] Vectorizer using Instruction, not opcodes

Hi folks, I've been thinking on how to implement some of the costs and there is a lot of instructions which cost depend on other instructions around. Casts are one obvious case, since arithmetic and memory instructions can, sometimes, cast values for free. The cost model receives Opcodes, which lose the info on the history of the values being vectorized, and I thought we could pass the whole

[LLVMdev] Vectorizer using Instruction, not opcodes

2013 Feb 04

[LLVMdev] Vectorizer using Instruction, not opcodes

The loop vectorized does not estimate the cost of vectorization by looking at the IR you list below. It does not vectorize and then run the CostAnalysis pass. It estimates the cost itself before it even performs the vectorization. The way it works is that it looks at all the scalar instructions and asks: What is the cost if I execute the scalar instruction as a vector instruction. Therefore, it

search for: vld1.64