thr3ads.net - search: "d17"

2010 Nov 12

2

[LLVMdev] Simple NEON optimization

...days, most as a matter of exercise, but it also simplifies (just a bit) the code generated. The case is simple: uint32x2_t x, res; res = vceq_u32(x, vcreate_u32(0)); This will generate the following code: ; zero d16 vmov.i32 d16, #0x0 ; load a into d17 movw r0, :lower16:a movt r0, :upper16:a vld1.32 {d17}, [r0] ; compare two registers vceq.i32 d17, d17, d16 But, because the vector is zero, and there is a NEON instruction to compare against an immediate zero (VCEQZ), we could combine the two in...

[LLVMdev] MI scheduler produce badly code with inline function

2013 Oct 15

0

[LLVMdev] MI scheduler produce badly code with inline function

On Oct 14, 2013, at 3:27 AM, Zakk <zakk0610 at gmail.com> wrote: > Hi all, > I meet this problem when compiling the TREAM benchmark (http://www.cs.virginia.edu/stream/FTP/Code/) with enable-misched > > The small function will be scheduled as good code, but if opt inline this function, the inline part will be scheduled as bad code. A bug for this is welcome. Pretty soon, I’ll

[LLVMdev] MI scheduler produce badly code with inline function

2013 Oct 14

2

[LLVMdev] MI scheduler produce badly code with inline function

Hi all, I meet this problem when compiling the TREAM benchmark ( http://www.cs.virginia.edu/stream/FTP/Code/) with enable-misched The small function will be scheduled as good code, but if opt inline this function, the inline part will be scheduled as bad code. so I rewrite a simple code as attached link (foo.c), and compiled with two different methods: *method A:* *$clang -O3 foo.c -static -S

[LLVMdev] MI scheduler produce badly code with inline function

2013 Oct 21

1

[LLVMdev] MI scheduler produce badly code with inline function

Hi Andy, I'm working on defining new machine model for my target, But I don't understand how to define the in-order machine (reservation tables) in new model. For example, if target has IF ID EX WB stages should I do: let BufferSize=0 in { def IF: ProcResource<1>; def ID: ProcResource<1>; def EX: ProcResource<1>; def WB: ProcResource<1>; } def :

[LLVMdev] NEON intrinsics preventing redundant load optimization?

2014 Dec 07

3

[LLVMdev] NEON intrinsics preventing redundant load optimization?

...0; i < 4; ++i) result.data[i] = a.data[i] * b.data[i]; return result; } void TestVec4Multiply(vec4& a, vec4& b, vec4& result) { result = a * b; } With -O3 the loop gets vectorized and the code generated looks optimal: __Z16TestVec4MultiplyR4vec4S0_S0_: @ BB#0: vld1.32 {d16, d17}, [r1] vld1.32 {d18, d19}, [r0] vmul.f32 q8, q9, q8 vst1.32 {d16, d17}, [r2] bx lr However if I replace the operator* with a NEON intrinsic implementation (I know the vectorizer figured out optimal code in this case anyway, but that wasn't true for my real situation) then the temporary &qu...

[LLVMdev] Simple NEON optimization

2010 Nov 12

0

[LLVMdev] Simple NEON optimization

...ies (just > a bit) the code generated. > > The case is simple: > > uint32x2_t x, res; > res = vceq_u32(x, vcreate_u32(0)); > > This will generate the following code: > > ; zero d16 > vmov.i32 d16, #0x0 > ; load a into d17 > movw r0, :lower16:a > movt r0, :upper16:a > vld1.32 {d17}, [r0] > ; compare two registers > vceq.i32 d17, d17, d16 > > But, because the vector is zero, and there is a NEON instruction to > compare against an immediate zero (...

[LLVMdev] MI scheduler produce badly code with inline function

2013 Oct 16

3

[LLVMdev] MI scheduler produce badly code with inline function

...ed -mllvm -scheditins=false per-operand cost model : Scale: push {lr} movw r12, :lower16:c movw lr, :lower16:b movw r3, #9216 movt r12, :upper16:c mov r1, #0 vmov.f64 d16, #3.000000e+00 movt lr, :upper16:b movt r3, #244 .LBB0_1: add r0, r12, r1 add r2, lr, r1 *vldr d17, [r0]* add r1, r1, #32 vmul.f64 d17, d17, d16 cmp r1, r3 vstr d17, [r2] * vldr d17, [r0, #8]* vmul.f64 d17, d17, d16 * * vstr d17, [r2, #8] * vldr d17, [r0, #16]* vmul.f64 d17, d17, d16 vstr d17, [r2, #16] * vldr d17, [r0, #24]* vmul.f64 d17, d17, d16 vstr d17, [r2,...

nested for loops

2010 Jul 05

2

nested for loops

...or your time and consideration. for(d1 in 0:n){ for(d2 in 0:n){ for(d3 in 0:n){ for(d4 in 0:n){ for(d5 in 0:n){ for(d6 in 0:n){ for(d7 in 0:n){ for(d8 in 0:n){ for(d9 in 0:n){ for(d10 in 0:n){ for(d11 in 0:n){ for(d12 in 0:n){ for(d13 in 0:n){ for(d14 in 0:n){ for(d15 in 0:n){ for(d16 in 0:n){ for(d17 in 0:n){ for(d18 in 0:n){ for(d19 in 0:n){ for(d20 in 0:n){ list=c(d1,d2,d3,d4,d5,d6,d7,d8,d9,d10,d11,d12,d13,d14,d15,d16,d17,d18,d19,d20) }}}}}}}}}}}}}}}}}}}} [[alternative HTML version deleted]]

[LLVMdev] Question about LLVM NEON intrinsics

2012 Sep 21

5

[LLVMdev] Question about LLVM NEON intrinsics

...store <4 x float> %tmp3, <4 x float>* %C ret void } declare <4 x float> @llvm.arm.neon.vmaxs.v4f32(<4 x float>, <4 x float>) nounwind readnone I've got following code generated: ... vmaxf32: @ @vmaxf32 @ BB#0: vld1.64 {d16, d17}, [r2] vld1.64 {d18, d19}, [r1] vmax.f32 q8, q9, q8 vst1.64 {d16, d17}, [r0] bx lr ... Now if use <16 x float> vectors instead of <4 x float>: define void @vmaxf32(<16 x float> *%C, <16 x float>* %A, <16 x float>* %B) nounwind { %tmp1 = load <16 x float&gt...

[LLVMdev] Unaligned vector memory access for ARM/NEON.

2012 Sep 06

1

[LLVMdev] Unaligned vector memory access for ARM/NEON.

...s confused about the alignment requirements here. :) > > I think I see, in general, where. I twiddled the IR to give it higher alignment (16 bytes) and get: > extend: @ @extend > @ BB#0: > vldr d16, [r0] > vmovl.s16 q8, d16 > vstmia r1, {d16, d17} > vldr d16, [r0, #8] > add r0, r1, #16 > vmovl.s16 q8, d16 > vstmia r0, {d16, d17} > bx lr > > Note that we're using a plain vldr instruction here to load the d register, not a vld1 instruction. Similarly for the stores. According to the ARM ARM (DDI 0406C), you'...

[LLVMdev] Question about LLVM NEON intrinsics

2012 Sep 21

0

[LLVMdev] Question about LLVM NEON intrinsics

...; ret void > } > > declare <4 x float> @llvm.arm.neon.vmaxs.v4f32(<4 x float>, <4 x float>) nounwind readnone > > I've got following code generated: > > ... > vmaxf32: @ @vmaxf32 > @ BB#0: > vld1.64 {d16, d17}, [r2] > vld1.64 {d18, d19}, [r1] > vmax.f32 q8, q9, q8 > vst1.64 {d16, d17}, [r0] > bx lr > ... > > Now if use <16 x float> vectors instead of <4 x float>: > > define void @vmaxf32(<16 x float> *%C, <16 x f...

[PATCH 3/5] resample: Add NEON optimized inner_product_single for fixed point

2011 Sep 01

0

[PATCH 3/5] resample: Add NEON optimized inner_product_single for fixed point

...uot; + " vld1.16 {d16}, [%[b]]!\n" + " vld1.16 {d20}, [%[a]]!\n" + " subs %[remainder], %[remainder], #4\n" + " vmull.s16 q0, d16, d20\n" + " beq 5f\n" + " b 4f\n" + "1:" + " vld1.16 {d16, d17, d18, d19}, [%[b]]!\n" + " vld1.16 {d20, d21, d22, d23}, [%[a]]!\n" + " subs %[len], %[len], #16\n" + " vmull.s16 q0, d16, d20\n" + " vmlal.s16 q0, d17, d21\n" + " vmlal.s16 q0, d18, d22\n" + " vmlal.s16 q0, d19, d2...

[LLVMdev] Unaligned vector memory access for ARM/NEON.

2012 Sep 05

0

[LLVMdev] Unaligned vector memory access for ARM/NEON.

...ssible that it's LLVM that's confused about the alignment requirements here. :) I think I see, in general, where. I twiddled the IR to give it higher alignment (16 bytes) and get: extend: @ @extend @ BB#0: vldr d16, [r0] vmovl.s16 q8, d16 vstmia r1, {d16, d17} vldr d16, [r0, #8] add r0, r1, #16 vmovl.s16 q8, d16 vstmia r0, {d16, d17} bx lr Note that we're using a plain vldr instruction here to load the d register, not a vld1 instruction. Similarly for the stores. According to the ARM ARM (DDI 0406C), you're correct about the element size...

[LLVMdev] RE : Question about LLVM NEON intrinsics

2012 Sep 21

2

[LLVMdev] RE : Question about LLVM NEON intrinsics

...; ret void > } > > declare <4 x float> @llvm.arm.neon.vmaxs.v4f32(<4 x float>, <4 x float>) nounwind readnone > > I've got following code generated: > > ... > vmaxf32: @ @vmaxf32 > @ BB#0: > vld1.64 {d16, d17}, [r2] > vld1.64 {d18, d19}, [r1] > vmax.f32 q8, q9, q8 > vst1.64 {d16, d17}, [r0] > bx lr > ... > > Now if use <16 x float> vectors instead of <4 x float>: > > define void @vmaxf32(<16 x float> *%C, <16 x f...

xen smp acpi failed

2008 May 15

2

xen smp acpi failed

In hvm enviroment, acpi failed. why? centos5.1 =================================================== [root@hvm001 ~]# xm dmesg __ __ _____ _ ____ ___ ____ _ ____ \ \/ /___ _ __ |___ / / | |___ \ / _ \___ \ ___| | ___| \ // _ \ \047_ \ |_ \ | | __) |_| (_) |__) | / _ \ |___ \ / \ __/ | | | ___) || |_ / __/|__\__, / __/ | __/ |___) | /_/\_\___|_| |_| |____(_)_(_)_____| /_/_____(_)___|_|____/

[LLVMdev] NEON intrinsics preventing redundant load optimization?

2015 Jan 05

2

[LLVMdev] NEON intrinsics preventing redundant load optimization?

On 4 Jan 2015, at 21:06, Tim Northover <t.p.northover at gmail.com> wrote: >>> I’ve managed to replace the load/store intrinsics with pointer dereferences (along with a typedef to get the alignment correct). This generates 100% the same IR + asm as the auto-vectorized C version (both using -O3), and works with the toolchain in the latest XCode. Are there any concerns around doing

[LLVMdev] neon registers llvm using

2014 Mar 10

4

[LLVMdev] neon registers llvm using

Hi, Everyone: Can anyone let me know the default NEON registers llvm going to use with armv7 devices? For example, d10 and d11 are treated as default zero? I am using Xcode5 + llvm and I got a case that compiler will generate neon codes " vst.8 {d10, d11}, [r1] " from C codes: "int aMV[4]; ...... aMV[0] = aMV[1] = aMV[2] = aMV[3] = 0; " and I

[LLVMdev] LLVM 3.0 release notes ARM Target

2011 Nov 16

0

[LLVMdev] LLVM 3.0 release notes ARM Target

what do you mean by "more optimal instructions" ? -omer On Wed, Nov 16, 2011 at 1:28 AM, Joe Abbey <jabbey at arxan.com> wrote: > I've done a first pass over the past 6 months of changes and some notable > things stood out: > > * The ARM backend has reworked Set Jump Long Jump EH Lowering. > * The ARM backend includes improved support for Cortex-M > *

[LLVMdev] RE : Vector argument passing abi for ARM ?

2012 Jul 05

2

[LLVMdev] RE : Vector argument passing abi for ARM ?

...v r0, r1, d16 vmov r2, r3, d18 bl zzz(PLT) pop {r11, pc} with LLVM trunk, assembly looks like: bar: @ @bar @ BB#0: @ %L.entry push {r11, lr} add r0, r1, #2 vld1.32 {d16[0]}, [r1, :16] vld1.32 {d17[0]}, [r0, :16] vmovl.u8 q9, d16 vmovl.u8 q8, d17 vmovl.u16 q9, d18 vmovl.u16 q8, d16 vmov r0, r1, d18 vmov r2, r3, d16 bl zzz(PLT) pop {r11, pc} .Ltmp0: .size bar, .Ltmp0-bar and assembler complaints with following message: bugparam.s:19: Err...

[LLVMdev] Unaligned vector memory access for ARM/NEON.

2012 Sep 05

3

[LLVMdev] Unaligned vector memory access for ARM/NEON.

Hello Jim, Thank you for the response. I may be confused about the alignment rules here. I had been looking at the ARM RVCT Assembler Guide, which seems to indicate vld1.16 operates on 16-bit aligned data, unless I am misinterpreting their table (Table 5-11 in ARM DUI 0204H, pg 5-70,5-71). Prior to the table, It does mention the accesses need to be "element" aligned, where I took

search for: d17