Displaying 20 results from an estimated 62 matches for "d17".
Did you mean:
17
2010 Nov 12
2
[LLVMdev] Simple NEON optimization
...days, most as a matter of exercise, but it also simplifies (just
a bit) the code generated.
The case is simple:
uint32x2_t x, res;
res = vceq_u32(x, vcreate_u32(0));
This will generate the following code:
; zero d16
vmov.i32 d16, #0x0
; load a into d17
movw r0, :lower16:a
movt r0, :upper16:a
vld1.32 {d17}, [r0]
; compare two registers
vceq.i32 d17, d17, d16
But, because the vector is zero, and there is a NEON instruction to
compare against an immediate zero (VCEQZ), we could combine the two
in...
2013 Oct 15
0
[LLVMdev] MI scheduler produce badly code with inline function
On Oct 14, 2013, at 3:27 AM, Zakk <zakk0610 at gmail.com> wrote:
> Hi all,
> I meet this problem when compiling the TREAM benchmark (http://www.cs.virginia.edu/stream/FTP/Code/) with enable-misched
>
> The small function will be scheduled as good code, but if opt inline this function, the inline part will be scheduled as bad code.
A bug for this is welcome. Pretty soon, I’ll
2013 Oct 14
2
[LLVMdev] MI scheduler produce badly code with inline function
Hi all,
I meet this problem when compiling the TREAM benchmark (
http://www.cs.virginia.edu/stream/FTP/Code/) with enable-misched
The small function will be scheduled as good code, but if opt inline this
function, the inline part will be scheduled as bad code.
so I rewrite a simple code as attached link (foo.c), and compiled with two
different methods:
*method A:*
*$clang -O3 foo.c -static -S
2013 Oct 21
1
[LLVMdev] MI scheduler produce badly code with inline function
Hi Andy, I'm working on defining new machine model for my target,
But I don't understand how to define the in-order machine (reservation
tables) in new model.
For example, if target has IF ID EX WB stages
should I do:
let BufferSize=0 in {
def IF: ProcResource<1>; def ID: ProcResource<1>;
def EX: ProcResource<1>; def WB: ProcResource<1>;
}
def :
2014 Dec 07
3
[LLVMdev] NEON intrinsics preventing redundant load optimization?
...0; i < 4; ++i)
result.data[i] = a.data[i] * b.data[i];
return result;
}
void TestVec4Multiply(vec4& a, vec4& b, vec4& result)
{
result = a * b;
}
With -O3 the loop gets vectorized and the code generated looks optimal:
__Z16TestVec4MultiplyR4vec4S0_S0_:
@ BB#0:
vld1.32 {d16, d17}, [r1]
vld1.32 {d18, d19}, [r0]
vmul.f32 q8, q9, q8
vst1.32 {d16, d17}, [r2]
bx lr
However if I replace the operator* with a NEON intrinsic implementation (I know the vectorizer figured out optimal code in this case anyway, but that wasn't true for my real situation) then the temporary &qu...
2010 Nov 12
0
[LLVMdev] Simple NEON optimization
...ies (just
> a bit) the code generated.
>
> The case is simple:
>
> uint32x2_t x, res;
> res = vceq_u32(x, vcreate_u32(0));
>
> This will generate the following code:
>
> ; zero d16
> vmov.i32 d16, #0x0
> ; load a into d17
> movw r0, :lower16:a
> movt r0, :upper16:a
> vld1.32 {d17}, [r0]
> ; compare two registers
> vceq.i32 d17, d17, d16
>
> But, because the vector is zero, and there is a NEON instruction to
> compare against an immediate zero (...
2013 Oct 16
3
[LLVMdev] MI scheduler produce badly code with inline function
...ed -mllvm -scheditins=false
per-operand cost model :
Scale:
push {lr}
movw r12, :lower16:c
movw lr, :lower16:b
movw r3, #9216
movt r12, :upper16:c
mov r1, #0
vmov.f64 d16, #3.000000e+00
movt lr, :upper16:b
movt r3, #244
.LBB0_1:
add r0, r12, r1
add r2, lr, r1
*vldr d17, [r0]*
add r1, r1, #32
vmul.f64 d17, d17, d16
cmp r1, r3
vstr d17, [r2]
* vldr d17, [r0, #8]*
vmul.f64 d17, d17, d16
* * vstr d17, [r2, #8]
* vldr d17, [r0, #16]*
vmul.f64 d17, d17, d16
vstr d17, [r2, #16]
* vldr d17, [r0, #24]*
vmul.f64 d17, d17, d16
vstr d17, [r2,...
2010 Jul 05
2
nested for loops
...or your time and consideration.
for(d1 in 0:n){
for(d2 in 0:n){
for(d3 in 0:n){
for(d4 in 0:n){
for(d5 in 0:n){
for(d6 in 0:n){
for(d7 in 0:n){
for(d8 in 0:n){
for(d9 in 0:n){
for(d10 in 0:n){
for(d11 in 0:n){
for(d12 in 0:n){
for(d13 in 0:n){
for(d14 in 0:n){
for(d15 in 0:n){
for(d16 in 0:n){
for(d17 in 0:n){
for(d18 in 0:n){
for(d19 in 0:n){
for(d20 in 0:n){
list=c(d1,d2,d3,d4,d5,d6,d7,d8,d9,d10,d11,d12,d13,d14,d15,d16,d17,d18,d19,d20)
}}}}}}}}}}}}}}}}}}}}
[[alternative HTML version deleted]]
2012 Sep 21
5
[LLVMdev] Question about LLVM NEON intrinsics
...store <4 x float> %tmp3, <4 x float>* %C
ret void
}
declare <4 x float> @llvm.arm.neon.vmaxs.v4f32(<4 x float>, <4 x float>) nounwind readnone
I've got following code generated:
...
vmaxf32: @ @vmaxf32
@ BB#0:
vld1.64 {d16, d17}, [r2]
vld1.64 {d18, d19}, [r1]
vmax.f32 q8, q9, q8
vst1.64 {d16, d17}, [r0]
bx lr
...
Now if use <16 x float> vectors instead of <4 x float>:
define void @vmaxf32(<16 x float> *%C, <16 x float>* %A, <16 x float>* %B) nounwind {
%tmp1 = load <16 x float>...
2012 Sep 06
1
[LLVMdev] Unaligned vector memory access for ARM/NEON.
...s confused about the alignment requirements here. :)
>
> I think I see, in general, where. I twiddled the IR to give it higher alignment (16 bytes) and get:
> extend: @ @extend
> @ BB#0:
> vldr d16, [r0]
> vmovl.s16 q8, d16
> vstmia r1, {d16, d17}
> vldr d16, [r0, #8]
> add r0, r1, #16
> vmovl.s16 q8, d16
> vstmia r0, {d16, d17}
> bx lr
>
> Note that we're using a plain vldr instruction here to load the d register, not a vld1 instruction. Similarly for the stores. According to the ARM ARM (DDI 0406C), you'...
2012 Sep 21
0
[LLVMdev] Question about LLVM NEON intrinsics
...; ret void
> }
>
> declare <4 x float> @llvm.arm.neon.vmaxs.v4f32(<4 x float>, <4 x float>) nounwind readnone
>
> I've got following code generated:
>
> ...
> vmaxf32: @ @vmaxf32
> @ BB#0:
> vld1.64 {d16, d17}, [r2]
> vld1.64 {d18, d19}, [r1]
> vmax.f32 q8, q9, q8
> vst1.64 {d16, d17}, [r0]
> bx lr
> ...
>
> Now if use <16 x float> vectors instead of <4 x float>:
>
> define void @vmaxf32(<16 x float> *%C, <16 x f...
2011 Sep 01
0
[PATCH 3/5] resample: Add NEON optimized inner_product_single for fixed point
...uot;
+ " vld1.16 {d16}, [%[b]]!\n"
+ " vld1.16 {d20}, [%[a]]!\n"
+ " subs %[remainder], %[remainder], #4\n"
+ " vmull.s16 q0, d16, d20\n"
+ " beq 5f\n"
+ " b 4f\n"
+ "1:"
+ " vld1.16 {d16, d17, d18, d19}, [%[b]]!\n"
+ " vld1.16 {d20, d21, d22, d23}, [%[a]]!\n"
+ " subs %[len], %[len], #16\n"
+ " vmull.s16 q0, d16, d20\n"
+ " vmlal.s16 q0, d17, d21\n"
+ " vmlal.s16 q0, d18, d22\n"
+ " vmlal.s16 q0, d19, d2...
2012 Sep 05
0
[LLVMdev] Unaligned vector memory access for ARM/NEON.
...ssible that it's LLVM that's confused about the alignment requirements here. :)
I think I see, in general, where. I twiddled the IR to give it higher alignment (16 bytes) and get:
extend: @ @extend
@ BB#0:
vldr d16, [r0]
vmovl.s16 q8, d16
vstmia r1, {d16, d17}
vldr d16, [r0, #8]
add r0, r1, #16
vmovl.s16 q8, d16
vstmia r0, {d16, d17}
bx lr
Note that we're using a plain vldr instruction here to load the d register, not a vld1 instruction. Similarly for the stores. According to the ARM ARM (DDI 0406C), you're correct about the element size...
2012 Sep 21
2
[LLVMdev] RE : Question about LLVM NEON intrinsics
...; ret void
> }
>
> declare <4 x float> @llvm.arm.neon.vmaxs.v4f32(<4 x float>, <4 x float>) nounwind readnone
>
> I've got following code generated:
>
> ...
> vmaxf32: @ @vmaxf32
> @ BB#0:
> vld1.64 {d16, d17}, [r2]
> vld1.64 {d18, d19}, [r1]
> vmax.f32 q8, q9, q8
> vst1.64 {d16, d17}, [r0]
> bx lr
> ...
>
> Now if use <16 x float> vectors instead of <4 x float>:
>
> define void @vmaxf32(<16 x float> *%C, <16 x f...
2008 May 15
2
xen smp acpi failed
In hvm enviroment, acpi failed. why? centos5.1
===================================================
[root@hvm001 ~]# xm dmesg
__ __ _____ _ ____ ___ ____ _ ____
\ \/ /___ _ __ |___ / / | |___ \ / _ \___ \ ___| | ___|
\ // _ \ \047_ \ |_ \ | | __) |_| (_) |__) | / _ \ |___ \
/ \ __/ | | | ___) || |_ / __/|__\__, / __/ | __/ |___) |
/_/\_\___|_| |_| |____(_)_(_)_____| /_/_____(_)___|_|____/
2015 Jan 05
2
[LLVMdev] NEON intrinsics preventing redundant load optimization?
On 4 Jan 2015, at 21:06, Tim Northover <t.p.northover at gmail.com> wrote:
>>> I’ve managed to replace the load/store intrinsics with pointer dereferences (along with a typedef to get the alignment correct). This generates 100% the same IR + asm as the auto-vectorized C version (both using -O3), and works with the toolchain in the latest XCode. Are there any concerns around doing
2014 Mar 10
4
[LLVMdev] neon registers llvm using
Hi, Everyone:
Can anyone let me know the default NEON registers llvm going to use with armv7 devices?
For example, d10 and d11 are treated as default zero? I am using Xcode5 + llvm and I got a case that compiler will generate neon codes
" vst.8 {d10, d11}, [r1] "
from C codes:
"int aMV[4];
......
aMV[0] = aMV[1] = aMV[2] = aMV[3] = 0; "
and I
2011 Nov 16
0
[LLVMdev] LLVM 3.0 release notes ARM Target
what do you mean by "more optimal instructions" ?
-omer
On Wed, Nov 16, 2011 at 1:28 AM, Joe Abbey <jabbey at arxan.com> wrote:
> I've done a first pass over the past 6 months of changes and some notable
> things stood out:
>
> * The ARM backend has reworked Set Jump Long Jump EH Lowering.
> * The ARM backend includes improved support for Cortex-M
> *
2012 Jul 05
2
[LLVMdev] RE : Vector argument passing abi for ARM ?
...v r0, r1, d16
vmov r2, r3, d18
bl zzz(PLT)
pop {r11, pc}
with LLVM trunk, assembly looks like:
bar: @ @bar
@ BB#0: @ %L.entry
push {r11, lr}
add r0, r1, #2
vld1.32 {d16[0]}, [r1, :16]
vld1.32 {d17[0]}, [r0, :16]
vmovl.u8 q9, d16
vmovl.u8 q8, d17
vmovl.u16 q9, d18
vmovl.u16 q8, d16
vmov r0, r1, d18
vmov r2, r3, d16
bl zzz(PLT)
pop {r11, pc}
.Ltmp0:
.size bar, .Ltmp0-bar
and assembler complaints with following message:
bugparam.s:19: Err...
2012 Sep 05
3
[LLVMdev] Unaligned vector memory access for ARM/NEON.
Hello Jim,
Thank you for the response. I may be confused about the alignment rules
here.
I had been looking at the ARM RVCT Assembler Guide, which seems to
indicate vld1.16 operates on 16-bit aligned data, unless I am
misinterpreting their table
(Table 5-11 in ARM DUI 0204H, pg 5-70,5-71).
Prior to the table, It does mention the accesses need to be "element"
aligned, where I took