search for: vdup

Displaying 8 results from an estimated 8 matches for "vdup".

Did you mean: dup
2013 May 21
0
regarding ARM NEON CELT filter optimizations
Hello Aurelien, + "vdup.s16 d8, %1;\n" //Duplicate num in d8 lane + "vdup.s16 q5, %4;\n" //Duplicate mem in q5 lane + + /* We try to process 16 samples at a time */ + "movs %5, %3, lsr #4;\n" + "beq .celt_fir1_process16_done_%=;\n" + + ".celt_fir1_process16_...
2013 May 21
0
[PATCH] 02-
....c are hard-coded with 1 and 4 order values + * + * TODO: Test one sample by one filtering + */ + +/* Order 1 NEON FIR filter implementation */ +static void celt_fir1(const opus_val16 *x, opus_val16 num, opus_val16 *y, + int N, opus_val16 mem) +{ + int i; + + __asm__ __volatile__( + "vdup.s16 d8, %1;\n" //Duplicate num in d8 lane + "vdup.s16 q5, %4;\n" //Duplicate mem in q5 lane + + /* We try to process 16 samples at a time */ + "movs %5, %3, lsr #4;\n" + "beq .celt_fir1_process16_done_%=;\n" + + ".celt_fir1_process16_...
2013 May 21
2
[PATCH] 02-Add CELT filter optimizations
....c are hard-coded with 1 and 4 order values + * + * TODO: Test one sample by one filtering + */ + +/* Order 1 NEON FIR filter implementation */ +static void celt_fir1(const opus_val16 *x, opus_val16 num, opus_val16 *y, + int N, opus_val16 mem) +{ + int i; + + __asm__ __volatile__( + "vdup.s16 d8, %1;\n" //Duplicate num in d8 lane + "vdup.s16 q5, %4;\n" //Duplicate mem in q5 lane + + /* We try to process 16 samples at a time */ + "movs %5, %3, lsr #4;\n" + "beq .celt_fir1_process16_done_%=;\n" + + ".celt_fir1_process16_...
2013 Feb 04
6
[LLVMdev] Vectorizer using Instruction, not opcodes
...is not stopping the vectorizer, but it does add a lot of costs where there are none... ** C code: int direct (int k) { int i; int a[256], b[256], c[256]; for (i=0; i<256; i++){ a[i] = b[i] * c[i]; } return a[k]; } ** ASM vectorized result: adr r5, .LCPI0_0 vdup.32 q9, r1 vld1.64 {d16, d17}, [r5, :128] add r1, r1, #4 vadd.i32 q8, q9, q8 cmp r3, r1 vmov.32 r5, d16[0] add r6, lr, r5, lsl #2 add r7, r2, r5, lsl #2 vld1.32 {d16, d17}, [r6] add r5, r4, r5, lsl #2...
2013 Feb 04
0
[LLVMdev] Vectorizer using Instruction, not opcodes
Hi all, My take on this is that, as you state below, at the IR level we are only roughly estimating cost, at best (or we would have to lower the code and then estimate cost - something we don't want to do). I would propose for estimating the "worst case costs" and see how far we get with this. My rational here is that we don't want vectorization to decrease performance relative
2013 Feb 04
0
[LLVMdev] Vectorizer using Instruction, not opcodes
...> > > ** C code: > > > int direct (int k) { > int i; > int a[256], b[256], c[256]; > > > for (i=0; i<256; i++){ > a[i] = b[i] * c[i]; > } > return a[k]; > } > > > ** ASM vectorized result: > > > > adr r5, .LCPI0_0 > vdup.32 q9, r1 > vld1.64 {d16, d17}, [r5, :128] > add r1, r1, #4 > vadd.i32 q8, q9, q8 > cmp r3, r1 > vmov.32 r5, d16[0] > add r6, lr, r5, lsl #2 > add r7, r2, r5, lsl #2 > vld1.32 {d16, d17}, [r6] > add r5, r4, r5, lsl #2 > vld1.32 {d18, d19}, [r7] > vmul.i32 q8, q9, q8...
2013 Feb 04
2
[LLVMdev] Vectorizer using Instruction, not opcodes
Hi folks, I've been thinking on how to implement some of the costs and there is a lot of instructions which cost depend on other instructions around. Casts are one obvious case, since arithmetic and memory instructions can, sometimes, cast values for free. The cost model receives Opcodes, which lose the info on the history of the values being vectorized, and I thought we could pass the whole
2013 Feb 04
0
[LLVMdev] Vectorizer using Instruction, not opcodes
.... > > ** C code: > > int direct (int k) { > int i; > int a[256], b[256], c[256]; > > for (i=0; i<256; i++){ > a[i] = b[i] * c[i]; > } > return a[k]; > } > > ** ASM vectorized result: > > adr r5, .LCPI0_0 > vdup.32 q9, r1 > vld1.64 {d16, d17}, [r5, :128] > add r1, r1, #4 > vadd.i32 q8, q9, q8 > cmp r3, r1 > vmov.32 r5, d16[0] > add r6, lr, r5, lsl #2 > add r7, r2, r5, lsl #2 > vld1.32 {d16, d17},...