thr3ads.net - search: "vget_high

[RFC V3 5/8] aarch64: celt_pitch_xcorr: Fixed point intrinsics

2015 May 15

0

[RFC V3 5/8] aarch64: celt_pitch_xcorr: Fixed point intrinsics

...yi += 8; + YY[2] = vld1q_s16(yi); + + XX[0] = vld1q_s16(xi); + xi += 8; + XX[1] = vld1q_s16(xi); + xi += 8; + + /* Consume XX[0][0:3] */ + SUMM = vmlal_lane_s16(SUMM, vget_low_s16(YY[0]), vget_low_s16(XX[0]), 0); + + YEXT[0] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 1); + SUMM = vmlal_lane_s16(SUMM, YEXT[0], vget_low_s16(XX[0]), 1); + + YEXT[1] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 2); + SUMM = vmlal_lane_s16(SUMM, YEXT[1], vget_low_s16(XX[0]), 2); + + YEXT[2] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 3);...

[[RFC PATCH v2]: Ne10 fft fixed and previous 5/8] aarch64: celt_pitch_xcorr: Fixed point intrinsics

2015 May 08

0

[[RFC PATCH v2]: Ne10 fft fixed and previous 5/8] aarch64: celt_pitch_xcorr: Fixed point intrinsics

...yi += 8; + YY[2] = vld1q_s16(yi); + + XX[0] = vld1q_s16(xi); + xi += 8; + XX[1] = vld1q_s16(xi); + xi += 8; + + /* Consume XX[0][0:3] */ + SUMM = vmlal_lane_s16(SUMM, vget_low_s16(YY[0]), vget_low_s16(XX[0]), 0); + + YEXT[0] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 1); + SUMM = vmlal_lane_s16(SUMM, YEXT[0], vget_low_s16(XX[0]), 1); + + YEXT[1] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 2); + SUMM = vmlal_lane_s16(SUMM, YEXT[1], vget_low_s16(XX[0]), 2); + + YEXT[2] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 3);...

[PATCH 2/5] Optimize fixed-point celt_fir_c() for ARM NEON

2016 Jul 14

0

[PATCH 2/5] Optimize fixed-point celt_fir_c() for ARM NEON

...um[ord-i-1]; + rnum[ord] = rnum[ord+1] = rnum[ord+2] = 0; + (void)arch; + +#ifdef SMALL_FOOTPRINT + for (i=0;i<N-7;i+=8) + { + int16x8_t x_s16x8 = vld1q_s16(_x+i); + int32x4_t sum0_s32x4 = vshll_n_s16(vget_low_s16 (x_s16x8), SIG_SHIFT); + int32x4_t sum1_s32x4 = vshll_n_s16(vget_high_s16(x_s16x8), SIG_SHIFT); + for (j=0;j<ord;j+=4) + { + const int16x4_t rnum_s16x4 = vld1_s16(rnum+j); + x_s16x8 = vld1q_s16(x+i+j+0); + sum0_s32x4 = vmlal_lane_s16(sum0_s32x4, vget_low_s16 (x_s16x8), rnum_s16x4, 0); + sum1_s32x4 = vmlal_lane_s16(sum1_s32x4,...

ARM NEON optimization -- celt_fir()

2016 Jun 17

5

ARM NEON optimization -- celt_fir()

Hi all, This is Linfeng Zhang from Google. I'll work on ARM NEON optimization in the next few months. I'm submitting 2 patches in the following couple of emails, which have the new created celt_fir_neon(). I revised celt_fir_c() to not pass in argument "mem" in Patch 1. If there are concerns to this change, please let me know. Many thanks to your comments. Linfeng Zhang

[PATCH] Optimize silk_LPC_analysis_filter() for ARM NEON

2016 Jul 28

0

[PATCH] Optimize silk_LPC_analysis_filter() for ARM NEON

...[d - ix - 1]; + } + rB[d] = rB[d + 1] = rB[d + 2] = 0; + + for (ix = d; ix < len - 7; ix += 8) { + int16x8_t in_s16x8 = vld1q_s16(in + ix); + int32x4_t out32_Q12_0_s32x4 = vshll_n_s16(vget_low_s16 (in_s16x8), 12); + int32x4_t out32_Q12_1_s32x4 = vshll_n_s16(vget_high_s16(in_s16x8), 12); + for (j = 0; j < d; j += 4) { + const int16x4_t rB_s16x4 = vld1_s16(rB + j); + in_s16x8 = vld1q_s16(in - d + ix + j + 0); + out32_Q12_0_s32x4 = vmlsl_lane_s16(out32_Q12_0_s32x4, vget_low_s16 (in_s16x8), rB_s16x4, 0); +...

[RFC PATCH v1 0/5] aarch64: celt_pitch_xcorr: Fixed point series

2015 Mar 31

6

[RFC PATCH v1 0/5] aarch64: celt_pitch_xcorr: Fixed point series

Hi Timothy, As I mentioned earlier [1], I now fixed compile issues with fixed point and resubmitting the patch. I also have new patch that does intrinsics optimizations for celt_pitch_xcorr targetting aarch64. You can find my latest work-in-progress branch at [2] For reference, you can use the Ne10 pre-built libraries at [3] Note that I am working with Phil at ARM to get my patch at [4]

[RFC PATCH v2]: Ne10 fft fixed and previous 0/8]

2015 May 08

8

[RFC PATCH v2]: Ne10 fft fixed and previous 0/8]

Hi All, As per Timothy's suggestion, disabling mdct_forward for fixed point. Only effects armv7,armv8: Extend fixed fft NE10 optimizations to mdct Rest of patches are same as in [1] For reference, latest wip code for opus is at [2] Still working with NE10 team at ARM to get corner cases of mdct_forward. Will update with another patch when issue in NE10 gets fixed. Regards, Vish [1]:

[RFC V3 0/8] Ne10 fft fixed and previous

2015 May 15

11

[RFC V3 0/8] Ne10 fft fixed and previous

Hi All, Changes from RFC v2 [1] armv7,armv8: Extend fixed fft NE10 optimizations to mdct - Overflow issue fixed by Phil at ARM. Ne10 wip at [2]. Should be upstream soon. - So, re-enabled using fixed fft for mdct_forward which was disabled in RFCv2 armv7,armv8: Optimize fixed point fft using NE10 library - Thanks to Jonathan Lennox, fixed some build fixes on iOS and some copy-paste errors Rest

[Aarch64 v2 08/18] Add Neon fixed-point implementation of xcorr_kernel.

2015 Nov 21

0

[Aarch64 v2 08/18] Add Neon fixed-point implementation of xcorr_kernel.

...q_s32(sum); + //Load y[0...3] + //This requires len>0 to always be valid (which we assert in the C code). + int16x4_t y0 = vld1_s16(y); + y += 4; + + for (j = 0; j + 8 <= len; j += 8) + { + // Load x[0...7] + int16x8_t xx = vld1q_s16(x); + int16x4_t x0 = vget_low_s16(xx); + int16x4_t x4 = vget_high_s16(xx); + // Load y[4...11] + int16x8_t yy = vld1q_s16(y); + int16x4_t y4 = vget_low_s16(yy); + int16x4_t y8 = vget_high_s16(yy); + int32x4_t a0 = vmlal_lane_s16(a, y0, x0, 0); + int32x4_t a1 = vmlal_lane_s16(a0, y4, x4, 0); + + int16x4_t y1 = vext_s16(y0, y4, 1); + int16x4_t y5 = vext_s16(y4,...

[RFC PATCH v1 0/8] Ne10 fft fixed and previous

2015 Apr 28

10

[RFC PATCH v1 0/8] Ne10 fft fixed and previous

Hello Timothy / Jean-Marc / opus-dev, This patch series is follow up on work I posted on [1]. In addition to what was posted on [1], this patch series mainly integrates Fixed point FFT implementations in NE10 library into opus. You can view my opus wip code at [2]. Note that while I found some issues both with the NE10 library(fixed fft) and with Linaro toolchain (armv8 intrinsics), the work

[Aarch64 00/11] Patches to enable Aarch64

2015 Nov 20

2

[Aarch64 00/11] Patches to enable Aarch64

> On Nov 19, 2015, at 5:47 PM, John Ridges <jridges at masque.com> wrote: > > Any speedup from the intrinsics may just be swamped by the rest of the encode/decode process. But I think you really want SIG2WORD16 to be (vqmovns_s32(PSHR32((x), SIG_SHIFT))) Yes, you?re right. I forgot to run the vectors under qemu with my previous version (oh, the embarrassment!) Fixed forthcoming

Several patches of ARM NEON optimization

2016 Jul 14

6

Several patches of ARM NEON optimization

I rebased my previous 3 patches to the current master with minor changes. Patches 1 to 3 replace all my previous submitted patches. Patches 4 and 5 are new. Thanks, Linfeng Zhang

[Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

2015 Nov 23

1

[Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

...lt;jridges at masque.com<mailto:jridges at masque.com>> wrote: Hi Jonathan. I really, really hate to bring this up this late in the game, but I just noticed that your NEON code doesn't use any of the "high" intrinsics for ARM64, e.g. instead of: int32x4_t coef1 = vmovl_s16(vget_high_s16(coef16)); you could use: int32x4_t coef1 = vmovl_high_s16(coef16); and instead of: int64x2_t b1 = vmlal_s32(b0, vget_high_s32(a0), vget_high_s32(coef0)); you could use: int64x2_t b1 = vmlal_high_s32(b0, a0, coef0); and instead of: int64x1_t c = vadd_s64(vget_low_s64(b3), vget_high_s64(b3));...

[PATCH 12/15] Replace call of celt_inner_prod_c() (step 1)

2016 Sep 13

4

[PATCH 12/15] Replace call of celt_inner_prod_c() (step 1)

Should call celt_inner_prod(). --- celt/bands.c | 7 ++++--- celt/bands.h | 2 +- celt/celt_encoder.c | 6 +++--- celt/pitch.c | 2 +- src/opus_multistream_encoder.c | 2 +- 5 files changed, 10 insertions(+), 9 deletions(-) diff --git a/celt/bands.c b/celt/bands.c index bbe8a4c..1ab24aa 100644 --- a/celt/bands.c +++ b/celt/bands.c

[PATCH 8/8] Optimize silk_NSQ_del_dec() for ARM NEON

2016 Aug 23

0

[PATCH 8/8] Optimize silk_NSQ_del_dec() for ARM NEON

...= vld1q_s16( AR_shp_Q13 + 0 ); + const int16x8_t t1_s16x8 = vld1q_s16( AR_shp_Q13 + 8 ); + const int16x8_t t2_s16x8 = vld1q_s16( AR_shp_Q13 + 16 ); + vst1q_s32( AR_shp_Q28 + 0, vshll_n_s16( vget_low_s16 ( t0_s16x8 ), 15 ) ); + vst1q_s32( AR_shp_Q28 + 4, vshll_n_s16( vget_high_s16( t0_s16x8 ), 15 ) ); + vst1q_s32( AR_shp_Q28 + 8, vshll_n_s16( vget_low_s16 ( t1_s16x8 ), 15 ) ); + vst1q_s32( AR_shp_Q28 + 12, vshll_n_s16( vget_high_s16( t1_s16x8 ), 15 ) ); + vst1q_s32( AR_shp_Q28 + 16, vshll_n_s16( vget_low_s16 ( t2_s16x8 ), 15 ) ); + vst1q_s32( AR_...

[PATCH 7/8] Update NSQ_LPC_BUF_LENGTH macro.

2016 Aug 23

2

[PATCH 7/8] Update NSQ_LPC_BUF_LENGTH macro.

NSQ_LPC_BUF_LENGTH is independent of DECISION_DELAY. --- silk/define.h | 4 ---- 1 file changed, 4 deletions(-) diff --git a/silk/define.h b/silk/define.h index 781cfdc..1286048 100644 --- a/silk/define.h +++ b/silk/define.h @@ -173,11 +173,7 @@ extern "C" #define MAX_MATRIX_SIZE MAX_LPC_ORDER /* Max of LPC Order and LTP order */ -#if( MAX_LPC_ORDER >

[Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

2015 Nov 23

0

[Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

Hi Jonathan. I really, really hate to bring this up this late in the game, but I just noticed that your NEON code doesn't use any of the "high" intrinsics for ARM64, e.g. instead of: int32x4_t coef1 = vmovl_s16(vget_high_s16(coef16)); you could use: int32x4_t coef1 = vmovl_high_s16(coef16); and instead of: int64x2_t b1 = vmlal_s32(b0, vget_high_s32(a0), vget_high_s32(coef0)); you could use: int64x2_t b1 = vmlal_high_s32(b0, a0, coef0); and instead of: int64x1_t c = vadd_s64(vget_low_s64(b3), vget_high_s64(b3));...

[PATCH 7/8] Add Neon intrinsics for Silk noise shape feedback loop.

2015 Aug 05

0

[PATCH 7/8] Add Neon intrinsics for Silk noise shape feedback loop.

...] ... [3] + + int32x4_t a0 = vextq_s32 (a00, a01, 3); // data0[0] data1[0] ...[2] + int32x4_t a1 = vld1q_s32(data1 + 3); // data1[3] ... [6] + + int16x8_t coef16 = vld1q_s16(coef); + int32x4_t coef0 = vmovl_s16(vget_low_s16(coef16)); + int32x4_t coef1 = vmovl_s16(vget_high_s16(coef16)); + + int64x2_t b0 = vmull_s32(vget_low_s32(a0), vget_low_s32(coef0)); + int64x2_t b1 = vmlal_s32(b0, vget_high_s32(a0), vget_high_s32(coef0)); + int64x2_t b2 = vmlal_s32(b1, vget_low_s32(a1), vget_low_s32(coef1)); + int64x2_t b3 = vmlal_s32(b2, vget_high_s32(a1)...

[Aarch64 v2 06/18] Add Neon intrinsics for Silk noise shape feedback loop.

2015 Nov 21

0

[Aarch64 v2 06/18] Add Neon intrinsics for Silk noise shape feedback loop.

...] ... [3] + + int32x4_t a0 = vextq_s32 (a00, a01, 3); // data0[0] data1[0] ...[2] + int32x4_t a1 = vld1q_s32(data1 + 3); // data1[3] ... [6] + + int16x8_t coef16 = vld1q_s16(coef); + int32x4_t coef0 = vmovl_s16(vget_low_s16(coef16)); + int32x4_t coef1 = vmovl_s16(vget_high_s16(coef16)); + + int64x2_t b0 = vmull_s32(vget_low_s32(a0), vget_low_s32(coef0)); + int64x2_t b1 = vmlal_s32(b0, vget_high_s32(a0), vget_high_s32(coef0)); + int64x2_t b2 = vmlal_s32(b1, vget_low_s32(a1), vget_low_s32(coef1)); + int64x2_t b3 = vmlal_s32(b2, vget_high_s32(a1)...

[PATCH 9/9] Optimize silk_inner_prod_aligned_scale() for ARM NEON

2016 Aug 26

2

[PATCH 9/9] Optimize silk_inner_prod_aligned_scale() for ARM NEON

...sum_s64x1; + + for( i = 0; i < len - 7; i += 8 ) { + const int16x8_t in1 = vld1q_s16(&inVec1[i]); + const int16x8_t in2 = vld1q_s16(&inVec2[i]); + int32x4_t t0 = vmull_s16(vget_low_s16 (in1), vget_low_s16 (in2)); + int32x4_t t1 = vmull_s16(vget_high_s16(in1), vget_high_s16(in2)); + t0 = vshlq_s32(t0, scaleLeft_s32x4); + sum_s32x4 = vaddq_s32(sum_s32x4, t0); + t1 = vshlq_s32(t1, scaleLeft_s32x4); + sum_s32x4 = vaddq_s32(sum_s32x4, t1); + } + sum_s64x2 = vpaddlq_...

search for: vget_high_s16