Displaying 20 results from an estimated 24 matches for "vget_high_s16".
2015 May 15
0
[RFC V3 5/8] aarch64: celt_pitch_xcorr: Fixed point intrinsics
...yi += 8;
+ YY[2] = vld1q_s16(yi);
+
+ XX[0] = vld1q_s16(xi);
+ xi += 8;
+ XX[1] = vld1q_s16(xi);
+ xi += 8;
+
+ /* Consume XX[0][0:3] */
+ SUMM = vmlal_lane_s16(SUMM, vget_low_s16(YY[0]), vget_low_s16(XX[0]), 0);
+
+ YEXT[0] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 1);
+ SUMM = vmlal_lane_s16(SUMM, YEXT[0], vget_low_s16(XX[0]), 1);
+
+ YEXT[1] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 2);
+ SUMM = vmlal_lane_s16(SUMM, YEXT[1], vget_low_s16(XX[0]), 2);
+
+ YEXT[2] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 3);...
2015 May 08
0
[[RFC PATCH v2]: Ne10 fft fixed and previous 5/8] aarch64: celt_pitch_xcorr: Fixed point intrinsics
...yi += 8;
+ YY[2] = vld1q_s16(yi);
+
+ XX[0] = vld1q_s16(xi);
+ xi += 8;
+ XX[1] = vld1q_s16(xi);
+ xi += 8;
+
+ /* Consume XX[0][0:3] */
+ SUMM = vmlal_lane_s16(SUMM, vget_low_s16(YY[0]), vget_low_s16(XX[0]), 0);
+
+ YEXT[0] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 1);
+ SUMM = vmlal_lane_s16(SUMM, YEXT[0], vget_low_s16(XX[0]), 1);
+
+ YEXT[1] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 2);
+ SUMM = vmlal_lane_s16(SUMM, YEXT[1], vget_low_s16(XX[0]), 2);
+
+ YEXT[2] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 3);...
2016 Jul 14
0
[PATCH 2/5] Optimize fixed-point celt_fir_c() for ARM NEON
...um[ord-i-1];
+ rnum[ord] = rnum[ord+1] = rnum[ord+2] = 0;
+ (void)arch;
+
+#ifdef SMALL_FOOTPRINT
+ for (i=0;i<N-7;i+=8)
+ {
+ int16x8_t x_s16x8 = vld1q_s16(_x+i);
+ int32x4_t sum0_s32x4 = vshll_n_s16(vget_low_s16 (x_s16x8), SIG_SHIFT);
+ int32x4_t sum1_s32x4 = vshll_n_s16(vget_high_s16(x_s16x8), SIG_SHIFT);
+ for (j=0;j<ord;j+=4)
+ {
+ const int16x4_t rnum_s16x4 = vld1_s16(rnum+j);
+ x_s16x8 = vld1q_s16(x+i+j+0);
+ sum0_s32x4 = vmlal_lane_s16(sum0_s32x4, vget_low_s16 (x_s16x8), rnum_s16x4, 0);
+ sum1_s32x4 = vmlal_lane_s16(sum1_s32x4,...
2016 Jun 17
5
ARM NEON optimization -- celt_fir()
Hi all,
This is Linfeng Zhang from Google. I'll work on ARM NEON optimization in the
next few months.
I'm submitting 2 patches in the following couple of emails, which have the new
created celt_fir_neon().
I revised celt_fir_c() to not pass in argument "mem" in Patch 1. If there are
concerns to this change, please let me know.
Many thanks to your comments.
Linfeng Zhang
2016 Jul 28
0
[PATCH] Optimize silk_LPC_analysis_filter() for ARM NEON
...[d - ix - 1];
+ }
+ rB[d] = rB[d + 1] = rB[d + 2] = 0;
+
+ for (ix = d; ix < len - 7; ix += 8) {
+ int16x8_t in_s16x8 = vld1q_s16(in + ix);
+ int32x4_t out32_Q12_0_s32x4 = vshll_n_s16(vget_low_s16 (in_s16x8), 12);
+ int32x4_t out32_Q12_1_s32x4 = vshll_n_s16(vget_high_s16(in_s16x8), 12);
+ for (j = 0; j < d; j += 4) {
+ const int16x4_t rB_s16x4 = vld1_s16(rB + j);
+ in_s16x8 = vld1q_s16(in - d + ix + j + 0);
+ out32_Q12_0_s32x4 = vmlsl_lane_s16(out32_Q12_0_s32x4, vget_low_s16 (in_s16x8), rB_s16x4, 0);
+...
2015 Mar 31
6
[RFC PATCH v1 0/5] aarch64: celt_pitch_xcorr: Fixed point series
Hi Timothy,
As I mentioned earlier [1], I now fixed compile issues
with fixed point and resubmitting the patch.
I also have new patch that does intrinsics optimizations
for celt_pitch_xcorr targetting aarch64.
You can find my latest work-in-progress branch at [2]
For reference, you can use the Ne10 pre-built libraries
at [3]
Note that I am working with Phil at ARM to get my patch at [4]
2015 May 08
8
[RFC PATCH v2]: Ne10 fft fixed and previous 0/8]
Hi All,
As per Timothy's suggestion, disabling mdct_forward
for fixed point. Only effects
armv7,armv8: Extend fixed fft NE10 optimizations to mdct
Rest of patches are same as in [1]
For reference, latest wip code for opus is at [2]
Still working with NE10 team at ARM to get corner cases of
mdct_forward. Will update with another patch
when issue in NE10 gets fixed.
Regards,
Vish
[1]:
2015 May 15
11
[RFC V3 0/8] Ne10 fft fixed and previous
Hi All,
Changes from RFC v2 [1]
armv7,armv8: Extend fixed fft NE10 optimizations to mdct
- Overflow issue fixed by Phil at ARM. Ne10 wip at [2]. Should be upstream soon.
- So, re-enabled using fixed fft for mdct_forward which was disabled in RFCv2
armv7,armv8: Optimize fixed point fft using NE10 library
- Thanks to Jonathan Lennox, fixed some build fixes on iOS and some copy-paste errors
Rest
2015 Nov 21
0
[Aarch64 v2 08/18] Add Neon fixed-point implementation of xcorr_kernel.
...q_s32(sum);
+ //Load y[0...3]
+ //This requires len>0 to always be valid (which we assert in the C code).
+ int16x4_t y0 = vld1_s16(y);
+ y += 4;
+
+ for (j = 0; j + 8 <= len; j += 8)
+ {
+ // Load x[0...7]
+ int16x8_t xx = vld1q_s16(x);
+ int16x4_t x0 = vget_low_s16(xx);
+ int16x4_t x4 = vget_high_s16(xx);
+ // Load y[4...11]
+ int16x8_t yy = vld1q_s16(y);
+ int16x4_t y4 = vget_low_s16(yy);
+ int16x4_t y8 = vget_high_s16(yy);
+ int32x4_t a0 = vmlal_lane_s16(a, y0, x0, 0);
+ int32x4_t a1 = vmlal_lane_s16(a0, y4, x4, 0);
+
+ int16x4_t y1 = vext_s16(y0, y4, 1);
+ int16x4_t y5 = vext_s16(y4,...
2015 Apr 28
10
[RFC PATCH v1 0/8] Ne10 fft fixed and previous
Hello Timothy / Jean-Marc / opus-dev,
This patch series is follow up on work I posted on [1].
In addition to what was posted on [1], this patch series mainly
integrates Fixed point FFT implementations in NE10 library into opus.
You can view my opus wip code at [2].
Note that while I found some issues both with the NE10 library(fixed fft)
and with Linaro toolchain (armv8 intrinsics), the work
2015 Nov 20
2
[Aarch64 00/11] Patches to enable Aarch64
> On Nov 19, 2015, at 5:47 PM, John Ridges <jridges at masque.com> wrote:
>
> Any speedup from the intrinsics may just be swamped by the rest of the encode/decode process. But I think you really want SIG2WORD16 to be (vqmovns_s32(PSHR32((x), SIG_SHIFT)))
Yes, you?re right. I forgot to run the vectors under qemu with my previous version (oh, the embarrassment!) Fixed forthcoming
2016 Jul 14
6
Several patches of ARM NEON optimization
I rebased my previous 3 patches to the current master with minor changes.
Patches 1 to 3 replace all my previous submitted patches.
Patches 4 and 5 are new.
Thanks,
Linfeng Zhang
2015 Nov 23
1
[Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.
...lt;jridges at masque.com<mailto:jridges at masque.com>> wrote:
Hi Jonathan.
I really, really hate to bring this up this late in the game, but I just noticed that your NEON code doesn't use any of the "high" intrinsics for ARM64, e.g. instead of:
int32x4_t coef1 = vmovl_s16(vget_high_s16(coef16));
you could use:
int32x4_t coef1 = vmovl_high_s16(coef16);
and instead of:
int64x2_t b1 = vmlal_s32(b0, vget_high_s32(a0), vget_high_s32(coef0));
you could use:
int64x2_t b1 = vmlal_high_s32(b0, a0, coef0);
and instead of:
int64x1_t c = vadd_s64(vget_low_s64(b3), vget_high_s64(b3));...
2016 Sep 13
4
[PATCH 12/15] Replace call of celt_inner_prod_c() (step 1)
Should call celt_inner_prod().
---
celt/bands.c | 7 ++++---
celt/bands.h | 2 +-
celt/celt_encoder.c | 6 +++---
celt/pitch.c | 2 +-
src/opus_multistream_encoder.c | 2 +-
5 files changed, 10 insertions(+), 9 deletions(-)
diff --git a/celt/bands.c b/celt/bands.c
index bbe8a4c..1ab24aa 100644
--- a/celt/bands.c
+++ b/celt/bands.c
2016 Aug 23
0
[PATCH 8/8] Optimize silk_NSQ_del_dec() for ARM NEON
...= vld1q_s16( AR_shp_Q13 + 0 );
+ const int16x8_t t1_s16x8 = vld1q_s16( AR_shp_Q13 + 8 );
+ const int16x8_t t2_s16x8 = vld1q_s16( AR_shp_Q13 + 16 );
+ vst1q_s32( AR_shp_Q28 + 0, vshll_n_s16( vget_low_s16 ( t0_s16x8 ), 15 ) );
+ vst1q_s32( AR_shp_Q28 + 4, vshll_n_s16( vget_high_s16( t0_s16x8 ), 15 ) );
+ vst1q_s32( AR_shp_Q28 + 8, vshll_n_s16( vget_low_s16 ( t1_s16x8 ), 15 ) );
+ vst1q_s32( AR_shp_Q28 + 12, vshll_n_s16( vget_high_s16( t1_s16x8 ), 15 ) );
+ vst1q_s32( AR_shp_Q28 + 16, vshll_n_s16( vget_low_s16 ( t2_s16x8 ), 15 ) );
+ vst1q_s32( AR_...
2016 Aug 23
2
[PATCH 7/8] Update NSQ_LPC_BUF_LENGTH macro.
NSQ_LPC_BUF_LENGTH is independent of DECISION_DELAY.
---
silk/define.h | 4 ----
1 file changed, 4 deletions(-)
diff --git a/silk/define.h b/silk/define.h
index 781cfdc..1286048 100644
--- a/silk/define.h
+++ b/silk/define.h
@@ -173,11 +173,7 @@ extern "C"
#define MAX_MATRIX_SIZE MAX_LPC_ORDER /* Max of LPC Order and LTP order */
-#if( MAX_LPC_ORDER >
2015 Nov 23
0
[Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.
Hi Jonathan.
I really, really hate to bring this up this late in the game, but I just
noticed that your NEON code doesn't use any of the "high" intrinsics for
ARM64, e.g. instead of:
int32x4_t coef1 = vmovl_s16(vget_high_s16(coef16));
you could use:
int32x4_t coef1 = vmovl_high_s16(coef16);
and instead of:
int64x2_t b1 = vmlal_s32(b0, vget_high_s32(a0), vget_high_s32(coef0));
you could use:
int64x2_t b1 = vmlal_high_s32(b0, a0, coef0);
and instead of:
int64x1_t c = vadd_s64(vget_low_s64(b3), vget_high_s64(b3));...
2015 Aug 05
0
[PATCH 7/8] Add Neon intrinsics for Silk noise shape feedback loop.
...] ... [3]
+
+ int32x4_t a0 = vextq_s32 (a00, a01, 3); // data0[0] data1[0] ...[2]
+ int32x4_t a1 = vld1q_s32(data1 + 3); // data1[3] ... [6]
+
+ int16x8_t coef16 = vld1q_s16(coef);
+ int32x4_t coef0 = vmovl_s16(vget_low_s16(coef16));
+ int32x4_t coef1 = vmovl_s16(vget_high_s16(coef16));
+
+ int64x2_t b0 = vmull_s32(vget_low_s32(a0), vget_low_s32(coef0));
+ int64x2_t b1 = vmlal_s32(b0, vget_high_s32(a0), vget_high_s32(coef0));
+ int64x2_t b2 = vmlal_s32(b1, vget_low_s32(a1), vget_low_s32(coef1));
+ int64x2_t b3 = vmlal_s32(b2, vget_high_s32(a1)...
2015 Nov 21
0
[Aarch64 v2 06/18] Add Neon intrinsics for Silk noise shape feedback loop.
...] ... [3]
+
+ int32x4_t a0 = vextq_s32 (a00, a01, 3); // data0[0] data1[0] ...[2]
+ int32x4_t a1 = vld1q_s32(data1 + 3); // data1[3] ... [6]
+
+ int16x8_t coef16 = vld1q_s16(coef);
+ int32x4_t coef0 = vmovl_s16(vget_low_s16(coef16));
+ int32x4_t coef1 = vmovl_s16(vget_high_s16(coef16));
+
+ int64x2_t b0 = vmull_s32(vget_low_s32(a0), vget_low_s32(coef0));
+ int64x2_t b1 = vmlal_s32(b0, vget_high_s32(a0), vget_high_s32(coef0));
+ int64x2_t b2 = vmlal_s32(b1, vget_low_s32(a1), vget_low_s32(coef1));
+ int64x2_t b3 = vmlal_s32(b2, vget_high_s32(a1)...
2016 Aug 26
2
[PATCH 9/9] Optimize silk_inner_prod_aligned_scale() for ARM NEON
...sum_s64x1;
+
+ for( i = 0; i < len - 7; i += 8 ) {
+ const int16x8_t in1 = vld1q_s16(&inVec1[i]);
+ const int16x8_t in2 = vld1q_s16(&inVec2[i]);
+ int32x4_t t0 = vmull_s16(vget_low_s16 (in1), vget_low_s16 (in2));
+ int32x4_t t1 = vmull_s16(vget_high_s16(in1), vget_high_s16(in2));
+ t0 = vshlq_s32(t0, scaleLeft_s32x4);
+ sum_s32x4 = vaddq_s32(sum_s32x4, t0);
+ t1 = vshlq_s32(t1, scaleLeft_s32x4);
+ sum_s32x4 = vaddq_s32(sum_s32x4, t1);
+ }
+ sum_s64x2 = vpaddlq_...