thr3ads.net - search: "vaddq

[PATCH 8/8] Optimize silk_NSQ_del_dec() for ARM NEON

2016 Aug 23

0

[PATCH 8/8] Optimize silk_NSQ_del_dec() for ARM NEON

...Winner_ind = i; + } + } + psDelDec->RD_Q10[ Winner_ind ] -= ( silk_int32_MAX >> 4 ); + RD_Q10_s32x4 = vld1q_s32( psDelDec->RD_Q10 ); + RD_Q10_s32x4 = vaddq_s32( RD_Q10_s32x4, vdupq_n_s32( silk_int32_MAX >> 4 ) ); + vst1q_s32( psDelDec->RD_Q10, RD_Q10_s32x4 ); + + /* Copy final part of signals from winner state to output and long-term filter states */ + copy_winner_state( psDelDe...

[PATCH 7/8] Update NSQ_LPC_BUF_LENGTH macro.

2016 Aug 23

2

[PATCH 7/8] Update NSQ_LPC_BUF_LENGTH macro.

NSQ_LPC_BUF_LENGTH is independent of DECISION_DELAY. --- silk/define.h | 4 ---- 1 file changed, 4 deletions(-) diff --git a/silk/define.h b/silk/define.h index 781cfdc..1286048 100644 --- a/silk/define.h +++ b/silk/define.h @@ -173,11 +173,7 @@ extern "C" #define MAX_MATRIX_SIZE MAX_LPC_ORDER /* Max of LPC Order and LTP order */ -#if( MAX_LPC_ORDER >

2017 Apr 26

2

2 patches related to silk_biquad_alt() optimization

...32x4 = vcombine_s32(*out32_Q14_s32x2, *out32_Q14_s32x2); /* out32_Q14_{0,1,0,1} */ t_s32x4 = vqrdmulhq_s32(out32_Q14_s32x4, A_Q28_s32x4); /* silk_RSHIFT_ROUND( (opus_int64)out32_Q14[ {0,1,0,1} ] * (-A_Q28[ {0,0,1,1} ]), 30 ) */ *S_s32x4 = vaddq_s32(*S_s32x4, t_s32x4); /* S[ {0,1,2,3} ] = {S[ {2,3} ],0,0} + silk_RSHIFT_ROUND( ); */ t_s32x4 = vqdmulhq_s32(inval_s32x4, B_Q28_s32x4); /* silk_SMULWB(B_Q28[ {1,1,2,2} ], in[ k * 2 + {0,1,0,1} ] ) */ *S_s32x4 = vaddq_s32(*S_s3...

silk_warped_autocorrelation_FIX() NEON optimization

2016 Jul 01

1

silk_warped_autocorrelation_FIX() NEON optimization

Hi all, I'm sending patch "Optimize silk_warped_autocorrelation_FIX() for ARM NEON" in an separate email. It is based on Tim’s aarch64v8 branch https://git.xiph.org/?p=users/tterribe/opus.git;a=shortlog;h=refs/heads/aarch64v8 Thanks for your comments. Linfeng

2017 May 15

2

2 patches related to silk_biquad_alt() optimization

...> */ > t_s32x4 = vqrdmulhq_s32(out32_Q14_s32x4, A_Q28_s32x4); > /* silk_RSHIFT_ROUND( (opus_int64)out32_Q14[ {0,1,0,1} ] * > (-A_Q28[ {0,0,1,1} ]), 30 ) */ > *S_s32x4 = vaddq_s32(*S_s32x4, t_s32x4); > /* S[ {0,1,2,3} ] = {S[ {2,3} ],0,0} + silk_RSHIFT_ROUND( ); > */ > t_s32x4 = vqdmulhq_s32(inval_s32x4, B_Q28_s32x4); > /* silk_SMULWB(B_Q28[ {1,1,2,2} ], in[ k *...

Several patches of ARM NEON optimization

2016 Jul 14

6

Several patches of ARM NEON optimization

I rebased my previous 3 patches to the current master with minor changes. Patches 1 to 3 replace all my previous submitted patches. Patches 4 and 5 are new. Thanks, Linfeng Zhang

Bug fix in celt_lpc.c and some xcorr_kernel optimizations

2013 Jun 07

2

Bug fix in celt_lpc.c and some xcorr_kernel optimizations

...s32(0); for (j = 0; j < len-1; j += 2) { xsum1 = vmlal_s16(xsum1,vdup_n_s16(*x++),vld1_s16(y++)); xsum2 = vmlal_s16(xsum2,vdup_n_s16(*x++),vld1_s16(y++)); } if (j < len) { xsum1 = vmlal_s16(xsum1,vdup_n_s16(*x),vld1_s16(y)); } vst1q_s32(sum,vaddq_s32(xsum1,xsum2)); } Cheers, John Ridges

2017 May 08

0

2 patches related to silk_biquad_alt() optimization

...s32x2, *out32_Q14_s32x2); > /* out32_Q14_{0,1,0,1} > */ > t_s32x4 = vqrdmulhq_s32(out32_Q14_s32x4, A_Q28_s32x4); > /* silk_RSHIFT_ROUND( (opus_int64)out32_Q14[ {0,1,0,1} ] * (-A_Q28[ > {0,0,1,1} ]), 30 ) */ > *S_s32x4 = vaddq_s32(*S_s32x4, t_s32x4); > /* S[ {0,1,2,3} ] = {S[ {2,3} ],0,0} + silk_RSHIFT_ROUND( ); > */ > t_s32x4 = vqdmulhq_s32(inval_s32x4, B_Q28_s32x4); > /* silk_SMULWB(B_Q28[ {1,1,2,2} ], in[ k * 2 + {0,1,0,1} ] ) > */ > *S_s...

2017 May 17

0

2 patches related to silk_biquad_alt() optimization

...*/ > > t_s32x4 = vqrdmulhq_s32(out32_Q14_s32x4, A_Q28_s32x4); > > /* silk_RSHIFT_ROUND( (opus_int64)out32_Q14[ {0,1,0,1} ] * > > (-A_Q28[ {0,0,1,1} ]), 30 ) */ > > *S_s32x4 = vaddq_s32(*S_s32x4, t_s32x4); > > /* S[ {0,1,2,3} ] = {S[ {2,3} ],0,0} + silk_RSHIFT_ROUND( ); > > */ > > t_s32x4 = vqdmulhq_s32(inval_s32x4, B_Q28_s32x4); > > /* silk_SMULWB(B_Q28[ {1,1,2,2} ], in[ k * 2...

2017 Apr 25

2

2 patches related to silk_biquad_alt() optimization

On Mon, Apr 24, 2017 at 5:52 PM, Jean-Marc Valin <jmvalin at jmvalin.ca> wrote: > On 24/04/17 08:03 PM, Linfeng Zhang wrote: > > Tested on my chromebook, when stride (channel) == 1, the optimization > > has no gain compared with C function. > > You mean that the Neon code is the same speed as the C code for > stride==1? This is not terribly surprising for an IIRC

Bug fix in celt_lpc.c and some xcorr_kernel optimizations

2013 Jun 07

0

Bug fix in celt_lpc.c and some xcorr_kernel optimizations

...n-1; j += 2) { > xsum1 = vmlal_s16(xsum1,vdup_n_s16(*x++),vld1_s16(y++)); > xsum2 = vmlal_s16(xsum2,vdup_n_s16(*x++),vld1_s16(y++)); > } > if (j < len) { > xsum1 = vmlal_s16(xsum1,vdup_n_s16(*x),vld1_s16(y)); > } > vst1q_s32(sum,vaddq_s32(xsum1,xsum2)); > } > > > Cheers, > John Ridges > > > _______________________________________________ > opus mailing list > opus at xiph.org > http://lists.xiph.org/mailman/listinfo/opus >

Bug fix in celt_lpc.c and some xcorr_kernel optimizations

2013 Jun 07

1

Bug fix in celt_lpc.c and some xcorr_kernel optimizations

...m1,vdup_n_s16(*(x+j)),vld1_s16(y+j)); if (++j < len) { xsum2 = vmlal_s16(xsum2,vdup_n_s16(*(x+j)),vld1_s16(y+j)); if (++j < len) { xsum1 = vmlal_s16(xsum1,vdup_n_s16(*(x+j)),vld1_s16(y+j)); } } } vst1q_s32(sum,vaddq_s32(xsum1,xsum2)); } Whether or not this version is faster than the first version I submitted probably depends a lot on how fast unaligned memory vector accesses are on an ARM processor. Of course hand-coded assembly would be even faster than using intrinsics (for instance the "vdup_lane_s16&q...

Bug fix in celt_lpc.c and some xcorr_kernel optimizations

2013 Jun 07

2

Bug fix in celt_lpc.c and some xcorr_kernel optimizations

Hi JM, I have no doubt that Mr. Zanelli's NEON code is faster, since hand tuned assembly is bound to be faster than using intrinsics. However I notice that his code can also read past the y buffer. Cheers, --John On 6/6/2013 9:22 PM, Jean-Marc Valin wrote: > Hi John, > > Thanks for the two fixes. They're in git now. Your SSE version seems to > also be slightly faster than

[PATCH 9/9] Optimize silk_inner_prod_aligned_scale() for ARM NEON

2016 Aug 26

2

[PATCH 9/9] Optimize silk_inner_prod_aligned_scale() for ARM NEON

...nt16x8_t in2 = vld1q_s16(&inVec2[i]); + int32x4_t t0 = vmull_s16(vget_low_s16 (in1), vget_low_s16 (in2)); + int32x4_t t1 = vmull_s16(vget_high_s16(in1), vget_high_s16(in2)); + t0 = vshlq_s32(t0, scaleLeft_s32x4); + sum_s32x4 = vaddq_s32(sum_s32x4, t0); + t1 = vshlq_s32(t1, scaleLeft_s32x4); + sum_s32x4 = vaddq_s32(sum_s32x4, t1); + } + sum_s64x2 = vpaddlq_s32(sum_s32x4); + sum_s64x1 = vadd_s64(vget_low_s64(sum_s64x2), vget_high_s64(sum_s64x2)); + sum = vget_lane_s64(sum_s6...

[AArch64 neon intrinsics v4 0/5] Rework Neon intrinsic code for Aarch64 patchset

2015 Dec 23

6

[AArch64 neon intrinsics v4 0/5] Rework Neon intrinsic code for Aarch64 patchset

Following Tim's comments, here are my reworked patches for the Neon intrinsic function patches of of my Aarch64 patchset, i.e. replacing patches 5-8 of the v2 series. Patches 1-4 and 9-18 of the old series still apply unmodified. The one new (as opposed to changed) patch is the first one in this series, to add named constants for the ARM architecture variants. There are also some minor code

opus Digest, Vol 53, Issue 2

2013 Jun 10

0

opus Digest, Vol 53, Issue 2

...m1,vdup_n_s16(*(x+j)),vld1_s16(y+j)); if (++j < len) { xsum2 = vmlal_s16(xsum2,vdup_n_s16(*(x+j)),vld1_s16(y+j)); if (++j < len) { xsum1 = vmlal_s16(xsum1,vdup_n_s16(*(x+j)),vld1_s16(y+j)); } } } vst1q_s32(sum,vaddq_s32(xsum1,xsum2)); } Whether or not this version is faster than the first version I submitted probably depends a lot on how fast unaligned memory vector accesses are on an ARM processor. Of course hand-coded assembly would be even faster than using intrinsics (for instance the "vdup_lane_s16&quot...

search for: vaddq_s32