thr3ads.net - search: "vadd

2017 Apr 26

2

2 patches related to silk_biquad_alt() optimization

..._lane_s32() to do the multiplication and rounding, where A_Q28_s32x{2,4} stores doubled -A_Q28[]: static inline void silk_biquad_alt_stride1_kernel(const int32x2_t A_Q28_s32x2, const int32x4_t t_s32x4, int32x2_t *S_s32x2, int32x2_t *out32_Q14_s32x2) { int32x2_t t_s32x2; *out32_Q14_s32x2 = vadd_s32(*S_s32x2, vget_low_s32(t_s32x4)); /* silk_SMLAWB( S[ 0 ], B_Q28[ 0 ], in[ k ] ) */ *S_s32x2 = vreinterpret_s32_u64(vshr_n_ u64(vreinterpret_u64_s32(*S_s32x2), 32)); /* S[ 0 ] = S[ 1 ]; S[ 1 ] = 0; */ *out...

2017 May 15

2

2 patches related to silk_biquad_alt() optimization

...stores doubled > -A_Q28[]: > > static inline void silk_biquad_alt_stride1_kernel(const int32x2_t > A_Q28_s32x2, const int32x4_t t_s32x4, int32x2_t *S_s32x2, int32x2_t > *out32_Q14_s32x2) > { > int32x2_t t_s32x2; > > *out32_Q14_s32x2 = vadd_s32(*S_s32x2, vget_low_s32(t_s32x4)); > /* silk_SMLAWB( S[ 0 ], B_Q28[ 0 ], in[ k ] > ) */ > *S_s32x2 = > vreinterpret_s32_u64(vshr_n_u64(vreinterpret_u64_s32(*S_s32x2), > 32)); /* S[ 0 ] = S[ 1 ];...

2017 May 08

0

2 patches related to silk_biquad_alt() optimization

...nd rounding, where A_Q28_s32x{2,4} stores doubled -A_Q28[]: > > static inline void silk_biquad_alt_stride1_kernel(const int32x2_t > A_Q28_s32x2, const int32x4_t t_s32x4, int32x2_t *S_s32x2, int32x2_t > *out32_Q14_s32x2) > { > int32x2_t t_s32x2; > > *out32_Q14_s32x2 = vadd_s32(*S_s32x2, vget_low_s32(t_s32x4)); > /* silk_SMLAWB( S[ 0 ], B_Q28[ 0 ], in[ k ] ) > */ > *S_s32x2 = vreinterpret_s32_u64(vshr_n_u6 > 4(vreinterpret_u64_s32(*S_s32x2), 32)); /* S[ 0 ] = S[ 1 ]; S[ 1 ] = 0; >...

2017 May 17

0

2 patches related to silk_biquad_alt() optimization

...> > > static inline void silk_biquad_alt_stride1_kernel(const int32x2_t > > A_Q28_s32x2, const int32x4_t t_s32x4, int32x2_t *S_s32x2, int32x2_t > > *out32_Q14_s32x2) > > { > > int32x2_t t_s32x2; > > > > *out32_Q14_s32x2 = vadd_s32(*S_s32x2, vget_low_s32(t_s32x4)); > > /* silk_SMLAWB( S[ 0 ], B_Q28[ 0 ], in[ k ] > > ) */ > > *S_s32x2 = > > vreinterpret_s32_u64(vshr_n_u64(vreinterpret_u64_s32(*S_s32x2), > > 32)...

2017 Apr 25

2

2 patches related to silk_biquad_alt() optimization

On Mon, Apr 24, 2017 at 5:52 PM, Jean-Marc Valin <jmvalin at jmvalin.ca> wrote: > On 24/04/17 08:03 PM, Linfeng Zhang wrote: > > Tested on my chromebook, when stride (channel) == 1, the optimization > > has no gain compared with C function. > > You mean that the Neon code is the same speed as the C code for > stride==1? This is not terribly surprising for an IIRC

Several patches of ARM NEON optimization

2016 Jul 14

6

Several patches of ARM NEON optimization

I rebased my previous 3 patches to the current master with minor changes. Patches 1 to 3 replace all my previous submitted patches. Patches 4 and 5 are new. Thanks, Linfeng Zhang

silk_warped_autocorrelation_FIX() NEON optimization

2016 Jul 01

1

silk_warped_autocorrelation_FIX() NEON optimization

Hi all, I'm sending patch "Optimize silk_warped_autocorrelation_FIX() for ARM NEON" in an separate email. It is based on Tim’s aarch64v8 branch https://git.xiph.org/?p=users/tterribe/opus.git;a=shortlog;h=refs/heads/aarch64v8 Thanks for your comments. Linfeng

[RFC V3 5/8] aarch64: celt_pitch_xcorr: Fixed point intrinsics

2015 May 15

0

[RFC V3 5/8] aarch64: celt_pitch_xcorr: Fixed point intrinsics

..._s16(SUMM, vget_high_s16(YY[0]), vget_high_s16(XX[0])); + len -= 8; + } + + /* Work on 4 values */ + if (len >= 4) { + XX_2 = vld1_s16(xi); + xi += 4; + YY_2 = vld1_s16(yi); + yi += 4; + SUMM = vmlal_s16(SUMM, YY_2, XX_2); + len -= 4; + } + + SUMM_2 = vadd_s32(vget_high_s32(SUMM), vget_low_s32(SUMM)); + SUMM_2 = vpadd_s32(SUMM_2, SUMM_2); + SUMM = vcombine_s32(SUMM_2, SUMM_2); + + while (len > 0) { + XX_2 = vld1_dup_s16(xi++); + YY_2 = vld1_dup_s16(yi++); + SUMM = vmlal_s16(SUMM, XX_2, YY_2); + len--; + } + vst1q_lane_s32...

[[RFC PATCH v2]: Ne10 fft fixed and previous 5/8] aarch64: celt_pitch_xcorr: Fixed point intrinsics

2015 May 08

0

[[RFC PATCH v2]: Ne10 fft fixed and previous 5/8] aarch64: celt_pitch_xcorr: Fixed point intrinsics

..._s16(SUMM, vget_high_s16(YY[0]), vget_high_s16(XX[0])); + len -= 8; + } + + /* Work on 4 values */ + if (len >= 4) { + XX_2 = vld1_s16(xi); + xi += 4; + YY_2 = vld1_s16(yi); + yi += 4; + SUMM = vmlal_s16(SUMM, YY_2, XX_2); + len -= 4; + } + + SUMM_2 = vadd_s32(vget_high_s32(SUMM), vget_low_s32(SUMM)); + SUMM_2 = vpadd_s32(SUMM_2, SUMM_2); + SUMM = vcombine_s32(SUMM_2, SUMM_2); + + while (len > 0) { + XX_2 = vld1_dup_s16(xi++); + YY_2 = vld1_dup_s16(yi++); + SUMM = vmlal_s16(SUMM, XX_2, YY_2); + len--; + } + vst1q_lane_s32...

[RFC PATCH v1 0/5] aarch64: celt_pitch_xcorr: Fixed point series

2015 Mar 31

6

[RFC PATCH v1 0/5] aarch64: celt_pitch_xcorr: Fixed point series

Hi Timothy, As I mentioned earlier [1], I now fixed compile issues with fixed point and resubmitting the patch. I also have new patch that does intrinsics optimizations for celt_pitch_xcorr targetting aarch64. You can find my latest work-in-progress branch at [2] For reference, you can use the Ne10 pre-built libraries at [3] Note that I am working with Phil at ARM to get my patch at [4]

[RFC PATCH v2]: Ne10 fft fixed and previous 0/8]

2015 May 08

8

[RFC PATCH v2]: Ne10 fft fixed and previous 0/8]

Hi All, As per Timothy's suggestion, disabling mdct_forward for fixed point. Only effects armv7,armv8: Extend fixed fft NE10 optimizations to mdct Rest of patches are same as in [1] For reference, latest wip code for opus is at [2] Still working with NE10 team at ARM to get corner cases of mdct_forward. Will update with another patch when issue in NE10 gets fixed. Regards, Vish [1]:

[RFC V3 0/8] Ne10 fft fixed and previous

2015 May 15

11

[RFC V3 0/8] Ne10 fft fixed and previous

Hi All, Changes from RFC v2 [1] armv7,armv8: Extend fixed fft NE10 optimizations to mdct - Overflow issue fixed by Phil at ARM. Ne10 wip at [2]. Should be upstream soon. - So, re-enabled using fixed fft for mdct_forward which was disabled in RFCv2 armv7,armv8: Optimize fixed point fft using NE10 library - Thanks to Jonathan Lennox, fixed some build fixes on iOS and some copy-paste errors Rest

[RFC PATCH v1 0/8] Ne10 fft fixed and previous

2015 Apr 28

10

[RFC PATCH v1 0/8] Ne10 fft fixed and previous

Hello Timothy / Jean-Marc / opus-dev, This patch series is follow up on work I posted on [1]. In addition to what was posted on [1], this patch series mainly integrates Fixed point FFT implementations in NE10 library into opus. You can view my opus wip code at [2]. Note that while I found some issues both with the NE10 library(fixed fft) and with Linaro toolchain (armv8 intrinsics), the work

search for: vadd_s32