Displaying 13 results from an estimated 13 matches for "vadd_s32".
Did you mean:
vaddq_s32
2017 Apr 26
2
2 patches related to silk_biquad_alt() optimization
..._lane_s32() to do the
multiplication and rounding, where A_Q28_s32x{2,4} stores doubled -A_Q28[]:
static inline void silk_biquad_alt_stride1_kernel(const int32x2_t
A_Q28_s32x2, const int32x4_t t_s32x4, int32x2_t *S_s32x2, int32x2_t
*out32_Q14_s32x2)
{
int32x2_t t_s32x2;
*out32_Q14_s32x2 = vadd_s32(*S_s32x2, vget_low_s32(t_s32x4));
/* silk_SMLAWB( S[ 0 ], B_Q28[ 0 ], in[ k ] )
*/
*S_s32x2 = vreinterpret_s32_u64(vshr_n_
u64(vreinterpret_u64_s32(*S_s32x2), 32)); /* S[ 0 ] = S[ 1 ]; S[ 1 ] = 0;
*/
*out...
2017 May 15
2
2 patches related to silk_biquad_alt() optimization
...stores doubled
> -A_Q28[]:
>
> static inline void silk_biquad_alt_stride1_kernel(const int32x2_t
> A_Q28_s32x2, const int32x4_t t_s32x4, int32x2_t *S_s32x2, int32x2_t
> *out32_Q14_s32x2)
> {
> int32x2_t t_s32x2;
>
> *out32_Q14_s32x2 = vadd_s32(*S_s32x2, vget_low_s32(t_s32x4));
> /* silk_SMLAWB( S[ 0 ], B_Q28[ 0 ], in[ k ]
> ) */
> *S_s32x2 =
> vreinterpret_s32_u64(vshr_n_u64(vreinterpret_u64_s32(*S_s32x2),
> 32)); /* S[ 0 ] = S[ 1 ];...
2017 May 08
0
2 patches related to silk_biquad_alt() optimization
...nd rounding, where A_Q28_s32x{2,4} stores doubled -A_Q28[]:
>
> static inline void silk_biquad_alt_stride1_kernel(const int32x2_t
> A_Q28_s32x2, const int32x4_t t_s32x4, int32x2_t *S_s32x2, int32x2_t
> *out32_Q14_s32x2)
> {
> int32x2_t t_s32x2;
>
> *out32_Q14_s32x2 = vadd_s32(*S_s32x2, vget_low_s32(t_s32x4));
> /* silk_SMLAWB( S[ 0 ], B_Q28[ 0 ], in[ k ] )
> */
> *S_s32x2 = vreinterpret_s32_u64(vshr_n_u6
> 4(vreinterpret_u64_s32(*S_s32x2), 32)); /* S[ 0 ] = S[ 1 ]; S[ 1 ] = 0;
>...
2017 May 17
0
2 patches related to silk_biquad_alt() optimization
...>
> > static inline void silk_biquad_alt_stride1_kernel(const int32x2_t
> > A_Q28_s32x2, const int32x4_t t_s32x4, int32x2_t *S_s32x2, int32x2_t
> > *out32_Q14_s32x2)
> > {
> > int32x2_t t_s32x2;
> >
> > *out32_Q14_s32x2 = vadd_s32(*S_s32x2, vget_low_s32(t_s32x4));
> > /* silk_SMLAWB( S[ 0 ], B_Q28[ 0 ], in[ k ]
> > ) */
> > *S_s32x2 =
> > vreinterpret_s32_u64(vshr_n_u64(vreinterpret_u64_s32(*S_s32x2),
> > 32)...
2017 Apr 25
2
2 patches related to silk_biquad_alt() optimization
On Mon, Apr 24, 2017 at 5:52 PM, Jean-Marc Valin <jmvalin at jmvalin.ca> wrote:
> On 24/04/17 08:03 PM, Linfeng Zhang wrote:
> > Tested on my chromebook, when stride (channel) == 1, the optimization
> > has no gain compared with C function.
>
> You mean that the Neon code is the same speed as the C code for
> stride==1? This is not terribly surprising for an IIRC
2016 Jul 14
6
Several patches of ARM NEON optimization
I rebased my previous 3 patches to the current master with minor changes.
Patches 1 to 3 replace all my previous submitted patches.
Patches 4 and 5 are new.
Thanks,
Linfeng Zhang
2016 Jul 01
1
silk_warped_autocorrelation_FIX() NEON optimization
Hi all,
I'm sending patch "Optimize silk_warped_autocorrelation_FIX() for ARM NEON" in an separate email.
It is based on Tim’s aarch64v8 branch https://git.xiph.org/?p=users/tterribe/opus.git;a=shortlog;h=refs/heads/aarch64v8
Thanks for your comments.
Linfeng
2015 May 15
0
[RFC V3 5/8] aarch64: celt_pitch_xcorr: Fixed point intrinsics
..._s16(SUMM, vget_high_s16(YY[0]), vget_high_s16(XX[0]));
+ len -= 8;
+ }
+
+ /* Work on 4 values */
+ if (len >= 4) {
+ XX_2 = vld1_s16(xi);
+ xi += 4;
+ YY_2 = vld1_s16(yi);
+ yi += 4;
+ SUMM = vmlal_s16(SUMM, YY_2, XX_2);
+ len -= 4;
+ }
+
+ SUMM_2 = vadd_s32(vget_high_s32(SUMM), vget_low_s32(SUMM));
+ SUMM_2 = vpadd_s32(SUMM_2, SUMM_2);
+ SUMM = vcombine_s32(SUMM_2, SUMM_2);
+
+ while (len > 0) {
+ XX_2 = vld1_dup_s16(xi++);
+ YY_2 = vld1_dup_s16(yi++);
+ SUMM = vmlal_s16(SUMM, XX_2, YY_2);
+ len--;
+ }
+ vst1q_lane_s32...
2015 May 08
0
[[RFC PATCH v2]: Ne10 fft fixed and previous 5/8] aarch64: celt_pitch_xcorr: Fixed point intrinsics
..._s16(SUMM, vget_high_s16(YY[0]), vget_high_s16(XX[0]));
+ len -= 8;
+ }
+
+ /* Work on 4 values */
+ if (len >= 4) {
+ XX_2 = vld1_s16(xi);
+ xi += 4;
+ YY_2 = vld1_s16(yi);
+ yi += 4;
+ SUMM = vmlal_s16(SUMM, YY_2, XX_2);
+ len -= 4;
+ }
+
+ SUMM_2 = vadd_s32(vget_high_s32(SUMM), vget_low_s32(SUMM));
+ SUMM_2 = vpadd_s32(SUMM_2, SUMM_2);
+ SUMM = vcombine_s32(SUMM_2, SUMM_2);
+
+ while (len > 0) {
+ XX_2 = vld1_dup_s16(xi++);
+ YY_2 = vld1_dup_s16(yi++);
+ SUMM = vmlal_s16(SUMM, XX_2, YY_2);
+ len--;
+ }
+ vst1q_lane_s32...
2015 Mar 31
6
[RFC PATCH v1 0/5] aarch64: celt_pitch_xcorr: Fixed point series
Hi Timothy,
As I mentioned earlier [1], I now fixed compile issues
with fixed point and resubmitting the patch.
I also have new patch that does intrinsics optimizations
for celt_pitch_xcorr targetting aarch64.
You can find my latest work-in-progress branch at [2]
For reference, you can use the Ne10 pre-built libraries
at [3]
Note that I am working with Phil at ARM to get my patch at [4]
2015 May 08
8
[RFC PATCH v2]: Ne10 fft fixed and previous 0/8]
Hi All,
As per Timothy's suggestion, disabling mdct_forward
for fixed point. Only effects
armv7,armv8: Extend fixed fft NE10 optimizations to mdct
Rest of patches are same as in [1]
For reference, latest wip code for opus is at [2]
Still working with NE10 team at ARM to get corner cases of
mdct_forward. Will update with another patch
when issue in NE10 gets fixed.
Regards,
Vish
[1]:
2015 May 15
11
[RFC V3 0/8] Ne10 fft fixed and previous
Hi All,
Changes from RFC v2 [1]
armv7,armv8: Extend fixed fft NE10 optimizations to mdct
- Overflow issue fixed by Phil at ARM. Ne10 wip at [2]. Should be upstream soon.
- So, re-enabled using fixed fft for mdct_forward which was disabled in RFCv2
armv7,armv8: Optimize fixed point fft using NE10 library
- Thanks to Jonathan Lennox, fixed some build fixes on iOS and some copy-paste errors
Rest
2015 Apr 28
10
[RFC PATCH v1 0/8] Ne10 fft fixed and previous
Hello Timothy / Jean-Marc / opus-dev,
This patch series is follow up on work I posted on [1].
In addition to what was posted on [1], this patch series mainly
integrates Fixed point FFT implementations in NE10 library into opus.
You can view my opus wip code at [2].
Note that while I found some issues both with the NE10 library(fixed fft)
and with Linaro toolchain (armv8 intrinsics), the work