Linfeng Zhang
2017-Apr-25 17:37 UTC
[opus] 2 patches related to silk_biquad_alt() optimization
On Mon, Apr 24, 2017 at 5:52 PM, Jean-Marc Valin <jmvalin at jmvalin.ca> wrote:> On 24/04/17 08:03 PM, Linfeng Zhang wrote: > > Tested on my chromebook, when stride (channel) == 1, the optimization > > has no gain compared with C function. > > You mean that the Neon code is the same speed as the C code for > stride==1? This is not terribly surprising for an IIRC filter. >Yes> > > When stride (channel) == 2, the optimization is 1.2%-1.8% faster (1.6% > > at Complexity 8) compared with C function. > > Is that gain due to Neon or simply due to computing two channels in > parallel? For example, if you make a special case in the C code to > handle both channels in the same loop, what kind of performance do you get? >Tested Complexity 8, it's half half, i.e., 0.8% faster if handling both channels in the same loop in C, and then additional 0.8% faster using NEON.> > > Please let me know and I can remove the optimization of stride 1 case. > > Yeah, if there's Neon code that provides no improvement over C, let's > stick with C. And if you manage to write C code that has the same > performance as the Neon code, then that would also be better (both > easier to maintain and more portable). >Will do.> > > If it's allowed to skip the split of A_Q28 and replace by 32-bit > > multiplication (result is 64-bit), probably it could be faster on NEON. > > This may change the encoder results because of different order of > > adding, shifting and rounding. > > I'm not sure what you mean for that. >/* Negate A_Q28 values and split in two parts */ A0_L_Q28 = ( -A_Q28[ 0 ] ) & 0x00003FFF; /* lower part */ A0_U_Q28 = silk_RSHIFT( -A_Q28[ 0 ], 14 ); /* upper part */ A1_L_Q28 = ( -A_Q28[ 1 ] ) & 0x00003FFF; /* lower part */ A1_U_Q28 = silk_RSHIFT( -A_Q28[ 1 ], 14 ); /* upper part */ ... S[ 0 ] = S[1] + silk_RSHIFT_ROUND( silk_SMULWB( out32_Q14, A0_L_Q28 ), 14 ); S[ 0 ] = silk_SMLAWB( S[ 0 ], out32_Q14, A0_U_Q28 ); S[ 0 ] = silk_SMLAWB( S[ 0 ], B_Q28[ 1 ], inval); S[ 1 ] = silk_RSHIFT_ROUND( silk_SMULWB( out32_Q14, A1_L_Q28 ), 14 ); S[ 1 ] = silk_SMLAWB( S[ 1 ], out32_Q14, A1_U_Q28 ); S[ 1 ] = silk_SMLAWB( S[ 1 ], B_Q28[ 2 ], inval ); A_Q28 is split to 2 14-bit (or 16-bit, whatever) integers, to make the multiplication operation within 32-bits. NEON can do 32-bit x 32-bit 64-bit using 'int64x2_t vmull_s32(int32x2_t a, int32x2_t b)', and it could possibly be faster and less rounding/shifting errors than above C code. But it may increase difficulties for other CPUs not supporting 32-bit multiplication. Thanks, Linfeng -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xiph.org/pipermail/opus/attachments/20170425/a44e9427/attachment.html>
Jean-Marc Valin
2017-Apr-26 05:31 UTC
[opus] 2 patches related to silk_biquad_alt() optimization
On 25/04/17 01:37 PM, Linfeng Zhang wrote:> Is that gain due to Neon or simply due to computing two channels in > parallel? For example, if you make a special case in the C code to > handle both channels in the same loop, what kind of performance do > you get? > > > Tested Complexity 8, it's half half, i.e., 0.8% faster if handling both > channels in the same loop in C, and then additional 0.8% faster using NEON.Considering that the function isn't huge, I'm OK in principle adding some Neon to gain 0.8%. It would just be good to check that the 0.8% indeed comes from Neon as opposed to just unrolling the channels.> A_Q28 is split to 2 14-bit (or 16-bit, whatever) integers, to make the > multiplication operation within 32-bits. NEON can do 32-bit x 32-bit > 64-bit using 'int64x2_t vmull_s32(int32x2_t a, int32x2_t b)', and it > could possibly be faster and less rounding/shifting errors than above C > code. But it may increase difficulties for other CPUs not supporting > 32-bit multiplication.OK, so I'm not totally opposed to that, but it increases the testing/maintenance cost so it needs to be worth it. So the question is how much speedup can you get and how close you can make the result to the original function. If you can make the output be always within one of two LSBs of the C version, then the asm check can simply be a little bit more lax than usual. Otherwise it becomes more complicated. This isn't a function that scares me too much about going non-bitexact, but it's also not one of the big complexity costs either. In any case, let me know what you find. Cheers, Jean-Marc
Linfeng Zhang
2017-Apr-26 21:15 UTC
[opus] 2 patches related to silk_biquad_alt() optimization
On Tue, Apr 25, 2017 at 10:31 PM, Jean-Marc Valin <jmvalin at jmvalin.ca> wrote:> > > A_Q28 is split to 2 14-bit (or 16-bit, whatever) integers, to make the > > multiplication operation within 32-bits. NEON can do 32-bit x 32-bit > > 64-bit using 'int64x2_t vmull_s32(int32x2_t a, int32x2_t b)', and it > > could possibly be faster and less rounding/shifting errors than above C > > code. But it may increase difficulties for other CPUs not supporting > > 32-bit multiplication. > > OK, so I'm not totally opposed to that, but it increases the > testing/maintenance cost so it needs to be worth it. So the question is > how much speedup can you get and how close you can make the result to > the original function. If you can make the output be always within one > of two LSBs of the C version, then the asm check can simply be a little > bit more lax than usual. Otherwise it becomes more complicated. This > isn't a function that scares me too much about going non-bitexact, but > it's also not one of the big complexity costs either. In any case, let > me know what you find. >Tested the proposed NEON optimization change, it can increase the whole encoder speed by about 1% at Complexity 8 and 10 for stride = 2 (compared to the NEON optimization in the previous patch), and 0.7% at Complexity 8 and 1.1% at Complexity 10 for stride = 1 (compared to the original C code). Unfortunately, the truncating difference accumulates in S[] and its difference cannot be easily bounded. The difference in out[] may somehow be bounded to 5 in my quick testing, but is not guaranteed to other inputs. So maybe comparing bit exactness with the following silk_biquad_alt_c_MulSingleAQ28() is better. Please let me know the decision (whether keeping the original NEON (stride 2 only) or choosing the new NEON (both stride 1 and 2) which optimizes following silk_biquad_alt_c_MulSingleAQ28()), and I'll wrap up the patch. Here attached the corresponding C code silk_biquad_alt_c_MulSingleAQ28() for your information. It uses 64-bit multiplications and is about 0.6% - 0.9% slower (for the whole encoder) than the original C code using 32-bit multiplications (for both strides and the same stride 2 unrolling). void silk_biquad_alt_c_MulSingleAQ28( const opus_int16 *in, /* I input signal */ const opus_int32 *B_Q28, /* I MA coefficients [3] */ const opus_int32 *A_Q28, /* I AR coefficients [2] */ opus_int32 *S, /* I/O State vector [2*stride] */ opus_int16 *out, /* O output signal */ const opus_int32 len, /* I signal length (must be even) */ opus_int stride /* I Operate on interleaved signal if > 1 */ ) { /* DIRECT FORM II TRANSPOSED (uses 2 element state vector) */ opus_int k; silk_assert( ( stride == 1 ) || ( stride == 2 ) ); if( stride == 1) { opus_int32 out32_Q14; for( k = 0; k < len; k++ ) { /* S[ 0 ], S[ 1 ]: Q12 */ out32_Q14 = silk_LSHIFT( silk_SMLAWB( S[ 0 ], B_Q28[ 0 ], in[ k ] ), 2 ); S[ 0 ] = S[ 1 ] + silk_RSHIFT_ROUND( (opus_int64)out32_Q14 * (-A_Q28[ 0 ]), 30 ); S[ 0 ] = silk_SMLAWB( S[ 0 ], B_Q28[ 1 ], in[ k ] ); S[ 1 ] = silk_RSHIFT_ROUND( (opus_int64)out32_Q14 * (-A_Q28[ 1 ]) , 30 ); S[ 1 ] = silk_SMLAWB( S[ 1 ], B_Q28[ 2 ], in[ k ] ); /* Scale back to Q0 and saturate */ out[ k ] = (opus_int16)silk_SAT16( silk_RSHIFT( out32_Q14 + (1<<14) - 1, 14 ) ); } } else { opus_int32 out32_Q14[ 2 ]; for( k = 0; k < len; k++ ) { /* S[ 0 ], S[ 1 ]: Q12 */ out32_Q14[ 0 ] = silk_LSHIFT( silk_SMLAWB( S[ 0 ], B_Q28[ 0 ], in[ k * 2 + 0 ] ), 2 ); out32_Q14[ 1 ] = silk_LSHIFT( silk_SMLAWB( S[ 2 ], B_Q28[ 0 ], in[ k * 2 + 1 ] ), 2 ); S[ 0 ] = S[ 1 ] + silk_RSHIFT_ROUND( (opus_int64)out32_Q14[ 0 ] * (-A_Q28[ 0 ]), 30 ); S[ 2 ] = S[ 3 ] + silk_RSHIFT_ROUND( (opus_int64)out32_Q14[ 1 ] * (-A_Q28[ 0 ]), 30 ); S[ 0 ] = silk_SMLAWB( S[ 0 ], B_Q28[ 1 ], in[ k * 2 + 0 ] ); S[ 2 ] = silk_SMLAWB( S[ 2 ], B_Q28[ 1 ], in[ k * 2 + 1 ] ); S[ 1 ] = silk_RSHIFT_ROUND( (opus_int64)out32_Q14[ 0 ] * (-A_Q28[ 1 ]), 30 ); S[ 3 ] = silk_RSHIFT_ROUND( (opus_int64)out32_Q14[ 1 ] * (-A_Q28[ 1 ]), 30 ); S[ 1 ] = silk_SMLAWB( S[ 1 ], B_Q28[ 2 ], in[ k * 2 + 0 ] ); S[ 3 ] = silk_SMLAWB( S[ 3 ], B_Q28[ 2 ], in[ k * 2 + 1 ] ); /* Scale back to Q0 and saturate */ out[ k * 2 + 0 ] = (opus_int16)silk_SAT16( silk_RSHIFT( out32_Q14[ 0 ] + (1<<14) - 1, 14 ) ); out[ k * 2 + 1 ] = (opus_int16)silk_SAT16( silk_RSHIFT( out32_Q14[ 1 ] + (1<<14) - 1, 14 ) ); } } } Here is the NEON kernels which uses vqrdmulh_lane_s32() to do the multiplication and rounding, where A_Q28_s32x{2,4} stores doubled -A_Q28[]: static inline void silk_biquad_alt_stride1_kernel(const int32x2_t A_Q28_s32x2, const int32x4_t t_s32x4, int32x2_t *S_s32x2, int32x2_t *out32_Q14_s32x2) { int32x2_t t_s32x2; *out32_Q14_s32x2 = vadd_s32(*S_s32x2, vget_low_s32(t_s32x4)); /* silk_SMLAWB( S[ 0 ], B_Q28[ 0 ], in[ k ] ) */ *S_s32x2 = vreinterpret_s32_u64(vshr_n_ u64(vreinterpret_u64_s32(*S_s32x2), 32)); /* S[ 0 ] = S[ 1 ]; S[ 1 ] = 0; */ *out32_Q14_s32x2 = vshl_n_s32(*out32_Q14_s32x2, 2); /* out32_Q14 = silk_LSHIFT( silk_SMLAWB( S[ 0 ], B_Q28[ 0 ], in[ k ] ), 2 ); */ t_s32x2 = vqrdmulh_lane_s32(A_Q28_s32x2, *out32_Q14_s32x2, 0); /* silk_RSHIFT_ROUND( (opus_int64)out32_Q14 * (-A_Q28[ {0,1} ]), 30 ) */ *S_s32x2 = vadd_s32(*S_s32x2, t_s32x2); /* S[ {0,1} ] = {S[ 1 ],0} + silk_RSHIFT_ROUND( ); */ *S_s32x2 = vadd_s32(*S_s32x2, vget_high_s32(t_s32x4)); /* S[ {0,1} ] = silk_SMLAWB( S[ {0,1} ], B_Q28[ {1,2} ], in[ k ] ); */ } static inline void silk_biquad_alt_stride2_kernel(const int32x4_t A_Q28_s32x4, const int32x4_t B_Q28_s32x4, const int32x2_t t_s32x2, const int32x4_t inval_s32x4, int32x4_t *S_s32x4, int32x2_t *out32_Q14_s32x2) { int32x4_t t_s32x4, out32_Q14_s32x4; *out32_Q14_s32x2 = vadd_s32(vget_low_s32(*S_s32x4), t_s32x2); /* silk_SMLAWB( S{0,1}, B_Q28[ 0 ], in[ k * 2 + {0,1} ] ) */ *S_s32x4 = vcombine_s32(vget_high_s32(*S_s32x4), vdup_n_s32(0)); /* S{0,1} = S{2,3}; S{2,3} = 0; */ *out32_Q14_s32x2 = vshl_n_s32(*out32_Q14_s32x2, 2); /* out32_Q14_{0,1} = silk_LSHIFT( silk_SMLAWB( S{0,1}, B_Q28[ 0 ], in[ k * 2 + {0,1} ] ), 2 ); */ out32_Q14_s32x4 = vcombine_s32(*out32_Q14_s32x2, *out32_Q14_s32x2); /* out32_Q14_{0,1,0,1} */ t_s32x4 = vqrdmulhq_s32(out32_Q14_s32x4, A_Q28_s32x4); /* silk_RSHIFT_ROUND( (opus_int64)out32_Q14[ {0,1,0,1} ] * (-A_Q28[ {0,0,1,1} ]), 30 ) */ *S_s32x4 = vaddq_s32(*S_s32x4, t_s32x4); /* S[ {0,1,2,3} ] = {S[ {2,3} ],0,0} + silk_RSHIFT_ROUND( ); */ t_s32x4 = vqdmulhq_s32(inval_s32x4, B_Q28_s32x4); /* silk_SMULWB(B_Q28[ {1,1,2,2} ], in[ k * 2 + {0,1,0,1} ] ) */ *S_s32x4 = vaddq_s32(*S_s32x4, t_s32x4); /* S[ {0,1,2,3} ] = silk_SMLAWB( S[ {0,1,2,3} ], B_Q28[ {1,1,2,2} ], in[ k * 2 + {0,1,0,1} ] ); */ } Thanks, Linfeng -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xiph.org/pipermail/opus/attachments/20170426/0a60318a/attachment-0001.html>
Possibly Parallel Threads
- 2 patches related to silk_biquad_alt() optimization
- 2 patches related to silk_biquad_alt() optimization
- 2 patches related to silk_biquad_alt() optimization
- 2 patches related to silk_biquad_alt() optimization
- 2 patches related to silk_biquad_alt() optimization