thr3ads.net - opus - [opus] 2 patches related to silk_biquad

If this information is useful, please help other people find it:
Share via:

Jean-Marc Valin

2017-May-15 16:36 UTC

[opus] 2 patches related to silk_biquad_alt() optimization

Hi Linfeng,

Sorry for the delay -- I was actually trying to think of the best option
here. For now, my preference would be to keep things bit-exact, but
should there be more similar optimizations relying on 64-bit
multiplication results, then we could consider having a special option
to enable those (even in C).

Cheers,

	Jean-Marc

On 08/05/17 12:12 PM, Linfeng Zhang wrote:> Ping for comments.
> 
> Thanks,
> Linfeng
> 
> On Wed, Apr 26, 2017 at 2:15 PM, Linfeng Zhang <linfengz at google.com
> <mailto:linfengz at google.com>> wrote:
> 
>     On Tue, Apr 25, 2017 at 10:31 PM, Jean-Marc Valin
>     <jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>>
wrote:
> 
> 
>         > A_Q28 is split to 2 14-bit (or 16-bit, whatever) integers, to
make the
>         > multiplication operation within 32-bits. NEON can do 32-bit x
32-bit >         > 64-bit using 'int64x2_t vmull_s32(int32x2_t a,
int32x2_t b)', and it
>         > could possibly be faster and less rounding/shifting errors
than above C
>         > code. But it may increase difficulties for other CPUs not
supporting
>         > 32-bit multiplication.
> 
>         OK, so I'm not totally opposed to that, but it increases the
>         testing/maintenance cost so it needs to be worth it. So the
>         question is
>         how much speedup can you get and how close you can make the
>         result to
>         the original function. If you can make the output be always
>         within one
>         of two LSBs of the C version, then the asm check can simply be a
>         little
>         bit more lax than usual. Otherwise it becomes more complicated.
This
>         isn't a function that scares me too much about going
>         non-bitexact, but
>         it's also not one of the big complexity costs either. In any
>         case, let
>         me know what you find.
> 
> 
>     Tested the proposed NEON optimization change, it can increase the
>     whole encoder speed by about 1% at Complexity 8 and 10 for stride > 
2 (compared to the NEON optimization in the previous patch), and
>     0.7% at Complexity 8 and 1.1% at Complexity 10 for stride = 1
>     (compared to the original C code).
> 
>     Unfortunately, the truncating difference accumulates in S[] and its
>     difference cannot be easily bounded. The difference in out[] may
>     somehow be bounded to 5 in my quick testing, but is not guaranteed
>     to other inputs. So maybe comparing bit exactness with the
>     following silk_biquad_alt_c_MulSingleAQ28() is better.
> 
>     Please let me know the decision (whether keeping the original NEON
>     (stride 2 only) or choosing the new NEON (both stride 1 and 2) which
>     optimizes following silk_biquad_alt_c_MulSingleAQ28()), and I'll
>     wrap up the patch.
> 
>     Here attached the corresponding C
>     code silk_biquad_alt_c_MulSingleAQ28() for your information. It uses
>     64-bit multiplications and is about 0.6% - 0.9% slower (for the
>     whole encoder) than the original C code using 32-bit multiplications
>     (for both strides and the same stride 2 unrolling).
> 
>     void silk_biquad_alt_c_MulSingleAQ28(
>         const opus_int16            *in,                /* I     input
>     signal                                               */
>         const opus_int32            *B_Q28,             /* I     MA
>     coefficients [3]                                        */
>         const opus_int32            *A_Q28,             /* I     AR
>     coefficients [2]                                        */
>         opus_int32                  *S,                 /* I/O   State
>     vector [2*stride]                                    */
>         opus_int16                  *out,               /* O     output
>     signal                                              */
>         const opus_int32            len,                /* I     signal
>     length (must be even)                               */
>         opus_int                    stride              /* I     Operate
>     on interleaved signal if > 1                       */
>     )
>     {
>         /* DIRECT FORM II TRANSPOSED (uses 2 element state vector) */
>         opus_int   k;
> 
>         silk_assert( ( stride == 1 ) || ( stride == 2 ) );
> 
>         if( stride == 1) {
>             opus_int32 out32_Q14;
>             for( k = 0; k < len; k++ ) {
>                 /* S[ 0 ], S[ 1 ]: Q12 */
>                 out32_Q14 = silk_LSHIFT( silk_SMLAWB( S[ 0 ], B_Q28[ 0
>     ], in[ k ] ), 2 );
> 
>                 S[ 0 ] = S[ 1 ] + silk_RSHIFT_ROUND(
>     (opus_int64)out32_Q14 * (-A_Q28[ 0 ]), 30 );
>                 S[ 0 ] = silk_SMLAWB( S[ 0 ], B_Q28[ 1 ], in[ k ] );
> 
>                 S[ 1 ] = silk_RSHIFT_ROUND( (opus_int64)out32_Q14 *
>     (-A_Q28[ 1 ]) , 30 );
>                 S[ 1 ] = silk_SMLAWB( S[ 1 ], B_Q28[ 2 ], in[ k ] );
> 
>                 /* Scale back to Q0 and saturate */
>                 out[ k ] = (opus_int16)silk_SAT16( silk_RSHIFT(
>     out32_Q14 + (1<<14) - 1, 14 ) );
>             }
>         } else {
>             opus_int32 out32_Q14[ 2 ];
>             for( k = 0; k < len; k++ ) {
>                 /* S[ 0 ], S[ 1 ]: Q12 */
>                 out32_Q14[ 0 ] = silk_LSHIFT( silk_SMLAWB( S[ 0 ],
>     B_Q28[ 0 ], in[ k * 2 + 0 ] ), 2 );
>                 out32_Q14[ 1 ] = silk_LSHIFT( silk_SMLAWB( S[ 2 ],
>     B_Q28[ 0 ], in[ k * 2 + 1 ] ), 2 );
> 
>                 S[ 0 ] = S[ 1 ] + silk_RSHIFT_ROUND(
>     (opus_int64)out32_Q14[ 0 ] * (-A_Q28[ 0 ]), 30 );
>                 S[ 2 ] = S[ 3 ] + silk_RSHIFT_ROUND(
>     (opus_int64)out32_Q14[ 1 ] * (-A_Q28[ 0 ]), 30 );
>                 S[ 0 ] = silk_SMLAWB( S[ 0 ], B_Q28[ 1 ], in[ k * 2 + 0 ]
);
>                 S[ 2 ] = silk_SMLAWB( S[ 2 ], B_Q28[ 1 ], in[ k * 2 + 1 ]
);
> 
>                 S[ 1 ] = silk_RSHIFT_ROUND( (opus_int64)out32_Q14[ 0 ] *
>     (-A_Q28[ 1 ]), 30 );
>                 S[ 3 ] = silk_RSHIFT_ROUND( (opus_int64)out32_Q14[ 1 ] *
>     (-A_Q28[ 1 ]), 30 );
>                 S[ 1 ] = silk_SMLAWB( S[ 1 ], B_Q28[ 2 ], in[ k * 2 + 0 ]
);
>                 S[ 3 ] = silk_SMLAWB( S[ 3 ], B_Q28[ 2 ], in[ k * 2 + 1 ]
);
> 
>                 /* Scale back to Q0 and saturate */
>                 out[ k * 2 + 0 ] = (opus_int16)silk_SAT16( silk_RSHIFT(
>     out32_Q14[ 0 ] + (1<<14) - 1, 14 ) );
>                 out[ k * 2 + 1 ] = (opus_int16)silk_SAT16( silk_RSHIFT(
>     out32_Q14[ 1 ] + (1<<14) - 1, 14 ) );
>             }
>         }
>     }
> 
>     Here is the NEON kernels which uses vqrdmulh_lane_s32() to do the
>     multiplication and rounding, where A_Q28_s32x{2,4} stores doubled
>     -A_Q28[]:
> 
>     static inline void silk_biquad_alt_stride1_kernel(const int32x2_t
>     A_Q28_s32x2, const int32x4_t t_s32x4, int32x2_t *S_s32x2, int32x2_t
>     *out32_Q14_s32x2)
>     {
>         int32x2_t t_s32x2;
> 
>         *out32_Q14_s32x2 = vadd_s32(*S_s32x2, vget_low_s32(t_s32x4));  
>                              /* silk_SMLAWB( S[ 0 ], B_Q28[ 0 ], in[ k ]
>     )                                */
>         *S_s32x2         >    
vreinterpret_s32_u64(vshr_n_u64(vreinterpret_u64_s32(*S_s32x2),
>     32)); /* S[ 0 ] = S[ 1 ]; S[ 1 ] = 0;                              
>                    */
>         *out32_Q14_s32x2 = vshl_n_s32(*out32_Q14_s32x2, 2);            
>                              /* out32_Q14 = silk_LSHIFT( silk_SMLAWB( S[
>     0 ], B_Q28[ 0 ], in[ k ] ), 2 ); */
>         t_s32x2          = vqrdmulh_lane_s32(A_Q28_s32x2,
>     *out32_Q14_s32x2, 0);                  /* silk_RSHIFT_ROUND(
>     (opus_int64)out32_Q14 * (-A_Q28[ {0,1} ]), 30 )        */
>         *S_s32x2         = vadd_s32(*S_s32x2, t_s32x2);                
>                              /* S[ {0,1} ] = {S[ 1 ],0} +
>     silk_RSHIFT_ROUND( );                           */
>         *S_s32x2         = vadd_s32(*S_s32x2, vget_high_s32(t_s32x4));  
>                             /* S[ {0,1} ] = silk_SMLAWB( S[ {0,1} ],
>     B_Q28[ {1,2} ], in[ k ] );          */
>     }
> 
>     static inline void silk_biquad_alt_stride2_kernel(const int32x4_t
>     A_Q28_s32x4, const int32x4_t B_Q28_s32x4, const int32x2_t t_s32x2,
>     const int32x4_t inval_s32x4, int32x4_t *S_s32x4, int32x2_t
>     *out32_Q14_s32x2)
>     {
>         int32x4_t t_s32x4, out32_Q14_s32x4;
> 
>         *out32_Q14_s32x2 = vadd_s32(vget_low_s32(*S_s32x4), t_s32x2);  
>              /* silk_SMLAWB( S{0,1}, B_Q28[ 0 ], in[ k * 2 + {0,1} ] )  
>                                         */
>         *S_s32x4         = vcombine_s32(vget_high_s32(*S_s32x4),
>     vdup_n_s32(0)); /* S{0,1} = S{2,3}; S{2,3} = 0;                    
>                                                 */
>         *out32_Q14_s32x2 = vshl_n_s32(*out32_Q14_s32x2, 2);            
>              /* out32_Q14_{0,1} = silk_LSHIFT( silk_SMLAWB( S{0,1},
>     B_Q28[ 0 ], in[ k * 2 + {0,1} ] ), 2 );  */
>         out32_Q14_s32x4  = vcombine_s32(*out32_Q14_s32x2,
>     *out32_Q14_s32x2);     /* out32_Q14_{0,1,0,1}                      
>                                                        */
>         t_s32x4          = vqrdmulhq_s32(out32_Q14_s32x4, A_Q28_s32x4);
>              /* silk_RSHIFT_ROUND( (opus_int64)out32_Q14[ {0,1,0,1} ] *
>     (-A_Q28[ {0,0,1,1} ]), 30 )          */
>         *S_s32x4         = vaddq_s32(*S_s32x4, t_s32x4);                
>             /* S[ {0,1,2,3} ] = {S[ {2,3} ],0,0} + silk_RSHIFT_ROUND( );
>                                        */
>         t_s32x4          = vqdmulhq_s32(inval_s32x4, B_Q28_s32x4);      
>             /* silk_SMULWB(B_Q28[ {1,1,2,2} ], in[ k * 2 + {0,1,0,1} ] )
>                                        */
>         *S_s32x4         = vaddq_s32(*S_s32x4, t_s32x4);                
>             /* S[ {0,1,2,3} ] = silk_SMLAWB( S[ {0,1,2,3} ], B_Q28[
>     {1,1,2,2} ], in[ k * 2 + {0,1,0,1} ] ); */
>     }
> 
>     Thanks,
>     Linfeng
> 
>

Linfeng Zhang

2017-May-17 16:17 UTC

head link

[opus] 2 patches related to silk_biquad_alt() optimization

Hi Jean-Marc,

Thanks!

Please find the 2 updated patches which only optimize stride 2 case and
keep the bit exactness. They have passed our internal tests as usual.

Thanks,
Linfeng

On Mon, May 15, 2017 at 9:36 AM, Jean-Marc Valin <jmvalin at jmvalin.ca>
wrote:
> Hi Linfeng,
>
> Sorry for the delay -- I was actually trying to think of the best option
> here. For now, my preference would be to keep things bit-exact, but
> should there be more similar optimizations relying on 64-bit
> multiplication results, then we could consider having a special option
> to enable those (even in C).
>
> Cheers,
>
>         Jean-Marc
>
> On 08/05/17 12:12 PM, Linfeng Zhang wrote:
> > Ping for comments.
> >
> > Thanks,
> > Linfeng
> >
> > On Wed, Apr 26, 2017 at 2:15 PM, Linfeng Zhang <linfengz at
google.com
> > <mailto:linfengz at google.com>> wrote:
> >
> >     On Tue, Apr 25, 2017 at 10:31 PM, Jean-Marc Valin
> >     <jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>>
wrote:
> >
> >
> >         > A_Q28 is split to 2 14-bit (or 16-bit, whatever)
integers, to
> make the
> >         > multiplication operation within 32-bits. NEON can do
32-bit x
> 32-bit > >         > 64-bit using 'int64x2_t
vmull_s32(int32x2_t a, int32x2_t b)',
> and it
> >         > could possibly be faster and less rounding/shifting
errors
> than above C
> >         > code. But it may increase difficulties for other CPUs not
> supporting
> >         > 32-bit multiplication.
> >
> >         OK, so I'm not totally opposed to that, but it increases
the
> >         testing/maintenance cost so it needs to be worth it. So the
> >         question is
> >         how much speedup can you get and how close you can make the
> >         result to
> >         the original function. If you can make the output be always
> >         within one
> >         of two LSBs of the C version, then the asm check can simply be
a
> >         little
> >         bit more lax than usual. Otherwise it becomes more
complicated.
> This
> >         isn't a function that scares me too much about going
> >         non-bitexact, but
> >         it's also not one of the big complexity costs either. In
any
> >         case, let
> >         me know what you find.
> >
> >
> >     Tested the proposed NEON optimization change, it can increase the
> >     whole encoder speed by about 1% at Complexity 8 and 10 for stride
> >     2 (compared to the NEON optimization in the previous patch), and
> >     0.7% at Complexity 8 and 1.1% at Complexity 10 for stride = 1
> >     (compared to the original C code).
> >
> >     Unfortunately, the truncating difference accumulates in S[] and
its
> >     difference cannot be easily bounded. The difference in out[] may
> >     somehow be bounded to 5 in my quick testing, but is not guaranteed
> >     to other inputs. So maybe comparing bit exactness with the
> >     following silk_biquad_alt_c_MulSingleAQ28() is better.
> >
> >     Please let me know the decision (whether keeping the original NEON
> >     (stride 2 only) or choosing the new NEON (both stride 1 and 2)
which
> >     optimizes following silk_biquad_alt_c_MulSingleAQ28()), and
I'll
> >     wrap up the patch.
> >
> >     Here attached the corresponding C
> >     code silk_biquad_alt_c_MulSingleAQ28() for your information. It
uses
> >     64-bit multiplications and is about 0.6% - 0.9% slower (for the
> >     whole encoder) than the original C code using 32-bit
multiplications
> >     (for both strides and the same stride 2 unrolling).
> >
> >     void silk_biquad_alt_c_MulSingleAQ28(
> >         const opus_int16            *in,                /* I     input
> >     signal                                               */
> >         const opus_int32            *B_Q28,             /* I     MA
> >     coefficients [3]                                        */
> >         const opus_int32            *A_Q28,             /* I     AR
> >     coefficients [2]                                        */
> >         opus_int32                  *S,                 /* I/O   State
> >     vector [2*stride]                                    */
> >         opus_int16                  *out,               /* O    
output
> >     signal                                              */
> >         const opus_int32            len,                /* I    
signal
> >     length (must be even)                               */
> >         opus_int                    stride              /* I    
Operate
> >     on interleaved signal if > 1                       */
> >     )
> >     {
> >         /* DIRECT FORM II TRANSPOSED (uses 2 element state vector) */
> >         opus_int   k;
> >
> >         silk_assert( ( stride == 1 ) || ( stride == 2 ) );
> >
> >         if( stride == 1) {
> >             opus_int32 out32_Q14;
> >             for( k = 0; k < len; k++ ) {
> >                 /* S[ 0 ], S[ 1 ]: Q12 */
> >                 out32_Q14 = silk_LSHIFT( silk_SMLAWB( S[ 0 ], B_Q28[ 0
> >     ], in[ k ] ), 2 );
> >
> >                 S[ 0 ] = S[ 1 ] + silk_RSHIFT_ROUND(
> >     (opus_int64)out32_Q14 * (-A_Q28[ 0 ]), 30 );
> >                 S[ 0 ] = silk_SMLAWB( S[ 0 ], B_Q28[ 1 ], in[ k ] );
> >
> >                 S[ 1 ] = silk_RSHIFT_ROUND( (opus_int64)out32_Q14 *
> >     (-A_Q28[ 1 ]) , 30 );
> >                 S[ 1 ] = silk_SMLAWB( S[ 1 ], B_Q28[ 2 ], in[ k ] );
> >
> >                 /* Scale back to Q0 and saturate */
> >                 out[ k ] = (opus_int16)silk_SAT16( silk_RSHIFT(
> >     out32_Q14 + (1<<14) - 1, 14 ) );
> >             }
> >         } else {
> >             opus_int32 out32_Q14[ 2 ];
> >             for( k = 0; k < len; k++ ) {
> >                 /* S[ 0 ], S[ 1 ]: Q12 */
> >                 out32_Q14[ 0 ] = silk_LSHIFT( silk_SMLAWB( S[ 0 ],
> >     B_Q28[ 0 ], in[ k * 2 + 0 ] ), 2 );
> >                 out32_Q14[ 1 ] = silk_LSHIFT( silk_SMLAWB( S[ 2 ],
> >     B_Q28[ 0 ], in[ k * 2 + 1 ] ), 2 );
> >
> >                 S[ 0 ] = S[ 1 ] + silk_RSHIFT_ROUND(
> >     (opus_int64)out32_Q14[ 0 ] * (-A_Q28[ 0 ]), 30 );
> >                 S[ 2 ] = S[ 3 ] + silk_RSHIFT_ROUND(
> >     (opus_int64)out32_Q14[ 1 ] * (-A_Q28[ 0 ]), 30 );
> >                 S[ 0 ] = silk_SMLAWB( S[ 0 ], B_Q28[ 1 ], in[ k * 2 +
0
> ] );
> >                 S[ 2 ] = silk_SMLAWB( S[ 2 ], B_Q28[ 1 ], in[ k * 2 +
1
> ] );
> >
> >                 S[ 1 ] = silk_RSHIFT_ROUND( (opus_int64)out32_Q14[ 0 ]
*
> >     (-A_Q28[ 1 ]), 30 );
> >                 S[ 3 ] = silk_RSHIFT_ROUND( (opus_int64)out32_Q14[ 1 ]
*
> >     (-A_Q28[ 1 ]), 30 );
> >                 S[ 1 ] = silk_SMLAWB( S[ 1 ], B_Q28[ 2 ], in[ k * 2 +
0
> ] );
> >                 S[ 3 ] = silk_SMLAWB( S[ 3 ], B_Q28[ 2 ], in[ k * 2 +
1
> ] );
> >
> >                 /* Scale back to Q0 and saturate */
> >                 out[ k * 2 + 0 ] = (opus_int16)silk_SAT16(
silk_RSHIFT(
> >     out32_Q14[ 0 ] + (1<<14) - 1, 14 ) );
> >                 out[ k * 2 + 1 ] = (opus_int16)silk_SAT16(
silk_RSHIFT(
> >     out32_Q14[ 1 ] + (1<<14) - 1, 14 ) );
> >             }
> >         }
> >     }
> >
> >     Here is the NEON kernels which uses vqrdmulh_lane_s32() to do the
> >     multiplication and rounding, where A_Q28_s32x{2,4} stores doubled
> >     -A_Q28[]:
> >
> >     static inline void silk_biquad_alt_stride1_kernel(const int32x2_t
> >     A_Q28_s32x2, const int32x4_t t_s32x4, int32x2_t *S_s32x2,
int32x2_t
> >     *out32_Q14_s32x2)
> >     {
> >         int32x2_t t_s32x2;
> >
> >         *out32_Q14_s32x2 = vadd_s32(*S_s32x2, vget_low_s32(t_s32x4));
> >                              /* silk_SMLAWB( S[ 0 ], B_Q28[ 0 ], in[ k
]
> >     )                                */
> >         *S_s32x2         > >    
vreinterpret_s32_u64(vshr_n_u64(vreinterpret_u64_s32(*S_s32x2),
> >     32)); /* S[ 0 ] = S[ 1 ]; S[ 1 ] = 0;
> >                    */
> >         *out32_Q14_s32x2 = vshl_n_s32(*out32_Q14_s32x2, 2);
> >                              /* out32_Q14 = silk_LSHIFT( silk_SMLAWB(
S[
> >     0 ], B_Q28[ 0 ], in[ k ] ), 2 ); */
> >         t_s32x2          = vqrdmulh_lane_s32(A_Q28_s32x2,
> >     *out32_Q14_s32x2, 0);                  /* silk_RSHIFT_ROUND(
> >     (opus_int64)out32_Q14 * (-A_Q28[ {0,1} ]), 30 )        */
> >         *S_s32x2         = vadd_s32(*S_s32x2, t_s32x2);
> >                              /* S[ {0,1} ] = {S[ 1 ],0} +
> >     silk_RSHIFT_ROUND( );                           */
> >         *S_s32x2         = vadd_s32(*S_s32x2, vget_high_s32(t_s32x4));
> >                             /* S[ {0,1} ] = silk_SMLAWB( S[ {0,1} ],
> >     B_Q28[ {1,2} ], in[ k ] );          */
> >     }
> >
> >     static inline void silk_biquad_alt_stride2_kernel(const int32x4_t
> >     A_Q28_s32x4, const int32x4_t B_Q28_s32x4, const int32x2_t t_s32x2,
> >     const int32x4_t inval_s32x4, int32x4_t *S_s32x4, int32x2_t
> >     *out32_Q14_s32x2)
> >     {
> >         int32x4_t t_s32x4, out32_Q14_s32x4;
> >
> >         *out32_Q14_s32x2 = vadd_s32(vget_low_s32(*S_s32x4), t_s32x2);
> >              /* silk_SMLAWB( S{0,1}, B_Q28[ 0 ], in[ k * 2 + {0,1} ] )
> >                                         */
> >         *S_s32x4         = vcombine_s32(vget_high_s32(*S_s32x4),
> >     vdup_n_s32(0)); /* S{0,1} = S{2,3}; S{2,3} = 0;
> >                                                 */
> >         *out32_Q14_s32x2 = vshl_n_s32(*out32_Q14_s32x2, 2);
> >              /* out32_Q14_{0,1} = silk_LSHIFT( silk_SMLAWB( S{0,1},
> >     B_Q28[ 0 ], in[ k * 2 + {0,1} ] ), 2 );  */
> >         out32_Q14_s32x4  = vcombine_s32(*out32_Q14_s32x2,
> >     *out32_Q14_s32x2);     /* out32_Q14_{0,1,0,1}
> >                                                        */
> >         t_s32x4          = vqrdmulhq_s32(out32_Q14_s32x4,
A_Q28_s32x4);
> >              /* silk_RSHIFT_ROUND( (opus_int64)out32_Q14[ {0,1,0,1} ]
*
> >     (-A_Q28[ {0,0,1,1} ]), 30 )          */
> >         *S_s32x4         = vaddq_s32(*S_s32x4, t_s32x4);
> >             /* S[ {0,1,2,3} ] = {S[ {2,3} ],0,0} + silk_RSHIFT_ROUND(
);
> >                                        */
> >         t_s32x4          = vqdmulhq_s32(inval_s32x4, B_Q28_s32x4);
> >             /* silk_SMULWB(B_Q28[ {1,1,2,2} ], in[ k * 2 + {0,1,0,1} ]
)
> >                                        */
> >         *S_s32x4         = vaddq_s32(*S_s32x4, t_s32x4);
> >             /* S[ {0,1,2,3} ] = silk_SMLAWB( S[ {0,1,2,3} ], B_Q28[
> >     {1,1,2,2} ], in[ k * 2 + {0,1,0,1} ] ); */
> >     }
> >
> >     Thanks,
> >     Linfeng
> >
> >
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xiph.org/pipermail/opus/attachments/20170517/8959da63/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Update-silk_biquad_alt.patch
Type: text/x-patch
Size: 8961 bytes
Desc: not available
URL:
<http://lists.xiph.org/pipermail/opus/attachments/20170517/8959da63/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-Optimize-silk_biquad_alt_stride2-for-ARM-NEON.patch
Type: text/x-patch
Size: 24704 bytes
Desc: not available
URL:
<http://lists.xiph.org/pipermail/opus/attachments/20170517/8959da63/attachment-0003.bin>

Jean-Marc Valin

2017-May-17 21:19 UTC

head link

[opus] 2 patches related to silk_biquad_alt() optimization

Hi Linfeng,

Thanks for the updated patches. They're now in master.

Cheers,

	Jean-Marc

On 17/05/17 12:17 PM, Linfeng Zhang wrote:> Hi Jean-Marc,
> 
> Thanks!
> 
> Please find the 2 updated patches which only optimize stride 2 case and
> keep the bit exactness. They have passed our internal tests as usual.
> 
> Thanks,
> Linfeng
> 
> On Mon, May 15, 2017 at 9:36 AM, Jean-Marc Valin <jmvalin at jmvalin.ca
> <mailto:jmvalin at jmvalin.ca>> wrote:
> 
>     Hi Linfeng,
> 
>     Sorry for the delay -- I was actually trying to think of the best
option
>     here. For now, my preference would be to keep things bit-exact, but
>     should there be more similar optimizations relying on 64-bit
>     multiplication results, then we could consider having a special option
>     to enable those (even in C).
> 
>     Cheers,
> 
>             Jean-Marc
> 
>     On 08/05/17 12:12 PM, Linfeng Zhang wrote:
>     > Ping for comments.
>     >
>     > Thanks,
>     > Linfeng
>     >
>     > On Wed, Apr 26, 2017 at 2:15 PM, Linfeng Zhang <linfengz at
google.com <mailto:linfengz at google.com>
>     > <mailto:linfengz at google.com <mailto:linfengz at
google.com>>> wrote:
>     >
>     >     On Tue, Apr 25, 2017 at 10:31 PM, Jean-Marc Valin
>     >     <jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>
>     <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at
jmvalin.ca>>> wrote:
>     >
>     >
>     >         > A_Q28 is split to 2 14-bit (or 16-bit, whatever)
>     integers, to make the
>     >         > multiplication operation within 32-bits. NEON can do
>     32-bit x 32-bit >     >         > 64-bit using 'int64x2_t
vmull_s32(int32x2_t a, int32x2_t
>     b)', and it
>     >         > could possibly be faster and less rounding/shifting
>     errors than above C
>     >         > code. But it may increase difficulties for other CPUs
>     not supporting
>     >         > 32-bit multiplication.
>     >
>     >         OK, so I'm not totally opposed to that, but it
increases the
>     >         testing/maintenance cost so it needs to be worth it. So
the
>     >         question is
>     >         how much speedup can you get and how close you can make
the
>     >         result to
>     >         the original function. If you can make the output be
always
>     >         within one
>     >         of two LSBs of the C version, then the asm check can
>     simply be a
>     >         little
>     >         bit more lax than usual. Otherwise it becomes more
>     complicated. This
>     >         isn't a function that scares me too much about going
>     >         non-bitexact, but
>     >         it's also not one of the big complexity costs either.
In any
>     >         case, let
>     >         me know what you find.
>     >
>     >
>     >     Tested the proposed NEON optimization change, it can increase
the
>     >     whole encoder speed by about 1% at Complexity 8 and 10 for
>     stride >     >     2 (compared to the NEON optimization in the
previous patch), and
>     >     0.7% at Complexity 8 and 1.1% at Complexity 10 for stride = 1
>     >     (compared to the original C code).
>     >
>     >     Unfortunately, the truncating difference accumulates in S[]
>     and its
>     >     difference cannot be easily bounded. The difference in out[]
may
>     >     somehow be bounded to 5 in my quick testing, but is not
guaranteed
>     >     to other inputs. So maybe comparing bit exactness with the
>     >     following silk_biquad_alt_c_MulSingleAQ28() is better.
>     >
>     >     Please let me know the decision (whether keeping the original
NEON
>     >     (stride 2 only) or choosing the new NEON (both stride 1 and 2)
>     which
>     >     optimizes following silk_biquad_alt_c_MulSingleAQ28()), and
I'll
>     >     wrap up the patch.
>     >
>     >     Here attached the corresponding C
>     >     code silk_biquad_alt_c_MulSingleAQ28() for your information.
>     It uses
>     >     64-bit multiplications and is about 0.6% - 0.9% slower (for
the
>     >     whole encoder) than the original C code using 32-bit
>     multiplications
>     >     (for both strides and the same stride 2 unrolling).
>     >
>     >     void silk_biquad_alt_c_MulSingleAQ28(
>     >         const opus_int16            *in,                /* I    
input
>     >     signal                                               */
>     >         const opus_int32            *B_Q28,             /* I    
MA
>     >     coefficients [3]                                        */
>     >         const opus_int32            *A_Q28,             /* I    
AR
>     >     coefficients [2]                                        */
>     >         opus_int32                  *S,                 /* I/O  
State
>     >     vector [2*stride]                                    */
>     >         opus_int16                  *out,               /* O   
>      output
>     >     signal                                              */
>     >         const opus_int32            len,                /* I   
>      signal
>     >     length (must be even)                               */
>     >         opus_int                    stride              /* I   
>      Operate
>     >     on interleaved signal if > 1                       */
>     >     )
>     >     {
>     >         /* DIRECT FORM II TRANSPOSED (uses 2 element state vector)
*/
>     >         opus_int   k;
>     >
>     >         silk_assert( ( stride == 1 ) || ( stride == 2 ) );
>     >
>     >         if( stride == 1) {
>     >             opus_int32 out32_Q14;
>     >             for( k = 0; k < len; k++ ) {
>     >                 /* S[ 0 ], S[ 1 ]: Q12 */
>     >                 out32_Q14 = silk_LSHIFT( silk_SMLAWB( S[ 0 ],
B_Q28[ 0
>     >     ], in[ k ] ), 2 );
>     >
>     >                 S[ 0 ] = S[ 1 ] + silk_RSHIFT_ROUND(
>     >     (opus_int64)out32_Q14 * (-A_Q28[ 0 ]), 30 );
>     >                 S[ 0 ] = silk_SMLAWB( S[ 0 ], B_Q28[ 1 ], in[ k ]
);
>     >
>     >                 S[ 1 ] = silk_RSHIFT_ROUND( (opus_int64)out32_Q14
*
>     >     (-A_Q28[ 1 ]) , 30 );
>     >                 S[ 1 ] = silk_SMLAWB( S[ 1 ], B_Q28[ 2 ], in[ k ]
);
>     >
>     >                 /* Scale back to Q0 and saturate */
>     >                 out[ k ] = (opus_int16)silk_SAT16( silk_RSHIFT(
>     >     out32_Q14 + (1<<14) - 1, 14 ) );
>     >             }
>     >         } else {
>     >             opus_int32 out32_Q14[ 2 ];
>     >             for( k = 0; k < len; k++ ) {
>     >                 /* S[ 0 ], S[ 1 ]: Q12 */
>     >                 out32_Q14[ 0 ] = silk_LSHIFT( silk_SMLAWB( S[ 0 ],
>     >     B_Q28[ 0 ], in[ k * 2 + 0 ] ), 2 );
>     >                 out32_Q14[ 1 ] = silk_LSHIFT( silk_SMLAWB( S[ 2 ],
>     >     B_Q28[ 0 ], in[ k * 2 + 1 ] ), 2 );
>     >
>     >                 S[ 0 ] = S[ 1 ] + silk_RSHIFT_ROUND(
>     >     (opus_int64)out32_Q14[ 0 ] * (-A_Q28[ 0 ]), 30 );
>     >                 S[ 2 ] = S[ 3 ] + silk_RSHIFT_ROUND(
>     >     (opus_int64)out32_Q14[ 1 ] * (-A_Q28[ 0 ]), 30 );
>     >                 S[ 0 ] = silk_SMLAWB( S[ 0 ], B_Q28[ 1 ], in[ k *
>     2 + 0 ] );
>     >                 S[ 2 ] = silk_SMLAWB( S[ 2 ], B_Q28[ 1 ], in[ k *
>     2 + 1 ] );
>     >
>     >                 S[ 1 ] = silk_RSHIFT_ROUND( (opus_int64)out32_Q14[
>     0 ] *
>     >     (-A_Q28[ 1 ]), 30 );
>     >                 S[ 3 ] = silk_RSHIFT_ROUND( (opus_int64)out32_Q14[
>     1 ] *
>     >     (-A_Q28[ 1 ]), 30 );
>     >                 S[ 1 ] = silk_SMLAWB( S[ 1 ], B_Q28[ 2 ], in[ k *
>     2 + 0 ] );
>     >                 S[ 3 ] = silk_SMLAWB( S[ 3 ], B_Q28[ 2 ], in[ k *
>     2 + 1 ] );
>     >
>     >                 /* Scale back to Q0 and saturate */
>     >                 out[ k * 2 + 0 ] = (opus_int16)silk_SAT16(
>     silk_RSHIFT(
>     >     out32_Q14[ 0 ] + (1<<14) - 1, 14 ) );
>     >                 out[ k * 2 + 1 ] = (opus_int16)silk_SAT16(
>     silk_RSHIFT(
>     >     out32_Q14[ 1 ] + (1<<14) - 1, 14 ) );
>     >             }
>     >         }
>     >     }
>     >
>     >     Here is the NEON kernels which uses vqrdmulh_lane_s32() to do
the
>     >     multiplication and rounding, where A_Q28_s32x{2,4} stores
doubled
>     >     -A_Q28[]:
>     >
>     >     static inline void silk_biquad_alt_stride1_kernel(const
int32x2_t
>     >     A_Q28_s32x2, const int32x4_t t_s32x4, int32x2_t *S_s32x2,
>     int32x2_t
>     >     *out32_Q14_s32x2)
>     >     {
>     >         int32x2_t t_s32x2;
>     >
>     >         *out32_Q14_s32x2 = vadd_s32(*S_s32x2,
vget_low_s32(t_s32x4));
>     >                              /* silk_SMLAWB( S[ 0 ], B_Q28[ 0 ],
>     in[ k ]
>     >     )                                */
>     >         *S_s32x2         >     >    
vreinterpret_s32_u64(vshr_n_u64(vreinterpret_u64_s32(*S_s32x2),
>     >     32)); /* S[ 0 ] = S[ 1 ]; S[ 1 ] = 0;
>     >                    */
>     >         *out32_Q14_s32x2 = vshl_n_s32(*out32_Q14_s32x2, 2);
>     >                              /* out32_Q14 = silk_LSHIFT(
>     silk_SMLAWB( S[
>     >     0 ], B_Q28[ 0 ], in[ k ] ), 2 ); */
>     >         t_s32x2          = vqrdmulh_lane_s32(A_Q28_s32x2,
>     >     *out32_Q14_s32x2, 0);                  /* silk_RSHIFT_ROUND(
>     >     (opus_int64)out32_Q14 * (-A_Q28[ {0,1} ]), 30 )        */
>     >         *S_s32x2         = vadd_s32(*S_s32x2, t_s32x2);
>     >                              /* S[ {0,1} ] = {S[ 1 ],0} +
>     >     silk_RSHIFT_ROUND( );                           */
>     >         *S_s32x2         = vadd_s32(*S_s32x2,
vget_high_s32(t_s32x4));
>     >                             /* S[ {0,1} ] = silk_SMLAWB( S[ {0,1}
],
>     >     B_Q28[ {1,2} ], in[ k ] );          */
>     >     }
>     >
>     >     static inline void silk_biquad_alt_stride2_kernel(const
int32x4_t
>     >     A_Q28_s32x4, const int32x4_t B_Q28_s32x4, const int32x2_t
t_s32x2,
>     >     const int32x4_t inval_s32x4, int32x4_t *S_s32x4, int32x2_t
>     >     *out32_Q14_s32x2)
>     >     {
>     >         int32x4_t t_s32x4, out32_Q14_s32x4;
>     >
>     >         *out32_Q14_s32x2 = vadd_s32(vget_low_s32(*S_s32x4),
t_s32x2);
>     >              /* silk_SMLAWB( S{0,1}, B_Q28[ 0 ], in[ k * 2 + {0,1}
] )
>     >                                         */
>     >         *S_s32x4         = vcombine_s32(vget_high_s32(*S_s32x4),
>     >     vdup_n_s32(0)); /* S{0,1} = S{2,3}; S{2,3} = 0;
>     >                                                 */
>     >         *out32_Q14_s32x2 = vshl_n_s32(*out32_Q14_s32x2, 2);
>     >              /* out32_Q14_{0,1} = silk_LSHIFT( silk_SMLAWB(
S{0,1},
>     >     B_Q28[ 0 ], in[ k * 2 + {0,1} ] ), 2 );  */
>     >         out32_Q14_s32x4  = vcombine_s32(*out32_Q14_s32x2,
>     >     *out32_Q14_s32x2);     /* out32_Q14_{0,1,0,1}
>     >                                                        */
>     >         t_s32x4          = vqrdmulhq_s32(out32_Q14_s32x4,
>     A_Q28_s32x4);
>     >              /* silk_RSHIFT_ROUND( (opus_int64)out32_Q14[
>     {0,1,0,1} ] *
>     >     (-A_Q28[ {0,0,1,1} ]), 30 )          */
>     >         *S_s32x4         = vaddq_s32(*S_s32x4, t_s32x4);
>     >             /* S[ {0,1,2,3} ] = {S[ {2,3} ],0,0} +
>     silk_RSHIFT_ROUND( );
>     >                                        */
>     >         t_s32x4          = vqdmulhq_s32(inval_s32x4, B_Q28_s32x4);
>     >             /* silk_SMULWB(B_Q28[ {1,1,2,2} ], in[ k * 2 +
>     {0,1,0,1} ] )
>     >                                        */
>     >         *S_s32x4         = vaddq_s32(*S_s32x4, t_s32x4);
>     >             /* S[ {0,1,2,3} ] = silk_SMLAWB( S[ {0,1,2,3} ],
B_Q28[
>     >     {1,1,2,2} ], in[ k * 2 + {0,1,0,1} ] ); */
>     >     }
>     >
>     >     Thanks,
>     >     Linfeng
>     >
>     >
> 
>

Reasonably Related Threads

Search for more apparently analagous threads

opus - May 2017 - 2 patches related to silk_biquad_alt() optimization

[opus] 2 patches related to silk_biquad_alt() optimization

[opus] 2 patches related to silk_biquad_alt() optimization

[opus] 2 patches related to silk_biquad_alt() optimization

Reasonably Related Threads