thr3ads.net - opus - [opus] 2 patches related to silk_biquad

If this information is useful, please help other people find it:
Share via:

Linfeng Zhang

2017-Apr-25 00:03 UTC

[opus] 2 patches related to silk_biquad_alt() optimization

Hi Jean-Marc,

Tested on my chromebook, when stride (channel) == 1, the optimization has
no gain compared with C function.
When stride (channel) == 2, the optimization is 1.2%-1.8% faster (1.6% at
Complexity 8) compared with C function.

Please let me know and I can remove the optimization of stride 1 case.

If it's allowed to skip the split of A_Q28 and replace by 32-bit
multiplication (result is 64-bit), probably it could be faster on NEON.
This may change the encoder results because of different order of adding,
shifting and rounding.

Thanks,
Linfeng

On Wed, Apr 19, 2017 at 10:23 PM, Jean-Marc Valin <jmvalin at jmvalin.ca>
wrote:
> Hi Linfeng,
>
> Thanks for the patches. I'll have a look and get back to you. What kind
> of speedup are you getting for these functions? On what command line?
>
> Cheers,
>
>         Jean-Marc
>
> On 19/04/17 12:29 PM, Linfeng Zhang wrote:
> > Hi,
> >
> > Attached are 2 patches related to silk_biquad_alt() optimization.
Please
> > review.
> >
> > Thanks,
> > Linfeng Zhang
> >
> >
> >
> > _______________________________________________
> > opus mailing list
> > opus at xiph.org
> > http://lists.xiph.org/mailman/listinfo/opus
> >
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xiph.org/pipermail/opus/attachments/20170424/f3262546/attachment.html>

Jean-Marc Valin

2017-Apr-25 00:52 UTC

head link

[opus] 2 patches related to silk_biquad_alt() optimization

On 24/04/17 08:03 PM, Linfeng Zhang wrote:> Tested on my chromebook, when stride (channel) == 1, the optimization
> has no gain compared with C function.
You mean that the Neon code is the same speed as the C code for
stride==1? This is not terribly surprising for an IIRC filter.
> When stride (channel) == 2, the optimization is 1.2%-1.8% faster (1.6%
> at Complexity 8) compared with C function.
Is that gain due to Neon or simply due to computing two channels in
parallel? For example, if you make a special case in the C code to
handle both channels in the same loop, what kind of performance do you get?
> Please let me know and I can remove the optimization of stride 1 case.
Yeah, if there's Neon code that provides no improvement over C, let's
stick with C. And if you manage to write C code that has the same
performance as the Neon code, then that would also be better (both
easier to maintain and more portable).
> If it's allowed to skip the split of A_Q28 and replace by 32-bit
> multiplication (result is 64-bit), probably it could be faster on NEON.
> This may change the encoder results because of different order of
> adding, shifting and rounding.
I'm not sure what you mean for that.

	Jean-Marc
> Thanks,
> Linfeng
> 
> 
> On Wed, Apr 19, 2017 at 10:23 PM, Jean-Marc Valin <jmvalin at jmvalin.ca
> <mailto:jmvalin at jmvalin.ca>> wrote:
> 
>     Hi Linfeng,
> 
>     Thanks for the patches. I'll have a look and get back to you. What
kind
>     of speedup are you getting for these functions? On what command line?
> 
>     Cheers,
> 
>             Jean-Marc
> 
>     On 19/04/17 12:29 PM, Linfeng Zhang wrote:
>     > Hi,
>     >
>     > Attached are 2 patches related to silk_biquad_alt() optimization.
>     Please
>     > review.
>     >
>     > Thanks,
>     > Linfeng Zhang
>     >
>     >
>     >
>     > _______________________________________________
>     > opus mailing list
>     > opus at xiph.org <mailto:opus at xiph.org>
>     > http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>
>     >
> 
>

Linfeng Zhang

2017-Apr-25 17:37 UTC

head link

[opus] 2 patches related to silk_biquad_alt() optimization

On Mon, Apr 24, 2017 at 5:52 PM, Jean-Marc Valin <jmvalin at jmvalin.ca>
wrote:
> On 24/04/17 08:03 PM, Linfeng Zhang wrote:
> > Tested on my chromebook, when stride (channel) == 1, the optimization
> > has no gain compared with C function.
>
> You mean that the Neon code is the same speed as the C code for
> stride==1? This is not terribly surprising for an IIRC filter.
>
Yes

>
> > When stride (channel) == 2, the optimization is 1.2%-1.8% faster (1.6%
> > at Complexity 8) compared with C function.
>
> Is that gain due to Neon or simply due to computing two channels in
> parallel? For example, if you make a special case in the C code to
> handle both channels in the same loop, what kind of performance do you get?
>
Tested Complexity 8, it's half half, i.e., 0.8% faster if handling both
channels in the same loop in C, and then additional 0.8% faster using NEON.

>
> > Please let me know and I can remove the optimization of stride 1 case.
>
> Yeah, if there's Neon code that provides no improvement over C,
let's
> stick with C. And if you manage to write C code that has the same
> performance as the Neon code, then that would also be better (both
> easier to maintain and more portable).
>
Will do.

>
> > If it's allowed to skip the split of A_Q28 and replace by 32-bit
> > multiplication (result is 64-bit), probably it could be faster on
NEON.
> > This may change the encoder results because of different order of
> > adding, shifting and rounding.
>
> I'm not sure what you mean for that.
>
    /* Negate A_Q28 values and split in two parts */
    A0_L_Q28 = ( -A_Q28[ 0 ] ) & 0x00003FFF;        /* lower part */
    A0_U_Q28 = silk_RSHIFT( -A_Q28[ 0 ], 14 );      /* upper part */
    A1_L_Q28 = ( -A_Q28[ 1 ] ) & 0x00003FFF;        /* lower part */
    A1_U_Q28 = silk_RSHIFT( -A_Q28[ 1 ], 14 );      /* upper part */

    ...

        S[ 0 ] = S[1] + silk_RSHIFT_ROUND( silk_SMULWB( out32_Q14, A0_L_Q28
), 14 );
        S[ 0 ] = silk_SMLAWB( S[ 0 ], out32_Q14, A0_U_Q28 );
        S[ 0 ] = silk_SMLAWB( S[ 0 ], B_Q28[ 1 ], inval);

        S[ 1 ] = silk_RSHIFT_ROUND( silk_SMULWB( out32_Q14, A1_L_Q28 ), 14
);
        S[ 1 ] = silk_SMLAWB( S[ 1 ], out32_Q14, A1_U_Q28 );
        S[ 1 ] = silk_SMLAWB( S[ 1 ], B_Q28[ 2 ], inval );

A_Q28 is split to 2 14-bit (or 16-bit, whatever) integers, to make the
multiplication operation within 32-bits. NEON can do 32-bit x 32-bit 64-bit
using 'int64x2_t vmull_s32(int32x2_t a, int32x2_t b)', and it could
possibly be faster and less rounding/shifting errors than above C code. But
it may increase difficulties for other CPUs not supporting 32-bit
multiplication.

Thanks,
Linfeng
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xiph.org/pipermail/opus/attachments/20170425/a44e9427/attachment.html>

Reasonably Related Threads

Search for more seemingly similar threads

opus - Apr 2017 - 2 patches related to silk_biquad_alt() optimization

[opus] 2 patches related to silk_biquad_alt() optimization

[opus] 2 patches related to silk_biquad_alt() optimization

[opus] 2 patches related to silk_biquad_alt() optimization

Reasonably Related Threads