Linfeng Zhang
2017-Apr-25 00:03 UTC
[opus] 2 patches related to silk_biquad_alt() optimization
Hi Jean-Marc, Tested on my chromebook, when stride (channel) == 1, the optimization has no gain compared with C function. When stride (channel) == 2, the optimization is 1.2%-1.8% faster (1.6% at Complexity 8) compared with C function. Please let me know and I can remove the optimization of stride 1 case. If it's allowed to skip the split of A_Q28 and replace by 32-bit multiplication (result is 64-bit), probably it could be faster on NEON. This may change the encoder results because of different order of adding, shifting and rounding. Thanks, Linfeng On Wed, Apr 19, 2017 at 10:23 PM, Jean-Marc Valin <jmvalin at jmvalin.ca> wrote:> Hi Linfeng, > > Thanks for the patches. I'll have a look and get back to you. What kind > of speedup are you getting for these functions? On what command line? > > Cheers, > > Jean-Marc > > On 19/04/17 12:29 PM, Linfeng Zhang wrote: > > Hi, > > > > Attached are 2 patches related to silk_biquad_alt() optimization. Please > > review. > > > > Thanks, > > Linfeng Zhang > > > > > > > > _______________________________________________ > > opus mailing list > > opus at xiph.org > > http://lists.xiph.org/mailman/listinfo/opus > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xiph.org/pipermail/opus/attachments/20170424/f3262546/attachment.html>
Jean-Marc Valin
2017-Apr-25 00:52 UTC
[opus] 2 patches related to silk_biquad_alt() optimization
On 24/04/17 08:03 PM, Linfeng Zhang wrote:> Tested on my chromebook, when stride (channel) == 1, the optimization > has no gain compared with C function.You mean that the Neon code is the same speed as the C code for stride==1? This is not terribly surprising for an IIRC filter.> When stride (channel) == 2, the optimization is 1.2%-1.8% faster (1.6% > at Complexity 8) compared with C function.Is that gain due to Neon or simply due to computing two channels in parallel? For example, if you make a special case in the C code to handle both channels in the same loop, what kind of performance do you get?> Please let me know and I can remove the optimization of stride 1 case.Yeah, if there's Neon code that provides no improvement over C, let's stick with C. And if you manage to write C code that has the same performance as the Neon code, then that would also be better (both easier to maintain and more portable).> If it's allowed to skip the split of A_Q28 and replace by 32-bit > multiplication (result is 64-bit), probably it could be faster on NEON. > This may change the encoder results because of different order of > adding, shifting and rounding.I'm not sure what you mean for that. Jean-Marc> Thanks, > Linfeng > > > On Wed, Apr 19, 2017 at 10:23 PM, Jean-Marc Valin <jmvalin at jmvalin.ca > <mailto:jmvalin at jmvalin.ca>> wrote: > > Hi Linfeng, > > Thanks for the patches. I'll have a look and get back to you. What kind > of speedup are you getting for these functions? On what command line? > > Cheers, > > Jean-Marc > > On 19/04/17 12:29 PM, Linfeng Zhang wrote: > > Hi, > > > > Attached are 2 patches related to silk_biquad_alt() optimization. > Please > > review. > > > > Thanks, > > Linfeng Zhang > > > > > > > > _______________________________________________ > > opus mailing list > > opus at xiph.org <mailto:opus at xiph.org> > > http://lists.xiph.org/mailman/listinfo/opus > <http://lists.xiph.org/mailman/listinfo/opus> > > > >
Linfeng Zhang
2017-Apr-25 17:37 UTC
[opus] 2 patches related to silk_biquad_alt() optimization
On Mon, Apr 24, 2017 at 5:52 PM, Jean-Marc Valin <jmvalin at jmvalin.ca> wrote:> On 24/04/17 08:03 PM, Linfeng Zhang wrote: > > Tested on my chromebook, when stride (channel) == 1, the optimization > > has no gain compared with C function. > > You mean that the Neon code is the same speed as the C code for > stride==1? This is not terribly surprising for an IIRC filter. >Yes> > > When stride (channel) == 2, the optimization is 1.2%-1.8% faster (1.6% > > at Complexity 8) compared with C function. > > Is that gain due to Neon or simply due to computing two channels in > parallel? For example, if you make a special case in the C code to > handle both channels in the same loop, what kind of performance do you get? >Tested Complexity 8, it's half half, i.e., 0.8% faster if handling both channels in the same loop in C, and then additional 0.8% faster using NEON.> > > Please let me know and I can remove the optimization of stride 1 case. > > Yeah, if there's Neon code that provides no improvement over C, let's > stick with C. And if you manage to write C code that has the same > performance as the Neon code, then that would also be better (both > easier to maintain and more portable). >Will do.> > > If it's allowed to skip the split of A_Q28 and replace by 32-bit > > multiplication (result is 64-bit), probably it could be faster on NEON. > > This may change the encoder results because of different order of > > adding, shifting and rounding. > > I'm not sure what you mean for that. >/* Negate A_Q28 values and split in two parts */ A0_L_Q28 = ( -A_Q28[ 0 ] ) & 0x00003FFF; /* lower part */ A0_U_Q28 = silk_RSHIFT( -A_Q28[ 0 ], 14 ); /* upper part */ A1_L_Q28 = ( -A_Q28[ 1 ] ) & 0x00003FFF; /* lower part */ A1_U_Q28 = silk_RSHIFT( -A_Q28[ 1 ], 14 ); /* upper part */ ... S[ 0 ] = S[1] + silk_RSHIFT_ROUND( silk_SMULWB( out32_Q14, A0_L_Q28 ), 14 ); S[ 0 ] = silk_SMLAWB( S[ 0 ], out32_Q14, A0_U_Q28 ); S[ 0 ] = silk_SMLAWB( S[ 0 ], B_Q28[ 1 ], inval); S[ 1 ] = silk_RSHIFT_ROUND( silk_SMULWB( out32_Q14, A1_L_Q28 ), 14 ); S[ 1 ] = silk_SMLAWB( S[ 1 ], out32_Q14, A1_U_Q28 ); S[ 1 ] = silk_SMLAWB( S[ 1 ], B_Q28[ 2 ], inval ); A_Q28 is split to 2 14-bit (or 16-bit, whatever) integers, to make the multiplication operation within 32-bits. NEON can do 32-bit x 32-bit 64-bit using 'int64x2_t vmull_s32(int32x2_t a, int32x2_t b)', and it could possibly be faster and less rounding/shifting errors than above C code. But it may increase difficulties for other CPUs not supporting 32-bit multiplication. Thanks, Linfeng -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xiph.org/pipermail/opus/attachments/20170425/a44e9427/attachment.html>
Seemingly Similar Threads
- 2 patches related to silk_biquad_alt() optimization
- 2 patches related to silk_biquad_alt() optimization
- 2 patches related to silk_biquad_alt() optimization
- 2 patches related to silk_biquad_alt() optimization
- 2 patches related to silk_biquad_alt() optimization