Linfeng Zhang
2017-Apr-11 23:07 UTC
[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
Hi Jean-Marc, Thanks for your suggestions! I attached the new patch, with inlined reply below. Thanks, Linfeng On Thu, Apr 6, 2017 at 12:55 PM, Jean-Marc Valin <jmvalin at jmvalin.ca> wrote:> I did some profiling on a Cortex A57 and I've been seeing slightly less > improvement than you're reporting, more like 3.5% at complexity 8. It > appears that the warped autocorrelation function itself is only faster > by a factor of about 1.35. That's a bit surprising considering I see > nothing obviously wrong with the code. >Speed test the new patch, and got about 7.8% whole encoder speed gain with complexity 8 on my Acre Chromebook. Here is my configure: ./configure --build x86_64-unknown-linux-gnu --host arm-linux-gnueabihf --disable-assertions --enable-fixed-point --enable-intrinsics CFLAGS=-O3 --disable-shared The testing speech file may also change the speed results.> 1) In calc_state(), rather than splitting the multiply in two > instructions, you may be able to simply shift the warping left 16 bits, > then use the Neon instruction that does a*b>>32 (i.e. the one that > computes the top bits of a 32x32 multiply) >Done.> 2) If the problem is with the movs at the end of each iteration, then > you should be able to get rid of them by unrolling by a factor of two. >We did this previously and get some gains, but the code size is much bigger. So we abandoned. Tested again on the new code and got no speed gains.> 3) It seems likely that you have significant register spill going on due > to processing up to 24 "taps" at the same time. If that's causing a > slowdown, then it should be possible to do the processing in "sections". > By that, I mean that you can implement (e.g.) an order-8 "kernel" that > computes the correlations and also outputs the last element of > state_QS_s32x4[0][0] back to input_QS[], so that it can be used to > compute a new secion. >Done. The speed is almost identical (slightly slower), however the extra bonus is code size saving. 4) It's a minor detail, but the last element of corr_QC[] that's not> currently vectorized could simply be vectorized independently outside > the loop (and it's the same for all orders). >Done. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xiph.org/pipermail/opus/attachments/20170411/300b590e/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-Optimize-silk_warped_autocorrelation_FIX-for-ARM-NEO.patch Type: text/x-patch Size: 29664 bytes Desc: not available URL: <http://lists.xiph.org/pipermail/opus/attachments/20170411/300b590e/attachment-0001.bin>
Linfeng Zhang
2017-Apr-13 01:17 UTC
[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
Attached a new patch, which fixes a compiling error. Thanks, Linfeng On Tue, Apr 11, 2017 at 4:07 PM, Linfeng Zhang <linfengz at google.com> wrote:> Hi Jean-Marc, > > Thanks for your suggestions! > > I attached the new patch, with inlined reply below. > > Thanks, > Linfeng > > On Thu, Apr 6, 2017 at 12:55 PM, Jean-Marc Valin <jmvalin at jmvalin.ca> > wrote: > >> I did some profiling on a Cortex A57 and I've been seeing slightly less >> improvement than you're reporting, more like 3.5% at complexity 8. It >> appears that the warped autocorrelation function itself is only faster >> by a factor of about 1.35. That's a bit surprising considering I see >> nothing obviously wrong with the code. >> > > Speed test the new patch, and got about 7.8% whole encoder speed gain with > complexity 8 on my Acre Chromebook. > Here is my configure: > ./configure --build x86_64-unknown-linux-gnu --host arm-linux-gnueabihf > --disable-assertions --enable-fixed-point --enable-intrinsics CFLAGS=-O3 > --disable-shared > > The testing speech file may also change the speed results. > > >> 1) In calc_state(), rather than splitting the multiply in two >> instructions, you may be able to simply shift the warping left 16 bits, >> then use the Neon instruction that does a*b>>32 (i.e. the one that >> computes the top bits of a 32x32 multiply) >> > > Done. > > >> 2) If the problem is with the movs at the end of each iteration, then >> you should be able to get rid of them by unrolling by a factor of two. >> > > We did this previously and get some gains, but the code size is much > bigger. So we abandoned. Tested again on the new code and got no speed > gains. > > >> 3) It seems likely that you have significant register spill going on due >> to processing up to 24 "taps" at the same time. If that's causing a >> slowdown, then it should be possible to do the processing in "sections". >> By that, I mean that you can implement (e.g.) an order-8 "kernel" that >> computes the correlations and also outputs the last element of >> state_QS_s32x4[0][0] back to input_QS[], so that it can be used to >> compute a new secion. >> > > Done. The speed is almost identical (slightly slower), however the extra > bonus is code size saving. > > 4) It's a minor detail, but the last element of corr_QC[] that's not >> currently vectorized could simply be vectorized independently outside >> the loop (and it's the same for all orders). >> > > Done. >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xiph.org/pipermail/opus/attachments/20170412/c3ac1cd0/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-Optimize-silk_warped_autocorrelation_FIX-for-ARM-NEO.patch Type: text/x-patch Size: 29705 bytes Desc: not available URL: <http://lists.xiph.org/pipermail/opus/attachments/20170412/c3ac1cd0/attachment-0001.bin>
Jean-Marc Valin
2017-Apr-13 22:33 UTC
[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
Hi Linfeng, Just pushed your patch to master. Turns out the performance improvement is pretty good on the A53 (RP3). It just looks like the A57 managed to do a really good job on the non-Neon code. Thanks for taking to time to address all the issues. Cheers, Jean-Marc On 12/04/17 09:17 PM, Linfeng Zhang wrote:> Attached a new patch, which fixes a compiling error. > > Thanks, > Linfeng > > On Tue, Apr 11, 2017 at 4:07 PM, Linfeng Zhang <linfengz at google.com > <mailto:linfengz at google.com>> wrote: > > Hi Jean-Marc, > > Thanks for your suggestions! > > I attached the new patch, with inlined reply below. > > Thanks, > Linfeng > > On Thu, Apr 6, 2017 at 12:55 PM, Jean-Marc Valin <jmvalin at jmvalin.ca > <mailto:jmvalin at jmvalin.ca>> wrote: > > I did some profiling on a Cortex A57 and I've been seeing > slightly less > improvement than you're reporting, more like 3.5% at complexity > 8. It > appears that the warped autocorrelation function itself is only > faster > by a factor of about 1.35. That's a bit surprising considering I see > nothing obviously wrong with the code. > > > Speed test the new patch, and got about 7.8% whole encoder speed > gain with complexity 8 on my Acre Chromebook. > Here is my configure: > ./configure --build x86_64-unknown-linux-gnu --host > arm-linux-gnueabihf --disable-assertions --enable-fixed-point > --enable-intrinsics CFLAGS=-O3 --disable-shared > > The testing speech file may also change the speed results. > > > 1) In calc_state(), rather than splitting the multiply in two > instructions, you may be able to simply shift the warping left > 16 bits, > then use the Neon instruction that does a*b>>32 (i.e. the one that > computes the top bits of a 32x32 multiply) > > > Done. > > > 2) If the problem is with the movs at the end of each iteration, > then > you should be able to get rid of them by unrolling by a factor > of two. > > > We did this previously and get some gains, but the code size is much > bigger. So we abandoned. Tested again on the new code and got no > speed gains. > > > 3) It seems likely that you have significant register spill > going on due > to processing up to 24 "taps" at the same time. If that's causing a > slowdown, then it should be possible to do the processing in > "sections". > By that, I mean that you can implement (e.g.) an order-8 > "kernel" that > computes the correlations and also outputs the last element of > state_QS_s32x4[0][0] back to input_QS[], so that it can be used to > compute a new secion. > > > Done. The speed is almost identical (slightly slower), however the > extra bonus is code size saving. > > 4) It's a minor detail, but the last element of corr_QC[] that's not > currently vectorized could simply be vectorized independently > outside > the loop (and it's the same for all orders). > > > Done. > >
Reasonably Related Threads
- [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
- [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
- [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
- [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
- [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON