Viswanath Puttagunta
2014-Nov-24 20:53 UTC
[opus] [RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics
On 21 November 2014 at 18:06, Timothy B. Terriberry <tterribe at xiph.org> wrote:> > Viswanath Puttagunta wrote: >> >> a. Simplest use case to validate this optimization for correctness. >> b. Simplest use case to validate this optimization for performance. >> >> Would prefer something like opusdec that can be executed on command >> line. > > > The easiest thing to use is probably opus_demo (opusdec does a bunch of extra things, plus for interactive use we care about both the encoder and decoder, and celt_pitch_xcorr gets used vastly more by the encoder than the decoder... I think the decoder only uses it for PLC). > > Something like > ./opus_demo restricted-lowdelay 48000 2 96000 comp48-stereo.sw /dev/null > > comp48-stereo.sw can be found here: https://people.xiph.org/~tterribe/opus/comp48-stereo.sw > > celt_pitch_xcorr also gets used by the SILK encoder (more in fixed-point than float, but the float one uses it, too). So it may be worth doing a run with the application set to voip instead of restricted-lowdelay and a lower bitrate (e.g., 24000 instead of 96000).Thanks for your feedback. I have verified both above cases. While I used ./opus_demo restricted-lowdelay 48000 2 96000 comp48-stereo.sw out.wav, ./opus_demo voip 48000 2 24000 comp48-stereo.sw out.wav to make sure the output out.wav is clearly audible, I used below command (encode only) for performance benchmarking. ./opus_demo -e restricted-lowdelay 48000 2 96000 comp48-stereo.sw opus_raw.out ./opus_demo -e voip 48000 2 96000 comp48-stereo.sw opus_raw.out I saw much better improvement in performance (16.16%) for overall encode use case for "restricted-lowdelay 48000 2 96000" for CELT encoding as you suspected as celt_pitch_xcorr function gets used much more. I observed lesser improvement in performance (3.42%) for overall encode use case for "voip 48000 2 24000". This is somewhat expected as cel_pitch_xcorr_c was not the main contributor for performance in this SILK encoder use case. For detailed information on how I measured performance on my Beaglebone Black (Cortex-A8), please see "celt_pitch_xcorr (float) Neon Optimization" section of [1] [1]: https://docs.google.com/document/d/1L6csATjSsXtzg_sa1iHZta8hOsoVWA4UjHXEakpTrNk/edit?usp=sharing> > Even though this primarily affects the encoder, as a sanity check, it's always good to make sure the test vectors still decode correctly. Get them from <http://opus-codec.org/testvectors/opus_testvectors.tar.gz> and use > tests/run_vectors.sh <build path> <test vectors path> 48000
Viswanath Puttagunta
2014-Nov-24 23:37 UTC
[opus] [RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics
On 24 November 2014 at 14:53, Viswanath Puttagunta <viswanath.puttagunta at linaro.org> wrote:> > On 21 November 2014 at 18:06, Timothy B. Terriberry <tterribe at xiph.org> wrote: > > > > Viswanath Puttagunta wrote: > >> > >> a. Simplest use case to validate this optimization for correctness. > >> b. Simplest use case to validate this optimization for performance. > >> > >> Would prefer something like opusdec that can be executed on command > >> line. > > > > > > The easiest thing to use is probably opus_demo (opusdec does a bunch of extra things, plus for interactive use we care about both the encoder and decoder, and celt_pitch_xcorr gets used vastly more by the encoder than the decoder... I think the decoder only uses it for PLC). > > > > Something like > > ./opus_demo restricted-lowdelay 48000 2 96000 comp48-stereo.sw /dev/null > > > > comp48-stereo.sw can be found here: https://people.xiph.org/~tterribe/opus/comp48-stereo.sw > > > > celt_pitch_xcorr also gets used by the SILK encoder (more in fixed-point than float, but the float one uses it, too). So it may be worth doing a run with the application set to voip instead of restricted-lowdelay and a lower bitrate (e.g., 24000 instead of 96000). > > Thanks for your feedback. I have verified both above cases. While I used > ./opus_demo restricted-lowdelay 48000 2 96000 comp48-stereo.sw out.wav, > ./opus_demo voip 48000 2 24000 comp48-stereo.sw out.wav > > to make sure the output out.wav is clearly audible, I used below > command (encode only) for performance benchmarking. > > ./opus_demo -e restricted-lowdelay 48000 2 96000 comp48-stereo.sw opus_raw.out > ./opus_demo -e voip 48000 2 96000 comp48-stereo.sw opus_raw.out > > I saw much better improvement in performance (16.16%) for overall > encode use case for "restricted-lowdelay 48000 2 96000" for CELT > encoding as you suspected as celt_pitch_xcorr function gets used much > more. > > I observed lesser improvement in performance (3.42%) for overall > encode use case for "voip 48000 2 24000". This is somewhat expected as > cel_pitch_xcorr_c was not the main contributor for performance in this > SILK encoder use case. > > For detailed information on how I measured performance on my > Beaglebone Black (Cortex-A8), please see "celt_pitch_xcorr (float) > Neon Optimization" section of [1] > > [1]: https://docs.google.com/document/d/1L6csATjSsXtzg_sa1iHZta8hOsoVWA4UjHXEakpTrNk/edit?usp=sharing > > > > > > > Even though this primarily affects the encoder, as a sanity check, it's always good to make sure the test vectors still decode correctly. Get them from <http://opus-codec.org/testvectors/opus_testvectors.tar.gz> and use > > tests/run_vectors.sh <build path> <test vectors path> 48000OK, this took about 2 hours.. but all tests passed successfully. Please let me know what the next steps are.
Peter Robinson
2014-Nov-24 23:48 UTC
[opus] [RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics
>> >> a. Simplest use case to validate this optimization for correctness. >> >> b. Simplest use case to validate this optimization for performance. >> >> >> >> Would prefer something like opusdec that can be executed on command >> >> line. >> > >> > >> > The easiest thing to use is probably opus_demo (opusdec does a bunch of extra things, plus for interactive use we care about both the encoder and decoder, and celt_pitch_xcorr gets used vastly more by the encoder than the decoder... I think the decoder only uses it for PLC). >> > >> > Something like >> > ./opus_demo restricted-lowdelay 48000 2 96000 comp48-stereo.sw /dev/null >> > >> > comp48-stereo.sw can be found here: https://people.xiph.org/~tterribe/opus/comp48-stereo.sw >> > >> > celt_pitch_xcorr also gets used by the SILK encoder (more in fixed-point than float, but the float one uses it, too). So it may be worth doing a run with the application set to voip instead of restricted-lowdelay and a lower bitrate (e.g., 24000 instead of 96000). >> >> Thanks for your feedback. I have verified both above cases. While I used >> ./opus_demo restricted-lowdelay 48000 2 96000 comp48-stereo.sw out.wav, >> ./opus_demo voip 48000 2 24000 comp48-stereo.sw out.wav >> >> to make sure the output out.wav is clearly audible, I used below >> command (encode only) for performance benchmarking. >> >> ./opus_demo -e restricted-lowdelay 48000 2 96000 comp48-stereo.sw opus_raw.out >> ./opus_demo -e voip 48000 2 96000 comp48-stereo.sw opus_raw.out >> >> I saw much better improvement in performance (16.16%) for overall >> encode use case for "restricted-lowdelay 48000 2 96000" for CELT >> encoding as you suspected as celt_pitch_xcorr function gets used much >> more. >> >> I observed lesser improvement in performance (3.42%) for overall >> encode use case for "voip 48000 2 24000". This is somewhat expected as >> cel_pitch_xcorr_c was not the main contributor for performance in this >> SILK encoder use case. >> >> For detailed information on how I measured performance on my >> Beaglebone Black (Cortex-A8), please see "celt_pitch_xcorr (float) >> Neon Optimization" section of [1] >> >> [1]: https://docs.google.com/document/d/1L6csATjSsXtzg_sa1iHZta8hOsoVWA4UjHXEakpTrNk/edit?usp=sharing >> >> >> >> > >> > Even though this primarily affects the encoder, as a sanity check, it's always good to make sure the test vectors still decode correctly. Get them from <http://opus-codec.org/testvectors/opus_testvectors.tar.gz> and use >> > tests/run_vectors.sh <build path> <test vectors path> 48000 > > OK, this took about 2 hours.. but all tests passed successfully. > Please let me know what the next steps are.Is there plans to support ARMv8/aarch64 NEON intrinsics too? Also is there plans to make the NEON optimisations on ARMv7 run time detectable like they have in cairo/pixman? For generic distributions it would nice to be able to be able to enable them as they offer decent performance improvements but have the code fall back on devices that don't support NEON. Peter
Timothy B. Terriberry
2014-Nov-25 02:51 UTC
[opus] [RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics
Viswanath Puttagunta wrote:> OK, this took about 2 hours.. but all tests passed successfully. > Please let me know what the next steps are.I just need to finish reviewing the patches. I'll try to get that finished this week.
Possibly Parallel Threads
- [RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics
- [RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics
- [RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics
- [RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics
- [RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics