Peter Robinson
2014-Nov-24 23:48 UTC
[opus] [RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics
>> >> a. Simplest use case to validate this optimization for correctness. >> >> b. Simplest use case to validate this optimization for performance. >> >> >> >> Would prefer something like opusdec that can be executed on command >> >> line. >> > >> > >> > The easiest thing to use is probably opus_demo (opusdec does a bunch of extra things, plus for interactive use we care about both the encoder and decoder, and celt_pitch_xcorr gets used vastly more by the encoder than the decoder... I think the decoder only uses it for PLC). >> > >> > Something like >> > ./opus_demo restricted-lowdelay 48000 2 96000 comp48-stereo.sw /dev/null >> > >> > comp48-stereo.sw can be found here: https://people.xiph.org/~tterribe/opus/comp48-stereo.sw >> > >> > celt_pitch_xcorr also gets used by the SILK encoder (more in fixed-point than float, but the float one uses it, too). So it may be worth doing a run with the application set to voip instead of restricted-lowdelay and a lower bitrate (e.g., 24000 instead of 96000). >> >> Thanks for your feedback. I have verified both above cases. While I used >> ./opus_demo restricted-lowdelay 48000 2 96000 comp48-stereo.sw out.wav, >> ./opus_demo voip 48000 2 24000 comp48-stereo.sw out.wav >> >> to make sure the output out.wav is clearly audible, I used below >> command (encode only) for performance benchmarking. >> >> ./opus_demo -e restricted-lowdelay 48000 2 96000 comp48-stereo.sw opus_raw.out >> ./opus_demo -e voip 48000 2 96000 comp48-stereo.sw opus_raw.out >> >> I saw much better improvement in performance (16.16%) for overall >> encode use case for "restricted-lowdelay 48000 2 96000" for CELT >> encoding as you suspected as celt_pitch_xcorr function gets used much >> more. >> >> I observed lesser improvement in performance (3.42%) for overall >> encode use case for "voip 48000 2 24000". This is somewhat expected as >> cel_pitch_xcorr_c was not the main contributor for performance in this >> SILK encoder use case. >> >> For detailed information on how I measured performance on my >> Beaglebone Black (Cortex-A8), please see "celt_pitch_xcorr (float) >> Neon Optimization" section of [1] >> >> [1]: https://docs.google.com/document/d/1L6csATjSsXtzg_sa1iHZta8hOsoVWA4UjHXEakpTrNk/edit?usp=sharing >> >> >> >> > >> > Even though this primarily affects the encoder, as a sanity check, it's always good to make sure the test vectors still decode correctly. Get them from <http://opus-codec.org/testvectors/opus_testvectors.tar.gz> and use >> > tests/run_vectors.sh <build path> <test vectors path> 48000 > > OK, this took about 2 hours.. but all tests passed successfully. > Please let me know what the next steps are.Is there plans to support ARMv8/aarch64 NEON intrinsics too? Also is there plans to make the NEON optimisations on ARMv7 run time detectable like they have in cairo/pixman? For generic distributions it would nice to be able to be able to enable them as they offer decent performance improvements but have the code fall back on devices that don't support NEON. Peter
Viswanath Puttagunta
2014-Nov-25 15:07 UTC
[opus] [RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics
On 24 November 2014 at 17:48, Peter Robinson <pbrobinson at gmail.com> wrote:>>> >> a. Simplest use case to validate this optimization for correctness. >>> >> b. Simplest use case to validate this optimization for performance. >>> >> >>> >> Would prefer something like opusdec that can be executed on command >>> >> line. >>> > >>> > >>> > The easiest thing to use is probably opus_demo (opusdec does a bunchof extra things, plus for interactive use we care about both the encoder and decoder, and celt_pitch_xcorr gets used vastly more by the encoder than the decoder... I think the decoder only uses it for PLC).>>> > >>> > Something like >>> > ./opus_demo restricted-lowdelay 48000 2 96000 comp48-stereo.sw/dev/null>>> > >>> > comp48-stereo.sw can be found here:https://people.xiph.org/~tterribe/opus/comp48-stereo.sw>>> > >>> > celt_pitch_xcorr also gets used by the SILK encoder (more infixed-point than float, but the float one uses it, too). So it may be worth doing a run with the application set to voip instead of restricted-lowdelay and a lower bitrate (e.g., 24000 instead of 96000).>>> >>> Thanks for your feedback. I have verified both above cases. While I used >>> ./opus_demo restricted-lowdelay 48000 2 96000 comp48-stereo.sw out.wav, >>> ./opus_demo voip 48000 2 24000 comp48-stereo.sw out.wav >>> >>> to make sure the output out.wav is clearly audible, I used below >>> command (encode only) for performance benchmarking. >>> >>> ./opus_demo -e restricted-lowdelay 48000 2 96000 comp48-stereo.swopus_raw.out>>> ./opus_demo -e voip 48000 2 96000 comp48-stereo.sw opus_raw.out >>> >>> I saw much better improvement in performance (16.16%) for overall >>> encode use case for "restricted-lowdelay 48000 2 96000" for CELT >>> encoding as you suspected as celt_pitch_xcorr function gets used much >>> more. >>> >>> I observed lesser improvement in performance (3.42%) for overall >>> encode use case for "voip 48000 2 24000". This is somewhat expected as >>> cel_pitch_xcorr_c was not the main contributor for performance in this >>> SILK encoder use case. >>> >>> For detailed information on how I measured performance on my >>> Beaglebone Black (Cortex-A8), please see "celt_pitch_xcorr (float) >>> Neon Optimization" section of [1] >>> >>> [1]:https://docs.google.com/document/d/1L6csATjSsXtzg_sa1iHZta8hOsoVWA4UjHXEakpTrNk/edit?usp=sharing>>> >>> >>> >>> > >>> > Even though this primarily affects the encoder, as a sanity check,it's always good to make sure the test vectors still decode correctly. Get them from <http://opus-codec.org/testvectors/opus_testvectors.tar.gz> and use>>> > tests/run_vectors.sh <build path> <test vectors path> 48000 >> >> OK, this took about 2 hours.. but all tests passed successfully. >> Please let me know what the next steps are. > > Is there plans to support ARMv8/aarch64 NEON intrinsics too? > > Also is there plans to make the NEON optimisations on ARMv7 run time > detectable like they have in cairo/pixman? For generic distributions > it would nice to be able to be able to enable them as they offer > decent performance improvements but have the code fall back on devices > that don't support NEON.Yep, adding support for ARMv8 is the final objective. I did not want to introduce too many changes in the first shot... and hence only introduced for ARMv7. In theory, most of the code (neon intrinsic code) in this patch should remain unchanged for ARMv8. Only the mechanism by which neon/asimd presence is detected during runtime and the flags used during compile are the only ones that should change. I will work on this once this patch gets reviewed and accepted. I made sure these changes are fairly localized. And yes, this patch also supports runtime detection of neon. Actually, most of code to do run time detection of neon was already there in the project before this patch. I just re-used the infrastructure.> > Peter-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.xiph.org/pipermail/opus/attachments/20141125/55fc1b93/attachment-0001.htm
Jonathan Lennox
2014-Nov-25 15:39 UTC
[opus] [RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics
On Nov 25, 2014, at 10:07 AM, Viswanath Puttagunta <viswanath.puttagunta at linaro.org> wrote:> > > Also is there plans to make the NEON optimisations on ARMv7 run time > > detectable like they have in cairo/pixman? For generic distributions > > it would nice to be able to be able to enable them as they offer > > decent performance improvements but have the code fall back on devices > > that don't support NEON. > Yep, adding support for ARMv8 is the final objective. I did not want to introduce too many changes in the first shot... and hence only introduced for ARMv7. In theory, most of the code (neon intrinsic code) in this patch should remain unchanged for ARMv8. Only the mechanism by which neon/asimd presence is detected during runtime and the flags used during compile are the only ones that should change. I will work on this once this patch gets reviewed and accepted. I made sure these changes are fairly localized. > > And yes, this patch also supports runtime detection of neon. Actually, most of code to do run time detection of neon was already there in the project before this patch. I just re-used the infrastructure.ARMv8 shouldn?t need Neon detection at all ? Neon is a mandatory part of the ARMv8 architecture, unlike ARMv7, where it?s optional. It looks like this is what the configure script is already doing ? arm64 sets rtcd_support to no. I believe iOS, Windows RT/Windows Phone 8, and Blackberry 10 all require CPU support for Neon when running on ARMv7+ platforms, so detection shouldn?t be necessary there either. The configure script should probably default rtcd_support accordingly, but configuring with --disable-rtcd should be sufficient to build on these platforms, regardless.
Possibly Parallel Threads
- [RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics
- [RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics
- [RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics
- [RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics
- [RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics