Linfeng Zhang
2017-Jun-01 22:33 UTC
[opus] [OPUS] celt_inner_prod() and dual_inner_prod() NEON intrinsics
Hi, Attached are 5 patches related to celt_inner_prod() and dual_inner_prod() NEON intrinsics optimization. In 0004-Optimize-floating-point-celt_inner_prod-and-dual_inn.patch, the optimization changed the order of floating-point inner products, which will change the results. I created celt_inner_prod_neon_float_c_simulation() and dual_inner_prod_neon_float_c_simulation() to simulate the order floating-point operations in NEON optimization and compare their results. Sorry that I cannot bond the distance between original C function and NEON function to any giving reasonable small number or ratio. It's easy to create an input which 0 and 1,000 are both correct results by just manipulating the inner product order. The total speed gain is about 1.0% for fixed-point encoder, and 1.8% for floating-point encoder, in Complexity 8, tested on my Chromebook. Thanks, Linfeng -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xiph.org/pipermail/opus/attachments/20170601/92c39072/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: 0005-Clean-celt_pitch_xcorr_float_neon.patch Type: text/x-patch Size: 3960 bytes Desc: not available URL: <http://lists.xiph.org/pipermail/opus/attachments/20170601/92c39072/attachment-0005.bin> -------------- next part -------------- A non-text attachment was scrubbed... Name: 0004-Optimize-floating-point-celt_inner_prod-and-dual_inn.patch Type: text/x-patch Size: 8832 bytes Desc: not available URL: <http://lists.xiph.org/pipermail/opus/attachments/20170601/92c39072/attachment-0006.bin> -------------- next part -------------- A non-text attachment was scrubbed... Name: 0003-Optimize-fixed-point-celt_inner_prod-and-dual_inner_.patch Type: text/x-patch Size: 9812 bytes Desc: not available URL: <http://lists.xiph.org/pipermail/opus/attachments/20170601/92c39072/attachment-0007.bin> -------------- next part -------------- A non-text attachment was scrubbed... Name: 0002-Replace-call-of-celt_inner_prod_c-step-2.patch Type: text/x-patch Size: 7652 bytes Desc: not available URL: <http://lists.xiph.org/pipermail/opus/attachments/20170601/92c39072/attachment-0008.bin> -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-Replace-call-of-celt_inner_prod_c-step-1.patch Type: text/x-patch Size: 5706 bytes Desc: not available URL: <http://lists.xiph.org/pipermail/opus/attachments/20170601/92c39072/attachment-0009.bin>
Jean-Marc Valin
2017-Jun-02 18:26 UTC
[opus] [OPUS] celt_inner_prod() and dual_inner_prod() NEON intrinsics
Hi Linfeng, I'll look into your patches. Can you let me know what's the expected effect on performance (if any) for each of your patches? Also, are these all the patches you intend to merge for 1.2 or are there more upcoming ones? Cheers, Jean-Marc On 01/06/17 06:33 PM, Linfeng Zhang wrote:> Hi, > > Attached are 5 patches related to celt_inner_prod() > and dual_inner_prod() NEON intrinsics optimization. > > In 0004-Optimize-floating-point-celt_inner_prod-and-dual_inn.patch, the > optimization changed the order of floating-point inner products, which > will change the results. I > created celt_inner_prod_neon_float_c_simulation() > and dual_inner_prod_neon_float_c_simulation() to simulate the order > floating-point operations in NEON optimization and compare their > results. Sorry that I cannot bond the distance between original C > function and NEON function to any giving reasonable small number or > ratio. It's easy to create an input which 0 and 1,000 are both correct > results by just manipulating the inner product order. > > The total speed gain is about 1.0% for fixed-point encoder, and 1.8% for > floating-point encoder, in Complexity 8, tested on my Chromebook. > > Thanks, > Linfeng > > > _______________________________________________ > opus mailing list > opus at xiph.org > http://lists.xiph.org/mailman/listinfo/opus >
Linfeng Zhang
2017-Jun-05 19:28 UTC
[opus] [OPUS] celt_inner_prod() and dual_inner_prod() NEON intrinsics
Hi Jean-Marc, I attached the new version in inner_prod_5patches_v2.zip which synced to the current master. For fixed-point ARM, only 0003-Optimize-fixed-point-celt _inner_prod-and-dual_inner_.patch changes the performance. For floating-point ARM, only 0004-Optimize-floating-point-c elt_inner_prod-and-dual_inn.patch changes the performance. Patch 1 and 2 are code clean-up and can only affect x86 performance. Patch 5 has neglectable effect on floating-point ARM performance. Thanks, Linfeng On Fri, Jun 2, 2017 at 11:26 AM, Jean-Marc Valin <jmvalin at jmvalin.ca> wrote:> Hi Linfeng, > > I'll look into your patches. Can you let me know what's the expected > effect on performance (if any) for each of your patches? Also, are these > all the patches you intend to merge for 1.2 or are there more upcoming > ones? > > Cheers, > > Jean-Marc > > On 01/06/17 06:33 PM, Linfeng Zhang wrote: > > Hi, > > > > Attached are 5 patches related to celt_inner_prod() > > and dual_inner_prod() NEON intrinsics optimization. > > > > In 0004-Optimize-floating-point-celt_inner_prod-and-dual_inn.patch, the > > optimization changed the order of floating-point inner products, which > > will change the results. I > > created celt_inner_prod_neon_float_c_simulation() > > and dual_inner_prod_neon_float_c_simulation() to simulate the order > > floating-point operations in NEON optimization and compare their > > results. Sorry that I cannot bond the distance between original C > > function and NEON function to any giving reasonable small number or > > ratio. It's easy to create an input which 0 and 1,000 are both correct > > results by just manipulating the inner product order. > > > > The total speed gain is about 1.0% for fixed-point encoder, and 1.8% for > > floating-point encoder, in Complexity 8, tested on my Chromebook. > > > > Thanks, > > Linfeng > > > > > > _______________________________________________ > > opus mailing list > > opus at xiph.org > > http://lists.xiph.org/mailman/listinfo/opus > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xiph.org/pipermail/opus/attachments/20170605/c8d5d402/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: inner_prod_5patches_v2.zip Type: application/zip Size: 10997 bytes Desc: not available URL: <http://lists.xiph.org/pipermail/opus/attachments/20170605/c8d5d402/attachment-0001.zip>
Jonathan Lennox
2017-Jun-06 20:09 UTC
[opus] [OPUS] celt_inner_prod() and dual_inner_prod() NEON intrinsics
Two comments on the various infrastructure for RTCD etc. 1. The 0002- patch changes the ABI of the celt_pitch_xcorr functions, but doesn’t change the assembly in celt/arm/celt_pitch_xcorr_arm.s correspondingly. I suspect the ‘arch’ parameter can just be ignored by the assembly functions, but at least the comments in that file should be updated to indicate the register that’s used to pass it in, and that it’s ignored. 2. In the 0003- patch, you shouldn’t use the MAY_HAVE_NEON macro in your new arm_celt_map tables, for the same reason we didn’t want it in the arm_silk_map tables. Out of curiosity, what’s the CPU in the Chromebook you’re using to test?> On Jun 1, 2017, at 6:33 PM, Linfeng Zhang <linfengz at google.com> wrote: > > Hi, > > Attached are 5 patches related to celt_inner_prod() and dual_inner_prod() NEON intrinsics optimization. > > In 0004-Optimize-floating-point-celt_inner_prod-and-dual_inn.patch, the optimization changed the order of floating-point inner products, which will change the results. I created celt_inner_prod_neon_float_c_simulation() and dual_inner_prod_neon_float_c_simulation() to simulate the order floating-point operations in NEON optimization and compare their results. Sorry that I cannot bond the distance between original C function and NEON function to any giving reasonable small number or ratio. It's easy to create an input which 0 and 1,000 are both correct results by just manipulating the inner product order. > > The total speed gain is about 1.0% for fixed-point encoder, and 1.8% for floating-point encoder, in Complexity 8, tested on my Chromebook. > > Thanks, > Linfeng > <0005-Clean-celt_pitch_xcorr_float_neon.patch><0004-Optimize-floating-point-celt_inner_prod-and-dual_inn.patch><0003-Optimize-fixed-point-celt_inner_prod-and-dual_inner_.patch><0002-Replace-call-of-celt_inner_prod_c-step-2.patch><0001-Replace-call-of-celt_inner_prod_c-step-1.patch>_______________________________________________ > opus mailing list > opus at xiph.org > http://lists.xiph.org/mailman/listinfo/opus
Jean-Marc Valin
2017-Jun-06 20:15 UTC
[opus] [OPUS] celt_inner_prod() and dual_inner_prod() NEON intrinsics
Hi Linfeng, On 06/06/17 04:09 PM, Jonathan Lennox wrote:> Two comments on the various infrastructure for RTCD etc. > > 1. The 0002- patch changes the ABI of the celt_pitch_xcorr functions, > but doesn’t change the assembly in celt/arm/celt_pitch_xcorr_arm.s > correspondingly. I suspect the ‘arch’ parameter can just be ignored > by the assembly functions, but at least the comments in that file > should be updated to indicate the register that’s used to pass it in, > and that it’s ignored. > > 2. In the 0003- patch, you shouldn’t use the MAY_HAVE_NEON macro in > your new arm_celt_map tables, for the same reason we didn’t want it > in the arm_silk_map tables.I have no further issues with your patches, so once you address the two issues Jonathan pointed out, I'll be able to merge them. Cheers, Jean-Marc> > Out of curiosity, what’s the CPU in the Chromebook you’re using to > test? > >> On Jun 1, 2017, at 6:33 PM, Linfeng Zhang <linfengz at google.com> >> wrote: >> >> Hi, >> >> Attached are 5 patches related to celt_inner_prod() and >> dual_inner_prod() NEON intrinsics optimization. >> >> In 0004-Optimize-floating-point-celt_inner_prod-and-dual_inn.patch, >> the optimization changed the order of floating-point inner >> products, which will change the results. I created >> celt_inner_prod_neon_float_c_simulation() and >> dual_inner_prod_neon_float_c_simulation() to simulate the order >> floating-point operations in NEON optimization and compare their >> results. Sorry that I cannot bond the distance between original C >> function and NEON function to any giving reasonable small number or >> ratio. It's easy to create an input which 0 and 1,000 are both >> correct results by just manipulating the inner product order. >> >> The total speed gain is about 1.0% for fixed-point encoder, and >> 1.8% for floating-point encoder, in Complexity 8, tested on my >> Chromebook. >> >> Thanks, Linfeng >> <0005-Clean-celt_pitch_xcorr_float_neon.patch><0004-Optimize-floating-point-celt_inner_prod-and-dual_inn.patch><0003-Optimize-fixed-point-celt_inner_prod-and-dual_inner_.patch><0002-Replace-call-of-celt_inner_prod_c-step-2.patch><0001-Replace-call-of-celt_inner_prod_c-step-1.patch>_______________________________________________ >> >>opus mailing list>> opus at xiph.org http://lists.xiph.org/mailman/listinfo/opus > > _______________________________________________ opus mailing list > opus at xiph.org http://lists.xiph.org/mailman/listinfo/opus >
Apparently Analagous Threads
- celt_inner_prod() and dual_inner_prod() NEON intrinsics
- celt_inner_prod() and dual_inner_prod() NEON intrinsics
- celt_inner_prod() and dual_inner_prod() NEON intrinsics
- Antw: Re: celt_inner_prod() and dual_inner_prod() NEON intrinsics
- celt_inner_prod() and dual_inner_prod() NEON intrinsics