Linfeng Zhang
2017-Feb-15 21:05 UTC
[opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON
Hi Jean-Marc, The original celt_fir() is a little bit messy. It has 2 branches chosen by #ifdef SMALL_FOOTPRINT. For floating-point, the 2 branches are identical (except the operation sequence of accumulating x[i] to sum, which is not a big deal). For fixed-point, the 2 branches are different. I separate them into 2 functions: the new celt_fir(), and celt_fir_permit_overflow() which is the SMALL_FOOTPRINT branch. The only difference for fixed-point is: celt_fir(): the sum is truncated first and then accumulated to x[i] and saturated. celt_fir_permit_overflow(): x[i] is accumulated to the sum first and then truncated saturated. Maybe this is the reason why silk_LPC_analysis_filter() switched the FIR from celt_fir() to celt_fir_permit_overflow() half a year ago. Because of silk_LPC_analysis_filter(), celt_fir_permit_overflow() must behave the same for both floating-point and fixed-point, and this is why we defined ADD32_FIXED(), ..., PSHR32_FIXED() etc. It's still a messy. For the NEON optimization part, the previous celt_fir() optimization calls xcorr_kernel(). We tested and found that calling the xcorr_kernel() optimization didn't help too much here. The optimization in the patch is about 1% faster than simply calling xcorr_kernel() for the whole encoder. Considering the really small size of the new optimization, it's better to not call xcorr_kernel() to get 1% faster. Thanks, Linfeng On Wed, Feb 15, 2017 at 12:06 PM, Jean-Marc Valin <jmvalin at jmvalin.ca> wrote:> Hi Linfeng, > > Can you give me a bit more details about the purpose of this patchset. > It seems to me like it's mostly duplicating the celt_fir() > optimizations? Did I miss anything? > > Cheers, > > Jean-Marc > > On 15/02/17 02:22 PM, Linfeng Zhang wrote: > > Hi, > > > > Attached are two patches. Patch 1 refactors silk_LPC_analysis_filter(). > > And Patch 2 optimizes the new function celt_fir_permit_overflow() for > > ARM NEON. > > > > Please recommend a better function name. > > > > We did the same internal code review and testing already. > > > > Thanks, > > Linfeng > > > > > > > > _______________________________________________ > > opus mailing list > > opus at xiph.org > > http://lists.xiph.org/mailman/listinfo/opus > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xiph.org/pipermail/opus/attachments/20170215/99a85dd6/attachment.html>
Jean-Marc Valin
2017-Feb-18 00:30 UTC
[opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON
Hi Linfeng, On 15/02/17 04:05 PM, Linfeng Zhang wrote:> The original celt_fir() is a little bit messy. It has 2 branches chosen > by #ifdef SMALL_FOOTPRINT.Yeah, I agree that the #ifdef SMALL_FOOTPRINT in celt_fir() is a bit of overkill since it's not saving much code space. I just pushed a commit that gets rid of it, also refactoring the #else case a bit (see below).> For floating-point, the 2 branches are identical (except the operation > sequence of accumulating x[i] to sum, which is not a big deal). > For fixed-point, the 2 branches are different. I separate them into 2 > functions: the new celt_fir(), and celt_fir_permit_overflow() which is > the SMALL_FOOTPRINT branch. > The only difference for fixed-point is: > celt_fir(): the sum is truncated first and then accumulated to x[i] and > saturated. > celt_fir_permit_overflow(): x[i] is accumulated to the sum first and > then truncated saturated.Actually, the two branches are bit-exact for fixed-point. There is indeed a difference in where x[i] gets accumulated, but because in the SMALL_FOOTPRINT case it first gets shifted up by SIG_SHIFT, the result of the downshift (also by SIG_SHIFT) is the same no matter when it gets added. That being said, I thought adding at the beginning was nicer so I changed the remaining code to do that.> Maybe this is the reason why silk_LPC_analysis_filter() switched the FIR > from celt_fir() to celt_fir_permit_overflow() half a year ago.No, it's another issue. silk_LPC_analysis_filter() was always bit-exact with celt_fir(), with or without SMALL_FOOTPRINT. The only difference lies with the signed integer overflow suppression. silk_LPC_analysis_filter() relies on the knowledge that signed integer overflows can occur during the accumulation, but they are guaranteed to cancel each other (i.e. equal wrap-arounds in each direction). For that reason, we use silk_SMLABB_ovflw() which casts to unsigned to avoid the undefined behaviour and thus the ubsan warnings. In celt_fir() and the pitch correlation code it uses, we know there should not be signed overflows, so we would like to detect any problem when using ubsan. I believe the solution would be to always have either: 1) USE_CELT_FIR=1 and use ovflw() macros in the xcorr code; or 2) USE_CELT_FIR=0 and no ovflw() in the xcorr code We can then switch between the two cases at compile time. I think 2) is appropriate when we have --enable-assertions without --enable-check-asm, wheras 1) is appropriate the rest of the time. That way we can test for overflows in the CELT code, without preventing optimization of the SILK code.> Because of silk_LPC_analysis_filter(), celt_fir_permit_overflow() must > behave the same for both floating-point and fixed-point, and this is why > we defined ADD32_FIXED(), ..., PSHR32_FIXED() etc.I don't think you will need these anymore, but if you ever need fixed-point macros that remain integer for float compilation, then you should use the silk_*() fixed-point macros (and the code should be in silk/).> For the NEON optimization part, the previous celt_fir() optimization > calls xcorr_kernel(). We tested and found that calling > the xcorr_kernel() optimization didn't help too much here. The > optimization in the patch is about 1% faster than simply > calling xcorr_kernel() for the whole encoder. Considering the really > small size of the new optimization, it's better to not call > xcorr_kernel() to get 1% faster.It seems like the new Neon code does essentially the same thing as the existing xcorr_kernel(), so the two should remain shared to reduce maintenance burden and maximize speed. If you're able to come up with something faster than xcorr_kernel() (which apparently you have), then it seems like it should be usable by the other functions that currently use xcorr_kernel(). Or did I miss anything here? Cheers, Jean-Marc
Linfeng Zhang
2017-Mar-01 18:49 UTC
[opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON
> > I believe the solution would be to always have either: > 1) USE_CELT_FIR=1 and use ovflw() macros in the xcorr code; or > 2) USE_CELT_FIR=0 and no ovflw() in the xcorr code >I prefer to create a function named silk_fir() with optimization to do the calculation when USE_CELT_FIR=0. xcorr_kernel() itself is great and provides many gains. The only issue is that calling it in a for loop makes it less efficient. xcorr_kernel() is called in several functions including celt_fir(), celt_pitch_xcorr() and celt_iir(). All these functions are not heavy hitters. silk_LPC_analysis_filter()'s CPU cycles are 6.8% with complexity 8 and 8.9% with complexity 5 out of the whole encoder. It probably makes sense to have a specific optimization to not calling xcorr_kernel() too many times to save 1% to 1.5% CPU cycles here. How do you think? Thanks, Linfeng> We can then switch between the two cases at compile time. I think 2) is > appropriate when we have --enable-assertions without --enable-check-asm, > wheras 1) is appropriate the rest of the time. That way we can test for > overflows in the CELT code, without preventing optimization of the SILK > code. > > > Because of silk_LPC_analysis_filter(), celt_fir_permit_overflow() must > > behave the same for both floating-point and fixed-point, and this is why > > we defined ADD32_FIXED(), ..., PSHR32_FIXED() etc. > > I don't think you will need these anymore, but if you ever need > fixed-point macros that remain integer for float compilation, then you > should use the silk_*() fixed-point macros (and the code should be in > silk/). > > > For the NEON optimization part, the previous celt_fir() optimization > > calls xcorr_kernel(). We tested and found that calling > > the xcorr_kernel() optimization didn't help too much here. The > > optimization in the patch is about 1% faster than simply > > calling xcorr_kernel() for the whole encoder. Considering the really > > small size of the new optimization, it's better to not call > > xcorr_kernel() to get 1% faster. > > It seems like the new Neon code does essentially the same thing as the > existing xcorr_kernel(), so the two should remain shared to reduce > maintenance burden and maximize speed. If you're able to come up with > something faster than xcorr_kernel() (which apparently you have), then > it seems like it should be usable by the other functions that currently > use xcorr_kernel(). Or did I miss anything here? > > Cheers, > > Jean-Marc >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xiph.org/pipermail/opus/attachments/20170301/89ed0208/attachment.html>
Reasonably Related Threads
- [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON
- [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON
- [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON
- [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON
- [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON