Linfeng Zhang
2017-Mar-01 18:49 UTC
[opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON
> > I believe the solution would be to always have either: > 1) USE_CELT_FIR=1 and use ovflw() macros in the xcorr code; or > 2) USE_CELT_FIR=0 and no ovflw() in the xcorr code >I prefer to create a function named silk_fir() with optimization to do the calculation when USE_CELT_FIR=0. xcorr_kernel() itself is great and provides many gains. The only issue is that calling it in a for loop makes it less efficient. xcorr_kernel() is called in several functions including celt_fir(), celt_pitch_xcorr() and celt_iir(). All these functions are not heavy hitters. silk_LPC_analysis_filter()'s CPU cycles are 6.8% with complexity 8 and 8.9% with complexity 5 out of the whole encoder. It probably makes sense to have a specific optimization to not calling xcorr_kernel() too many times to save 1% to 1.5% CPU cycles here. How do you think? Thanks, Linfeng> We can then switch between the two cases at compile time. I think 2) is > appropriate when we have --enable-assertions without --enable-check-asm, > wheras 1) is appropriate the rest of the time. That way we can test for > overflows in the CELT code, without preventing optimization of the SILK > code. > > > Because of silk_LPC_analysis_filter(), celt_fir_permit_overflow() must > > behave the same for both floating-point and fixed-point, and this is why > > we defined ADD32_FIXED(), ..., PSHR32_FIXED() etc. > > I don't think you will need these anymore, but if you ever need > fixed-point macros that remain integer for float compilation, then you > should use the silk_*() fixed-point macros (and the code should be in > silk/). > > > For the NEON optimization part, the previous celt_fir() optimization > > calls xcorr_kernel(). We tested and found that calling > > the xcorr_kernel() optimization didn't help too much here. The > > optimization in the patch is about 1% faster than simply > > calling xcorr_kernel() for the whole encoder. Considering the really > > small size of the new optimization, it's better to not call > > xcorr_kernel() to get 1% faster. > > It seems like the new Neon code does essentially the same thing as the > existing xcorr_kernel(), so the two should remain shared to reduce > maintenance burden and maximize speed. If you're able to come up with > something faster than xcorr_kernel() (which apparently you have), then > it seems like it should be usable by the other functions that currently > use xcorr_kernel(). Or did I miss anything here? > > Cheers, > > Jean-Marc >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xiph.org/pipermail/opus/attachments/20170301/89ed0208/attachment.html>
Timothy B. Terriberry
2017-Mar-01 18:58 UTC
[opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON
Linfeng Zhang wrote:> xcorr_kernel() itself is great and provides many gains. The only issue > is that calling it in a for loop makes it less efficient.Do you think it would be possible to improve the API of xcorr_kernel() so that calling it in a loop is more efficient? I haven't looked at an instruction-level profile, but I find it hard to believe that the function prologue/epilogue is really responsible for 1% to 1.5% of the whole decoder cost. Perhaps it is just bouncing the values in and out of memory from the NEON pipeline or something like that which is expensive? Otherwise it seems to be doing exactly the same things as your celt_fir() (unless I've missed something, which is certainly possible). The other advantage to wiring up xcorr_kernel() is that it applies in more places than your intrinsics-only celt_fir() implementation.
Linfeng Zhang
2017-Mar-01 19:30 UTC
[opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON
Hi Timothy, Do you think it would be possible to improve the API of xcorr_kernel() so> that calling it in a loop is more efficient? >If it could be inlined, it will be more efficient. Besides memory bouncing, frequent function call is expensive. The other advantage to wiring up xcorr_kernel() is that it applies in more> places than your intrinsics-only celt_fir() implementation. >I agree. One solution is to put the outer for(N) loop inside xcorr_kernel() to let it return N results instead of 4 (similar to the celt_fir() NEON intrinsics did). This will make it efficient plus universal. Thanks, -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xiph.org/pipermail/opus/attachments/20170301/7bb0970c/attachment.html>
Maybe Matching Threads
- [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON
- [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON
- Antw: Re: [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON
- [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON
- [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON