thr3ads.net - opus - [opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit

If this information is useful, please help other people find it:
Share via:

Linfeng Zhang

2017-Mar-01 18:49 UTC

[opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

>
> I believe the solution would be to always have either:
> 1) USE_CELT_FIR=1 and use ovflw() macros in the xcorr code; or
> 2) USE_CELT_FIR=0 and no ovflw() in the xcorr code
>
I prefer to create a function named silk_fir() with optimization to do the
calculation when USE_CELT_FIR=0.

xcorr_kernel() itself is great and provides many gains. The only issue is
that calling it in a for loop makes it less efficient.

xcorr_kernel() is called in several functions including
celt_fir(), celt_pitch_xcorr() and celt_iir(). All these functions are not
heavy hitters.
silk_LPC_analysis_filter()'s CPU cycles are 6.8% with complexity 8 and 8.9%
with complexity 5 out of the whole encoder. It probably makes sense to have
a specific optimization to not calling xcorr_kernel() too many times to
save 1% to 1.5% CPU cycles here.

How do you think?

Thanks,
Linfeng

> We can then switch between the two cases at compile time. I think 2) is
> appropriate when we have --enable-assertions without --enable-check-asm,
> wheras 1) is appropriate the rest of the time. That way we can test for
> overflows in the CELT code, without preventing optimization of the SILK
> code.
>
> > Because of silk_LPC_analysis_filter(), celt_fir_permit_overflow() must
> > behave the same for both floating-point and fixed-point, and this is
why
> > we defined ADD32_FIXED(), ..., PSHR32_FIXED() etc.
>
> I don't think you will need these anymore, but if you ever need
> fixed-point macros that remain integer for float compilation, then you
> should use the silk_*() fixed-point macros (and the code should be in
> silk/).
>
> > For the NEON optimization part, the previous celt_fir() optimization
> > calls xcorr_kernel(). We tested and found that calling
> > the xcorr_kernel() optimization didn't help too much here. The
> > optimization in the patch is about 1% faster than simply
> > calling xcorr_kernel() for the whole encoder. Considering the really
> > small size of the new optimization, it's better to not call
> > xcorr_kernel() to get 1% faster.
>
> It seems like the new Neon code does essentially the same thing as the
> existing xcorr_kernel(), so the two should remain shared to reduce
> maintenance burden and maximize speed. If you're able to come up with
> something faster than xcorr_kernel() (which apparently you have), then
> it seems like it should be usable by the other functions that currently
> use xcorr_kernel(). Or did I miss anything here?
>
> Cheers,
>
>         Jean-Marc
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xiph.org/pipermail/opus/attachments/20170301/89ed0208/attachment.html>

Timothy B. Terriberry

2017-Mar-01 18:58 UTC

head link

[opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

Linfeng Zhang wrote:> xcorr_kernel() itself is great and provides many gains. The only issue
> is that calling it in a for loop makes it less efficient.
Do you think it would be possible to improve the API of xcorr_kernel() 
so that calling it in a loop is more efficient?

I haven't looked at an instruction-level profile, but I find it hard to 
believe that the function prologue/epilogue is really responsible for 1% 
to 1.5% of the whole decoder cost. Perhaps it is just bouncing the 
values in and out of memory from the NEON pipeline or something like 
that which is expensive? Otherwise it seems to be doing exactly the same 
things as your celt_fir() (unless I've missed something, which is 
certainly possible).

The other advantage to wiring up xcorr_kernel() is that it applies in 
more places than your intrinsics-only celt_fir() implementation.

Linfeng Zhang

2017-Mar-01 19:30 UTC

head link

[opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

Hi Timothy,

Do you think it would be possible to improve the API of xcorr_kernel()
so> that calling it in a loop is more efficient?
>
If it could be inlined, it will be more efficient. Besides memory bouncing,
frequent function call is expensive.

The other advantage to wiring up xcorr_kernel() is that it applies in
more> places than your intrinsics-only celt_fir() implementation.
>
I agree.

One solution is to put the outer for(N) loop inside xcorr_kernel() to let
it return N results instead of 4 (similar to the celt_fir() NEON intrinsics
did). This will make it efficient plus universal.

Thanks,
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xiph.org/pipermail/opus/attachments/20170301/7bb0970c/attachment.html>

Maybe Matching Threads

Search for more possibly parallel threads

opus - Mar 2017 - [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

[opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

[opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

[opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

Maybe Matching Threads