thr3ads.net - opus - [opus] Antw: Re: [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit

If this information is useful, please help other people find it:
Share via:

Linfeng Zhang

2017-Mar-01 19:30 UTC

[opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

Hi Timothy,

Do you think it would be possible to improve the API of xcorr_kernel()
so> that calling it in a loop is more efficient?
>
If it could be inlined, it will be more efficient. Besides memory bouncing,
frequent function call is expensive.

The other advantage to wiring up xcorr_kernel() is that it applies in
more> places than your intrinsics-only celt_fir() implementation.
>
I agree.

One solution is to put the outer for(N) loop inside xcorr_kernel() to let
it return N results instead of 4 (similar to the celt_fir() NEON intrinsics
did). This will make it efficient plus universal.

Thanks,
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xiph.org/pipermail/opus/attachments/20170301/7bb0970c/attachment.html>

Ulrich Windl

2017-Mar-02 07:27 UTC

head link

[opus] Antw: Re: [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

Hi!

I'm not deep i the code, but from my experience even older gcc (4.3.4) does
function inlining at -O2, and at -O3 it inlines almost any function inside one
module. Once I even let it inline across modules (-combine). I'm not talking
about explicit inline functions; just about automatic optimization.
So did you check that frequent function calls actually happen? I'm a bit
afraid that after all those optimizations suggested the code may be rather hard
to understand. I think compilers should do the dirty work (i.e.: optimizing and
inlining). Sometimes "static" and "const" attributes help
the compiler to optimize...

Regards,
Ulrich
>>> Linfeng Zhang <linfengz at google.com> schrieb am 01.03.2017
um 20:30 in Nachricht<CAKoqLCANyWDPpy4rccL3TJ37gbhWxRWkCrqR9GCATGhTFoaDyA at
mail.gmail.com>:> Hi Timothy,
> 
> Do you think it would be possible to improve the API of xcorr_kernel() so
>> that calling it in a loop is more efficient?
>>
> 
> If it could be inlined, it will be more efficient. Besides memory bouncing,
> frequent function call is expensive.
> 
> The other advantage to wiring up xcorr_kernel() is that it applies in more
>> places than your intrinsics-only celt_fir() implementation.
>>
> 
> I agree.
> 
> One solution is to put the outer for(N) loop inside xcorr_kernel() to let
> it return N results instead of 4 (similar to the celt_fir() NEON intrinsics
> did). This will make it efficient plus universal.
> 
> Thanks,

Linfeng Zhang

2017-Mar-02 21:12 UTC

head link

[opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

Thank Ulrich!

Yes, but when the jump table is active, the platform specific optimization
functions could not be inlined.

On Wed, Mar 1, 2017 at 11:27 PM, Ulrich Windl <
Ulrich.Windl at rz.uni-regensburg.de> wrote:
> Hi!
>
> I'm not deep i the code, but from my experience even older gcc (4.3.4)
> does function inlining at -O2, and at -O3 it inlines almost any function
> inside one module. Once I even let it inline across modules (-combine).
I'm
> not talking about explicit inline functions; just about automatic
> optimization.
> So did you check that frequent function calls actually happen? I'm a
bit
> afraid that after all those optimizations suggested the code may be rather
> hard to understand. I think compilers should do the dirty work (i.e.:
> optimizing and inlining). Sometimes "static" and
"const" attributes help
> the compiler to optimize...
>
> Regards,
> Ulrich
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xiph.org/pipermail/opus/attachments/20170302/5d460825/attachment.html>

Timothy B. Terriberry

2017-Mar-21 22:56 UTC

head link

[opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

Linfeng Zhang wrote:> One solution is to put the outer for(N) loop inside xcorr_kernel() to
> let it return N results instead of 4 (similar to the celt_fir() NEON
> intrinsics did). This will make it efficient plus universal.
Sorry for not replying to this earlier, but isn't this what 
celt_pitch_xcorr() does? Or am I missing something?

Maybe Matching Threads

Search for more seemingly similar threads

opus - Mar 2017 - Antw: Re: [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

[opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

[opus] Antw: Re: [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

[opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

[opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

Maybe Matching Threads