thr3ads.net - opus - [opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit

If this information is useful, please help other people find it:
Share via:

Linfeng Zhang

2017-Feb-15 21:05 UTC

[opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

Hi Jean-Marc,

The original celt_fir() is a little bit messy. It has 2 branches chosen by
#ifdef SMALL_FOOTPRINT.
For floating-point, the 2 branches are identical (except the operation
sequence of accumulating x[i] to sum, which is not a big deal).
For fixed-point, the 2 branches are different. I separate them into 2
functions: the new celt_fir(), and celt_fir_permit_overflow() which is the
SMALL_FOOTPRINT branch.
The only difference for fixed-point is:
celt_fir(): the sum is truncated first and then accumulated to x[i] and
saturated.
celt_fir_permit_overflow(): x[i] is accumulated to the sum first and then
truncated saturated.

Maybe this is the reason why silk_LPC_analysis_filter() switched the FIR
from celt_fir() to celt_fir_permit_overflow() half a year ago.

Because of silk_LPC_analysis_filter(), celt_fir_permit_overflow() must
behave the same for both floating-point and fixed-point, and this is why we
defined ADD32_FIXED(), ..., PSHR32_FIXED() etc.

It's still a messy.

For the NEON optimization part, the previous celt_fir() optimization calls
xcorr_kernel(). We tested and found that calling the xcorr_kernel()
optimization didn't help too much here. The optimization in the patch is
about 1% faster than simply calling xcorr_kernel() for the whole encoder.
Considering the really small size of the new optimization, it's better to
not call xcorr_kernel() to get 1% faster.

Thanks,
Linfeng

On Wed, Feb 15, 2017 at 12:06 PM, Jean-Marc Valin <jmvalin at jmvalin.ca>
wrote:
> Hi Linfeng,
>
> Can you give me a bit more details about the purpose of this patchset.
> It seems to me like it's mostly duplicating the celt_fir()
> optimizations? Did I miss anything?
>
> Cheers,
>
>         Jean-Marc
>
> On 15/02/17 02:22 PM, Linfeng Zhang wrote:
> > Hi,
> >
> > Attached are two patches. Patch 1 refactors
silk_LPC_analysis_filter().
> > And Patch 2 optimizes the new function celt_fir_permit_overflow() for
> > ARM NEON.
> >
> > Please recommend a better function name.
> >
> > We did the same internal code review and testing already.
> >
> > Thanks,
> > Linfeng
> >
> >
> >
> > _______________________________________________
> > opus mailing list
> > opus at xiph.org
> > http://lists.xiph.org/mailman/listinfo/opus
> >
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xiph.org/pipermail/opus/attachments/20170215/99a85dd6/attachment.html>

Jean-Marc Valin

2017-Feb-18 00:30 UTC

head link

[opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

Hi Linfeng,

On 15/02/17 04:05 PM, Linfeng Zhang wrote:> The original celt_fir() is a little bit messy. It has 2 branches chosen
> by #ifdef SMALL_FOOTPRINT.
Yeah, I agree that the #ifdef SMALL_FOOTPRINT in celt_fir() is a bit of
overkill since it's not saving much code space. I just pushed a commit
that gets rid of it, also refactoring the #else case a bit (see below).
> For floating-point, the 2 branches are identical (except the operation
> sequence of accumulating x[i] to sum, which is not a big deal).
> For fixed-point, the 2 branches are different. I separate them into 2
> functions: the new celt_fir(), and celt_fir_permit_overflow() which is
> the SMALL_FOOTPRINT branch.
> The only difference for fixed-point is:
> celt_fir(): the sum is truncated first and then accumulated to x[i] and
> saturated.
> celt_fir_permit_overflow(): x[i] is accumulated to the sum first and
> then truncated saturated.
Actually, the two branches are bit-exact for fixed-point. There is
indeed a difference in where x[i] gets accumulated, but because in the
SMALL_FOOTPRINT case it first gets shifted up by SIG_SHIFT, the result
of the downshift (also by SIG_SHIFT) is the same no matter when it gets
added. That being said, I thought adding at the beginning was nicer so I
changed the remaining code to do that.
> Maybe this is the reason why silk_LPC_analysis_filter() switched the FIR
> from celt_fir() to celt_fir_permit_overflow() half a year ago.
No, it's another issue. silk_LPC_analysis_filter() was always bit-exact
with celt_fir(), with or without SMALL_FOOTPRINT. The only difference
lies with the signed integer overflow suppression.
silk_LPC_analysis_filter() relies on the knowledge that signed integer
overflows can occur during the accumulation, but they are guaranteed to
cancel each other (i.e. equal wrap-arounds in each direction). For that
reason, we use silk_SMLABB_ovflw() which casts to unsigned to avoid the
undefined behaviour and thus the ubsan warnings. In celt_fir() and the
pitch correlation code it uses, we know there should not be signed
overflows, so we would like to detect any problem when using ubsan.

I believe the solution would be to always have either:
1) USE_CELT_FIR=1 and use ovflw() macros in the xcorr code; or
2) USE_CELT_FIR=0 and no ovflw() in the xcorr code

We can then switch between the two cases at compile time. I think 2) is
appropriate when we have --enable-assertions without --enable-check-asm,
wheras 1) is appropriate the rest of the time. That way we can test for
overflows in the CELT code, without preventing optimization of the SILK
code.
> Because of silk_LPC_analysis_filter(), celt_fir_permit_overflow() must
> behave the same for both floating-point and fixed-point, and this is why
> we defined ADD32_FIXED(), ..., PSHR32_FIXED() etc.
I don't think you will need these anymore, but if you ever need
fixed-point macros that remain integer for float compilation, then you
should use the silk_*() fixed-point macros (and the code should be in
silk/).
> For the NEON optimization part, the previous celt_fir() optimization
> calls xcorr_kernel(). We tested and found that calling
> the xcorr_kernel() optimization didn't help too much here. The
> optimization in the patch is about 1% faster than simply
> calling xcorr_kernel() for the whole encoder. Considering the really
> small size of the new optimization, it's better to not call
> xcorr_kernel() to get 1% faster.
It seems like the new Neon code does essentially the same thing as the
existing xcorr_kernel(), so the two should remain shared to reduce
maintenance burden and maximize speed. If you're able to come up with
something faster than xcorr_kernel() (which apparently you have), then
it seems like it should be usable by the other functions that currently
use xcorr_kernel(). Or did I miss anything here?

Cheers,

	Jean-Marc

Linfeng Zhang

2017-Mar-01 18:49 UTC

head link

[opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

>
> I believe the solution would be to always have either:
> 1) USE_CELT_FIR=1 and use ovflw() macros in the xcorr code; or
> 2) USE_CELT_FIR=0 and no ovflw() in the xcorr code
>
I prefer to create a function named silk_fir() with optimization to do the
calculation when USE_CELT_FIR=0.

xcorr_kernel() itself is great and provides many gains. The only issue is
that calling it in a for loop makes it less efficient.

xcorr_kernel() is called in several functions including
celt_fir(), celt_pitch_xcorr() and celt_iir(). All these functions are not
heavy hitters.
silk_LPC_analysis_filter()'s CPU cycles are 6.8% with complexity 8 and 8.9%
with complexity 5 out of the whole encoder. It probably makes sense to have
a specific optimization to not calling xcorr_kernel() too many times to
save 1% to 1.5% CPU cycles here.

How do you think?

Thanks,
Linfeng

> We can then switch between the two cases at compile time. I think 2) is
> appropriate when we have --enable-assertions without --enable-check-asm,
> wheras 1) is appropriate the rest of the time. That way we can test for
> overflows in the CELT code, without preventing optimization of the SILK
> code.
>
> > Because of silk_LPC_analysis_filter(), celt_fir_permit_overflow() must
> > behave the same for both floating-point and fixed-point, and this is
why
> > we defined ADD32_FIXED(), ..., PSHR32_FIXED() etc.
>
> I don't think you will need these anymore, but if you ever need
> fixed-point macros that remain integer for float compilation, then you
> should use the silk_*() fixed-point macros (and the code should be in
> silk/).
>
> > For the NEON optimization part, the previous celt_fir() optimization
> > calls xcorr_kernel(). We tested and found that calling
> > the xcorr_kernel() optimization didn't help too much here. The
> > optimization in the patch is about 1% faster than simply
> > calling xcorr_kernel() for the whole encoder. Considering the really
> > small size of the new optimization, it's better to not call
> > xcorr_kernel() to get 1% faster.
>
> It seems like the new Neon code does essentially the same thing as the
> existing xcorr_kernel(), so the two should remain shared to reduce
> maintenance burden and maximize speed. If you're able to come up with
> something faster than xcorr_kernel() (which apparently you have), then
> it seems like it should be usable by the other functions that currently
> use xcorr_kernel(). Or did I miss anything here?
>
> Cheers,
>
>         Jean-Marc
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xiph.org/pipermail/opus/attachments/20170301/89ed0208/attachment.html>

Maybe Matching Threads

Search for more maybe matching threads

opus - Mar 2017 - [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

[opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

[opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

[opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

Maybe Matching Threads