thr3ads.net - search: "celt

[PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

2017 Feb 15

2

[PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

Hi Jean-Marc, The original celt_fir() is a little bit messy. It has 2 branches chosen by #ifdef SMALL_FOOTPRINT. For floating-point, the 2 branches are identical (except the operation sequence of accumulating x[i] to sum, which is not a big deal). For fixed-point, the 2 branches are different. I separate them into 2 functions: the ne...

ASM runtime detection and optimizations

2013 May 23

2

ASM runtime detection and optimizations

...ndex); @@ -496,7 +499,7 @@ static void celt_decode_lost(CELTDecoder * OPUS_RESTRICT st, opus_val16 * OPUS_R ROUND16(buf[DECODE_BUFFER_SIZE-exc_length-1-i], SIG_SHIFT); } /* Compute the excitation for exc_length samples before the loss. */ - celt_fir(exc+MAX_PERIOD-exc_length, lpc+c*LPC_ORDER, + celt_fir[st->arch&OPUS_ARCHMASK](exc+MAX_PERIOD-exc_length, lpc+c*LPC_ORDER, exc+MAX_PERIOD-exc_length, exc_length, LPC_ORDER, lpc_mem); } diff --git a/celt/celt_encoder.c b/celt/celt_encoder.c index 26e6...

[PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

2017 Feb 18

0

[PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

Hi Linfeng, On 15/02/17 04:05 PM, Linfeng Zhang wrote: > The original celt_fir() is a little bit messy. It has 2 branches chosen > by #ifdef SMALL_FOOTPRINT. Yeah, I agree that the #ifdef SMALL_FOOTPRINT in celt_fir() is a bit of overkill since it's not saving much code space. I just pushed a commit that gets rid of it, also refactoring the #else case a bit (see below...

ARM NEON optimization -- celt_fir()

2016 Jun 17

0

ARM NEON optimization -- celt_fir()

...l hasn’t gotten around to reviewing). As they used Neon intrinsics, several of these actually applied to both armv7 and aarch64 Neon. In particular, note http://lists.xiph.org/pipermail/opus/2015-December/003339.html , which added a Neon-optimized version of xcorr_kernel. xcorr_kernel is used in celt_fir, celt_iir, and celt_pitch_xcorr. > On Jun 17, 2016, at 5:09 PM, Linfeng Zhang <linfengz at google.com> wrote: > > Hi all, > > This is Linfeng Zhang from Google. I'll work on ARM NEON optimization in the > next few months. > > I'm submitting 2 patches in the...

[PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

2017 Feb 15

4

[PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

Hi, Attached are two patches. Patch 1 refactors silk_LPC_analysis_filter(). And Patch 2 optimizes the new function celt_fir_permit_overflow() for ARM NEON. Please recommend a better function name. We did the same internal code review and testing already. Thanks, Linfeng -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xiph.org/pipermail/opus/attachments/20170215/c44c802...

ARM NEON optimization -- celt_fir()

2016 Jun 17

5

ARM NEON optimization -- celt_fir()

Hi all, This is Linfeng Zhang from Google. I'll work on ARM NEON optimization in the next few months. I'm submitting 2 patches in the following couple of emails, which have the new created celt_fir_neon(). I revised celt_fir_c() to not pass in argument "mem" in Patch 1. If there are concerns to this change, please let me know. Many thanks to your comments. Linfeng Zhang

[PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

2017 Mar 01

2

[PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

> > I believe the solution would be to always have either: > 1) USE_CELT_FIR=1 and use ovflw() macros in the xcorr code; or > 2) USE_CELT_FIR=0 and no ovflw() in the xcorr code > I prefer to create a function named silk_fir() with optimization to do the calculation when USE_CELT_FIR=0. xcorr_kernel() itself is great and provides many gains. The only issue is that ca...

[PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

2017 Mar 01

3

[PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

...corr_kernel() so > that calling it in a loop is more efficient? > If it could be inlined, it will be more efficient. Besides memory bouncing, frequent function call is expensive. The other advantage to wiring up xcorr_kernel() is that it applies in more > places than your intrinsics-only celt_fir() implementation. > I agree. One solution is to put the outer for(N) loop inside xcorr_kernel() to let it return N results instead of 4 (similar to the celt_fir() NEON intrinsics did). This will make it efficient plus universal. Thanks, -------------- next part -------------- An HTML attachme...

AVX Optimizations

2015 Nov 05

2

AVX Optimizations

Yes, Thank you. I'll follow up with the AVX code and tests for pitch code. Radu -----Original Message----- From: opus-bounces at xiph.org [mailto:opus-bounces at xiph.org] On Behalf Of Timothy B. Terriberry Sent: Thursday, November 5, 2015 10:31 AM To: opus at xiph.org Subject: Re: [opus] AVX Optimizations Velea, Radu wrote: > I've created a pull request[1] to enable configuration

Bug fix in celt_lpc.c and some xcorr_kernel optimizations

2013 Jun 07

2

Bug fix in celt_lpc.c and some xcorr_kernel optimizations

...n git now. Your SSE version seems to > also be slightly faster than mine -- probably due the the partial sums. > As for the NEON code, it would be good to compare the performance with > the code Aur?lien Zanelli posted at > http://darkosphere.fr/public/0002-Add-optimized-NEON-version-of-celt_fir-celt_iir-and-.patch > > Cheers, > > Jean-Marc > >

[PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

2017 Mar 01

0

[PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

...elieve that the function prologue/epilogue is really responsible for 1% to 1.5% of the whole decoder cost. Perhaps it is just bouncing the values in and out of memory from the NEON pipeline or something like that which is expensive? Otherwise it seems to be doing exactly the same things as your celt_fir() (unless I've missed something, which is certainly possible). The other advantage to wiring up xcorr_kernel() is that it applies in more places than your intrinsics-only celt_fir() implementation.

Antw: Re: [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

2017 Mar 02

0

Antw: Re: [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

...g it in a loop is more efficient? >> > > If it could be inlined, it will be more efficient. Besides memory bouncing, > frequent function call is expensive. > > The other advantage to wiring up xcorr_kernel() is that it applies in more >> places than your intrinsics-only celt_fir() implementation. >> > > I agree. > > One solution is to put the outer for(N) loop inside xcorr_kernel() to let > it return N results instead of 4 (similar to the celt_fir() NEON intrinsics > did). This will make it efficient plus universal. > > Thanks,

[PATCH 2/5] Optimize fixed-point celt_fir_c() for ARM NEON

2016 Jul 14

0

[PATCH 2/5] Optimize fixed-point celt_fir_c() for ARM NEON

Create the fixed-point intrinsics optimization celt_fir_neon() for ARM NEON. Create test tests/test_unit_optimization to unit test the optimization. --- .gitignore | 1 + Makefile.am | 39 ++++- celt/arm/arm_celt_map.c | 17 +++ celt/arm/celt_lpc_arm.h | 65 ++...

AVX Optimizations

2015 Nov 05

0

AVX Optimizations

...arch(void); #else #define OPUS_ARCHMASK 0 static OPUS_INLINE int opus_select_arch(void) { return 0; diff --git a/celt/x86/x86_celt_map.c b/celt/x86/x86_celt_map.c index 1ed2acb..8e5e449 100644 --- a/celt/x86/x86_celt_map.c +++ b/celt/x86/x86_celt_map.c @@ -48,44 +48,47 @@ void (*const CELT_FIR_IMPL[OPUS_ARCHMASK + 1])( int ord, opus_val16 *mem, int arch ) = { celt_fir_c, /* non-sse */ celt_fir_c, celt_fir_c, MAY_HAVE_SSE4_1(celt_fir), /* sse4.1 */ + MAY_HAVE_SSE4_1(celt_fir) /* avx */...

[PATCH 2/5] Optimize fixed-point celt_fir_c() for ARM NEON

2016 Sep 28

2

[PATCH 2/5] Optimize fixed-point celt_fir_c() for ARM NEON

...e existing EDSP asm. Testing on comp48-stereo.sw encoded to 64 kbps and decoded with a 15% loss rate on a Novena using opus_demo (by using RTCD and changing the function pointers to the version of the code to test), optimizing xcorr_kernel gives almost as much speed-up as intrinsics for all of celt_fir: celt_fir_c, xcorr_kernel_c: 1753 ms (stddev 9) [1730 1740 {1740 1740 1740 1750 1750 1750 1750 1750 1750 1750 1750 1750 1750 1750 1760 1760 1760 1760 1770 1770} 1780 1860] celt_fir_c, xcorr_kernel_neon: 1710 ms (stddev 12) [1680 1690 {1690 1690 1700 1700 1700 1700 1710 1710 1710 1710 1710 1710...

Bug fix in celt_lpc.c and some xcorr_kernel optimizations

2013 Jun 07

1

Bug fix in celt_lpc.c and some xcorr_kernel optimizations

...s to >>> also be slightly faster than mine -- probably due the the partial sums. >>> As for the NEON code, it would be good to compare the performance with >>> the code Aur?lien Zanelli posted at >>> http://darkosphere.fr/public/0002-Add-optimized-NEON-version-of-celt_fir-celt_iir-and-.patch >>> >>> >>> Cheers, >>> >>> Jean-Marc >>> >>> >> >> >

AVX Optimizations

2015 Nov 05

2

AVX Optimizations

...arch(void); #else #define OPUS_ARCHMASK 0 static OPUS_INLINE int opus_select_arch(void) { return 0; diff --git a/celt/x86/x86_celt_map.c b/celt/x86/x86_celt_map.c index 1ed2acb..8e5e449 100644 --- a/celt/x86/x86_celt_map.c +++ b/celt/x86/x86_celt_map.c @@ -48,44 +48,47 @@ void (*const CELT_FIR_IMPL[OPUS_ARCHMASK + 1])( int ord, opus_val16 *mem, int arch ) = { celt_fir_c, /* non-sse */ celt_fir_c, celt_fir_c, MAY_HAVE_SSE4_1(celt_fir), /* sse4.1 */ + MAY_HAVE_SSE4_1(celt_fir) /* avx */...

Bug fix in celt_lpc.c and some xcorr_kernel optimizations

2013 Jun 07

2

Bug fix in celt_lpc.c and some xcorr_kernel optimizations

Hi JM, At line 221 in celt_lpc.c (the celt_iir function) I think you really want the RESTORE_STACK statement to be before the #endif instead of after it. Also, I couldn't help notice that your SSE code for xcorr_kernel reads more than "len" elements of "_x". I don't know if that's really a problem when running the codec, but a tool like valgrind will have a

[PATCH] 02-

2013 May 21

0

[PATCH] 02-

...quot; +#ifdef ARM_HAVE_NEON +#include "celt_lpc_neon.h" +#endif + void _celt_lpc( opus_val16 *_lpc, /* out: [0...p-1] LPC coefficients */ const opus_val32 *ac, /* in: [0...p] autocorrelation values */ @@ -87,6 +91,7 @@ int p #endif } +#ifndef OVERRIDE_CELT_FIR void celt_fir(const opus_val16 *x, const opus_val16 *num, opus_val16 *y, @@ -101,7 +106,7 @@ void celt_fir(const opus_val16 *x, opus_val32 sum = SHL32(EXTEND32(x[i]), SIG_SHIFT); for (j=0;j<ord;j++) { - sum += MULT16_16(num[j],mem[j]); +...

[PATCH] 02-Add CELT filter optimizations

2013 May 21

2

[PATCH] 02-Add CELT filter optimizations

...quot; +#ifdef ARM_HAVE_NEON +#include "celt_lpc_neon.h" +#endif + void _celt_lpc( opus_val16 *_lpc, /* out: [0...p-1] LPC coefficients */ const opus_val32 *ac, /* in: [0...p] autocorrelation values */ @@ -87,6 +91,7 @@ int p #endif } +#ifndef OVERRIDE_CELT_FIR void celt_fir(const opus_val16 *x, const opus_val16 *num, opus_val16 *y, @@ -101,7 +106,7 @@ void celt_fir(const opus_val16 *x, opus_val32 sum = SHL32(EXTEND32(x[i]), SIG_SHIFT); for (j=0;j<ord;j++) { - sum += MULT16_16(num[j],mem[j]); +...

search for: celt_fir