thr3ads.net - search: "xcorr

Displaying 20 results from an estimated 34 matches for "xcorr_kernel".

[PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

2017 Mar 01

[PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

...> I believe the solution would be to always have either: > 1) USE_CELT_FIR=1 and use ovflw() macros in the xcorr code; or > 2) USE_CELT_FIR=0 and no ovflw() in the xcorr code > I prefer to create a function named silk_fir() with optimization to do the calculation when USE_CELT_FIR=0. xcorr_kernel() itself is great and provides many gains. The only issue is that calling it in a for loop makes it less efficient. xcorr_kernel() is called in several functions including celt_fir(), celt_pitch_xcorr() and celt_iir(). All these functions are not heavy hitters. silk_LPC_analysis_filter()'s CPU...

[PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

2017 Feb 15

[PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

...o. Because of silk_LPC_analysis_filter(), celt_fir_permit_overflow() must behave the same for both floating-point and fixed-point, and this is why we defined ADD32_FIXED(), ..., PSHR32_FIXED() etc. It's still a messy. For the NEON optimization part, the previous celt_fir() optimization calls xcorr_kernel(). We tested and found that calling the xcorr_kernel() optimization didn't help too much here. The optimization in the patch is about 1% faster than simply calling xcorr_kernel() for the whole encoder. Considering the really small size of the new optimization, it's better to not call xcorr_...

Bug fix in celt_lpc.c and some xcorr_kernel optimizations

2013 Jun 07

Bug fix in celt_lpc.c and some xcorr_kernel optimizations

Hi JM, At line 221 in celt_lpc.c (the celt_iir function) I think you really want the RESTORE_STACK statement to be before the #endif instead of after it. Also, I couldn't help notice that your SSE code for xcorr_kernel reads more than "len" elements of "_x". I don't know if that's really a problem when running the codec, but a tool like valgrind will have a fit if it's accessing uninitialized memory. Here's a version I wrote a few days ago you're welcome to use that does...

Bug fix in celt_lpc.c and some xcorr_kernel optimizations

2013 Jun 07

Bug fix in celt_lpc.c and some xcorr_kernel optimizations

...Jean-Marc On 06/06/2013 08:07 PM, John Ridges wrote: > Hi JM, > > At line 221 in celt_lpc.c (the celt_iir function) I think you really > want the RESTORE_STACK statement to be before the #endif instead of > after it. Also, I couldn't help notice that your SSE code for > xcorr_kernel reads more than "len" elements of "_x". I don't know if > that's really a problem when running the codec, but a tool like valgrind > will have a fit if it's accessing uninitialized memory. Here's a version > I wrote a few days ago you're welcome t...

AVX Optimizations

2015 Nov 05

AVX Optimizations

Yes, Thank you. I'll follow up with the AVX code and tests for pitch code. Radu -----Original Message----- From: opus-bounces at xiph.org [mailto:opus-bounces at xiph.org] On Behalf Of Timothy B. Terriberry Sent: Thursday, November 5, 2015 10:31 AM To: opus at xiph.org Subject: Re: [opus] AVX Optimizations Velea, Radu wrote: > I've created a pull request[1] to enable configuration

[PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

2017 Mar 01

[PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

Hi Timothy, Do you think it would be possible to improve the API of xcorr_kernel() so > that calling it in a loop is more efficient? > If it could be inlined, it will be more efficient. Besides memory bouncing, frequent function call is expensive. The other advantage to wiring up xcorr_kernel() is that it applies in more > places than your intrinsics-only celt_fir()...

[Aarch64 v2 08/18] Add Neon fixed-point implementation of xcorr_kernel.

2015 Nov 21

[Aarch64 v2 08/18] Add Neon fixed-point implementation of xcorr_kernel.

...+58,23 @@ void (*const CELT_PITCH_XCORR_IMPL[OPUS_ARCHMASK+1])(const opus_val16 *, # endif # endif /* FIXED_POINT */ +#if defined(FIXED_POINT) && defined(OPUS_HAVE_RTCD) && \ + defined(OPUS_ARM_MAY_HAVE_NEON_INTR) && !defined(OPUS_ARM_PRESUME_NEON_INTR) + +void (*const XCORR_KERNEL_IMPL[OPUS_ARCHMASK + 1])( + const opus_val16 *x, + const opus_val16 *y, + opus_val32 sum[4], + int len +) = { + xcorr_kernel_c, /* ARMv4 */ + xcorr_kernel_c, /* EDSP */ + xcorr_kernel_c, /* Media */ +...

[PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

2017 Feb 18

[PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

...on't think you will need these anymore, but if you ever need fixed-point macros that remain integer for float compilation, then you should use the silk_*() fixed-point macros (and the code should be in silk/). > For the NEON optimization part, the previous celt_fir() optimization > calls xcorr_kernel(). We tested and found that calling > the xcorr_kernel() optimization didn't help too much here. The > optimization in the patch is about 1% faster than simply > calling xcorr_kernel() for the whole encoder. Considering the really > small size of the new optimization, it's bette...

Bug fix in celt_lpc.c and some xcorr_kernel, optimizations

2013 Jun 11

Bug fix in celt_lpc.c and some xcorr_kernel, optimizations

...ctly in ASM since typically neither compilers do what you want. > > Cliff On 6/11/2013 1:00 PM, opus-request at xiph.org wrote: > Date: Tue, 11 Jun 2013 09:31:31 +0200 > From: Aur?lien Zanelli<aurelien.zanelli at parrot.com> > Subject: Re: [opus] Bug fix in celt_lpc.c and some xcorr_kernel > optimizations > To:<opus at xiph.org> > Message-ID:<51B6D253.9030505 at parrot.com> > Content-Type: text/plain; charset="ISO-8859-1"; format=flowed > > Hi, > > I compared C version, John's versions and azanelli's version. > > I encoded...

AVX Optimizations

2015 Nov 05

AVX Optimizations

...+ 1])( int ord, opus_val16 *mem, int arch ) = { celt_fir_c, /* non-sse */ celt_fir_c, celt_fir_c, MAY_HAVE_SSE4_1(celt_fir), /* sse4.1 */ + MAY_HAVE_SSE4_1(celt_fir) /* avx */ }; void (*const XCORR_KERNEL_IMPL[OPUS_ARCHMASK + 1])( const opus_val16 *x, const opus_val16 *y, opus_val32 sum[4], int len ) = { xcorr_kernel_c, /* non-sse */ xcorr_kernel_c, xcorr_kernel_c, MAY_HAVE_SSE4_1(xcorr_kernel), /* sse4.1...

Bug fix in celt_lpc.c and some xcorr_kernel optimizations

2013 Jun 07

Bug fix in celt_lpc.c and some xcorr_kernel optimizations

Unfortunately I don't have a setup that lets me easily profile ARM code, so I really can't tell which method is faster (though I suspect Mr. Zanelli's code is). Let me offer up another intrinsic version of the NEON xcorr_kernel that is almost identical to the SSE version, and more in line with Mr. Zanelli's code: static inline void xcorr_kernel_neon(const opus_val16 *x, const opus_val16 *y, opus_val32 sum[4], int len) { int j; int32x4_t xsum1 = vld1q_s32(sum); int32x4_t xsum2 = vdupq_n_s32(0);...

Bug fix in celt_lpc.c and some xcorr_kernel optimizations

2013 Jun 07

Bug fix in celt_lpc.c and some xcorr_kernel optimizations

On 06/07/2013 02:33 PM, John Ridges wrote: > I have no doubt that Mr. Zanelli's NEON code is faster, since hand tuned > assembly is bound to be faster than using intrinsics. I was mostly curious about comparing vectorization approaches (assuming the two are different) than exact code. > However I notice > that his code can also read past the y buffer. Yeah we'd need to

AVX Optimizations

2015 Nov 05

AVX Optimizations

Bug fix in celt_lpc.c and some xcorr_kernel optimizations

2013 Jun 07

Bug fix in celt_lpc.c and some xcorr_kernel optimizations

Hi JM, I have no doubt that Mr. Zanelli's NEON code is faster, since hand tuned assembly is bound to be faster than using intrinsics. However I notice that his code can also read past the y buffer. Cheers, --John On 6/6/2013 9:22 PM, Jean-Marc Valin wrote: > Hi John, > > Thanks for the two fixes. They're in git now. Your SSE version seems to > also be slightly faster than

[PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

2017 Feb 15

[PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

Hi, Attached are two patches. Patch 1 refactors silk_LPC_analysis_filter(). And Patch 2 optimizes the new function celt_fir_permit_overflow() for ARM NEON. Please recommend a better function name. We did the same internal code review and testing already. Thanks, Linfeng -------------- next part -------------- An HTML attachment was scrubbed... URL:

[PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

2017 Mar 01

[PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

Linfeng Zhang wrote: > xcorr_kernel() itself is great and provides many gains. The only issue > is that calling it in a for loop makes it less efficient. Do you think it would be possible to improve the API of xcorr_kernel() so that calling it in a loop is more efficient? I haven't looked at an instruction-level profile, bu...

Antw: Re: [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

2017 Mar 02

Antw: Re: [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

...r to optimize... Regards, Ulrich >>> Linfeng Zhang <linfengz at google.com> schrieb am 01.03.2017 um 20:30 in Nachricht <CAKoqLCANyWDPpy4rccL3TJ37gbhWxRWkCrqR9GCATGhTFoaDyA at mail.gmail.com>: > Hi Timothy, > > Do you think it would be possible to improve the API of xcorr_kernel() so >> that calling it in a loop is more efficient? >> > > If it could be inlined, it will be more efficient. Besides memory bouncing, > frequent function call is expensive. > > The other advantage to wiring up xcorr_kernel() is that it applies in more >> places...

ARM NEON optimization -- celt_fir()

2016 Jun 17

ARM NEON optimization -- celt_fir()

...n November and December (which Tim still hasn’t gotten around to reviewing). As they used Neon intrinsics, several of these actually applied to both armv7 and aarch64 Neon. In particular, note http://lists.xiph.org/pipermail/opus/2015-December/003339.html , which added a Neon-optimized version of xcorr_kernel. xcorr_kernel is used in celt_fir, celt_iir, and celt_pitch_xcorr. > On Jun 17, 2016, at 5:09 PM, Linfeng Zhang <linfengz at google.com> wrote: > > Hi all, > > This is Linfeng Zhang from Google. I'll work on ARM NEON optimization in the > next few months. > >...

opus Digest, Vol 53, Issue 2

2013 Jun 10

opus Digest, Vol 53, Issue 2

...body 'help' to opus-request at xiph.org You can reach the person managing the list at opus-owner at xiph.org When replying, please edit your Subject line so it is more specific than "Re: Contents of opus digest..." Today's Topics: 1. Re: Bug fix in celt_lpc.c and some xcorr_kernel optimizations (John Ridges) 2. Invitation to connect on LinkedIn (casey guan) 3. Invitation to connect on LinkedIn (casey guan) ---------------------------------------------------------------------- Message: 1 Date: Fri, 07 Jun 2013 16:50:48 -0600 From: John Ridges <jridges at mas...

[PATCH 2/5] Optimize fixed-point celt_fir_c() for ARM NEON

2016 Sep 28

[PATCH 2/5] Optimize fixed-point celt_fir_c() for ARM NEON

...ver 80 lines of complicated NEON intrinsics code] > + } > +#else So, one of the points of SMALL_FOOTPRINT is to reduce the code size on targets where this matters (even if it means running slower), but this is an awful lot of code. I think it makes much more sense to expose the existing xcorr_kernel asm and use that. I wrote a simple patch demonstrating this (attached... it applies on top of your full series, so it'd be a little work to rebase it into place here). It adds one 16-byte table and 16 instructions, and even gives speed-ups on non-NEON CPUs by reusing the existing EDSP asm....

search for: xcorr_kernel