thr3ads.net - search: "mult16_32

2009 Oct 26

1

[PATCH] Fix miscompile of SSE resampler

...cum[3]; #else - sum = inner_product_double(sinc, iptr, N); + inner_product_double(&sum, sinc, iptr, N); #endif out[out_stride * out_sample++] = PSHR32(sum, 15); @@ -472,7 +472,7 @@ static int resampler_basic_interpolate_single(SpeexResamplerState *st, spx_uint3 sum = MULT16_32_Q15(interp[0],SHR32(accum[0], 1)) + MULT16_32_Q15(interp[1],SHR32(accum[1], 1)) + MULT16_32_Q15(interp[2],SHR32(accum[2], 1)) + MULT16_32_Q15(interp[3],SHR32(accum[3], 1)); #else cubic_coef(frac, interp); - sum = interpolate_product_single(iptr, st->sinc_table + st->oversample + 4 -...

[Aarch64 00/11] Patches to enable Aarch64

2015 Nov 13

2

[Aarch64 00/11] Patches to enable Aarch64

...o beat a dead horse, but I was very surprised by your benchmarks so I took a little closer look. I think what's happening is that it's a little unfair to compare the ARM64 inline assembly to the C code, because looking at the C macros in "fixed_generic.h" for MULT16_32_Q16 and MULT16_32_Q15 you find they are implemented with two multiplies, two shifts, an AND and an ADD. It's not hard for me to believe that your inline assembly is faster than that mess. But on a 64-bit machine, there's no reason to go through all that when a simple 64-bit multiply will do. The SILK macros...

[Aarch64 00/11] Patches to enable Aarch64

2015 Nov 13

2

[Aarch64 00/11] Patches to enable Aarch64

...horse, but I was very surprised by your benchmarks so I took a little closer look. >> >> I think what's happening is that it's a little unfair to compare the ARM64 inline assembly to the C code, because looking at the C macros in "fixed_generic.h" for MULT16_32_Q16 and MULT16_32_Q15 you find they are implemented with two multiplies, two shifts, an AND and an ADD. It's not hard for me to believe that your inline assembly is faster than that mess. But on a 64-bit machine, there's no reason to go through all that when a simple 64-bit multiply will do. The SILK macros in &...

high-pass filter issues

2007 Aug 29

2

high-pass filter issues

...=4; den = Pcoef[filtID]; num = Zcoef[filtID]; /*return;*/ for (i=0;i<len;i++) { spx_word16_t yi; spx_word32_t vout = ADD32(MULT16_16(num[0], x[i]),mem[0]); yi = EXTRACT16(SATURATE(PSHR32(vout,14),32767)); mem[0] = ADD32(MAC16_16(mem[1], num[1],x[i]), SHL32(MULT16_32_Q15(-den[1],vout),1)); mem[1] = ADD32(MULT16_16(num[2],x[i]), SHL32(MULT16_32_Q15(-den[2],vout),1)); y[i] = yi; } } I can step into the function just fine, but when I run it, even just from the initial variable declarations to the top of that "for" loop I vector off. I am...

[Aarch64 00/11] Patches to enable Aarch64

2015 Nov 12

2

[Aarch64 00/11] Patches to enable Aarch64

One other minor thing: I notice that in the inline assembly the result (rd) is constrained as an earlyclobber operand. What was the reason for that?

[PATCH 3/5] resample: Add NEON optimized inner_product_single for fixed point

2011 Sep 01

0

[PATCH 3/5] resample: Add NEON optimized inner_product_single for fixed point

...[out_stride * out_sample++] = sum; last_sample += int_advance; samp_frac_num += frac_advance; if (samp_frac_num >= den_rate) @@ -470,12 +475,13 @@ static int resampler_basic_interpolate_single(SpeexResamplerState *st, spx_uint3 cubic_coef(frac, interp); sum = MULT16_32_Q15(interp[0],SHR32(accum[0], 1)) + MULT16_32_Q15(interp[1],SHR32(accum[1], 1)) + MULT16_32_Q15(interp[2],SHR32(accum[2], 1)) + MULT16_32_Q15(interp[3],SHR32(accum[3], 1)); + sum = SATURATE32PSHR(sum, 15, 32767); #else cubic_coef(frac, interp); sum = interpolate_product_single(iptr,...

[Aarch64 00/11] Patches to enable Aarch64

2015 Nov 13

0

[Aarch64 00/11] Patches to enable Aarch64

...a dead horse, but I was very surprised by your benchmarks so I took a little closer look. > > I think what's happening is that it's a little unfair to compare the ARM64 inline assembly to the C code, because looking at the C macros in "fixed_generic.h" for MULT16_32_Q16 and MULT16_32_Q15 you find they are implemented with two multiplies, two shifts, an AND and an ADD. It's not hard for me to believe that your inline assembly is faster than that mess. But on a 64-bit machine, there's no reason to go through all that when a simple 64-bit multiply will do. The SILK macros in &...

[PATCH 4/8] Arm64 assembly for Celt fixed-point math.

2015 Aug 05

0

[PATCH 4/8] Arm64 assembly for Celt fixed-point math.

...w1, %w2\n\t" + : "=&r"(rd) + : "%r"(b), "r"(a<<16) + ); + return (rd >> 32); +} +#define MULT16_32_Q16(a, b) (MULT16_32_Q16_arm64(a, b)) + + +/** 16x32 multiplication, followed by a 15-bit shift right. Results fits in 32 bits */ +#undef MULT16_32_Q15 +static OPUS_INLINE opus_val32 MULT16_32_Q15_arm64(opus_val16 a, opus_val32 b) +{ + opus_int64 rd; + __asm__( + "smull %x0, %w1, %w2\n\t" + : "=&r"(rd) + : "%r"(b), "r"(a<<16) + ); + return ((rd >> 32) << 1); +} +#defi...

[Aarch64 06/11] Add aarch64 assembly for Celt fixed-point math.

2015 Nov 07

0

[Aarch64 06/11] Add aarch64 assembly for Celt fixed-point math.

...w1, %w2\n\t" + : "=&r"(rd) + : "%r"(b), "r"(a<<16) + ); + return (rd >> 32); +} +#define MULT16_32_Q16(a, b) (MULT16_32_Q16_arm64(a, b)) + + +/** 16x32 multiplication, followed by a 15-bit shift right. Results fits in 32 bits */ +#undef MULT16_32_Q15 +static OPUS_INLINE opus_val32 MULT16_32_Q15_arm64(opus_val16 a, opus_val32 b) +{ + opus_int64 rd; + __asm__( + "smull %x0, %w1, %w2\n\t" + : "=&r"(rd) + : "%r"(b), "r"(a<<16) + ); + return ((rd >> 32) << 1); +} +#defi...

[Aarch64 00/11] Patches to enable Aarch64

2015 Nov 16

0

[Aarch64 00/11] Patches to enable Aarch64

...t to beat a dead horse, but I was very surprised by your benchmarks so I took a little closer look. I think what's happening is that it's a little unfair to compare the ARM64 inline assembly to the C code, because looking at the C macros in "fixed_generic.h" for MULT16_32_Q16 and MULT16_32_Q15 you find they are implemented with two multiplies, two shifts, an AND and an ADD. It's not hard for me to believe that your inline assembly is faster than that mess. But on a 64-bit machine, there's no reason to go through all that when a simple 64-bit multiply will do. The SILK macros in &...

Speex echo canceller on TI C55 DSP

2006 May 08

1

Speex echo canceller on TI C55 DSP

...px_uint16_t) a.m)+(1<<(-a.e-1)))>>-a.e; > > in FLOAT_EXTRACT16. This changes the returned value from 0xfc00 to 0x400. > Now it runs on for a while, then hits another infinite loop at mdf.c line > 641: > > st->power_1[i] = > FLOAT_SHL(FLOAT_DIV32_FLOAT(MULT16_32_Q15(M_1,r),FLOAT_MUL32U(e,st->power[i]+10)),WEIGHT_SHIFT+16); > > I have not had time to trace this, but it looks like a similar problem, > where the result of MULT16_32_Q15(M_1,r) is negative, and > FLOAT_DIV32_FLOAT bombs. Maybe the best thing to do next is to instrument > the r...

Resampler (no api)

2008 May 03

2

Resampler (no api)

...j+1)*st->oversample-offset-1]); - accum[2] += MULT16_16(curr_in,st->sinc_table[4+(j+1)*st->oversample-offset]); - accum[3] += MULT16_16(curr_in,st->sinc_table[4+(j+1)*st->oversample-offset+1]); - } - } + cubic_coef(frac, interp); sum = MULT16_32_Q15(interp[0],accum[0]) + MULT16_32_Q15(interp[1],accum[1]) + MULT16_32_Q15(interp[2],accum[2]) + MULT16_32_Q15(interp[3],accum[3]); - - *out = PSHR32(sum,15); - out += st->out_stride; - out_sample++; - last_sample += st->int_advance; - samp_frac_num += st->frac_adv...

Re: speex echo cancellation limitations

2006 May 02

3

Re: speex echo cancellation limitations

Hi Ted, Thanks a lot for this analysis. > In FLOAT_DIVU() it hangs at the following: > while (a.m >= b.m) > { > e++; > a.m >>= 1; > } > for the case where a and b are both zero (yes, division by zero). > This happens from mdf.c: True, that needs to be fixed even after I fix the rest. > leak_estimate =

Resampler, memory only variant

2008 May 03

0

Resampler, memory only variant

...j+1)*st->oversample-offset-1]); - accum[2] += MULT16_16(curr_in,st->sinc_table[4+(j+1)*st->oversample-offset]); - accum[3] += MULT16_16(curr_in,st->sinc_table[4+(j+1)*st->oversample-offset+1]); - } - } + cubic_coef(frac, interp); sum = MULT16_32_Q15(interp[0],accum[0]) + MULT16_32_Q15(interp[1],accum[1]) + MULT16_32_Q15(interp[2],accum[2]) + MULT16_32_Q15(interp[3],accum[3]); - - *out = PSHR32(sum,15); - out += st->out_stride; - out_sample++; - last_sample += st->int_advance; - samp_frac_num += st->frac_adv...

[Fast Int64 2/4] Add OPUS_FAST_INT64 flavors of celt/fixed_generic.h macros.

2015 Nov 16

0

[Fast Int64 2/4] Add OPUS_FAST_INT64 flavors of celt/fixed_generic.h macros.

...opus_val32)PSHR((opus_int64)((opus_val16)(a))*(b),16)) +#else #define MULT16_32_P16(a,b) ADD32(MULT16_16((a),SHR((b),16)), PSHR(MULT16_16SU((a),((b)&0x0000ffff)),16)) +#endif /** 16x32 multiplication, followed by a 15-bit shift right. Results fits in 32 bits */ +#if OPUS_FAST_INT64 +#define MULT16_32_Q15(a,b) ((opus_val32)SHR((opus_int64)((opus_val16)(a))*(b),15)) +#else #define MULT16_32_Q15(a,b) ADD32(SHL(MULT16_16((a),SHR((b),16)),1), SHR(MULT16_16SU((a),((b)&0x0000ffff)),15)) +#endif /** 32x32 multiplication, followed by a 31-bit shift right. Results fits in 32 bits */ +#if OPUS_FAST_IN...

Speex echo canceller on TI C55 DSP

2006 May 09

2

Speex echo canceller on TI C55 DSP

...overflows at this place? > > The rounding takes the value to exactly 0x8000, and it is followed by a > right shift, so you just need to avoid the sign extension. > > >> I have not had time to trace this, but it looks like a similar problem, > >> where the result of MULT16_32_Q15(M_1,r) is negative, and > >> FLOAT_DIV32_FLOAT > >> bombs. Maybe the best thing to do next is to instrument the routines in > >> pseudofloat.h which have loops, but I will not get to that for a day or > >> two. > > > > Yeah, r is never supposed to...

Speex echo canceller on TI C55 DSP

2006 May 09

2

Speex echo canceller on TI C55 DSP

...ring the > second of two brief speech bursts). So, my problem must again be related to > the 16-bit processing on the C5X DSPs. Good. At least we've narrowed it down a bit. > Also, the line where it is hanging is: > st->power_1[i] = > FLOAT_SHL(FLOAT_DIV32_FLOAT(MULT16_32_Q15(M_1,r),FLOAT_MUL32U(e,st->power[i]+10)),WEIGHT_SHIFT+16); Actually, I just found a 16-bit bug in FLOAT_DIV32_FLOAT. Could you update svn and let me know if it works? > and it is e that is in the denominator, not r (sorry for the confusion). I > can now run the simulations side-by-side...

Speex echo canceller on TI C55 DSP

2006 May 08

0

Speex echo canceller on TI C55 DSP

...t;>-a.e; to return (((spx_uint16_t) a.m)+(1<<(-a.e-1)))>>-a.e; in FLOAT_EXTRACT16. This changes the returned value from 0xfc00 to 0x400. Now it runs on for a while, then hits another infinite loop at mdf.c line 641: st->power_1[i] = FLOAT_SHL(FLOAT_DIV32_FLOAT(MULT16_32_Q15(M_1,r),FLOAT_MUL32U(e,st->power[i]+10)),WEIGHT_SHIFT+16); I have not had time to trace this, but it looks like a similar problem, where the result of MULT16_32_Q15(M_1,r) is negative, and FLOAT_DIV32_FLOAT bombs. Maybe the best thing to do next is to instrument the routines in pseudofloat.h...

[PATCH 0/5] ARM NEON optimization for samplerate converter

2011 Sep 01

6

[PATCH 0/5] ARM NEON optimization for samplerate converter

From: Jyri Sarha <jsarha at ti.com> I optimized Speex resampler for NEON capable ARM CPUs. The first patch should speed up resampling on any platform that can spare the increased memory usage. It would be nice to have these merged to the master branch. Please let me know if there is anything I can do to help the the merge. The patches have been rebased on top of master branch in

[RFC PATCH v1] armv7(float): Optimize decode usecase using NE10 library

2015 Mar 04

0

[RFC PATCH v1] armv7(float): Optimize decode usecase using NE10 library

..._scalar * OPUS_RESTRICT yp1 = out; + const opus_val16 * OPUS_RESTRICT wp1 = window; + const opus_val16 * OPUS_RESTRICT wp2 = window+overlap-1; + + for(i = 0; i < overlap/2; i++) + { + kiss_fft_scalar x1, x2; + x1 = *xp1; + x2 = *yp1; + *yp1++ = MULT16_32_Q15(*wp2, x2) - MULT16_32_Q15(*wp1, x1); + *xp1-- = MULT16_32_Q15(*wp1, x2) + MULT16_32_Q15(*wp2, x1); + wp1++; + wp2--; + } + } + RESTORE_STACK; +} diff --git a/celt/arm/fft_arm.h b/celt/arm/fft_arm.h index e7a30d6..e57b0aa 100644 --- a/celt/arm/fft_arm.h +++ b/celt/ar...

search for: mult16_32_q15