thr3ads.net - search: "int32x2

2017 Apr 26

2

2 patches related to silk_biquad_alt() optimization

...Tue, Apr 25, 2017 at 10:31 PM, Jean-Marc Valin <jmvalin at jmvalin.ca> wrote: > > > A_Q28 is split to 2 14-bit (or 16-bit, whatever) integers, to make the > > multiplication operation within 32-bits. NEON can do 32-bit x 32-bit = > > 64-bit using 'int64x2_t vmull_s32(int32x2_t a, int32x2_t b)', and it > > could possibly be faster and less rounding/shifting errors than above C > > code. But it may increase difficulties for other CPUs not supporting > > 32-bit multiplication. > > OK, so I'm not totally opposed to that, but it increases the &...

2017 May 15

2

2 patches related to silk_biquad_alt() optimization

...in.ca <mailto:jmvalin at jmvalin.ca>> wrote: > > > > A_Q28 is split to 2 14-bit (or 16-bit, whatever) integers, to make the > > multiplication operation within 32-bits. NEON can do 32-bit x 32-bit = > > 64-bit using 'int64x2_t vmull_s32(int32x2_t a, int32x2_t b)', and it > > could possibly be faster and less rounding/shifting errors than above C > > code. But it may increase difficulties for other CPUs not supporting > > 32-bit multiplication. > > OK, so I'm not totally oppose...

2017 Apr 25

2

2 patches related to silk_biquad_alt() optimization

...S[ 1 ] = silk_SMLAWB( S[ 1 ], out32_Q14, A1_U_Q28 ); S[ 1 ] = silk_SMLAWB( S[ 1 ], B_Q28[ 2 ], inval ); A_Q28 is split to 2 14-bit (or 16-bit, whatever) integers, to make the multiplication operation within 32-bits. NEON can do 32-bit x 32-bit = 64-bit using 'int64x2_t vmull_s32(int32x2_t a, int32x2_t b)', and it could possibly be faster and less rounding/shifting errors than above C code. But it may increase difficulties for other CPUs not supporting 32-bit multiplication. Thanks, Linfeng -------------- next part -------------- An HTML attachment was scrubbed... URL: <http:...

2017 May 08

0

2 patches related to silk_biquad_alt() optimization

...1 PM, Jean-Marc Valin <jmvalin at jmvalin.ca> > wrote: > >> >> > A_Q28 is split to 2 14-bit (or 16-bit, whatever) integers, to make the >> > multiplication operation within 32-bits. NEON can do 32-bit x 32-bit = >> > 64-bit using 'int64x2_t vmull_s32(int32x2_t a, int32x2_t b)', and it >> > could possibly be faster and less rounding/shifting errors than above C >> > code. But it may increase difficulties for other CPUs not supporting >> > 32-bit multiplication. >> >> OK, so I'm not totally opposed to that, bu...

2017 May 17

0

2 patches related to silk_biquad_alt() optimization

...in.ca>> wrote: > > > > > > > A_Q28 is split to 2 14-bit (or 16-bit, whatever) integers, to > make the > > > multiplication operation within 32-bits. NEON can do 32-bit x > 32-bit = > > > 64-bit using 'int64x2_t vmull_s32(int32x2_t a, int32x2_t b)', > and it > > > could possibly be faster and less rounding/shifting errors > than above C > > > code. But it may increase difficulties for other CPUs not > supporting > > > 32-bit multiplication. > > > >...

[LLVMdev] Vectors in structures

2010 Sep 28

0

[LLVMdev] Vectors in structures

...what compatibility problems you had with GCC? And that > by using structures in Clang you made it work with armcc? > > Is it just a source code compatibility issue? Yes, there are multiple issues but they all involve source compatibility. Here is an example: #include <arm_neon.h> uint32x2_t test(int32x2_t x) { return vadd_u32(x, x); } This works fine with GCC because int32x2_t and uint32x2_t are built-in vector types and can be implicitly converted. It is not valid if those types are defined as structs, because C/C++ do not allow distinct struct types to be implicitly converted just...

[PATCH 8/8] Optimize silk_NSQ_del_dec() for ARM NEON

2016 Aug 23

0

[PATCH 8/8] Optimize silk_NSQ_del_dec() for ARM NEON

..., 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39 +}; + +static OPUS_INLINE void copy_winner_state_kernel( + const NSQ_del_decs_struct *psDelDec, + const opus_int offset, + const opus_int last_smple_idx, + const opus_int Winner_ind, + const int32x2_t gain_lo_s32x2, + const int32x2_t gain_hi_s32x2, + const int32x4_t shift_s32x4, + int32x4_t t0_s32x4, + int32x4_t t1_s32x4, + opus_int8 *pulses, + opus_int16 *pxq, + silk_nsq_state...

[PATCH 7/8] Update NSQ_LPC_BUF_LENGTH macro.

2016 Aug 23

2

[PATCH 7/8] Update NSQ_LPC_BUF_LENGTH macro.

NSQ_LPC_BUF_LENGTH is independent of DECISION_DELAY. --- silk/define.h | 4 ---- 1 file changed, 4 deletions(-) diff --git a/silk/define.h b/silk/define.h index 781cfdc..1286048 100644 --- a/silk/define.h +++ b/silk/define.h @@ -173,11 +173,7 @@ extern "C" #define MAX_MATRIX_SIZE MAX_LPC_ORDER /* Max of LPC Order and LTP order */ -#if( MAX_LPC_ORDER >

2017 Apr 26

0

2 patches related to silk_biquad_alt() optimization

...d just be good to check that the 0.8% indeed comes from Neon as opposed to just unrolling the channels. > A_Q28 is split to 2 14-bit (or 16-bit, whatever) integers, to make the > multiplication operation within 32-bits. NEON can do 32-bit x 32-bit = > 64-bit using 'int64x2_t vmull_s32(int32x2_t a, int32x2_t b)', and it > could possibly be faster and less rounding/shifting errors than above C > code. But it may increase difficulties for other CPUs not supporting > 32-bit multiplication. OK, so I'm not totally opposed to that, but it increases the testing/maintenance cost...

[LLVMdev] Vectors in structures

2010 Sep 28

2

[LLVMdev] Vectors in structures

On 27 September 2010 23:45, Bob Wilson <bob.wilson at apple.com> wrote: > An implementation, such as in GCC, that does not use structures is compatible with ARM's specification in only one direction. GCC will accept any code written for RVCT, but not the other way around. And, as Al pointed out, there are also compatibility issues with how you can initialize vectors. (In fact, if

[Aarch64 00/11] Patches to enable Aarch64

2015 Nov 20

2

[Aarch64 00/11] Patches to enable Aarch64

> On Nov 19, 2015, at 5:47 PM, John Ridges <jridges at masque.com> wrote: > > Any speedup from the intrinsics may just be swamped by the rest of the encode/decode process. But I think you really want SIG2WORD16 to be (vqmovns_s32(PSHR32((x), SIG_SHIFT))) Yes, you?re right. I forgot to run the vectors under qemu with my previous version (oh, the embarrassment!) Fixed forthcoming

[LLVMdev] arm neon intrinsics cross compile error on windows system

2011 Nov 23

4

[LLVMdev] arm neon intrinsics cross compile error on windows system

...owings are error codes. Thanks and regards, Seung-yeon. In file included from helloneon.c:4: d:/llvm_projects/llvm-3.0rc4/bin/../lib/clang/3.0/include\arm_neon.h:41:24: error: invalid vector element type 'int32_t' (aka 'long') typedef __attribute__((neon_vector_type(2))) int32_t int32x2_t; ^ d:/llvm_projects/llvm-3.0rc4/bin/../lib/clang/3.0/include\arm_neon.h:42:24: error: invalid vector element type 'int32_t' (aka 'long') typedef __attribute__((neon_vector_type(4))) int32_t int32x4_t; ^ d:/llvm_projects/llvm-3.0rc4/bin/...

[Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

2015 Nov 23

1

[Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

..._t coef1 = vmovl_high_s16(coef16); and instead of: int64x2_t b1 = vmlal_s32(b0, vget_high_s32(a0), vget_high_s32(coef0)); you could use: int64x2_t b1 = vmlal_high_s32(b0, a0, coef0); and instead of: int64x1_t c = vadd_s64(vget_low_s64(b3), vget_high_s64(b3)); int64x1_t cS = vshr_n_s64(c, 16); int32x2_t d = vreinterpret_s32_s64(cS); out = vget_lane_s32(d, 0); you could use: out = (opus_int32)(vaddvq_s64(b3) >> 16); I understand that ARM added these intrinsics because "vget_high_xxx" generates an instruction in ARM64, and isn't just free the way it was in ARMv7 ("vget_lo...

2017 Apr 25

2

2 patches related to silk_biquad_alt() optimization

Hi Jean-Marc, Tested on my chromebook, when stride (channel) == 1, the optimization has no gain compared with C function. When stride (channel) == 2, the optimization is 1.2%-1.8% faster (1.6% at Complexity 8) compared with C function. Please let me know and I can remove the optimization of stride 1 case. If it's allowed to skip the split of A_Q28 and replace by 32-bit multiplication

[Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

2015 Dec 20

2

[Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

...oef2)); > + int64x2_t b6 = vmlal_s32(b5, vget_low_s32(a3), vget_low_s32(coef3)); > + int64x2_t b7 = vmlal_s32(b6, vget_high_s32(a3), vget_high_s32(coef3)); > + > + int64x1_t c = vadd_s64(vget_low_s64(b7), vget_high_s64(b7)); > + int64x1_t cS = vshr_n_s64(c, 16); > + int32x2_t d = vreinterpret_s32_s64(cS); > + opus_int32 out = vget_lane_s32(d, 0); > + return out; > +} So, this is not bit-exact in a portion of the code where I am personally wary of the problems that might cause, since (like most speech codecs) we can use slightly unstable filters. If the...

[Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

2015 Nov 23

0

[Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

..._t coef1 = vmovl_high_s16(coef16); and instead of: int64x2_t b1 = vmlal_s32(b0, vget_high_s32(a0), vget_high_s32(coef0)); you could use: int64x2_t b1 = vmlal_high_s32(b0, a0, coef0); and instead of: int64x1_t c = vadd_s64(vget_low_s64(b3), vget_high_s64(b3)); int64x1_t cS = vshr_n_s64(c, 16); int32x2_t d = vreinterpret_s32_s64(cS); out = vget_lane_s32(d, 0); you could use: out = (opus_int32)(vaddvq_s64(b3) >> 16); I understand that ARM added these intrinsics because "vget_high_xxx" generates an instruction in ARM64, and isn't just free the way it was in ARMv7 ("vget_...

[Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

2015 Dec 21

0

[Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

...t64x2_t b6 = vmlal_s32(b5, vget_low_s32(a3), vget_low_s32(coef3)); >> + int64x2_t b7 = vmlal_s32(b6, vget_high_s32(a3), vget_high_s32(coef3)); >> + >> + int64x1_t c = vadd_s64(vget_low_s64(b7), vget_high_s64(b7)); >> + int64x1_t cS = vshr_n_s64(c, 16); >> + int32x2_t d = vreinterpret_s32_s64(cS); >> + opus_int32 out = vget_lane_s32(d, 0); >> + return out; >> +} > > So, this is not bit-exact in a portion of the code where I am personally > wary of the problems that might cause, since (like most speech codecs) > we can use s...

Several patches of ARM NEON optimization

2016 Jul 14

6

Several patches of ARM NEON optimization

I rebased my previous 3 patches to the current master with minor changes. Patches 1 to 3 replace all my previous submitted patches. Patches 4 and 5 are new. Thanks, Linfeng Zhang

[GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!

2017 Nov 17

2

[GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!

...t;'; nd; Kristof Beyls > Subject: RE: [llvm-dev] [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try! > > Hi Quentin, > > It seems that we also get the calling convention wrong for vector types on big-endian: > #include <arm_neon.h> > int32x2_t load_vector(int32x2_t *p) { > return *p; > } > > Global-isel generates this: > // armclang --target=aarch64-arm-none-eabi -march=armv8-a -c callees.cpp -O0 -Wall -std=c++11 -mllvm -global-isel=true -mllvm -global-isel-abort=0 -mbig-endian -o - -S > _Z11load_vectorP11__...

[PATCH 7/8] Add Neon intrinsics for Silk noise shape feedback loop.

2015 Aug 05

0

[PATCH 7/8] Add Neon intrinsics for Silk noise shape feedback loop.

...high_s32(coef0)); + int64x2_t b2 = vmlal_s32(b1, vget_low_s32(a1), vget_low_s32(coef1)); + int64x2_t b3 = vmlal_s32(b2, vget_high_s32(a1), vget_high_s32(coef1)); + + int64x1_t c = vadd_s64(vget_low_s64(b3), vget_high_s64(b3)); + int64x1_t cS = vshr_n_s64(c, 16); + int32x2_t d = vreinterpret_s32_s64(cS); + + out = vget_lane_s32(d, 0); + vst1q_s32(data1, a0); + vst1q_s32(data1 + 4, a1); + } + else + { + opus_int32 tmp1, tmp2; + opus_int j; + + tmp2 = data0[0]; + tmp1 = data1[0]; + data1[0] = tmp2; + +...

search for: int32x2_t