thr3ads.net - opus - [opus] [Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization. [Dec 2015]

If this information is useful, please help other people find it:
Share via:

Timothy B. Terriberry

2015-Dec-20 03:07 UTC

[opus] [Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

Jonathan Lennox wrote:> +opus_int32 silk_noise_shape_quantizer_short_prediction_neon(const
opus_int32 *buf32, const opus_int32 *coef32)
> +{
> +    int32x4_t coef0 = vld1q_s32(coef32);
> +    int32x4_t coef1 = vld1q_s32(coef32 + 4);
> +    int32x4_t coef2 = vld1q_s32(coef32 + 8);
> +    int32x4_t coef3 = vld1q_s32(coef32 + 12);
> +
> +    int32x4_t a0 = vld1q_s32(buf32 - 15);
> +    int32x4_t a1 = vld1q_s32(buf32 - 11);
> +    int32x4_t a2 = vld1q_s32(buf32 - 7);
> +    int32x4_t a3 = vld1q_s32(buf32 - 3);
> +
> +    int64x2_t b0 = vmull_s32(vget_low_s32(a0), vget_low_s32(coef0));
> +    int64x2_t b1 = vmlal_s32(b0, vget_high_s32(a0), vget_high_s32(coef0));
> +    int64x2_t b2 = vmlal_s32(b1, vget_low_s32(a1), vget_low_s32(coef1));
> +    int64x2_t b3 = vmlal_s32(b2, vget_high_s32(a1), vget_high_s32(coef1));
> +    int64x2_t b4 = vmlal_s32(b3, vget_low_s32(a2), vget_low_s32(coef2));
> +    int64x2_t b5 = vmlal_s32(b4, vget_high_s32(a2), vget_high_s32(coef2));
> +    int64x2_t b6 = vmlal_s32(b5, vget_low_s32(a3), vget_low_s32(coef3));
> +    int64x2_t b7 = vmlal_s32(b6, vget_high_s32(a3), vget_high_s32(coef3));
> +
> +    int64x1_t c = vadd_s64(vget_low_s64(b7), vget_high_s64(b7));
> +    int64x1_t cS = vshr_n_s64(c, 16);
> +    int32x2_t d = vreinterpret_s32_s64(cS);
> +    opus_int32 out = vget_lane_s32(d, 0);
> +    return out;
> +}
So, this is not bit-exact in a portion of the code where I am personally 
wary of the problems that might cause, since (like most speech codecs) 
we can use slightly unstable filters. If there was a big speed advantage 
it might be worth the testing to make sure nothing diverges here 
significantly (and it's _probably_ fine), but I think you can actually 
do this faster while remaining bitexact.

If you shift up the contents of coef32 by 15 bits (which you can do, 
since you are already transforming them specially for this platform), 
you can use vqdmulhq_s32() to emulate SMULWB. You then have to do the 
addition in a separate instruction, but because you can keep all of the 
results in 32-bit, you get double the parallelism and only need half as 
many multiplies (which have much higher latency than addition). Overall 
it should be faster, and match the C code exactly.
> +#define optional_coef_reversal(out, in, order) do { if (arch == 3) {
optional_coef_reversal_neon(out, in, order); } } while (0)
> +
> +#endif
> +
> +opus_int32 silk_noise_shape_quantizer_short_prediction_neon(const
opus_int32 *buf32, const opus_int32 *coef32);
> +
> +#if OPUS_ARM_PRESUME_NEON_INTR
> +#undef silk_noise_shape_quantizer_short_prediction
> +#define silk_noise_shape_quantizer_short_prediction(in, coef, coefRev,
order, arch)  ((void)arch,silk_noise_shape_quantizer_short_prediction_neon(in,
coefRev))
> +
> +#elif OPUS_HAVE_RTCD
> +
> +/* silk_noise_shape_quantizer_short_prediction implementations take
different parameters based on arch
> +   (coef vs. coefRev) so can't use the usual IMPL table implementation
*/
> +#undef silk_noise_shape_quantizer_short_prediction
> +#define silk_noise_shape_quantizer_short_prediction(in, coef, coefRev,
order, arch)  (arch == 3 ? silk_noise_shape_quantizer_short_prediction_neon(in,
coefRev) : silk_noise_shape_quantizer_short_prediction_c(in, coef, order))
I'm also not wild about these hard-coded 3's. Right now what arch maps 
to what number is confined to arm_celt_map.c, which does not use the 
indices directly (only sorts its table entries by them). So we never got 
named constants for them. But if we have to re-organize what arch 
configurations we support, these might change. Random 3's scattered 
across the codebase are going to be hard to track down and update.

(also, I realize libopus doesn't have a line-length restriction, but a 
few newlines in here might be a mercy to those of us who work in 
80-column terminals)

Jonathan Lennox

2015-Dec-21 15:37 UTC

head link

[opus] [Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

> On Dec 19, 2015, at 10:07 PM, Timothy B. Terriberry <tterribe at
xiph.org> wrote:
> 
> Jonathan Lennox wrote:
>> +opus_int32 silk_noise_shape_quantizer_short_prediction_neon(const
opus_int32 *buf32, const opus_int32 *coef32)
>> +{
>> +    int32x4_t coef0 = vld1q_s32(coef32);
>> +    int32x4_t coef1 = vld1q_s32(coef32 + 4);
>> +    int32x4_t coef2 = vld1q_s32(coef32 + 8);
>> +    int32x4_t coef3 = vld1q_s32(coef32 + 12);
>> +
>> +    int32x4_t a0 = vld1q_s32(buf32 - 15);
>> +    int32x4_t a1 = vld1q_s32(buf32 - 11);
>> +    int32x4_t a2 = vld1q_s32(buf32 - 7);
>> +    int32x4_t a3 = vld1q_s32(buf32 - 3);
>> +
>> +    int64x2_t b0 = vmull_s32(vget_low_s32(a0), vget_low_s32(coef0));
>> +    int64x2_t b1 = vmlal_s32(b0, vget_high_s32(a0),
vget_high_s32(coef0));
>> +    int64x2_t b2 = vmlal_s32(b1, vget_low_s32(a1),
vget_low_s32(coef1));
>> +    int64x2_t b3 = vmlal_s32(b2, vget_high_s32(a1),
vget_high_s32(coef1));
>> +    int64x2_t b4 = vmlal_s32(b3, vget_low_s32(a2),
vget_low_s32(coef2));
>> +    int64x2_t b5 = vmlal_s32(b4, vget_high_s32(a2),
vget_high_s32(coef2));
>> +    int64x2_t b6 = vmlal_s32(b5, vget_low_s32(a3),
vget_low_s32(coef3));
>> +    int64x2_t b7 = vmlal_s32(b6, vget_high_s32(a3),
vget_high_s32(coef3));
>> +
>> +    int64x1_t c = vadd_s64(vget_low_s64(b7), vget_high_s64(b7));
>> +    int64x1_t cS = vshr_n_s64(c, 16);
>> +    int32x2_t d = vreinterpret_s32_s64(cS);
>> +    opus_int32 out = vget_lane_s32(d, 0);
>> +    return out;
>> +}
> 
> So, this is not bit-exact in a portion of the code where I am personally 
> wary of the problems that might cause, since (like most speech codecs) 
> we can use slightly unstable filters. If there was a big speed advantage 
> it might be worth the testing to make sure nothing diverges here 
> significantly (and it's _probably_ fine), but I think you can actually 
> do this faster while remaining bitexact.
> 
> If you shift up the contents of coef32 by 15 bits (which you can do, 
> since you are already transforming them specially for this platform), 
> you can use vqdmulhq_s32() to emulate SMULWB. You then have to do the 
> addition in a separate instruction, but because you can keep all of the 
> results in 32-bit, you get double the parallelism and only need half as 
> many multiplies (which have much higher latency than addition). Overall 
> it should be faster, and match the C code exactly.
Okay ? I?ll consult with the guy who wrote the code originally, and try it out.
> 
>> +#define optional_coef_reversal(out, in, order) do { if (arch == 3) {
optional_coef_reversal_neon(out, in, order); } } while (0)
>> +
>> +#endif
>> +
>> +opus_int32 silk_noise_shape_quantizer_short_prediction_neon(const
opus_int32 *buf32, const opus_int32 *coef32);
>> +
>> +#if OPUS_ARM_PRESUME_NEON_INTR
>> +#undef silk_noise_shape_quantizer_short_prediction
>> +#define silk_noise_shape_quantizer_short_prediction(in, coef, coefRev,
order, arch)  ((void)arch,silk_noise_shape_quantizer_short_prediction_neon(in,
coefRev))
>> +
>> +#elif OPUS_HAVE_RTCD
>> +
>> +/* silk_noise_shape_quantizer_short_prediction implementations take
different parameters based on arch
>> +   (coef vs. coefRev) so can't use the usual IMPL table
implementation */
>> +#undef silk_noise_shape_quantizer_short_prediction
>> +#define silk_noise_shape_quantizer_short_prediction(in, coef, coefRev,
order, arch)  (arch == 3 ? silk_noise_shape_quantizer_short_prediction_neon(in,
coefRev) : silk_noise_shape_quantizer_short_prediction_c(in, coef, order))
> 
> I'm also not wild about these hard-coded 3's. Right now what arch
maps
> to what number is confined to arm_celt_map.c, which does not use the 
> indices directly (only sorts its table entries by them). So we never got 
> named constants for them. But if we have to re-organize what arch 
> configurations we support, these might change. Random 3's scattered 
> across the codebase are going to be hard to track down and update.
I see your point ? what do you suggest instead?   As the comment mentions, this
can?t use the usual IMPL table implementation, because the parameters are
different (due to the transformed coefficients.)

Would something like an OPUS_ARCH_ARM_NEON #define in celt/arm/armcpu.h be okay?
It?d be a bit confusing with OPUS_CPU_ARM_NEON in armcpu.c, but I could probably
organize it to be sensible.
> (also, I realize libopus doesn't have a line-length restriction, but a 
> few newlines in here might be a mercy to those of us who work in 
> 80-column terminals)
Okay.

Timothy B. Terriberry

2015-Dec-21 15:44 UTC

head link

[opus] [Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

Jonathan Lennox wrote:> Would something like an OPUS_ARCH_ARM_NEON #define in celt/arm/armcpu.h be
okay?  It?d be a bit confusing with OPUS_CPU_ARM_NEON in armcpu.c, but I could
probably organize it to be sensible.
Yes, this is what I had in mind. I would have suggested just replacing 
the existing constants in armcpu.c (either just do the shifts manually, 
or have some sort of _FLAG variant that encapsulates the shift).

Maybe Matching Threads

Search for more seemingly similar threads

opus - Dec 2015 - [Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

[opus] [Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

[opus] [Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

[opus] [Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

Maybe Matching Threads