thr3ads.net - opus - [opus] [Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization. [Nov 2015]

If this information is useful, please help other people find it:
Share via:

Jonathan Lennox

2015-Nov-20 00:18 UTC

[opus] [Aarch64 00/11] Patches to enable Aarch64

> On Nov 19, 2015, at 5:47 PM, John Ridges <jridges at masque.com>
wrote:
> 
> Any speedup from the intrinsics may just be swamped by the rest of the
encode/decode process. But I think you really want SIG2WORD16 to be
(vqmovns_s32(PSHR32((x), SIG_SHIFT)))
Yes, you?re right. I forgot to run the vectors under qemu with my previous
version (oh, the embarrassment!)  Fixed forthcoming once the tests actually run.
> On 11/19/2015 2:52 PM, Jonathan Lennox wrote:
>>> On Nov 16, 2015, at 4:42 PM, Jonathan Lennox <jonathan at
vidyo.com> wrote:
>>> 
>>> I haven?t yet tried replacing SIG2WORD16 (or
silk_ADD_SAT32/silk_SUB_SAT32) with Neon intrinsics.  That?s an obvious next
step.
>> This doesn?t show any appreciable speed difference in my tests, but the
code is obviously better by inspection (all three of these map directly to a
single Aarch64 instruction and a single Neon intrinsic) so my code paths may
just not exercise them.
>> 
>> Patches follow.
>> 
>

John Ridges

2015-Nov-23 17:04 UTC

head link

[opus] [Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

Hi Jonathan.

I really, really hate to bring this up this late in the game, but I just 
noticed that your NEON code doesn't use any of the "high"
intrinsics for
ARM64, e.g. instead of:

int32x4_t coef1 = vmovl_s16(vget_high_s16(coef16));

you could use:

int32x4_t coef1 = vmovl_high_s16(coef16);

and instead of:

int64x2_t b1 = vmlal_s32(b0, vget_high_s32(a0), vget_high_s32(coef0));

you could use:

int64x2_t b1 = vmlal_high_s32(b0, a0, coef0);

and instead of:

int64x1_t c = vadd_s64(vget_low_s64(b3), vget_high_s64(b3));
int64x1_t cS = vshr_n_s64(c, 16);
int32x2_t d = vreinterpret_s32_s64(cS);
out = vget_lane_s32(d, 0);

you could use:

out = (opus_int32)(vaddvq_s64(b3) >> 16);

I understand that ARM added these intrinsics because "vget_high_xxx" 
generates an instruction in ARM64, and isn't just free the way it was in 
ARMv7 ("vget_low_xxx" is of course still free on both platforms).

Regards,

John Ridges

Jonathan Lennox

2015-Nov-23 18:11 UTC

head link

[opus] [Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

On Nov 23, 2015, at 12:04 PM, John Ridges <jridges at
masque.com<mailto:jridges at masque.com>> wrote:

Hi Jonathan.

I really, really hate to bring this up this late in the game, but I just noticed
that your NEON code doesn't use any of the "high" intrinsics for
ARM64, e.g. instead of:

int32x4_t coef1 = vmovl_s16(vget_high_s16(coef16));

you could use:

int32x4_t coef1 = vmovl_high_s16(coef16);

and instead of:

int64x2_t b1 = vmlal_s32(b0, vget_high_s32(a0), vget_high_s32(coef0));

you could use:

int64x2_t b1 = vmlal_high_s32(b0, a0, coef0);

and instead of:

int64x1_t c = vadd_s64(vget_low_s64(b3), vget_high_s64(b3));
int64x1_t cS = vshr_n_s64(c, 16);
int32x2_t d = vreinterpret_s32_s64(cS);
out = vget_lane_s32(d, 0);

you could use:

out = (opus_int32)(vaddvq_s64(b3) >> 16);

I understand that ARM added these intrinsics because "vget_high_xxx"
generates an instruction in ARM64, and isn't just free the way it was in
ARMv7 ("vget_low_xxx" is of course still free on both platforms).

Other than the one-intrinsic optimizations, I?d rather keep the Neon intrinsics
code compilable on ARMv7 as well as ARM64 ? the Neon code is a performance boost
for both platforms, and I?d rather not litter it with #ifdef?s unless there?s a
large difference between the platforms.

It looks like Clang (the version in Xcode 7.1.1, at least) is smart enough to
optimize the first two operations you mention, figuring out sshll2 and smlal2
properly, though the third causes a gratuitous extra ?ext.16b? to be generated. 
I?ve filed a missed-optimization bug on Clang for the latter.

Here?s the code it generates:

_silk_NSQ_noise_shape_feedback_loop_neon:
000000000000004c        ldr      w9, [x0]
0000000000000050        cmp      w3, #8
0000000000000054        b.ne    0x9c
0000000000000058        dup.4s  v0, w9
000000000000005c        ldr      q1, [x1]
0000000000000060        ext.16b v0, v0, v1, #12
0000000000000064        ldur    q1, [x1, #12]
0000000000000068        ldr      q2, [x2]
000000000000006c        sshll.4s        v3, v2, #0
0000000000000070        sshll2.4s       v2, v2, #0
0000000000000074        smull.2d        v4, v0, v3
0000000000000078        smlal2.2d       v4, v0, v3
000000000000007c        smlal.2d        v4, v1, v2
0000000000000080        smlal2.2d       v4, v1, v2
0000000000000084        ext.16b v2, v4, v4, #8
0000000000000088        add     d2, d4, d2
000000000000008c        sshr    d2, d2, #16
0000000000000090        fmov    w0, s2
0000000000000094        stp      q0, q1, [x1]
0000000000000098        ret

(Non-vectorized code for non-order-8 omitted.)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.xiph.org/pipermail/opus/attachments/20151123/77f930ee/attachment.htm

Reasonably Related Threads

Search for more seemingly similar threads

opus - Nov 2015 - [Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

[opus] [Aarch64 00/11] Patches to enable Aarch64

[opus] [Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

[opus] [Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

Reasonably Related Threads