thr3ads.net - opus - [opus] [Aarch64 00/11] Patches to enable Aarch64 [Nov 2015]

If this information is useful, please help other people find it:
Share via:

John Ridges

2015-Nov-10 20:45 UTC

[opus] [Aarch64 00/11] Patches to enable Aarch64

Since you're already set up for benchmarks, I would ask if you could 
benchmark the difference between using and not using the ARM64 inline 
assembly. I believe the original justification on ARMv7 for the assembly 
was the processor's panoply of multiply instructions and their long 
cycle times. It seems to me that the ARM64 processor is much more like 
an x86 one, where using a simpleminded C multiply gives just as good of 
results. Inline assembly tends to hobble the compiler's optimizer, and 
in ARM64's case, may actually be counterproductive.

The NEON code of course is valuable on all the ARM processors.


On 11/10/2015 1:00 PM, opus-request at xiph.org wrote:> Send opus mailing list submissions to
> 	opus at xiph.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://lists.xiph.org/mailman/listinfo/opus
> or, via email, send a message with subject or body 'help' to
> 	opus-request at xiph.org
>
> You can reach the person managing the list at
> 	opus-owner at xiph.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of opus digest..."
>
>
> Today's Topics:
>
>     1. Re: [Aarch64 00/11] Patches to enable Aarch64	(arm64)
>        optimizations, rebased to current master. (Jonathan Lennox)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 10 Nov 2015 19:32:35 +0000
> From: Jonathan Lennox <jonathan at vidyo.com>
> Subject: Re: [opus] [Aarch64 00/11] Patches to enable Aarch64	(arm64)
> 	optimizations, rebased to current master.
> To: "opus at xiph.org" <opus at xiph.org>
> Message-ID: <A0373653-FF01-472A-AC31-A68348384BF2 at vidyo.com>
> Content-Type: text/plain; charset="utf-8"
>
>
>> On Nov 6, 2015, at 9:05 PM, Jonathan Lennox <jonathan at
vidyo.com> wrote:
>>
>> These have been tested for correctness under qemu (including running
>> the test vectors), but not yet performance tested on a live aarch64
>> CPU (which will probably be an iPhone).  I should be able to do this
>> Monday or Tuesday.
> I?ve now done this, on an iPhone 5S.  (Building with clang from Xcode 7.1)
>
> In fixed-point mode, relative to current HEAD of master, in my tests
aarch64 gets an 10-12% encode boost, and a 6-7% decode boost, without Ne10. 
With Ne10, it?s an 11-13% encode boost, and a 14-15% decode boost. (Current HEAD
of master doesn?t use Ne10 on aarch64 at all.)
>
> There?s also about a 5-6% boost to aarch64 floating-point mode, since some
of the optimizations apply to both fixed and float code.
>
> Fixed-point mode is still substantially faster than floating-point (about
20% faster for encode, about 10% faster for decode.)
>
> These patches also speed armv7 up substantially, since a number of the Neon
intrinsics apply to armv7 as well.
>
> Any questions, feel free to ask me or ping me on #opus.
>
> ------------------------------
>
> _______________________________________________
> opus mailing list
> opus at xiph.org
> http://lists.xiph.org/mailman/listinfo/opus
>
>
> End of opus Digest, Vol 82, Issue 15
> ************************************
>

Jonathan Lennox

2015-Nov-10 21:37 UTC

head link

[opus] [Aarch64 00/11] Patches to enable Aarch64

> On Nov 10, 2015, at 3:45 PM, John Ridges <jridges at masque.com>
wrote:
> 
> Since you're already set up for benchmarks, I would ask if you could 
> benchmark the difference between using and not using the ARM64 inline 
> assembly. I believe the original justification on ARMv7 for the assembly 
> was the processor's panoply of multiply instructions and their long 
> cycle times. It seems to me that the ARM64 processor is much more like 
> an x86 one, where using a simpleminded C multiply gives just as good of 
> results. Inline assembly tends to hobble the compiler's optimizer, and 
> in ARM64's case, may actually be counterproductive.
> 
> The NEON code of course is valuable on all the ARM processors.
No, configuring my patchset with ?disable-asm (which disables both my celt and
silk inline assembly, patch 06/11 and 07/11) slows down encode by 2-3% and
decode by 5-6% on fixed-point arm64 (without Ne10).

Note that my submission has many *fewer* inline assembly snippets for ARM64 than
the ARMv7 code does.  The guy here at Vidyo who actually did this optimization
work (Johnny Lee, whose work I?m just massaging into submittable form) found
that many of the multiplies were indeed better as C, especially with (what?s
now) the OPUS_FAST_INT64 test.

John Ridges

2015-Nov-10 21:49 UTC

head link

[opus] [Aarch64 00/11] Patches to enable Aarch64

Good to know. Thank-you for the test.


On 11/10/2015 2:37 PM, Jonathan Lennox wrote:>> On Nov 10, 2015, at 3:45 PM, John Ridges <jridges at masque.com>
wrote:
>>
>> Since you're already set up for benchmarks, I would ask if you
could
>> benchmark the difference between using and not using the ARM64 inline
>> assembly. I believe the original justification on ARMv7 for the
assembly
>> was the processor's panoply of multiply instructions and their long
>> cycle times. It seems to me that the ARM64 processor is much more like
>> an x86 one, where using a simpleminded C multiply gives just as good of
>> results. Inline assembly tends to hobble the compiler's optimizer,
and
>> in ARM64's case, may actually be counterproductive.
>>
>> The NEON code of course is valuable on all the ARM processors.
> No, configuring my patchset with ?disable-asm (which disables both my celt
and silk inline assembly, patch 06/11 and 07/11) slows down encode by 2-3% and
decode by 5-6% on fixed-point arm64 (without Ne10).
>
> Note that my submission has many *fewer* inline assembly snippets for ARM64
than the ARMv7 code does.  The guy here at Vidyo who actually did this
optimization work (Johnny Lee, whose work I?m just massaging into submittable
form) found that many of the multiplies were indeed better as C, especially with
(what?s now) the OPUS_FAST_INT64 test.
>
>

John Ridges

2015-Nov-12 17:23 UTC

head link

[opus] [Aarch64 00/11] Patches to enable Aarch64

One other minor thing: I notice that in the inline assembly the result 
(rd) is constrained as an earlyclobber operand. What was the reason for 
that?

Possibly Parallel Threads

Search for more possibly parallel threads

opus - Nov 2015 - [Aarch64 00/11] Patches to enable Aarch64

[opus] [Aarch64 00/11] Patches to enable Aarch64

[opus] [Aarch64 00/11] Patches to enable Aarch64

[opus] [Aarch64 00/11] Patches to enable Aarch64

[opus] [Aarch64 00/11] Patches to enable Aarch64

Possibly Parallel Threads