thr3ads.net - opus - [opus] [Aarch64 00/11] Patches to enable Aarch64 [Nov 2015]

If this information is useful, please help other people find it:
Share via:

John Ridges

2015-Nov-12 17:23 UTC

[opus] [Aarch64 00/11] Patches to enable Aarch64

One other minor thing: I notice that in the inline assembly the result 
(rd) is constrained as an earlyclobber operand. What was the reason for 
that?

Jonathan Lennox

2015-Nov-13 16:30 UTC

head link

[opus] [Aarch64 00/11] Patches to enable Aarch64

> On Nov 12, 2015, at 12:23 PM, John Ridges <jridges at masque.com>
wrote:
> 
> One other minor thing: I notice that in the inline assembly the result (rd)
is constrained as an earlyclobber operand. What was the reason for that?
Possibly an error?  Probably from modeling it on macros_armv4.h, which I guess
does require earlyclobber.  I?ll ask the developer.

John Ridges

2015-Nov-13 18:51 UTC

head link

[opus] [Aarch64 00/11] Patches to enable Aarch64

Hi Jonathan,

I'm sorry to bring this up again, and I don't want to beat a dead horse,
but I was very surprised by your benchmarks so I took a little closer look.

I think what's happening is that it's a little unfair to compare the 
ARM64 inline assembly to the C code, because looking at the C macros in 
"fixed_generic.h" for MULT16_32_Q16 and MULT16_32_Q15 you find they
are
implemented with two multiplies, two shifts, an AND and an ADD. It's not 
hard for me to believe that your inline assembly is faster than that 
mess. But on a 64-bit machine, there's no reason to go through all that 
when a simple 64-bit multiply will do. The SILK macros in "macros.h" 
already make this distinction but you don't see it in your benchmarks 
because the OPUS_FAST_INT64 macro only looks for 64-bit x86, and doesn't 
check for ARM64.
I don't think it's a coincidence that the macros you didn't replace
only
performed one multiply while the ones you did replace performed two.

I think it would be very interesting to try the benchmarks again after 
adding detection for __aarch64__ to OPUS_FAST_INT64, and replacing 
MULT16_32_Q15 with something like:
#define MULT16_32_Q15(a,b) 
((opus_val32)SHR((opus_int64)((opus_val16)(a))*(b),15))
and MULT16_32_Q16 with:
#define MULT16_32_Q16(a,b) 
((opus_val32)SHR((opus_int64)((opus_val16)(a))*(b),16))
There isn't an "opus_val64" or I would have used it (it would be
defined
as "float" for floating point).

I think the problem here is that "fixed_generic.h" needs to be updated
for 64-bit machines the way "macros.h" was. x86 64-bit machines will 
perform just as poorly using the current macros as ARM64 does.

Also, I think you need to use the "fixed_generic.h" code as the 
reference for any replacements. I noticed that MULT16_32_Q15_arm64 went 
out of its way to clear the LSB of the result, which matches what the 
32-bit ARM inline assembly returns, but *doesn't* match what 
"fixed_generic.h" returns. As I understand it, losing the LSB in
32-bit
ARM was a compromise between speed and accuracy, but there shouldn't be 
any reason to do that in ARM64.

Anyway I'll stop talking now. I'm not saying that the inline assembly 
isn't faster, but I don't think it's giving you as much of a gain
over C
as you think.

--John Ridges

On 11/13/2015 9:30 AM, Jonathan Lennox wrote:>> On Nov 12, 2015, at 12:23 PM, John Ridges <jridges at masque.com>
wrote:
>>
>> One other minor thing: I notice that in the inline assembly the result
(rd) is constrained as an earlyclobber operand. What was the reason for that?
> Possibly an error?  Probably from modeling it on macros_armv4.h, which I
guess does require earlyclobber.  I?ll ask the developer.

Seemingly Similar Threads

Search for more maybe matching threads

opus - Nov 2015 - [Aarch64 00/11] Patches to enable Aarch64

[opus] [Aarch64 00/11] Patches to enable Aarch64

[opus] [Aarch64 00/11] Patches to enable Aarch64

[opus] [Aarch64 00/11] Patches to enable Aarch64

Seemingly Similar Threads