Hi Jonathan,
I'm sorry to bring this up again, and I don't want to beat a dead horse,
but I was very surprised by your benchmarks so I took a little closer look.
I think what's happening is that it's a little unfair to compare the
ARM64 inline assembly to the C code, because looking at the C macros in
"fixed_generic.h" for MULT16_32_Q16 and MULT16_32_Q15 you find they
are
implemented with two multiplies, two shifts, an AND and an ADD. It's not
hard for me to believe that your inline assembly is faster than that
mess. But on a 64-bit machine, there's no reason to go through all that
when a simple 64-bit multiply will do. The SILK macros in "macros.h"
already make this distinction but you don't see it in your benchmarks
because the OPUS_FAST_INT64 macro only looks for 64-bit x86, and doesn't
check for ARM64.
I don't think it's a coincidence that the macros you didn't replace
only
performed one multiply while the ones you did replace performed two.
I think it would be very interesting to try the benchmarks again after
adding detection for __aarch64__ to OPUS_FAST_INT64, and replacing
MULT16_32_Q15 with something like:
#define MULT16_32_Q15(a,b)
((opus_val32)SHR((opus_int64)((opus_val16)(a))*(b),15))
and MULT16_32_Q16 with:
#define MULT16_32_Q16(a,b)
((opus_val32)SHR((opus_int64)((opus_val16)(a))*(b),16))
There isn't an "opus_val64" or I would have used it (it would be
defined
as "float" for floating point).
I think the problem here is that "fixed_generic.h" needs to be updated
for 64-bit machines the way "macros.h" was. x86 64-bit machines will
perform just as poorly using the current macros as ARM64 does.
Also, I think you need to use the "fixed_generic.h" code as the
reference for any replacements. I noticed that MULT16_32_Q15_arm64 went
out of its way to clear the LSB of the result, which matches what the
32-bit ARM inline assembly returns, but *doesn't* match what
"fixed_generic.h" returns. As I understand it, losing the LSB in
32-bit
ARM was a compromise between speed and accuracy, but there shouldn't be
any reason to do that in ARM64.
Anyway I'll stop talking now. I'm not saying that the inline assembly
isn't faster, but I don't think it's giving you as much of a gain
over C
as you think.
--John Ridges
On 11/13/2015 9:30 AM, Jonathan Lennox wrote:>> On Nov 12, 2015, at 12:23 PM, John Ridges <jridges at masque.com>
wrote:
>>
>> One other minor thing: I notice that in the inline assembly the result
(rd) is constrained as an earlyclobber operand. What was the reason for that?
> Possibly an error? Probably from modeling it on macros_armv4.h, which I
guess does require earlyclobber. I?ll ask the developer.