Thanks, I look forward to seeing what you find out. BTW, I was wondering if you tried replacing the SIG2WORD16 macro using the vqmovns_s32 intrinsic? I'm sure it would be faster than the C code, but in the grand scheme of things it might not make much difference. On 11/13/2015 12:15 PM, Jonathan Lennox wrote:>> On Nov 13, 2015, at 1:51 PM, John Ridges <jridges at masque.com> wrote: >> >> Hi Jonathan, >> >> I'm sorry to bring this up again, and I don't want to beat a dead horse, but I was very surprised by your benchmarks so I took a little closer look. >> >> I think what's happening is that it's a little unfair to compare the ARM64 inline assembly to the C code, because looking at the C macros in "fixed_generic.h" for MULT16_32_Q16 and MULT16_32_Q15 you find they are implemented with two multiplies, two shifts, an AND and an ADD. It's not hard for me to believe that your inline assembly is faster than that mess. But on a 64-bit machine, there's no reason to go through all that when a simple 64-bit multiply will do. The SILK macros in "macros.h" already make this distinction but you don't see it in your benchmarks because the OPUS_FAST_INT64 macro only looks for 64-bit x86, and doesn't check for ARM64. > No, __LP64__ is set on arm64, both on iOS and on Linux. (64-bit long and pointer.) This is included in the OPUS_FAST_INT64 test. > > The tests for __x86_64__ and for _WIN64 are in OPUS_FAST_INT64 because those are the two common platforms with fast native int64 types which *aren?t* LP64 ? x86_64 can be x32, and Win64 is always LLP64 (32-bit long, 64-bit long long and pointer). > >> I don't think it's a coincidence that the macros you didn't replace only performed one multiply while the ones you did replace performed two. >> >> I think it would be very interesting to try the benchmarks again after adding detection for __aarch64__ to OPUS_FAST_INT64, and replacing MULT16_32_Q15 with something like: >> #define MULT16_32_Q15(a,b) ((opus_val32)SHR((opus_int64)((opus_val16)(a))*(b),15)) >> and MULT16_32_Q16 with: >> #define MULT16_32_Q16(a,b) ((opus_val32)SHR((opus_int64)((opus_val16)(a))*(b),16)) >> There isn't an "opus_val64" or I would have used it (it would be defined as "float" for floating point). >> >> I think the problem here is that "fixed_generic.h" needs to be updated for 64-bit machines the way "macros.h" was. x86 64-bit machines will perform just as poorly using the current macros as ARM64 does. > This, however, seems reasonable to me. I?ll see if it?s possible to pick up the OPUS_FAST_INT64 macro in fixed_generic.h, which may require moving its definition to avoid #include tangles. > >> Also, I think you need to use the "fixed_generic.h" code as the reference for any replacements. I noticed that MULT16_32_Q15_arm64 went out of its way to clear the LSB of the result, which matches what the 32-bit ARM inline assembly returns, but *doesn't* match what "fixed_generic.h" returns. As I understand it, losing the LSB in 32-bit ARM was a compromise between speed and accuracy, but there shouldn't be any reason to do that in ARM64. > Okay, I?ll try that if the int64 multiply doesn?t help. > >> Anyway I'll stop talking now. I'm not saying that the inline assembly isn't faster, but I don't think it's giving you as much of a gain over C as you think. >> >> --John Ridges >> >> >> On 11/13/2015 9:30 AM, Jonathan Lennox wrote: >>>> On Nov 12, 2015, at 12:23 PM, John Ridges <jridges at masque.com> wrote: >>>> >>>> One other minor thing: I notice that in the inline assembly the result (rd) is constrained as an earlyclobber operand. What was the reason for that? >>> Possibly an error? Probably from modeling it on macros_armv4.h, which I guess does require earlyclobber. I?ll ask the developer.-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.xiph.org/pipermail/opus/attachments/20151113/ea13efaf/attachment.htm
I?ve tried adding support for OPUS_FAST_INT64 to celt/arch.h, and I?ve found that this is indeed comparable in speed, if not a touch faster, than my inline assembly. I?ll submit patches for this. The inline assembly parts of my aarch64 patch set can thus be considered withdrawn. I haven?t yet tried replacing SIG2WORD16 (or silk_ADD_SAT32/silk_SUB_SAT32) with Neon intrinsics. That?s an obvious next step. On Nov 13, 2015, at 2:47 PM, John Ridges <jridges at masque.com<mailto:jridges at masque.com>> wrote: Thanks, I look forward to seeing what you find out. BTW, I was wondering if you tried replacing the SIG2WORD16 macro using the vqmovns_s32 intrinsic? I'm sure it would be faster than the C code, but in the grand scheme of things it might not make much difference. On 11/13/2015 12:15 PM, Jonathan Lennox wrote: On Nov 13, 2015, at 1:51 PM, John Ridges <jridges at masque.com><mailto:jridges at masque.com> wrote: Hi Jonathan, I'm sorry to bring this up again, and I don't want to beat a dead horse, but I was very surprised by your benchmarks so I took a little closer look. I think what's happening is that it's a little unfair to compare the ARM64 inline assembly to the C code, because looking at the C macros in "fixed_generic.h" for MULT16_32_Q16 and MULT16_32_Q15 you find they are implemented with two multiplies, two shifts, an AND and an ADD. It's not hard for me to believe that your inline assembly is faster than that mess. But on a 64-bit machine, there's no reason to go through all that when a simple 64-bit multiply will do. The SILK macros in "macros.h" already make this distinction but you don't see it in your benchmarks because the OPUS_FAST_INT64 macro only looks for 64-bit x86, and doesn't check for ARM64. No, __LP64__ is set on arm64, both on iOS and on Linux. (64-bit long and pointer.) This is included in the OPUS_FAST_INT64 test. The tests for __x86_64__ and for _WIN64 are in OPUS_FAST_INT64 because those are the two common platforms with fast native int64 types which *aren?t* LP64 ? x86_64 can be x32, and Win64 is always LLP64 (32-bit long, 64-bit long long and pointer). I don't think it's a coincidence that the macros you didn't replace only performed one multiply while the ones you did replace performed two. I think it would be very interesting to try the benchmarks again after adding detection for __aarch64__ to OPUS_FAST_INT64, and replacing MULT16_32_Q15 with something like: #define MULT16_32_Q15(a,b) ((opus_val32)SHR((opus_int64)((opus_val16)(a))*(b),15)) and MULT16_32_Q16 with: #define MULT16_32_Q16(a,b) ((opus_val32)SHR((opus_int64)((opus_val16)(a))*(b),16)) There isn't an "opus_val64" or I would have used it (it would be defined as "float" for floating point). I think the problem here is that "fixed_generic.h" needs to be updated for 64-bit machines the way "macros.h" was. x86 64-bit machines will perform just as poorly using the current macros as ARM64 does. This, however, seems reasonable to me. I?ll see if it?s possible to pick up the OPUS_FAST_INT64 macro in fixed_generic.h, which may require moving its definition to avoid #include tangles. Also, I think you need to use the "fixed_generic.h" code as the reference for any replacements. I noticed that MULT16_32_Q15_arm64 went out of its way to clear the LSB of the result, which matches what the 32-bit ARM inline assembly returns, but *doesn't* match what "fixed_generic.h" returns. As I understand it, losing the LSB in 32-bit ARM was a compromise between speed and accuracy, but there shouldn't be any reason to do that in ARM64. Okay, I?ll try that if the int64 multiply doesn?t help. Anyway I'll stop talking now. I'm not saying that the inline assembly isn't faster, but I don't think it's giving you as much of a gain over C as you think. --John Ridges On 11/13/2015 9:30 AM, Jonathan Lennox wrote: On Nov 12, 2015, at 12:23 PM, John Ridges <jridges at masque.com><mailto:jridges at masque.com> wrote: One other minor thing: I notice that in the inline assembly the result (rd) is constrained as an earlyclobber operand. What was the reason for that? Possibly an error? Probably from modeling it on macros_armv4.h, which I guess does require earlyclobber. I?ll ask the developer. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.xiph.org/pipermail/opus/attachments/20151116/8710f6f9/attachment.htm
> On Nov 16, 2015, at 4:42 PM, Jonathan Lennox <jonathan at vidyo.com> wrote: > > I haven?t yet tried replacing SIG2WORD16 (or silk_ADD_SAT32/silk_SUB_SAT32) with Neon intrinsics. That?s an obvious next step.This doesn?t show any appreciable speed difference in my tests, but the code is obviously better by inspection (all three of these map directly to a single Aarch64 instruction and a single Neon intrinsic) so my code paths may just not exercise them. Patches follow.
Reasonably Related Threads
- [Aarch64 00/11] Patches to enable Aarch64
- [PATCH 1/3] Add configure check for Aarch64-specific Neon intrinsics.
- [Aarch64 v2 10/18] Clean up some intrinsics-related wording in configure.
- [Aarch64 00/11] Patches to enable Aarch64
- [Aarch64 00/11] Patches to enable Aarch64