Timothy B. Terriberry
2014-Dec-18 05:59 UTC
[opus] [RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
Almost there... just a few nits left. Viswanath Puttagunta wrote:> +if OPUS_ARM_NEON_INTR > +CELT_SOURCES += $(CELT_SOURCES_ARM_NEON_INTR) > +OPUS_ARM_NEON_INTR_CPPFLAGS = -mfpu=neon -O3I'll repeat: I don't think you should change the optimization level here.> + /* Just unroll the rest of the loop */I saw you decided to keep this unrolled, but you didn't actually answer my question.> -#elif defined(OPUS_ARM_ASM) && defined(FIXED_POINT) > +#endif > + > +#if defined(OPUS_ARM_NEON_INTR) > +#include "arm/celt_neon_intr.c" > +#endif > + > +#if ((defined(OPUS_ARM_ASM) && defined(FIXED_POINT)) \ > + || defined(OPUS_ARM_NEON_INTR)) > #include "arm/arm_celt_map.c" > #endifYou should keep the #elif (the intent was to _guarantee_ that only one of x86_celt_map.c or arm_celt_map.c would be included). Instead, just move the #if defined(OPUS_ARM_NEON_INTR) block inside that #elif (both in test_unit_mathops.c and test_unit_rotation.c).
Viswanath Puttagunta
2014-Dec-18 15:38 UTC
[opus] [RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
Hi Timothy, I responded to your feedback before I started on RFCv3.. and took your silence as approval :).. I guess that email got lost in your inbox sea some where.. so re-posting the responses. So, this time, I will wait for your ack before I proceed to RFCv4.. Although.. after all these reviews.. I may just submit it as PATCHv1.. since the patch is almost there and there are no major objections any more.. Regards, Vish On 17 December 2014 at 23:59, Timothy B. Terriberry <tterribe at xiph.org> wrote:> > Almost there... just a few nits left. > > Viswanath Puttagunta wrote: > > +if OPUS_ARM_NEON_INTR > > +CELT_SOURCES += $(CELT_SOURCES_ARM_NEON_INTR) > > +OPUS_ARM_NEON_INTR_CPPFLAGS = -mfpu=neon -O3 > > I'll repeat: I don't think you should change the optimization level here.After you feedback, I moved this part from configure.ac to Makefile.am.. If it doesn't belong here either, then I need some guidance. I wouldn't know where else to put this. Without "-mfpu=neon", the intrinsic code wouldn't even compile. And without -O3.. the instructions that are generated are just painful. Please advise.> > > + /* Just unroll the rest of the loop */ > > I saw you decided to keep this unrolled, but you didn't actually answer > my question.Copy paste from my previous email. I got the idea of unrolling this from your earlier comment "This load is always redundant in the first iteration, which is a bit unfortunate." Now, that I unrolled it and tested it to make sure it works, is there a specific reason you think this is will be any slower than a simple loop and want to go back? I know it saves one load as compared to a simple loop. It would be hard to prove that it is measurably better than the simple loop. But I would sincerely prefer it this way.. It is fairly straight forward unroll in my opinion.> > > -#elif defined(OPUS_ARM_ASM) && defined(FIXED_POINT) > > +#endif > > + > > +#if defined(OPUS_ARM_NEON_INTR) > > +#include "arm/celt_neon_intr.c" > > +#endif > > + > > +#if ((defined(OPUS_ARM_ASM) && defined(FIXED_POINT)) \ > > + || defined(OPUS_ARM_NEON_INTR)) > > #include "arm/arm_celt_map.c" > > #endif > > You should keep the #elif (the intent was to _guarantee_ that only one > of x86_celt_map.c or arm_celt_map.c would be included). Instead, just > move the #if defined(OPUS_ARM_NEON_INTR) block inside that #elif (both > in test_unit_mathops.c and test_unit_rotation.c).Will do> _______________________________________________ > opus mailing list > opus at xiph.org > http://lists.xiph.org/mailman/listinfo/opus
Timothy B. Terriberry
2014-Dec-19 11:26 UTC
[opus] [RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
Viswanath Puttagunta wrote:> I responded to your feedback before I started on RFCv3.. and took your > silence as approval :).. I guess that email got lost in your inbox sea > some where.. so re-posting the responses.Sorry, I did see it but I guess I read it rather more quickly than I thought. Apologies for that.> guidance. I wouldn't know where else to put this. WithoutSorry for not being clear. I meant you shouldn't put the -O3 anywhere at all.> "-mfpu=neon", the intrinsic code wouldn't even compile. And without > -O3.. the instructions that are generated are just painful. Please > advise.With what compiler? Using gcc 4.2 on my rather outdated Cortex A8 setup (Greg is working on getting a more modern one configured for us), the assembly output with -O3 and (the default) -O2 is identical. However, if someone actually wants to make an unoptimized build for some reason (e.g., testing or debugging), you've broken that.> Now, that I unrolled it and tested it to make sure it works, is there > a specific reason you think this is will be any slower than a simple > loop and want to go back? I know it saves one load as compared to a > simple loop. It would be hard to prove that it is measurably better > than the simple loop. But I would sincerely prefer it this way.. It is > fairly straight forward unroll in my opinion.It makes sense to optimize for executed cycles or readability/code size, but this doesn't really do either. You'll see the code in celt_pitch_xcorr_arm.s processes the last samples in groups of 2, 1, 1. It costs one extra compare (only one, because keep in mind the switch has an extra compare since gcc is not smart enough to prove that len==0 is impossible, despite our assert), but avoids an indirect jump. That's all probably irrelevant, as the bottlenecks here are going to be the stalls waiting for the loads and multiplies on the NEON side, and this has one fewer of both when len is 3 or 4. If you were going to fully unroll, that's kind of what I expected. I agree that doing the extra load is suboptimal, but you can still loop the first three iterations, which means half the code. That's good for code cache size and readability (i.e., Don't Repeat Yourself). If there were a measurable performance advantage to the switch statement, that would change my opinion, but if there isn't, then either of the other two approaches seem better. One more nit I hadn't noticed before:> + float *xi = x; > + float *yi = y;These need to be const float32_t (in both xcorr_kernel_neon_float and xcorr_kernel_neon_float_process1). They're currently causing a ton of warning spew. float32_t appears to not be considered equivalent to float, which means you'll also need casts here:> + vst1q_f32(sum, SUMM);and here:> + vst1_lane_f32(sum, SUMM_2[0], 0);
Possibly Parallel Threads
- [RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
- [RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
- [RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
- [PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
- [RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics