Simon Taylor
2014-Dec-08 09:05 UTC
[LLVMdev] NEON intrinsics preventing redundant load optimization?
On 8 Dec 2014, at 00:13, Renato Golin <renato.golin at linaro.org> wrote:> On 7 December 2014 at 19:15, Simon Taylor <simontaylor1 at ntlworld.com> wrote: >> Is there something about the use of intrinsics that prevents the compiler optimizing out the redundant store on the stack? Is there any hope for this improving in the future, or anything I can do now to improve the generated code? > > If I had to guess, I'd say the intrinsic got in the way of recognising > the pattern. vmulq_f32 got correctly lowered to IR as "fmul", but > vld1q_f32 is still kept as an intrinsic, so register allocators and > schedulers get confused and, when lowering to assembly, you're left > with garbage around it. > > Creating a bug for this is probably the best thing to do, since this > is a common pattern that needs looking into to produce optimal code.Thanks for the responses. I’ve filed bug #21778 for this: http://llvm.org/bugs/show_bug.cgi?id=21778 I’ve also tried replacing the vst1.32 with setting the data[i] elements individually with vgetq_lane, which gets at least the single multiply case back to optimal code. There’s still an unneeded temporary when doing res = a * b * c though. Anyway, let’s continue this on the bug tracker :) Simon
Jim Grosbach
2014-Dec-09 02:20 UTC
[LLVMdev] NEON intrinsics preventing redundant load optimization?
> On Dec 8, 2014, at 1:05 AM, Simon Taylor <simontaylor1 at ntlworld.com> wrote: > > On 8 Dec 2014, at 00:13, Renato Golin <renato.golin at linaro.org> wrote: > >> On 7 December 2014 at 19:15, Simon Taylor <simontaylor1 at ntlworld.com> wrote: >>> Is there something about the use of intrinsics that prevents the compiler optimizing out the redundant store on the stack? Is there any hope for this improving in the future, or anything I can do now to improve the generated code? >> >> If I had to guess, I'd say the intrinsic got in the way of recognising >> the pattern. vmulq_f32 got correctly lowered to IR as "fmul", but >> vld1q_f32 is still kept as an intrinsic, so register allocators and >> schedulers get confused and, when lowering to assembly, you're left >> with garbage around it. >> >> Creating a bug for this is probably the best thing to do, since this >> is a common pattern that needs looking into to produce optimal code. > > Thanks for the responses. I’ve filed bug #21778 for this: > http://llvm.org/bugs/show_bug.cgi?id=21778 > > I’ve also tried replacing the vst1.32 with setting the data[i] elements individually with vgetq_lane, which gets at least the single multiply case back to optimal code. There’s still an unneeded temporary when doing res = a * b * c though. Anyway, let’s continue this on the bug tracker :) >FWIW, with top of tree clang, I get the same (good) code for both of the implementations of operator* in the original email. That appears to be a fairly recent improvement, though I haven’t bisected it down or anything. -Jim> Simon > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Simon Taylor
2014-Dec-10 09:19 UTC
[LLVMdev] NEON intrinsics preventing redundant load optimization?
On 9 Dec 2014, at 02:20, Jim Grosbach <grosbach at apple.com> wrote:>> On Dec 8, 2014, at 1:05 AM, Simon Taylor <simontaylor1 at ntlworld.com> wrote: >> >> On 8 Dec 2014, at 00:13, Renato Golin <renato.golin at linaro.org> wrote: >> >>> On 7 December 2014 at 19:15, Simon Taylor <simontaylor1 at ntlworld.com> wrote: >>>> Is there something about the use of intrinsics that prevents the compiler optimizing out the redundant store on the stack? Is there any hope for this improving in the future, or anything I can do now to improve the generated code? >>> >>> If I had to guess, I'd say the intrinsic got in the way of recognising >>> the pattern. vmulq_f32 got correctly lowered to IR as "fmul", but >>> vld1q_f32 is still kept as an intrinsic, so register allocators and >>> schedulers get confused and, when lowering to assembly, you're left >>> with garbage around it. > > FWIW, with top of tree clang, I get the same (good) code for both of the implementations of operator* in the original email. That appears to be a fairly recent improvement, though I haven’t bisected it down or anything.Thanks for the note. I’m building a recent checkout now to see if I see the same behaviour. Looking at the -emit-llvm output from the XCode build shows "load <4 x float>* %1, align 4, !tbaa !3” for the C implementation where redundant temporaries are successfully eliminated. With the intrinsics the IR still contains "tail call <4 x float> @llvm.arm.neon.vld1.v4f32(i8* %1, i32 4)”. Perhaps this is due to NEON supporting interleaved loads for loading arrays of structs into vector registers of each element. I suspect that isn’t very common across architectures, so it wouldn’t surprise me if there was no IR instruction for the interleaved cases (vld[234].*). It seems the vld1.* and vst1.* do have those direct IR representations though. It’s great news if this is fixed in the current tip, but in the short term (for app store builds using the official toolchain) are there any LLVM-specific extensions to initialise a float32x4_t that will get lowered to the "load <4 x float>* %1” form? Or is that more a question for the clang folks? Simon