Simon Taylor
2014-Dec-07 19:15 UTC
[LLVMdev] NEON intrinsics preventing redundant load optimization?
Hi all, I’m not sure if this is the right list, so apologies if not. Doing some profiling I noticed some of my hand-tuned matrix multiply code with NEON intrinsics was much slower through a C++ template wrapper vs calling the intrinsics function directly. It turned out clang/LLVM was unable to eliminate a temporary even though the case seemed quite straightforward. Unfortunately any loads directly after NEON stores seem to be bad news on many arm cores (the wrapped version that stores to a temporary, then loads and stores back to the final location was almost 4x slower than the direct version without the temporary). I'm using the clang in the latest XCode + iOS SDK: Apple LLVM version 6.0 (clang-600.0.56) (based on LLVM 3.5svn) Here's a simplified test case: struct vec4 { float data[4]; }; vec4 operator* (vec4& a, vec4& b) { vec4 result; for(int i = 0; i < 4; ++i) result.data[i] = a.data[i] * b.data[i]; return result; } void TestVec4Multiply(vec4& a, vec4& b, vec4& result) { result = a * b; } With -O3 the loop gets vectorized and the code generated looks optimal: __Z16TestVec4MultiplyR4vec4S0_S0_: @ BB#0: vld1.32 {d16, d17}, [r1] vld1.32 {d18, d19}, [r0] vmul.f32 q8, q9, q8 vst1.32 {d16, d17}, [r2] bx lr However if I replace the operator* with a NEON intrinsic implementation (I know the vectorizer figured out optimal code in this case anyway, but that wasn't true for my real situation) then the temporary "result" seems to be kept in the generated code for the test function, and triggers the bad penalty of a load after a NEON store. vec4 operator* (vec4& a, vec4& b) { vec4 result; float32x4_t result_data = vmulq_f32(vld1q_f32(a.data), vld1q_f32(b.data)); vst1q_f32(result.data, result_data); return result; } __Z16TestVec4MultiplyR4vec4S0_S0_: @ BB#0: sub sp, #16 vld1.32 {d16, d17}, [r1] vld1.32 {d18, d19}, [r0] mov r0, sp vmul.f32 q8, q9, q8 vst1.32 {d16, d17}, [r0] vld1.32 {d16, d17}, [r0] vst1.32 {d16, d17}, [r2] add sp, #16 bx lr Is there something about the use of intrinsics that prevents the compiler optimizing out the redundant store on the stack? Is there any hope for this improving in the future, or anything I can do now to improve the generated code? Thanks, Simon
Hal Finkel
2014-Dec-07 23:57 UTC
[LLVMdev] NEON intrinsics preventing redundant load optimization?
----- Original Message -----> From: "Simon Taylor" <simontaylor1 at ntlworld.com> > To: llvmdev at cs.uiuc.edu > Sent: Sunday, December 7, 2014 1:15:51 PM > Subject: [LLVMdev] NEON intrinsics preventing redundant load optimization? > > Hi all, > > I’m not sure if this is the right list, so apologies if not.This is not a bad place ;)> > Doing some profiling I noticed some of my hand-tuned matrix multiply > code with NEON intrinsics was much slower through a C++ template > wrapper vs calling the intrinsics function directly. It turned out > clang/LLVM was unable to eliminate a temporary even though the case > seemed quite straightforward. Unfortunately any loads directly after > NEON stores seem to be bad news on many arm cores (the wrapped > version that stores to a temporary, then loads and stores back to > the final location was almost 4x slower than the direct version > without the temporary). > > I'm using the clang in the latest XCode + iOS SDK: Apple LLVM version > 6.0 (clang-600.0.56) (based on LLVM 3.5svn) > > Here's a simplified test case: > > struct vec4 > { > float data[4]; > }; > > vec4 operator* (vec4& a, vec4& b) > { > vec4 result; > for(int i = 0; i < 4; ++i) > result.data[i] = a.data[i] * b.data[i]; > > return result; > } > > void TestVec4Multiply(vec4& a, vec4& b, vec4& result) > { > result = a * b; > } > > With -O3 the loop gets vectorized and the code generated looks > optimal: > > __Z16TestVec4MultiplyR4vec4S0_S0_: > @ BB#0: > vld1.32 {d16, d17}, [r1] > vld1.32 {d18, d19}, [r0] > vmul.f32 q8, q9, q8 > vst1.32 {d16, d17}, [r2] > bx lr > > However if I replace the operator* with a NEON intrinsic > implementation (I know the vectorizer figured out optimal code in > this case anyway, but that wasn't true for my real situation) then > the temporary "result" seems to be kept in the generated code for > the test function, and triggers the bad penalty of a load after a > NEON store. > > vec4 operator* (vec4& a, vec4& b) > { > vec4 result; > > float32x4_t result_data = vmulq_f32(vld1q_f32(a.data), > vld1q_f32(b.data)); > vst1q_f32(result.data, result_data); > > return result; > } > > __Z16TestVec4MultiplyR4vec4S0_S0_: > @ BB#0: > sub sp, #16 > vld1.32 {d16, d17}, [r1] > vld1.32 {d18, d19}, [r0] > mov r0, sp > vmul.f32 q8, q9, q8 > vst1.32 {d16, d17}, [r0] > vld1.32 {d16, d17}, [r0] > vst1.32 {d16, d17}, [r2] > add sp, #16 > bx lr > > Is there something about the use of intrinsics that prevents the > compiler optimizing out the redundant store on the stack?I recommend filing a bug report so that someone can look at this in detail. You can do this at llvm.org/bugs -- select "libraries" as the product, and then "Scalar Optimizations" as the component (that's probably right, and we can always change it if it turns out the problem lies elsewhere). In the mean time, I recommend trying to pass by value, instead of by reference, in your multiplication operator. It is hard to say without looking at the code in detail, but it is easier for the compiler to analyze: vec4 operator* (vec4 a, vec4 b) than to analyze: vec4 operator* (vec4& a, vec4& b) -Hal> Is there > any hope for this improving in the future, or anything I can do now > to improve the generated code? > > Thanks, > > Simon > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory
Renato Golin
2014-Dec-08 00:13 UTC
[LLVMdev] NEON intrinsics preventing redundant load optimization?
On 7 December 2014 at 19:15, Simon Taylor <simontaylor1 at ntlworld.com> wrote:> Is there something about the use of intrinsics that prevents the compiler optimizing out the redundant store on the stack? Is there any hope for this improving in the future, or anything I can do now to improve the generated code?If I had to guess, I'd say the intrinsic got in the way of recognising the pattern. vmulq_f32 got correctly lowered to IR as "fmul", but vld1q_f32 is still kept as an intrinsic, so register allocators and schedulers get confused and, when lowering to assembly, you're left with garbage around it. Creating a bug for this is probably the best thing to do, since this is a common pattern that needs looking into to produce optimal code. cheers, --renato
Simon Taylor
2014-Dec-08 09:05 UTC
[LLVMdev] NEON intrinsics preventing redundant load optimization?
On 8 Dec 2014, at 00:13, Renato Golin <renato.golin at linaro.org> wrote:> On 7 December 2014 at 19:15, Simon Taylor <simontaylor1 at ntlworld.com> wrote: >> Is there something about the use of intrinsics that prevents the compiler optimizing out the redundant store on the stack? Is there any hope for this improving in the future, or anything I can do now to improve the generated code? > > If I had to guess, I'd say the intrinsic got in the way of recognising > the pattern. vmulq_f32 got correctly lowered to IR as "fmul", but > vld1q_f32 is still kept as an intrinsic, so register allocators and > schedulers get confused and, when lowering to assembly, you're left > with garbage around it. > > Creating a bug for this is probably the best thing to do, since this > is a common pattern that needs looking into to produce optimal code.Thanks for the responses. I’ve filed bug #21778 for this: http://llvm.org/bugs/show_bug.cgi?id=21778 I’ve also tried replacing the vst1.32 with setting the data[i] elements individually with vgetq_lane, which gets at least the single multiply case back to optimal code. There’s still an unneeded temporary when doing res = a * b * c though. Anyway, let’s continue this on the bug tracker :) Simon