thr3ads.net - llvm dev - [LLVMdev] NEON intrinsics preventing redundant load optimization? [Dec 2014]

If this information is useful, please help other people find it:
Share via:

Simon Taylor

2014-Dec-07 19:15 UTC

[LLVMdev] NEON intrinsics preventing redundant load optimization?

Hi all,

I’m not sure if this is the right list, so apologies if not.

Doing some profiling I noticed some of my hand-tuned matrix multiply code with
NEON intrinsics was much slower through a C++ template wrapper vs calling the
intrinsics function directly. It turned out clang/LLVM was unable to eliminate a
temporary even though the case seemed quite straightforward. Unfortunately any
loads directly after NEON stores seem to be bad news on many arm cores (the
wrapped version that stores to a temporary, then loads and stores back to the
final location was almost 4x slower than the direct version without the
temporary).

I'm using the clang in the latest XCode + iOS SDK: Apple LLVM version 6.0
(clang-600.0.56) (based on LLVM 3.5svn)

Here's a simplified test case:

struct vec4
{
	float data[4];
};

vec4 operator* (vec4& a, vec4& b)
{
	vec4 result;
	for(int i = 0; i < 4; ++i)
		result.data[i] = a.data[i] * b.data[i];

	return result;
}

void TestVec4Multiply(vec4& a, vec4& b, vec4& result)
{
	result = a * b;
}

With -O3 the loop gets vectorized and the code generated looks optimal:

__Z16TestVec4MultiplyR4vec4S0_S0_:
@ BB#0:
	vld1.32	{d16, d17}, [r1]
	vld1.32	{d18, d19}, [r0]
	vmul.f32	q8, q9, q8
	vst1.32	{d16, d17}, [r2]
	bx	lr

However if I replace the operator* with a NEON intrinsic implementation (I know
the vectorizer figured out optimal code in this case anyway, but that wasn't
true for my real situation) then the temporary "result" seems to be
kept in the generated code for the test function, and triggers the bad penalty
of a load after a NEON store.

vec4 operator* (vec4& a, vec4& b)
{
	vec4 result;
	
	float32x4_t result_data = vmulq_f32(vld1q_f32(a.data), vld1q_f32(b.data));
	vst1q_f32(result.data, result_data);

	return result;
}

__Z16TestVec4MultiplyR4vec4S0_S0_:
@ BB#0:
	sub	sp, #16
	vld1.32	{d16, d17}, [r1]
	vld1.32	{d18, d19}, [r0]
	mov	r0, sp
	vmul.f32	q8, q9, q8
	vst1.32	{d16, d17}, [r0]
	vld1.32	{d16, d17}, [r0]
	vst1.32	{d16, d17}, [r2]
	add	sp, #16
	bx	lr

Is there something about the use of intrinsics that prevents the compiler
optimizing out the redundant store on the stack? Is there any hope for this
improving in the future, or anything I can do now to improve the generated code?

Thanks,

Simon

Hal Finkel

2014-Dec-07 23:57 UTC

head link

[LLVMdev] NEON intrinsics preventing redundant load optimization?

----- Original Message -----> From: "Simon Taylor" <simontaylor1 at ntlworld.com>
> To: llvmdev at cs.uiuc.edu
> Sent: Sunday, December 7, 2014 1:15:51 PM
> Subject: [LLVMdev] NEON intrinsics preventing redundant load optimization?
> 
> Hi all,
> 
> I’m not sure if this is the right list, so apologies if not.
This is not a bad place ;)
> 
> Doing some profiling I noticed some of my hand-tuned matrix multiply
> code with NEON intrinsics was much slower through a C++ template
> wrapper vs calling the intrinsics function directly. It turned out
> clang/LLVM was unable to eliminate a temporary even though the case
> seemed quite straightforward. Unfortunately any loads directly after
> NEON stores seem to be bad news on many arm cores (the wrapped
> version that stores to a temporary, then loads and stores back to
> the final location was almost 4x slower than the direct version
> without the temporary).
> 
> I'm using the clang in the latest XCode + iOS SDK: Apple LLVM version
> 6.0 (clang-600.0.56) (based on LLVM 3.5svn)
> 
> Here's a simplified test case:
> 
> struct vec4
> {
> 	float data[4];
> };
> 
> vec4 operator* (vec4& a, vec4& b)
> {
> 	vec4 result;
> 	for(int i = 0; i < 4; ++i)
> 		result.data[i] = a.data[i] * b.data[i];
> 
> 	return result;
> }
> 
> void TestVec4Multiply(vec4& a, vec4& b, vec4& result)
> {
> 	result = a * b;
> }
> 
> With -O3 the loop gets vectorized and the code generated looks
> optimal:
> 
> __Z16TestVec4MultiplyR4vec4S0_S0_:
> @ BB#0:
> 	vld1.32	{d16, d17}, [r1]
> 	vld1.32	{d18, d19}, [r0]
> 	vmul.f32	q8, q9, q8
> 	vst1.32	{d16, d17}, [r2]
> 	bx	lr
> 
> However if I replace the operator* with a NEON intrinsic
> implementation (I know the vectorizer figured out optimal code in
> this case anyway, but that wasn't true for my real situation) then
> the temporary "result" seems to be kept in the generated code for
> the test function, and triggers the bad penalty of a load after a
> NEON store.
> 
> vec4 operator* (vec4& a, vec4& b)
> {
> 	vec4 result;
> 	
> 	float32x4_t result_data = vmulq_f32(vld1q_f32(a.data),
> 	vld1q_f32(b.data));
> 	vst1q_f32(result.data, result_data);
> 
> 	return result;
> }
> 
> __Z16TestVec4MultiplyR4vec4S0_S0_:
> @ BB#0:
> 	sub	sp, #16
> 	vld1.32	{d16, d17}, [r1]
> 	vld1.32	{d18, d19}, [r0]
> 	mov	r0, sp
> 	vmul.f32	q8, q9, q8
> 	vst1.32	{d16, d17}, [r0]
> 	vld1.32	{d16, d17}, [r0]
> 	vst1.32	{d16, d17}, [r2]
> 	add	sp, #16
> 	bx	lr
> 
> Is there something about the use of intrinsics that prevents the
> compiler optimizing out the redundant store on the stack?
I recommend filing a bug report so that someone can look at this in detail. You
can do this at llvm.org/bugs -- select "libraries" as the product, and
then "Scalar Optimizations" as the component (that's probably
right, and we can always change it if it turns out the problem lies elsewhere).

In the mean time, I recommend trying to pass by value, instead of by reference,
in your multiplication operator. It is hard to say without looking at the code
in detail, but it is easier for the compiler to analyze:
  vec4 operator* (vec4 a, vec4 b)
than to analyze:
  vec4 operator* (vec4& a, vec4& b)

 -Hal
> Is there
> any hope for this improving in the future, or anything I can do now
> to improve the generated code?
> 
> Thanks,
> 
> Simon
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Renato Golin

2014-Dec-08 00:13 UTC

head link

[LLVMdev] NEON intrinsics preventing redundant load optimization?

On 7 December 2014 at 19:15, Simon Taylor <simontaylor1 at ntlworld.com>
wrote:> Is there something about the use of intrinsics that prevents the compiler
optimizing out the redundant store on the stack? Is there any hope for this
improving in the future, or anything I can do now to improve the generated code?
If I had to guess, I'd say the intrinsic got in the way of recognising
the pattern. vmulq_f32 got correctly lowered to IR as "fmul", but
vld1q_f32 is still kept as an intrinsic, so register allocators and
schedulers get confused and, when lowering to assembly, you're left
with garbage around it.

Creating a bug for this is probably the best thing to do, since this
is a common pattern that needs looking into to produce optimal code.

cheers,
--renato

Simon Taylor

2014-Dec-08 09:05 UTC

head link

[LLVMdev] NEON intrinsics preventing redundant load optimization?

On 8 Dec 2014, at 00:13, Renato Golin <renato.golin at linaro.org> wrote:
> On 7 December 2014 at 19:15, Simon Taylor <simontaylor1 at
ntlworld.com> wrote:
>> Is there something about the use of intrinsics that prevents the
compiler optimizing out the redundant store on the stack? Is there any hope for
this improving in the future, or anything I can do now to improve the generated
code?
> 
> If I had to guess, I'd say the intrinsic got in the way of recognising
> the pattern. vmulq_f32 got correctly lowered to IR as "fmul", but
> vld1q_f32 is still kept as an intrinsic, so register allocators and
> schedulers get confused and, when lowering to assembly, you're left
> with garbage around it.
> 
> Creating a bug for this is probably the best thing to do, since this
> is a common pattern that needs looking into to produce optimal code.
Thanks for the responses. I’ve filed bug #21778 for this:
http://llvm.org/bugs/show_bug.cgi?id=21778

I’ve also tried replacing the vst1.32 with setting the data[i] elements
individually with vgetq_lane, which gets at least the single multiply case back
to optimal code. There’s still an unneeded temporary when doing res = a * b * c
though. Anyway, let’s continue this on the bug tracker :)

Simon

llvm dev - Dec 2014 - [LLVMdev] NEON intrinsics preventing redundant load optimization?

[LLVMdev] NEON intrinsics preventing redundant load optimization?

[LLVMdev] NEON intrinsics preventing redundant load optimization?

[LLVMdev] NEON intrinsics preventing redundant load optimization?

[LLVMdev] NEON intrinsics preventing redundant load optimization?