thr3ads.net - llvm dev - [LLVMdev] NEON intrinsics preventing redundant load optimization? [Jan 2015]

If this information is useful, please help other people find it:
Share via:

Simon Taylor

2014-Dec-10 09:19 UTC

[LLVMdev] NEON intrinsics preventing redundant load optimization?

On 9 Dec 2014, at 02:20, Jim Grosbach <grosbach at apple.com> wrote:
>> On Dec 8, 2014, at 1:05 AM, Simon Taylor <simontaylor1 at
ntlworld.com> wrote:
>> 
>> On 8 Dec 2014, at 00:13, Renato Golin <renato.golin at
linaro.org> wrote:
>> 
>>> On 7 December 2014 at 19:15, Simon Taylor <simontaylor1 at
ntlworld.com> wrote:
>>>> Is there something about the use of intrinsics that prevents
the compiler optimizing out the redundant store on the stack? Is there any hope
for this improving in the future, or anything I can do now to improve the
generated code?
>>> 
>>> If I had to guess, I'd say the intrinsic got in the way of
recognising
>>> the pattern. vmulq_f32 got correctly lowered to IR as
"fmul", but
>>> vld1q_f32 is still kept as an intrinsic, so register allocators and
>>> schedulers get confused and, when lowering to assembly, you're
left
>>> with garbage around it.
> 
> FWIW, with top of tree clang, I get the same (good) code for both of the
implementations of operator* in the original email. That appears to be a fairly
recent improvement, though I haven’t bisected it down or anything.
Thanks for the note. I’m building a recent checkout now to see if I see the same
behaviour.

Looking at the -emit-llvm output from the XCode build shows "load <4 x
float>* %1, align 4, !tbaa !3” for the C implementation where redundant
temporaries are successfully eliminated. With the intrinsics the IR still
contains "tail call <4 x float> @llvm.arm.neon.vld1.v4f32(i8* %1, i32
4)”.

Perhaps this is due to NEON supporting interleaved loads for loading arrays of
structs into vector registers of each element. I suspect that isn’t very common
across architectures, so it wouldn’t surprise me if there was no IR instruction
for the interleaved cases (vld[234].*). It seems the vld1.* and vst1.* do have
those direct IR representations though.

It’s great news if this is fixed in the current tip, but in the short term (for
app store builds using the official toolchain) are there any LLVM-specific
extensions to initialise a float32x4_t that will get lowered to the "load
<4 x float>* %1” form? Or is that more a question for the clang folks?

Simon

Simon Taylor

2014-Dec-10 11:13 UTC

head link

[LLVMdev] NEON intrinsics preventing redundant load optimization?

On 10 Dec 2014, at 09:19, Simon Taylor <simontaylor1 at ntlworld.com>
wrote:
> On 9 Dec 2014, at 02:20, Jim Grosbach <grosbach at apple.com> wrote:
>> 
>> FWIW, with top of tree clang, I get the same (good) code for both of
the implementations of operator* in the original email. That appears to be a
fairly recent improvement, though I haven’t bisected it down or anything.
> 
> Thanks for the note. I’m building a recent checkout now to see if I see the
same behaviour.
Things have definitely improved with the top of tree, and as Jim reported the
examples I started with now generate the expected code. The NEON load/stores
still appear in the IR rather than the load <4 x float> from the
auto-vectorized C.

There’s an additional example in the bug report
[http://llvm.org/bugs/show_bug.cgi?id=21778] that tests chained multiply (ie res
= a * b * c). In this case the current top of tree clang still has one redundant
temporary.
> It’s great news if this is fixed in the current tip, but in the short term
(for app store builds using the official toolchain) are there any LLVM-specific
extensions to initialise a float32x4_t that will get lowered to the "load
<4 x float>* %1” form? Or is that more a question for the clang folks?
I’ve managed to replace the load/store intrinsics with pointer dereferences
(along with a typedef to get the alignment correct). This generates 100% the
same IR + asm as the auto-vectorized C version (both using -O3), and works with
the toolchain in the latest XCode. Are there any concerns around doing this?

typedef float32x4_t __attribute((aligned(4))) f32x4_align4_t;

vec4 operator* (const vec4& a, const vec4& b)
{
  vec4 result;

  float32x4_t a_data = *((f32x4_align4_t*)a.data);
  float32x4_t b_data = *((f32x4_align4_t*)b.data);

  float32x4_t result_data = vmulq_f32(a_data, b_data);

  *((f32x4_align4_t*)result.data) = result_data;

  return result;
}

Renato Golin

2015-Jan-02 15:37 UTC

head link

[LLVMdev] NEON intrinsics preventing redundant load optimization?

On 10 December 2014 at 11:13, Simon Taylor <simontaylor1 at ntlworld.com>
wrote:> I’ve managed to replace the load/store intrinsics with pointer dereferences
(along with a typedef to get the alignment correct). This generates 100% the
same IR + asm as the auto-vectorized C version (both using -O3), and works with
the toolchain in the latest XCode. Are there any concerns around doing this?
My view is that you should only use intrinsics where the language has
no semantics for it. Since this is not the case, using pointers is
probably the best way, anyway.

There is still the "bug" where the load/store intrinsics don't map
to
simple pointer references, but since you found a better work-around,
that has lower priority now.

I changed the bug to reflect that.

cheers,
--renato

Possibly Parallel Threads

Search for more apparently analagous threads

llvm dev - Jan 2015 - [LLVMdev] NEON intrinsics preventing redundant load optimization?

[LLVMdev] NEON intrinsics preventing redundant load optimization?

[LLVMdev] NEON intrinsics preventing redundant load optimization?

[LLVMdev] NEON intrinsics preventing redundant load optimization?

Possibly Parallel Threads