thr3ads.net - llvm dev - [LLVMdev] NEON intrinsics preventing redundant load optimization? [Jan 2015]

If this information is useful, please help other people find it:
Share via:

Simon Taylor

2015-Jan-05 10:14 UTC

[LLVMdev] NEON intrinsics preventing redundant load optimization?

On 4 Jan 2015, at 21:06, Tim Northover <t.p.northover at gmail.com> wrote:
>>> I’ve managed to replace the load/store intrinsics with pointer
dereferences (along with a typedef to get the alignment correct). This generates
100% the same IR + asm as the auto-vectorized C version (both using -O3), and
works with the toolchain in the latest XCode. Are there any concerns around
doing this?
>> 
>> My view is that you should only use intrinsics where the language has
>> no semantics for it. Since this is not the case, using pointers is
>> probably the best way, anyway.
> 
> I think dereferencing pointers is explicitly discouraged in the
> documentation for portability reasons. It may well have issues on
> wrong-endian targets.
The ARM ACLE docs recommend against the GCC extension that allows an initializer
list because of potential endianness issues:
float32x4_t values = {1, 2, 3, 4};

I don’t recall seeing anything about pointer dereferencing, but it may have the
same issues. I’m a bit hazy on endianness issues with NEON anyway (in terms of
element numbering, casts between types, etc) but it seems like all the
smartphone platform ABIs are defined to be little-endian so I haven’t spent too
much time worrying about it.

Simon

Renato Golin

2015-Jan-05 10:24 UTC

head link

[LLVMdev] NEON intrinsics preventing redundant load optimization?

On 5 January 2015 at 10:14, Simon Taylor <simontaylor1 at ntlworld.com>
wrote:> I don’t recall seeing anything about pointer dereferencing, but it may have
the same issues. I’m a bit hazy on endianness issues with NEON anyway (in terms
of element numbering, casts between types, etc) but it seems like all the
smartphone platform ABIs are defined to be little-endian so I haven’t spent too
much time worrying about it.
Tim is right, this can be a potential danger, but not more than other
endian or type size issues. If you're writing portable code, I assume
you'll already be mindful of those issues.

This is why I said it's still a problem, but not a critical one. Maybe
adding a comment to your code explaining the issue will help you in
the future to move it back to NEON loads/stores once this is fixed.

cheers,
--renato

James Molloy

2015-Jan-05 12:13 UTC

head link

[LLVMdev] NEON intrinsics preventing redundant load optimization?

Hi all,

Sorry for arriving late to the party. First, some context:

vld1 is not the same as a pointer dereference. The alignment requirements
are different (which I saw you hacked around in your testcase using
attribute((aligned(4))) ), and in big endian environments they do totally
different things (VLD1 does element-wise byteswapping and pointer
dereferences byteswaps the entire 128-bit number).

While pointer dereference does work just as well (and better, given this
defect) as VLD1 it is explicitly *not supported*. The ACLE mandates that
there are only certain ways to legitimately "create" a vector object -
vcreate, vcombine, vreinterpret and vload. NEON intrinsic types don't exist
in memory (memory is modelled as a sequence of scalars, as in the C model).
For this reason Renato I don't think we should advise people to work around
the API, as who knows what problems that will cause later.

The reason above is why we map a vloadq_f32() into a NEON intrinsic instead
of a generic IR load. Looking at your testcase, even with tip-of-trunk
clang we generate redundant loads and stores:

vld1.32 {d16, d17}, [r1]
vld1.32 {d18, d19}, [r0]
mov r0, sp
vmul.f32 q8, q9, q8
vst1.32 {d16, d17}, [r0]
vld1.64 {d16, d17}, [r0:128]
vst1.32 {d16, d17}, [r2]

Whereas for AArch64, we don't (and neither do we for the chained multiply
case):

ldr q0, [x0]
ldr q1, [x1]
fmul v0.4s, v0.4s, v1.4s
str q0, [x2]
ret

So this is handled, and I think there's something wrong/missing in the
optimizer for AArch32. This is a legitimate bug and should be fixed (even
if a workaround is required in the interim!)

Cheers,

James

On Mon Jan 05 2015 at 10:46:10 AM Renato Golin <renato.golin at
linaro.org>
wrote:
> On 5 January 2015 at 10:14, Simon Taylor <simontaylor1 at
ntlworld.com>
> wrote:
> > I don’t recall seeing anything about pointer dereferencing, but it may
> have the same issues. I’m a bit hazy on endianness issues with NEON anyway
> (in terms of element numbering, casts between types, etc) but it seems like
> all the smartphone platform ABIs are defined to be little-endian so I
> haven’t spent too much time worrying about it.
>
> Tim is right, this can be a potential danger, but not more than other
> endian or type size issues. If you're writing portable code, I assume
> you'll already be mindful of those issues.
>
> This is why I said it's still a problem, but not a critical one. Maybe
> adding a comment to your code explaining the issue will help you in
> the future to move it back to NEON loads/stores once this is fixed.
>
> cheers,
> --renato
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150105/e7ebf1d6/attachment.html>

Possibly Parallel Threads

Search for more maybe matching threads

llvm dev - Jan 2015 - [LLVMdev] NEON intrinsics preventing redundant load optimization?

[LLVMdev] NEON intrinsics preventing redundant load optimization?

[LLVMdev] NEON intrinsics preventing redundant load optimization?

[LLVMdev] NEON intrinsics preventing redundant load optimization?

Possibly Parallel Threads