thr3ads.net - llvm dev - [LLVMdev] NEON intrinsics preventing redundant load optimization? [Jan 2015]

If this information is useful, please help other people find it:
Share via:

James Molloy

2015-Jan-05 12:13 UTC

[LLVMdev] NEON intrinsics preventing redundant load optimization?

Hi all,

Sorry for arriving late to the party. First, some context:

vld1 is not the same as a pointer dereference. The alignment requirements
are different (which I saw you hacked around in your testcase using
attribute((aligned(4))) ), and in big endian environments they do totally
different things (VLD1 does element-wise byteswapping and pointer
dereferences byteswaps the entire 128-bit number).

While pointer dereference does work just as well (and better, given this
defect) as VLD1 it is explicitly *not supported*. The ACLE mandates that
there are only certain ways to legitimately "create" a vector object -
vcreate, vcombine, vreinterpret and vload. NEON intrinsic types don't exist
in memory (memory is modelled as a sequence of scalars, as in the C model).
For this reason Renato I don't think we should advise people to work around
the API, as who knows what problems that will cause later.

The reason above is why we map a vloadq_f32() into a NEON intrinsic instead
of a generic IR load. Looking at your testcase, even with tip-of-trunk
clang we generate redundant loads and stores:

vld1.32 {d16, d17}, [r1]
vld1.32 {d18, d19}, [r0]
mov r0, sp
vmul.f32 q8, q9, q8
vst1.32 {d16, d17}, [r0]
vld1.64 {d16, d17}, [r0:128]
vst1.32 {d16, d17}, [r2]

Whereas for AArch64, we don't (and neither do we for the chained multiply
case):

ldr q0, [x0]
ldr q1, [x1]
fmul v0.4s, v0.4s, v1.4s
str q0, [x2]
ret

So this is handled, and I think there's something wrong/missing in the
optimizer for AArch32. This is a legitimate bug and should be fixed (even
if a workaround is required in the interim!)

Cheers,

James

On Mon Jan 05 2015 at 10:46:10 AM Renato Golin <renato.golin at
linaro.org>
wrote:
> On 5 January 2015 at 10:14, Simon Taylor <simontaylor1 at
ntlworld.com>
> wrote:
> > I don’t recall seeing anything about pointer dereferencing, but it may
> have the same issues. I’m a bit hazy on endianness issues with NEON anyway
> (in terms of element numbering, casts between types, etc) but it seems like
> all the smartphone platform ABIs are defined to be little-endian so I
> haven’t spent too much time worrying about it.
>
> Tim is right, this can be a potential danger, but not more than other
> endian or type size issues. If you're writing portable code, I assume
> you'll already be mindful of those issues.
>
> This is why I said it's still a problem, but not a critical one. Maybe
> adding a comment to your code explaining the issue will help you in
> the future to move it back to NEON loads/stores once this is fixed.
>
> cheers,
> --renato
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150105/e7ebf1d6/attachment.html>

Renato Golin

2015-Jan-05 13:08 UTC

head link

[LLVMdev] NEON intrinsics preventing redundant load optimization?

On 5 January 2015 at 12:13, James Molloy <james at jamesmolloy.co.uk>
wrote:> For this reason Renato I don't think we should advise people to work
around
> the API, as who knows what problems that will cause later.
I stand corrected (twice). But we changed the subject a bit, so things
got different midway through.

There were only two codes: full C, with pointers, vectorized some
times; and full NEON, with extra loads. The mix between the two came
later, as a quick fix. My comment to that, I agree, was misplaced, and
both you and Tim are correct is asking for it not to be used as a
final solution. But as an interim solution, with the care around
alignment, endian and all, it is "fine".

> So this is handled, and I think there's something wrong/missing in the
> optimizer for AArch32. This is a legitimate bug and should be fixed (even
if
> a workaround is required in the interim!)
It is legitimate, and that's why I created the bug in the first place.
The only difference is that it's now lower priority since it:

1. is a performance issue, not a conformance one
2. has a work around that is not too ugly

I now agree that the second part is not as strong as I judged (while
drinking Savoy wine and eating Gruyere cheese). Feel free to revamp
the priority.

cheers,
--renato

Simon Taylor

2015-Jan-13 11:57 UTC

head link

[LLVMdev] NEON intrinsics preventing redundant load optimization?

> On 5 Jan 2015, at 13:08, Renato Golin <renato.golin at linaro.org>
wrote:
> 
> On 5 January 2015 at 12:13, James Molloy <james at jamesmolloy.co.uk>
wrote:
>> For this reason Renato I don't think we should advise people to
work around
>> the API, as who knows what problems that will cause later.
> 
> I stand corrected (twice). But we changed the subject a bit, so things
> got different midway through.
> 
> There were only two codes: full C, with pointers, vectorized some
> times; and full NEON, with extra loads. The mix between the two came
> later, as a quick fix. My comment to that, I agree, was misplaced, and
> both you and Tim are correct is asking for it not to be used as a
> final solution. But as an interim solution, with the care around
> alignment, endian and all, it is "fine”.
After a bit more testing applying the pointer dereferencing workaround in my
real code (which has some layers of templates and inline functions) I’ve decided
against using it in practice.

GCC produced incorrect code unless -fno-strict-aliasing was specified. It’s
probably entirely entitled to do that, so it just seems too flaky to recommend
in any case.

Simon

Tim Northover

2015-Jan-13 17:22 UTC

head link

[LLVMdev] NEON intrinsics preventing redundant load optimization?

> While pointer dereference does work just as well (and better, given this
> defect) as VLD1 it is explicitly *not supported*.
I'm actually less convinced of this now. I'd argue that you
shouldn't
*create* vectors like that (e.g. by casting a pointer to another type
and loadin), but the ACLE can't really dictate what you do with those
vectors after they've been created, and passing around pointers to
them (or storing them in an array, as in int8x8xN_t and so on) is
entirely valid. I do hope we haven't broken that assumption with our
big-endian model...

Cheers.

Tim.

James Molloy

2015-Jan-13 17:28 UTC

head link

[LLVMdev] NEON intrinsics preventing redundant load optimization?

Hi Tim,

I do hope we haven't broken that assumption with our
big-endian model...

I should rephrase. I'm concerned solely with creating vectors by
dereferencing casted pointers to scalars (and vice-versa):

   uint32_t *g;
   uint32x4_t f = *(uint32x4_t)g;

What you referring to works perfectly fine in big endian mode and should be
supported - they're a type like any other. The important difference is the
implicit reinterpret_cast, which can make things go wrong fast in big
endian mode (I haven't wrapped my head yet quite around whether there is an
easy scenario in which things could break...)

Cheers,

James

On Tue Jan 13 2015 at 5:23:21 PM Tim Northover <t.p.northover at
gmail.com>
wrote:
> > While pointer dereference does work just as well (and better, given
this
> > defect) as VLD1 it is explicitly *not supported*.
>
> I'm actually less convinced of this now. I'd argue that you
shouldn't
> *create* vectors like that (e.g. by casting a pointer to another type
> and loadin), but the ACLE can't really dictate what you do with those
> vectors after they've been created, and passing around pointers to
> them (or storing them in an array, as in int8x8xN_t and so on) is
> entirely valid. I do hope we haven't broken that assumption with our
> big-endian model...
>
> Cheers.
>
> Tim.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150113/28a12b1a/attachment.html>

llvm dev - Jan 2015 - [LLVMdev] NEON intrinsics preventing redundant load optimization?

[LLVMdev] NEON intrinsics preventing redundant load optimization?

[LLVMdev] NEON intrinsics preventing redundant load optimization?

[LLVMdev] NEON intrinsics preventing redundant load optimization?

[LLVMdev] NEON intrinsics preventing redundant load optimization?

[LLVMdev] NEON intrinsics preventing redundant load optimization?