thr3ads.net - llvm dev - [llvm-dev] [arm, aarch64] Alignment checking in interleaved access pass [Sep 2016]

If this information is useful, please help other people find it:
Share via:

Alina Sbirlea via llvm-dev

2016-Sep-19 20:52 UTC

[llvm-dev] [arm, aarch64] Alignment checking in interleaved access pass

Hi,

As a follow up to Patch D23646 <https://reviews.llvm.org/D23646>, I'm
trying to figure out if there should be an alignment check and what the
correct approach is.

Some background:
For stores, the pass turns:
%i.vec = shuffle <8 x i32> %v0, <8 x i32> %v1,
                 <0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11>
store <12 x i32> %i.vec, <12 x i32>* %ptr
Into:
%sub.v0 = shuffle <8 x i32> %v0, <8 x i32> v1, <0, 1, 2, 3>
%sub.v1 = shuffle <8 x i32> %v0, <8 x i32> v1, <4, 5, 6, 7>
%sub.v2 = shuffle <8 x i32> %v0, <8 x i32> v1, <8, 9, 10, 11>
call void llvm.aarch64.neon.st3(%sub.v0, %sub.v1, %sub.v2, %ptr)

The purpose of the above patch is to enable more general patterns such as
turning:
%i.vec = shuffle <32 x i32> %v0, <32 x i32> %v1,
                <4, 32, 16, 5, 33, 17, 6, 34, 18, 7, 35, 19>
store <12 x i32> %i.vec, <12 x i32>* %ptr
Into:
%sub.v0 = shuffle <32 x i32> %v0, <32 x i32> v1, <4, 5, 6, 7>
%sub.v1 = shuffle <32 x i32> %v0, <32 x i32> v1, <32, 33, 34,
35>
%sub.v2 = shuffle <32 x i32> %v0, <32 x i32> v1, <16, 17, 18,
19>
call void llvm.aarch64.neon.st3(%sub.v0, %sub.v1, %sub.v2, %ptr)

The question I'm trying to get answered if there should have been an
alignment check for the original pass, and, similarly, if there should be
an expanded one for the more general pattern.
In the example above, I was looking to check if the data at positions 4,
16, 32 is aligned, but I cannot get a clear picture on the impact on
performance (hence the side question below).
Also, some preliminary alignment checks I added break some ARM tests (and
not their AArch64 counterparts). The cause is getting "not fast" from
allowsMisalignedMemoryAccesses, from checking hasV7Ops.
I'd appreciate getting some guidance one how to best address and analyze
this.

Side question for Tim and other ARM folks, could I get a recommendation on
reading material for performance tuning for the different ARM archs?

Thank you,
Alina
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160919/516a784b/attachment.html>

Alina Sbirlea via llvm-dev

2016-Oct-06 18:58 UTC

head link

[llvm-dev] [arm, aarch64] Alignment checking in interleaved access pass

All,

Gentle reminder in the hopes of getting some answers to the questions in
the original email.

Thank you,
Alina


On Mon, Sep 19, 2016 at 1:52 PM, Alina Sbirlea <alina.sbirlea at
gmail.com>
wrote:
> Hi,
>
> As a follow up to Patch D23646 <https://reviews.llvm.org/D23646>,
I'm
> trying to figure out if there should be an alignment check and what the
> correct approach is.
>
> Some background:
> For stores, the pass turns:
> %i.vec = shuffle <8 x i32> %v0, <8 x i32> %v1,
>                  <0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11>
> store <12 x i32> %i.vec, <12 x i32>* %ptr
> Into:
> %sub.v0 = shuffle <8 x i32> %v0, <8 x i32> v1, <0, 1, 2,
3>
> %sub.v1 = shuffle <8 x i32> %v0, <8 x i32> v1, <4, 5, 6,
7>
> %sub.v2 = shuffle <8 x i32> %v0, <8 x i32> v1, <8, 9, 10,
11>
> call void llvm.aarch64.neon.st3(%sub.v0, %sub.v1, %sub.v2, %ptr)
>
> The purpose of the above patch is to enable more general patterns such as
> turning:
> %i.vec = shuffle <32 x i32> %v0, <32 x i32> %v1,
>                 <4, 32, 16, 5, 33, 17, 6, 34, 18, 7, 35, 19>
> store <12 x i32> %i.vec, <12 x i32>* %ptr
> Into:
> %sub.v0 = shuffle <32 x i32> %v0, <32 x i32> v1, <4, 5, 6,
7>
> %sub.v1 = shuffle <32 x i32> %v0, <32 x i32> v1, <32, 33,
34, 35>
> %sub.v2 = shuffle <32 x i32> %v0, <32 x i32> v1, <16, 17,
18, 19>
> call void llvm.aarch64.neon.st3(%sub.v0, %sub.v1, %sub.v2, %ptr)
>
> The question I'm trying to get answered if there should have been an
> alignment check for the original pass, and, similarly, if there should be
> an expanded one for the more general pattern.
> In the example above, I was looking to check if the data at positions 4,
> 16, 32 is aligned, but I cannot get a clear picture on the impact on
> performance (hence the side question below).
> Also, some preliminary alignment checks I added break some ARM tests (and
> not their AArch64 counterparts). The cause is getting "not fast"
from
> allowsMisalignedMemoryAccesses, from checking hasV7Ops.
> I'd appreciate getting some guidance one how to best address and
analyze
> this.
>
> Side question for Tim and other ARM folks, could I get a recommendation on
> reading material for performance tuning for the different ARM archs?
>
> Thank you,
> Alina
>
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161006/6ec89b56/attachment.html>

Renato Golin via llvm-dev

2016-Oct-08 13:26 UTC

head link

[llvm-dev] [arm, aarch64] Alignment checking in interleaved access pass

On 19 September 2016 at 21:52, Alina Sbirlea via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> The question I'm trying to get answered if there should have been an
> alignment check for the original pass, and, similarly, if there should be
an
> expanded one for the more general pattern.
Hi Alina,

IIRC, the initial implementation was very simple and straightforward
to make use of VLDn instructions on ARM/AArch64 NEON.

Your patterns allow simple vector instructions in the trivial case,
but not in the cases where VLDn would make a difference.

The examples were:

for (i..N)
  out[i] = in[i] * Factor; // R
  out[i+1] = in[i+1] * Factor; // G
  out[i+2] = in[i+2] * Factor; // B

This pattern is easily vectorised on most platforms, since loads, muls
and stores are the exact same operation. which can be combined.

for (i..N)
  out[i] = in[i] * FactorR; // R
  out[i+1] = in[i+1] * FactorG; // G
  out[i+2] = in[i+2] * FactorB; // B

This still can be vectorised easily, since the Factor vector can be
easily constructed.

for (i..N)
  out[i] = in[i] + FactorR; // R
  out[i+1] = in[i+1] - FactorG; // G
  out[i+2] = in[i+2] * FactorB; // B

Now it gets complicated, because the operations are not the same. In
this case, VLDn helps, because you shuffle [0, 1, 2, 3, 4, 5] -> VADD
[0, 3] + VSUB [1, 4] + VMUL [2, 5].

Your case seems to be more like:

for (i..N)
  out[i] = in[i] * FactorR; // R
  out[i+4] = in[i+4] * FactorG; // G
  out[i+8] = in[i+8] * FactorB; // B

In which VLDn won't help, but re-shuffling the vectors like the second
case above will.

Even this case:

for (i..N)
  out[i] = in[i] + FactorR; // R
  out[i+4] = in[i+4] - FactorG; // G
  out[i+8] = in[i+8] * FactorB; // B

can work, if the ranges are not overlapping. So, [0, 4, 8] would work
on a 4-way vector, but [0, 2, 4] would only work on a 2-way vector.

> In the example above, I was looking to check if the data at positions 4,
16,
> 32 is aligned, but I cannot get a clear picture on the impact on
performance
On modern ARM and AArch64, misaligned loads are not a problem. This is
true at least from A15 onwards, possibly A9 (James may know better).

If your ranges overlap, you may be forced to reduce the vectorisation
factor, thus reducing performance, but the vectoriser should be able
to pick that up from the cost analysis pass (2-way vs 4-way).

> Also, some preliminary alignment checks I added break some ARM tests (and
> not their AArch64 counterparts). The cause is getting "not fast"
from
> allowsMisalignedMemoryAccesses, from checking hasV7Ops.
What do you mean by "break"? Bad codegen? Slower code?

> Side question for Tim and other ARM folks, could I get a recommendation on
> reading material for performance tuning for the different ARM archs?
ARM has a list of manuals on each core, including optimisation guides:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.set.cortexa/index.html

cheers,
--renato

Alina Sbirlea via llvm-dev

2016-Oct-10 18:39 UTC

head link

[llvm-dev] [arm, aarch64] Alignment checking in interleaved access pass

Hi Renato,

Thank you for the answers!

First, let me clarify a couple of things and give some context.

The patch it looking at VSTn, rather than VLDn (stores seem to be somewhat
harder to get the "right" patterns, the pass is doing a good job for
loads
already)

The examples you gave come mostly from loop vectorization, which, as I
understand it, was the reason for adding the interleaved access pass. I'm
looking at a different usecase. The code in question is generated by a DSL
(Halide), and it's directly generating LLVM bitcode. The computations do
come originally from loops, but they are pre-processed by Halide, followed
by vector code generation as bitcode. The patters I'm targeting are not
generated AFAIK from the loop vectorization or SLP passes.

Now, for ARM archs Halide is currently generating explicit VSTn intrinsics,
with some of the patterns I described, and I found no reason why Halide
shouldn't generate a single shuffle, followed by a generic vector store and
rely on the interleaved access pass to generate the right intrinsic.
Performance-wise, it is worth using the VSTns in the scenarios they
encounter, it's mostly a question of where they get generated.

The alignment question is orthogonal to the patch up for review. There was
no alignment check before, and I didn't have enough background of the
architectures to conclude if this was needed or not. I added a simple check
to test if this would make any difference and some of the ARM tests started
failing. The break just meant that the interleaved access pass was not
replacing the "shuffle+store" with the "vstn" so the checks
were failing.
If the alignment is not an issue, it simplifies things.

Also, thank you for the reference!

Best,
Alina

On Sat, Oct 8, 2016 at 6:26 AM, Renato Golin <renato.golin at linaro.org>
wrote:
> On 19 September 2016 at 21:52, Alina Sbirlea via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
> > The question I'm trying to get answered if there should have been
an
> > alignment check for the original pass, and, similarly, if there should
> be an
> > expanded one for the more general pattern.
>
> Hi Alina,
>
> IIRC, the initial implementation was very simple and straightforward
> to make use of VLDn instructions on ARM/AArch64 NEON.
>
> Your patterns allow simple vector instructions in the trivial case,
> but not in the cases where VLDn would make a difference.
>
> The examples were:
>
> for (i..N)
>   out[i] = in[i] * Factor; // R
>   out[i+1] = in[i+1] * Factor; // G
>   out[i+2] = in[i+2] * Factor; // B
>
> This pattern is easily vectorised on most platforms, since loads, muls
> and stores are the exact same operation. which can be combined.
>
> for (i..N)
>   out[i] = in[i] * FactorR; // R
>   out[i+1] = in[i+1] * FactorG; // G
>   out[i+2] = in[i+2] * FactorB; // B
>
> This still can be vectorised easily, since the Factor vector can be
> easily constructed.
>
> for (i..N)
>   out[i] = in[i] + FactorR; // R
>   out[i+1] = in[i+1] - FactorG; // G
>   out[i+2] = in[i+2] * FactorB; // B
>
> Now it gets complicated, because the operations are not the same. In
> this case, VLDn helps, because you shuffle [0, 1, 2, 3, 4, 5] -> VADD
> [0, 3] + VSUB [1, 4] + VMUL [2, 5].
>
> Your case seems to be more like:
>
> for (i..N)
>   out[i] = in[i] * FactorR; // R
>   out[i+4] = in[i+4] * FactorG; // G
>   out[i+8] = in[i+8] * FactorB; // B
>
> In which VLDn won't help, but re-shuffling the vectors like the second
> case above will.
>
> Even this case:
>
> for (i..N)
>   out[i] = in[i] + FactorR; // R
>   out[i+4] = in[i+4] - FactorG; // G
>   out[i+8] = in[i+8] * FactorB; // B
>
> can work, if the ranges are not overlapping. So, [0, 4, 8] would work
> on a 4-way vector, but [0, 2, 4] would only work on a 2-way vector.
>
>
> > In the example above, I was looking to check if the data at positions
4,
> 16,
> > 32 is aligned, but I cannot get a clear picture on the impact on
> performance
>
> On modern ARM and AArch64, misaligned loads are not a problem. This is
> true at least from A15 onwards, possibly A9 (James may know better).
>
> If your ranges overlap, you may be forced to reduce the vectorisation
> factor, thus reducing performance, but the vectoriser should be able
> to pick that up from the cost analysis pass (2-way vs 4-way).
>
>
> > Also, some preliminary alignment checks I added break some ARM tests
(and
> > not their AArch64 counterparts). The cause is getting "not
fast" from
> > allowsMisalignedMemoryAccesses, from checking hasV7Ops.
>
> What do you mean by "break"? Bad codegen? Slower code?
>
>
> > Side question for Tim and other ARM folks, could I get a
recommendation
> on
> > reading material for performance tuning for the different ARM archs?
>
> ARM has a list of manuals on each core, including optimisation guides:
>
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.
> doc.set.cortexa/index.html
>
> cheers,
> --renato
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161010/712c6fa4/attachment.html>

Maybe Matching Threads

Search for more possibly parallel threads

llvm dev - Sep 2016 - [arm, aarch64] Alignment checking in interleaved access pass

[llvm-dev] [arm, aarch64] Alignment checking in interleaved access pass

[llvm-dev] [arm, aarch64] Alignment checking in interleaved access pass

[llvm-dev] [arm, aarch64] Alignment checking in interleaved access pass

[llvm-dev] [arm, aarch64] Alignment checking in interleaved access pass

Maybe Matching Threads