thr3ads.net - llvm dev - [llvm-dev] [arm, aarch64] Alignment checking in interleaved access pass [Oct 2016]

If this information is useful, please help other people find it:
Share via:

Alina Sbirlea via llvm-dev

2016-Oct-10 21:16 UTC

[llvm-dev] [arm, aarch64] Alignment checking in interleaved access pass

On Mon, Oct 10, 2016 at 1:14 PM, Renato Golin <renato.golin at linaro.org>
wrote:
> On 10 October 2016 at 19:39, Alina Sbirlea <alina.sbirlea at
gmail.com>
> wrote:
> > Now, for ARM archs Halide is currently generating explicit VSTn
> intrinsics,
> > with some of the patterns I described, and I found no reason why
Halide
> > shouldn't generate a single shuffle, followed by a generic vector
store
> and
> > rely on the interleaved access pass to generate the right intrinsic.
>
> IIRC, the shuffle that unscrambles the interleaved pattern <0, 4, 8, 1
> ...> -> <0, 1, 2 ...> is how the strided algorithm works, so
that the
> back-end can match the patterns and emit a VSTn / STn because the
> access is "sequential" in a n-hop way.
>
> If the shuffle doesn't present itself in that way, the back-end
> doesn't match and you end up with a long list of VMOVs.
>
That's what the patch addresses. It generalizes the algorithm to accept a
more generic interleaved pattern and emit a VSTn/STn.
The access no longer needs to be sequential in an n-hop way, but a pattern
amenable to the use of the store intrinsic:
[x, y, ... z, x+1, y+1, ...z+1, x+2, y+2, ...z+2, ...]

> Also, the vectorisers today make sure that the sequence is continuous
> and contiguous, which doesn't seem to be a hard requirement for
> Halide. I don't think there's a better reason than "we
haven't thought
> about the general case yet".
>
That's right, perhaps because Halide is not a regular vectorizer, which
opens up new cases.
To give a bit more insight, here's a simple example of where the data is
still continuous: [0 .. 32) , but it needs to be split to use multiple
VSTns/STns. This is what Halide generates for aarch64:

  %uglygep242243 = bitcast i8* %uglygep242 to <16 x i32>*

  %114 = shufflevector <16 x i32> %112, <16 x i32> %113, <4 x
i32> <i32 0,
i32 1, i32 2, i32 3>
  %115 = shufflevector <16 x i32> %112, <16 x i32> %113, <4 x
i32> <i32 8,
i32 9, i32 10, i32 11>
  %116 = shufflevector <16 x i32> %112, <16 x i32> %113, <4 x
i32> <i32 16,
i32 17, i32 18, i32 19>
  %117 = shufflevector <16 x i32> %112, <16 x i32> %113, <4 x
i32> <i32 24,
i32 25, i32 26, i32 27>
  %118 = bitcast <16 x i32>* %uglygep242243 to <4 x i32>*
  call void @llvm.aarch64.neon.st4.v4i32.p0v4i32(<4 x i32> %114, <4 x
i32>
%115, <4 x i32> %116, <4 x i32> %117, <4 x i32>* %118)
  %scevgep241 = getelementptr <16 x i32>, <16 x i32>*
%uglygep242243, i64 1

  %119 = shufflevector <16 x i32> %112, <16 x i32> %113, <4 x
i32> <i32 4,
i32 5, i32 6, i32 7>
  %120 = shufflevector <16 x i32> %112, <16 x i32> %113, <4 x
i32> <i32 12,
i32 13, i32 14, i32 15>
  %121 = shufflevector <16 x i32> %112, <16 x i32> %113, <4 x
i32> <i32 20,
i32 21, i32 22, i32 23>
  %122 = shufflevector <16 x i32> %112, <16 x i32> %113, <4 x
i32> <i32 28,
i32 29, i32 30, i32 31>
  %123 = bitcast <16 x i32>* %scevgep241 to <4 x i32>*
  call void @llvm.aarch64.neon.st4.v4i32.p0v4i32(<4 x i32> %119, <4 x
i32>
%120, <4 x i32> %121, <4 x i32> %122, <4 x i32>* %123)

IMO, it makes sense to have Halide generate this instead:
%114 = shufflevector <16 x i32> %112, <16 x i32> %113, <16 x
i32> <i32 0,
i32 8, i32 16, i32 24, i32 1, i32 9, i32 17, i32 25, i32 2, i32 10, i32 18,
i32 26, i32 3, i32 11, i32 19, i32 27>
store <16 x i32> %114, <16 x i32>* %sunkaddr262
 %115 = shufflevector <16 x i32> %112, <16 x i32> %113, <16 x
i32> <i32 4,
i32 12, i32 20, i32 28, i32 5, i32 13, i32 21, i32 29, i32 6, i32 14, i32
22, i32 30, i32 7, i32 15, i32 23, i32 31>
store <16 x i32> %115, <16 x i32>* %scevgep241
With the changes from the patch, this translates to the code above, and it
is arch independent.

AFAIK there isn't an LLVM pass going one step further to do such splits if
given a single store.

But, you're right that in general, the data is not necessarily continuous
and contiguous in Halide, and I think the pass should address those cases
as well.

> One way to test the back-end pattern matching is to emit textual IR
> and manually change it, removing the intrinsics, or changing the
> shuffles and see what happens after `opt`.
>
Yes, I did that with some of the codes generated by Halide, it's what led
to patch D23646 to extend the patterns. The new code being generated is the
"expected" one.
Also, benchmarking some of their apps showed that llvm's pass (after the
patch) does the job as well as the custom code generation they were using
before. (Note, that Halide's code generation was written before the
interleaved access pass was added, so it made sense at the time.)

>
> > Performance-wise, it is worth using the VSTns in the scenarios they
> > encounter, it's mostly a question of where they get generated.
>
> I'm confused. AFAIK, VSTn is AArch32 while STn is AArch64, and for the
> one lane variant, the latency and throughput seem to be identical.
>
That's right, the aim is to have llvm generate VSTn/STn depending on the
arch.


Best,
Alina
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161010/33a35d5e/attachment.html>

Renato Golin via llvm-dev

2016-Oct-14 14:38 UTC

head link

[llvm-dev] [arm, aarch64] Alignment checking in interleaved access pass

On 10 October 2016 at 22:16, Alina Sbirlea <alina.sbirlea at gmail.com>
wrote:> IMO, it makes sense to have Halide generate this instead:
> %114 = shufflevector <16 x i32> %112, <16 x i32> %113, <16 x
i32> <i32 0,
> i32 8, i32 16, i32 24, i32 1, i32 9, i32 17, i32 25, i32 2, i32 10, i32 18,
> i32 26, i32 3, i32 11, i32 19, i32 27>
> store <16 x i32> %114, <16 x i32>* %sunkaddr262
>  %115 = shufflevector <16 x i32> %112, <16 x i32> %113, <16
x i32> <i32 4,
> i32 12, i32 20, i32 28, i32 5, i32 13, i32 21, i32 29, i32 6, i32 14, i32
> 22, i32 30, i32 7, i32 15, i32 23, i32 31>
> store <16 x i32> %115, <16 x i32>* %scevgep241
> With the changes from the patch, this translates to the code above, and it
> is arch independent.
Right, this makes sense.

This should generate 2 VST4/ST4, which together will be contiguous,
but not individually.

> Yes, I did that with some of the codes generated by Halide, it's what
led to
> patch D23646 to extend the patterns. The new code being generated is the
> "expected" one.
I have added some comments on the review, but I think overall, it
makes sense and it's a much simpler patch than I was expecting to find
working all the way to the end. :)

> Also, benchmarking some of their apps showed that llvm's pass (after
the
> patch) does the job as well as the custom code generation they were using
> before. (Note, that Halide's code generation was written before the
> interleaved access pass was added, so it made sense at the time.)
Nice!

cheers,
--renato

Alina Sbirlea via llvm-dev

2016-Oct-14 18:36 UTC

head link

[llvm-dev] [arm, aarch64] Alignment checking in interleaved access pass

On Fri, Oct 14, 2016 at 7:38 AM, Renato Golin <renato.golin at linaro.org>
wrote:
> On 10 October 2016 at 22:16, Alina Sbirlea <alina.sbirlea at
gmail.com>
> wrote:
> > IMO, it makes sense to have Halide generate this instead:
> > %114 = shufflevector <16 x i32> %112, <16 x i32> %113,
<16 x i32> <i32 0,
> > i32 8, i32 16, i32 24, i32 1, i32 9, i32 17, i32 25, i32 2, i32 10,
i32
> 18,
> > i32 26, i32 3, i32 11, i32 19, i32 27>
> > store <16 x i32> %114, <16 x i32>* %sunkaddr262
> >  %115 = shufflevector <16 x i32> %112, <16 x i32> %113,
<16 x i32> <i32
> 4,
> > i32 12, i32 20, i32 28, i32 5, i32 13, i32 21, i32 29, i32 6, i32 14,
i32
> > 22, i32 30, i32 7, i32 15, i32 23, i32 31>
> > store <16 x i32> %115, <16 x i32>* %scevgep241
> > With the changes from the patch, this translates to the code above,
and
> it
> > is arch independent.
>
> Right, this makes sense.
>
> This should generate 2 VST4/ST4, which together will be contiguous,
> but not individually.
>
>
> > Yes, I did that with some of the codes generated by Halide, it's
what
> led to
> > patch D23646 to extend the patterns. The new code being generated is
the
> > "expected" one.
>
> I have added some comments on the review, but I think overall, it
> makes sense and it's a much simpler patch than I was expecting to find
> working all the way to the end. :)
>
Thanks a lot for looking at the patch, much appreciated! :)
Updated and followed up on it.

Alina

>
>
> > Also, benchmarking some of their apps showed that llvm's pass
(after the
> > patch) does the job as well as the custom code generation they were
using
> > before. (Note, that Halide's code generation was written before
the
> > interleaved access pass was added, so it made sense at the time.)
>
> Nice!
>
> cheers,
> --renato
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161014/179f3437/attachment.html>

llvm dev - Oct 2016 - [arm, aarch64] Alignment checking in interleaved access pass

[llvm-dev] [arm, aarch64] Alignment checking in interleaved access pass

[llvm-dev] [arm, aarch64] Alignment checking in interleaved access pass

[llvm-dev] [arm, aarch64] Alignment checking in interleaved access pass