thr3ads.net - llvm dev - [LLVMdev] First attempt at recognizing pointer reduction [Oct 2013]

If this information is useful, please help other people find it:
Share via:

Renato Golin

2013-Oct-21 18:00 UTC

[LLVMdev] First attempt at recognizing pointer reduction

Hi Arnold,

To sum up my intentions, I want to understand how the reduction/induction
variable detection works in LLVM, so that I can know better how to detect
different patterns in memory, not just the stride vectorization.

For instance, even if the relationship between each loop would be
complicated, I know that in each loop, all three reads are sequential. So,
at least, I could use a Load<3> rather than three loads. I guess this is
why Nadav was hinting for the SLP vectorizer, not the loop vec. Since the
operation on all three loaded values are exactly the same AND the writes
are also sequential, I can reduce that to: load<3> -> op<3> ->
store<3>.
That should work even on machines that don't have de-interleaved access
(assuming they're pointers to simple types, etc).

On 21 October 2013 17:29, Arnold Schwaighofer <aschwaighofer at
apple.com>wrote:
> can you post a hand-created vectorized IR of how a reduction would work on
> your example?
>
I'm not there yet, just trying to understand the induction/reduction code
first and what comes out of it.

I think the right approach to get examples like this is that we need
to> recognize strided pointer inductions (we only support strides by one
> currently).
>
I see. I'll have a look at IK_PtrInduction and see what patterns I can spot.

Do you envisage a new IK type for strided induction, or just work with the
PtrInduction to extend the concept of a non-unit stride (ie. separate Step
from ElementSize)?

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131021/4320c3b6/attachment.html>

Arnold Schwaighofer

2013-Oct-21 19:58 UTC

head link

[LLVMdev] First attempt at recognizing pointer reduction

On Oct 21, 2013, at 1:00 PM, Renato Golin <renato.golin at linaro.org>
wrote:
> Hi Arnold,
> 
> To sum up my intentions, I want to understand how the reduction/induction
variable detection works in LLVM, so that I can know better how to detect
different patterns in memory, not just the stride vectorization.
To detect memory access patterns you will want to look at the SCEV of a pointer
(or at a set of SCEVs/pointers). This way you get a canonical form.

For example these should be the SCEVs of “int a[2*i] = ; a[2*i+1] =”:

{ptr,   +, 8}_loop
{ptr+4, +, 8}_loop

Each access on its own requires a gather/scather (2 loads/stores when vectorized
(VF=2) + inserts/extracts). But when we look at both at once we see that we only
need two load/store in total (plus some interleaving operations).

What other patterns (than strided accesses) do you want to vectorize?
> 
> For instance, even if the relationship between each loop would be
complicated, I know that in each loop, all three reads are sequential. So, at
least, I could use a Load<3> rather than three loads.

Yes. Detecting this is what is described in the paper I referred in a post
before (Auto-vectorization of interleaved data for SIMD). And what gcc is doing
(AFAICT). You want to take a set of accesses in the loop and recognize that they
are consecutive in memory (currently we look at every access on it is own, if an
access is not sequential we assume we have to gather/scather). Once you know
that you have consecutive accesses spread across different instructions you can
generate more efficient code instead of scather/gathers. You would want to do
the evaluation of which accesses are consecutive in SCEV I think.

For your example, you want to recognize strided accesses (this is separate from
the induction/reduction mechanism), first:

for (i = 0 .. n, +1)
 a[2*i] = …
 a[2*1+1] = …

You want this part first because without it the loop vectorizer is not going to
vectorize because of the cost of gather/scather. But if it can recognize that in
cases like this the “gather/scather” is not as costly and we emit the right code
we will start to vectorize such loops.

Once we can do that. We need to support strided pointer inductions to get your
example.

for (i = 0..n, +1)
 *a++ *a++
> I guess this is why Nadav was hinting for the SLP vectorizer, not the loop
vec. Since the operation on all three loaded values are exactly the same AND the
writes are also sequential, I can reduce that to: load<3> ->
op<3> ->
> store<3>. That should work even on machines that don't have
de-interleaved access (assuming they're pointers to simple types, etc).
Getting this example in the slp vectorizer is easier but we won’t get the most
efficient code (i.e. the one that gcc emits) because we would have <3 x
i8> stores/loads. With vectorization of interleaved data you can load/store
more elements (from several iterations) with a single load.
> 
> 
> On 21 October 2013 17:29, Arnold Schwaighofer <aschwaighofer at
apple.com> wrote:
> can you post a hand-created vectorized IR of how a reduction would work on
your example?
> 
> I'm not there yet, just trying to understand the induction/reduction
code first and what comes out of it.
> 
> 
> I think the right approach to get examples like this is that we need to
recognize strided pointer inductions (we only support strides by one currently).
> 
> I see. I'll have a look at IK_PtrInduction and see what patterns I can
spot.
> 
> Do you envisage a new IK type for strided induction, or just work with the
PtrInduction to extend the concept of a non-unit stride (ie. separate Step from
ElementSize)?
Either representation should be fine. I think the bigger task is not recognizing
the induction but recognizing consecutive strided memory accesses, though.
First, I think we want to be able to do:

for (i = 0 … n, +1)
  a[3*i]   a[3*i+1]   a[3*i+2] 
And next,

for (i = 0 … n, +1)
  *a++   *a++   *a++   
Because to get the latter, you need the former.

Have you compared the performance of the kernel (gcc vectorized) you showed vs a
scalar version? I would be curious about the speed-up.

Thanks,
Arnold

Renato Golin

2013-Oct-21 20:40 UTC

head link

[LLVMdev] First attempt at recognizing pointer reduction

On 21 October 2013 20:58, Arnold Schwaighofer <aschwaighofer at
apple.com>wrote:
> For example these should be the SCEVs of “int a[2*i] = ; a[2*i+1] =”:
>
> {ptr,   +, 8}_loop
> {ptr+4, +, 8}_loop
>
> Each access on its own requires a gather/scather (2 loads/stores when
> vectorized (VF=2) + inserts/extracts). But when we look at both at once we
> see that we only need two load/store in total (plus some interleaving
> operations).
>

Yes, I've been studying SCEV when trying to understand some other patterns
where the vectorizer was unable to detect the exit count (basically this
case, with a nested loop). It does make things easier to spot patterns in
the code.

The patch I attached here was not to help vectorize anything, but to let me
jump over the validation step, so that I could start working with the
patterns themselves during the actual vectorization. The review request was
only to understand if the checks I was making made sense, but it turned out
a lot more valuable than that.


Getting this example in the slp vectorizer is easier but we won’t get
the> most efficient code (i.e. the one that gcc emits) because we would have
<3
> x i8> stores/loads. With vectorization of interleaved data you can
> load/store more elements (from several iterations) with a single load.
>
So, this was the other patterns I was looking for, as a stepping stone into
the full vectorizer. But I'm not sure this will help in any way the strided
access.



Either representation should be fine. I think the bigger task is
not> recognizing the induction but recognizing consecutive strided memory
> accesses, though. First, I think we want to be able to do:
>
> for (i = 0 … n, +1)
>   a[3*i] >   a[3*i+1] >   a[3*i+2] >
> And next,
>
> for (i = 0 … n, +1)
>   *a++ >   *a++ >   *a++ >
> Because to get the latter, you need the former.
>
Makes total sense. I'll change my approach.


Have you compared the performance of the kernel (gcc vectorized) you
showed> vs a scalar version? I would be curious about the speed-up.
>
4x faster, on both Cortex A9 and A15. :)

Thanks for the tips, I hope I can find more time to work on it this week,
since Linaro Connect is in the coming week and the US dev meeting is on the
next.

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131021/c0d72575/attachment.html>

Possibly Parallel Threads

Search for more possibly parallel threads

llvm dev - Oct 2013 - [LLVMdev] First attempt at recognizing pointer reduction

[LLVMdev] First attempt at recognizing pointer reduction

[LLVMdev] First attempt at recognizing pointer reduction

[LLVMdev] First attempt at recognizing pointer reduction

Possibly Parallel Threads