thr3ads.net - llvm dev - [LLVMdev] First attempt at recognizing pointer reduction [Oct 2013]

If this information is useful, please help other people find it:
Share via:

Renato Golin

2013-Oct-23 20:10 UTC

[LLVMdev] First attempt at recognizing pointer reduction

On 23 October 2013 16:05, Arnold Schwaighofer <aschwaighofer at
apple.com>wrote:
> In the examples you gave there are no reduction variables in the loop
> vectorizer’s sense. But, they all have memory accesses that are strided.
>
This is what I don't get. As far as I understood, a reduction variable is
the one that aggregates the computation done by the loop, and is used
outside the loop.

In my example, I'm aggregating a computation in an array and returning this
array for later use, what am I missing here?

1.) analyze the memory accesses in the loop for strided accesses and
deal> with them appropriately during analysis and transformation
> 2.) be able to handle loops with non-unit stride integer induction
> variables - this only makes sense if we have support for 1.)
>
So, despite all this, we still have to pass validation, so canVectorize()
must return true. As it stands, all my examples can't vectorize because of
the extra memory PHI, and your example below can't vectorize because it
can't find the array bounds.

I'm assuming that, as soon as I teach the validation to accept your loop
(given additional checks), and teach about the strides and costs, it will
vectorize. But if I go back to my original code, it won't, because of the
reduction PHI.

So, again, I agree with you, one step at a time, I'll work with your loop,
because it's the straightest path from here. But at some time, I'll have
to
either identify the memory PHIs or transform the loop to avoid them at all.
Does that make sense?

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131023/47c22e7d/attachment.html>

Arnold Schwaighofer

2013-Oct-23 22:05 UTC

head link

[LLVMdev] First attempt at recognizing pointer reduction

On Oct 23, 2013, at 3:10 PM, Renato Golin <renato.golin at linaro.org>
wrote:
> On 23 October 2013 16:05, Arnold Schwaighofer <aschwaighofer at
apple.com> wrote:
> In the examples you gave there are no reduction variables in the loop
vectorizer’s sense. But, they all have memory accesses that are strided.
> 
> This is what I don't get. As far as I understood, a reduction variable
is the one that aggregates the computation done by the loop, and is used outside
the loop.
> 
Yes. But:

for (i = …)
  a[i] = b[i]
  a[i+1] = b[i+1]
  …

is not such a reduction. Not even when the code disguises like this:

for (i= …)
 *a++ = *b++
 *a++ = *b++

A reduction is something like:

for (i= …) {
 r+= a[i];
}
return r;

here the value of “r” is only truly used by the reduction itself and outside of
the loop.

In your example you have

for ()
  *a   a++;
  …

the “*a=“ part being a true use in the loop. This is a problem because at that
point you would have to reduce the vectorized reduction.

I.e if we wrote “a" as (pointer reduction, VF=2) as:

  %a_red = phi <2 x i32*> [preheader, <%a, 0>], [loop, %a_red_next]

  %a_red_next = gep %a_red, <3, 3>

you would have to compute the value of “a” for the loop on every iteration from
the vectorized reduction.

  %a_red = phi <2 x i32*> [preheader, <%a, 0>], [loop, %a_red_next]

  %a = horizontal_add %a_red // This is executed on every iteration.
    = load a
  …
  %a_red_next = gep %a_red, <3, 3>

Something like the above wants to be a simple induction

  %a_ind = phi i32* [preheader, %a], [loop, %a_next]
   … use of %a_ind
   ...
  %a_next = gep %a_ind, 6

The horizontal reduction i mention above is not needed if you have no use inside
the loop like in the case of:

r=0
for (i = …) {
  r += a[i];
}
return r;

This is simply (i am leaving out the induction variable for “i”):

  %r_red = phi <2 x i32> [preheader, <%r, 0>] , [loop, %r_red_next]
  %a_ptr = gep %a, %i
  %val_of_a_sub_i = load <2 x i32> * %a_ptr

  %r_red_next = add %r_red, %val_of_a_sub_i

Outside of the loop we reduce the vectorized reduction to the final value of “r”

 %r= horizontal_add %r_red_next
 ret %r
 > In my example, I'm aggregating a computation in an array and returning
this array for later use, what am I missing here?
See above. You are not really aggregating in the spirit of reductions as used in
the loop vectorizer. Every array element is only accesses once so this is not
really an aggregation.

> 
> 
> 1.) analyze the memory accesses in the loop for strided accesses and deal
with them appropriately during analysis and transformation
> 2.) be able to handle loops with non-unit stride integer induction
variables - this only makes sense if we have support for 1.)
> 
> So, despite all this, we still have to pass validation, so canVectorize()
must return true. As it stands, all my examples can't vectorize because of
the extra memory PHI, and your example below can't vectorize because it
can't find the array bounds.
Yes because we have to teach the vectorizer about pointer induction variables
that stride for your example.

Your code - as far as I can reconstruct it from memory - looks something like

preheader:

loop:
%strided_induction_pointer = phi [preheader, %b], [loop, %str_ind_ptr_inc]

     = load %strided_induction_pointer
%b_1 = gep %strided_induction_pointer, 1

…    = load %b_1

%b_2 = gep %strided_induction_pointer, 2

…    = load %b_2

%str_ind_ptr_inc = gep %strided_induction_pointer, 3 // Induction variable that
strides by 3
%cmp = …
br %cmp, loop, exit

exit:

%strided_induction_pointer here is a pointer induction not a reduction, because
a reduction precludes uses like the loads in the loop.

My example probably gives you this answer because it is trying to add  runtime
checks that ensure that

a[3*i] a[3*i+1] a[3*i+2] 

don’t overlap for i in {0 …n}. Since we check overlapping by checking a range
this would fail anyways.

Teaching the legality logic that we can still vectorize this code - because the
accesses are strided (and don’t overlap) - is part of the problem to solve when
I say we have to teach the vectorizer how to  handle interleaved/strided data.

> 
> I'm assuming that, as soon as I teach the validation to accept your
loop (given additional checks), and teach about the strides and costs, it will
vectorize. But if I go back to my original code, it won't, because of the
reduction PHI.
> 
> So, again, I agree with you, one step at a time, I'll work with your
loop, because it's the straightest path from here. But at some time,
I'll have to either identify the memory PHIs or transform the loop to avoid
them at all. Does that make sense?

You don’t need to transform them in the legality phase. Believe me ;). Look at
how we handle stride one pointer inductions at the moment (they are also memory
phis) - they are based off a canonical induction variable that we create during
the actual vectorization. Everything before that is done virtually without
having to transform code until we actually know we want to vectorize it.

Basically for pointer inductions we store the start value. When we come to
actually vectorize the loop we create a canonical induction variable that starts
at zero and strides by the vector factor (VF). Any use of the original pointer
induction variable in the vectorized loop is simply:

  Builder.CreateGEP(II.StartValue, cannonical_induction_variable);

For a non-unit pointer induction this would be

  Builder.CreateGEP(II.StartValue, stride * cannonical_induction_variable);

(I have simplified the problem a little the code in the vectorizer is a little
bit more complicated but essentially boils down to the above)

Thanks,
Arnold> 
> cheers,
> --renato

Renato Golin

2013-Oct-24 07:40 UTC

head link

[LLVMdev] First attempt at recognizing pointer reduction

On 23 October 2013 23:05, Arnold Schwaighofer <aschwaighofer at
apple.com>wrote:
> A reduction is something like:
>
> for (i= …) {
>  r+= a[i];
> }
> return r;
>
Ok, so "reduction" is just a reduction in the map-reduce sense, and
nothing
else.


You don’t need to transform them in the legality phase. Believe me ;).
Look> at how we handle stride one pointer inductions at the moment (they are also
> memory phis) - they are based off a canonical induction variable that we
> create during the actual vectorization. Everything before that is done
> virtually without having to transform code until we actually know we want
> to vectorize it.
>
Oh, so that's what I was missing. When Nadav said about pointer reduction,
I thought that was how we were should be dealing with memory PHIs in the
end. I'll see how stride 1 pointer induction traverses the code and where
to add stride N (but not sooner than I try to teach strides to non-pointer
cases).

I'll also add the reduction case, like:

for (i .. N/3) {
  r += a[3*i] ..;
  r += a[3*i+1] ..;
  r += a[3*i+2] ..;
}
return r;

And see how it should work. (later, too).


Basically for pointer inductions we store the start value. When we come
to> actually vectorize the loop we create a canonical induction variable that
> starts at zero and strides by the vector factor (VF). Any use of the
> original pointer induction variable in the vectorized loop is simply:
>
Thanks, I'll have a look (again). ;)

I'll open some Bugzilla bugs and assign to me, one for each type of loop to
be vectorized, in sequence (depends on) and using some of the test we
discussed in the mailing list. That way, we'll be able to agree on the
order and implementation beforehand. Feel free to share them with me if you
have time in the interim, I'll mainly be two weeks away from now (Connect,
holidays, LLVM meeting).

So far, we have covered:

1. simple stride: for (N/3) { a[3*i] = b[3*i]; a[3*i+1] = b[3*i+1];
a[3*i+2] = b[3*i+2]; }
 * The simplest form, needs only basic checks & vectorization logic

2. re-rolling possible: for (N/3) { a[3*i] = b[3*i] + K; a[3*i+1] b[3*i+1] + K;
a[3*i+2] = b[3*i+2] + K; }
 * can be re-rolled for simple optimization (already implemented)
 * doesn't need interleaved data, even if vectorized directly via stride
access

3. re-rolling impossible: for (N/3) { a[3*i] = b[3*i] + I; a[3*i+1] b[3*i+1] +
J; a[3*i+2] = b[3*i+2] + K; }
 * can't be re-rolled and does need interleaved data access

4. reduction stride: for (N/3) { a += b[3*i]; a += b[3*i+1]; a += b[3*i+2];
} return a;
 * May work with case above implemented, may need additional reduction logic
 * can be any of the three types above

5. memory stride: for (N/3) { a++ = b++; a++ = b++; a++ = b++; }
 * should be transformed into one of the cases above first, than vectorized
as them
 * can be any of the three types above (1, 2, 3)

With the dependency being:

1 -> { 2, 3} -> {4, 5}

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131024/41e81dc6/attachment.html>

Maybe Matching Threads

Search for more seemingly similar threads

llvm dev - Oct 2013 - [LLVMdev] First attempt at recognizing pointer reduction

[LLVMdev] First attempt at recognizing pointer reduction

[LLVMdev] First attempt at recognizing pointer reduction

[LLVMdev] First attempt at recognizing pointer reduction

Maybe Matching Threads