thr3ads.net - llvm dev - [LLVMdev] Vectorization of loops with conditional dereferencing [Nov 2013]

If this information is useful, please help other people find it:
Share via:

Hal Finkel

2013-Nov-01 13:40 UTC

[LLVMdev] Vectorization of loops with conditional dereferencing

Nadav, Arnold, et al.,

I have a number of loops that I would like us to be able to autovectorize
(common, for example, in n-body inter-particle force kernels), and the problem
is that they look like this:

for (int i = 0; i < N; ++i) {
  if (r[i] > 0)
    v += m[i]*...;
}

where, as written, m[i] is not accessed unless the condition is true. The
general problem (as is noted by the loop vectorizer), is that we don't know
that dereferencing m[i] is safe for values where the guarding condition is not
true, and so if-converting the block directly is not allowed.

One possibility, if we disallow for m[i] to have been munmaped or page protected
in the middle, is to first collect the first and last index for which the
condition is true, and then run the vector loop only on that portion. For the
loop above, I mean something like:

FirstI = -1, LastI = -1, I = 0;
for (; I < N; ++I)
  if (r[i] > 0) {
    FirstI = I;
    break;
  }

for (; I < N; ++I) {
  if (r[i] > 0)
    LastI = I;
}

[In this case, since the guard dominates all side-effect, we don't need to
do anything outside that range, otherwise, we'd need to run the scalar loop
until FirstI, and then again from LastI until N]

for (I = FirstI; I <= LastI; ...) {
  // Run the scalar/vector loop sequence as before.
}

On the other hand, if we can't make this range assumption generally, then
I'd assume that we could make it per-page (where a page is 4kB or similar).
Then the scheme becomes more complicated, because we need to break the ranges on
page boundaries (maybe something like this):

Done = false;
FirstI = -1, LastI = -1;
while (!Done) {
  for (I = FirstI+1; I < N; ++I)
    if (r[i] > 0) {
      FirstI = I;
      break;
    }

  for (; I < N && !page_bound(&m[i]) && ...; ++I) {
    if (r[i] > 0)
      LastI = I;
  }

  Done = I == N;

  for (I = FirstI; I <= LastI; ...) {
    // Run the scalar/vector loop sequence as before.
  }  
}

I hope that we don't need to do something like this, but I've been
disappointed by C/POSIX semantics before. On the other hand, we might want to
break these into groups anyway, and allocate a local stack buffer to hold the
mask computed from the condition, so that we don't have to reevaluate the
condition in the loop.

Also, I'm certainly open to other ways to do this. Are there any other ways?
Thoughts?

Thanks again,
Hal

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Nadav Rotem

2013-Nov-01 16:38 UTC

head link

[LLVMdev] Vectorization of loops with conditional dereferencing

Hi Hal,

Yes, I agree that this is a problem that prevents vectorization in many loops. 
Another problem that we have is that sunk loads don’t preserve their control
dependence properties.  For example in the code below, if we sink the load into
the branch then we can't vectorize the loop.

x = A[i]
if (cond) {
	sum += x;	
}

I agree with you that checking the first and last element for each 4k page has
many disadvantages and I would like to explore other options. Yesterday Pekka
proposed marking some operations as 'no-trap'.  Maybe we should consider
marking some memory regions as ‘mapped'.  Every time we malloc memory or
pass a pointer to statically allocated memory we could propagate the information
about the availability and size of the memory.  What do you think ?

Thanks,
Nadav 



On Nov 1, 2013, at 6:40 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> Nadav, Arnold, et al.,
> 
> I have a number of loops that I would like us to be able to autovectorize
(common, for example, in n-body inter-particle force kernels), and the problem
is that they look like this:
> 
> for (int i = 0; i < N; ++i) {
>  if (r[i] > 0)
>    v += m[i]*...;
> }
> 
> where, as written, m[i] is not accessed unless the condition is true. The
general problem (as is noted by the loop vectorizer), is that we don't know
that dereferencing m[i] is safe for values where the guarding condition is not
true, and so if-converting the block directly is not allowed.
> 
> One possibility, if we disallow for m[i] to have been munmaped or page
protected in the middle, is to first collect the first and last index for which
the condition is true, and then run the vector loop only on that portion. For
the loop above, I mean something like:
> 
> FirstI = -1, LastI = -1, I = 0;
> for (; I < N; ++I)
>  if (r[i] > 0) {
>    FirstI = I;
>    break;
>  }
> 
> for (; I < N; ++I) {
>  if (r[i] > 0)
>    LastI = I;
> }
> 
> [In this case, since the guard dominates all side-effect, we don't need
to do anything outside that range, otherwise, we'd need to run the scalar
loop until FirstI, and then again from LastI until N]
> 
> for (I = FirstI; I <= LastI; ...) {
>  // Run the scalar/vector loop sequence as before.
> }
> 
> On the other hand, if we can't make this range assumption generally,
then I'd assume that we could make it per-page (where a page is 4kB or
similar). Then the scheme becomes more complicated, because we need to break the
ranges on page boundaries (maybe something like this):
> 
> Done = false;
> FirstI = -1, LastI = -1;
> while (!Done) {
>  for (I = FirstI+1; I < N; ++I)
>    if (r[i] > 0) {
>      FirstI = I;
>      break;
>    }
> 
>  for (; I < N && !page_bound(&m[i]) && ...; ++I) {
>    if (r[i] > 0)
>      LastI = I;
>  }
> 
>  Done = I == N;
> 
>  for (I = FirstI; I <= LastI; ...) {
>    // Run the scalar/vector loop sequence as before.
>  }  
> }
> 
> I hope that we don't need to do something like this, but I've been
disappointed by C/POSIX semantics before. On the other hand, we might want to
break these into groups anyway, and allocate a local stack buffer to hold the
mask computed from the condition, so that we don't have to reevaluate the
condition in the loop.
> 
> Also, I'm certainly open to other ways to do this. Are there any other
ways? Thoughts?
> 
> Thanks again,
> Hal
> 
> -- 
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory

Nadav Rotem

2013-Nov-01 16:47 UTC

head link

[LLVMdev] Vectorization of loops with conditional dereferencing

I forgot to mention that it would be interesting to disable this safety check in
the vectorizer and run the LLVM test suite.  I wonder how many tests will run
faster and how many will crash.

On Nov 1, 2013, at 6:40 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> Nadav, Arnold, et al.,
> 
> I have a number of loops that I would like us to be able to autovectorize
(common, for example, in n-body inter-particle force kernels), and the problem
is that they look like this:
> 
> for (int i = 0; i < N; ++i) {
>  if (r[i] > 0)
>    v += m[i]*...;
> }
> 
> where, as written, m[i] is not accessed unless the condition is true. The
general problem (as is noted by the loop vectorizer), is that we don't know
that dereferencing m[i] is safe for values where the guarding condition is not
true, and so if-converting the block directly is not allowed.
> 
> One possibility, if we disallow for m[i] to have been munmaped or page
protected in the middle, is to first collect the first and last index for which
the condition is true, and then run the vector loop only on that portion. For
the loop above, I mean something like:
> 
> FirstI = -1, LastI = -1, I = 0;
> for (; I < N; ++I)
>  if (r[i] > 0) {
>    FirstI = I;
>    break;
>  }
> 
> for (; I < N; ++I) {
>  if (r[i] > 0)
>    LastI = I;
> }
> 
> [In this case, since the guard dominates all side-effect, we don't need
to do anything outside that range, otherwise, we'd need to run the scalar
loop until FirstI, and then again from LastI until N]
> 
> for (I = FirstI; I <= LastI; ...) {
>  // Run the scalar/vector loop sequence as before.
> }
> 
> On the other hand, if we can't make this range assumption generally,
then I'd assume that we could make it per-page (where a page is 4kB or
similar). Then the scheme becomes more complicated, because we need to break the
ranges on page boundaries (maybe something like this):
> 
> Done = false;
> FirstI = -1, LastI = -1;
> while (!Done) {
>  for (I = FirstI+1; I < N; ++I)
>    if (r[i] > 0) {
>      FirstI = I;
>      break;
>    }
> 
>  for (; I < N && !page_bound(&m[i]) && ...; ++I) {
>    if (r[i] > 0)
>      LastI = I;
>  }
> 
>  Done = I == N;
> 
>  for (I = FirstI; I <= LastI; ...) {
>    // Run the scalar/vector loop sequence as before.
>  }  
> }
> 
> I hope that we don't need to do something like this, but I've been
disappointed by C/POSIX semantics before. On the other hand, we might want to
break these into groups anyway, and allocate a local stack buffer to hold the
mask computed from the condition, so that we don't have to reevaluate the
condition in the loop.
> 
> Also, I'm certainly open to other ways to do this. Are there any other
ways? Thoughts?
> 
> Thanks again,
> Hal
> 
> -- 
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory

Hal Finkel

2013-Nov-01 19:41 UTC

head link

[LLVMdev] Vectorization of loops with conditional dereferencing

----- Original Message -----> Hi Hal,
> 
> Yes, I agree that this is a problem that prevents vectorization in
> many loops.  Another problem that we have is that sunk loads don’t
> preserve their control dependence properties.  For example in the
> code below, if we sink the load into the branch then we can't
> vectorize the loop.
> 
> x = A[i]
> if (cond) {
> 	sum += x;
> }
> 
> I agree with you that checking the first and last element for each 4k
> page has many disadvantages and I would like to explore other
> options.
Yes -- So far, I've failed to think of a better way that will work
automatically (meaning without adding pragmas, etc.) :(
> Yesterday Pekka proposed marking some operations as
> 'no-trap'.  Maybe we should consider marking some memory regions as
> ‘mapped'.  Every time we malloc memory or pass a pointer to
> statically allocated memory we could propagate the information about
> the availability and size of the memory.  What do you think ?
I think that this is a good idea (and I suspect that code already exists to do
some of this; poolalloc perhaps?), although I fear that it generally depends on
LTO to work for general codes.

In the short term, I think a pragma, and a compiler option that changes the
default, may be the best solution.

Realistically, I know of 0 real applications that use loop-dependent conditions
to prevent dereferencing invalid pointers. As you say, it will be interesting to
turn off the safety check, and see what in the test suite is affected.

Thanks again,
Hal
> 
> Thanks,
> Nadav
> 
> 
> 
> On Nov 1, 2013, at 6:40 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> 
> > Nadav, Arnold, et al.,
> > 
> > I have a number of loops that I would like us to be able to
> > autovectorize (common, for example, in n-body inter-particle force
> > kernels), and the problem is that they look like this:
> > 
> > for (int i = 0; i < N; ++i) {
> >  if (r[i] > 0)
> >    v += m[i]*...;
> > }
> > 
> > where, as written, m[i] is not accessed unless the condition is
> > true. The general problem (as is noted by the loop vectorizer), is
> > that we don't know that dereferencing m[i] is safe for values
> > where the guarding condition is not true, and so if-converting the
> > block directly is not allowed.
> > 
> > One possibility, if we disallow for m[i] to have been munmaped or
> > page protected in the middle, is to first collect the first and
> > last index for which the condition is true, and then run the
> > vector loop only on that portion. For the loop above, I mean
> > something like:
> > 
> > FirstI = -1, LastI = -1, I = 0;
> > for (; I < N; ++I)
> >  if (r[i] > 0) {
> >    FirstI = I;
> >    break;
> >  }
> > 
> > for (; I < N; ++I) {
> >  if (r[i] > 0)
> >    LastI = I;
> > }
> > 
> > [In this case, since the guard dominates all side-effect, we don't
> > need to do anything outside that range, otherwise, we'd need to
> > run the scalar loop until FirstI, and then again from LastI until
> > N]
> > 
> > for (I = FirstI; I <= LastI; ...) {
> >  // Run the scalar/vector loop sequence as before.
> > }
> > 
> > On the other hand, if we can't make this range assumption
> > generally, then I'd assume that we could make it per-page (where a
> > page is 4kB or similar). Then the scheme becomes more complicated,
> > because we need to break the ranges on page boundaries (maybe
> > something like this):
> > 
> > Done = false;
> > FirstI = -1, LastI = -1;
> > while (!Done) {
> >  for (I = FirstI+1; I < N; ++I)
> >    if (r[i] > 0) {
> >      FirstI = I;
> >      break;
> >    }
> > 
> >  for (; I < N && !page_bound(&m[i]) && ...;
++I) {
> >    if (r[i] > 0)
> >      LastI = I;
> >  }
> > 
> >  Done = I == N;
> > 
> >  for (I = FirstI; I <= LastI; ...) {
> >    // Run the scalar/vector loop sequence as before.
> >  }
> > }
> > 
> > I hope that we don't need to do something like this, but I've
been
> > disappointed by C/POSIX semantics before. On the other hand, we
> > might want to break these into groups anyway, and allocate a local
> > stack buffer to hold the mask computed from the condition, so that
> > we don't have to reevaluate the condition in the loop.
> > 
> > Also, I'm certainly open to other ways to do this. Are there any
> > other ways? Thoughts?
> > 
> > Thanks again,
> > Hal
> > 
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> 
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Ralf Karrenberg

2013-Nov-14 08:16 UTC

head link

[LLVMdev] Vectorization of loops with conditional dereferencing

Hi Hal,

I may have missed the point here, but what about the "naive" solution
of
using a cascade of guarded, scalar loads in the vectorized code?

for (int i = 0; i < N; i+=4) {
   v0 = ...

   mask cond = r[i] > 0;
   vector l;
   if (mask[0])
     l[0] = m[0];
   if (mask[1])
     l[0] = m[1];
   ...

   v1 = v0 + l*...;
   blend(cond, v0, v1);
}

(I used array notation for accessing vector elements)

The rest of the code can remain vectorized (e.g. the += and *).
Or is it the case that the load is exactly the instruction you want 
vectorized because it dominates the runtime of the loop?

Oh, and sorry for the late reply, I don't have the time to check LLVMdev 
regularly atm :).

Cheers,
Ralf

On 01/11/13 14:40, Hal Finkel wrote:> Nadav, Arnold, et al.,
>
> I have a number of loops that I would like us to be able to autovectorize
(common, for example, in n-body inter-particle force kernels), and the problem
is that they look like this:
>
> for (int i = 0; i < N; ++i) {
>    if (r[i] > 0)
>      v += m[i]*...;
> }
>
> where, as written, m[i] is not accessed unless the condition is true. The
general problem (as is noted by the loop vectorizer), is that we don't know
that dereferencing m[i] is safe for values where the guarding condition is not
true, and so if-converting the block directly is not allowed.
>
> One possibility, if we disallow for m[i] to have been munmaped or page
protected in the middle, is to first collect the first and last index for which
the condition is true, and then run the vector loop only on that portion. For
the loop above, I mean something like:
>
> FirstI = -1, LastI = -1, I = 0;
> for (; I < N; ++I)
>    if (r[i] > 0) {
>      FirstI = I;
>      break;
>    }
>
> for (; I < N; ++I) {
>    if (r[i] > 0)
>      LastI = I;
> }
>
> [In this case, since the guard dominates all side-effect, we don't need
to do anything outside that range, otherwise, we'd need to run the scalar
loop until FirstI, and then again from LastI until N]
>
> for (I = FirstI; I <= LastI; ...) {
>    // Run the scalar/vector loop sequence as before.
> }
>
> On the other hand, if we can't make this range assumption generally,
then I'd assume that we could make it per-page (where a page is 4kB or
similar). Then the scheme becomes more complicated, because we need to break the
ranges on page boundaries (maybe something like this):
>
> Done = false;
> FirstI = -1, LastI = -1;
> while (!Done) {
>    for (I = FirstI+1; I < N; ++I)
>      if (r[i] > 0) {
>        FirstI = I;
>        break;
>      }
>
>    for (; I < N && !page_bound(&m[i]) && ...; ++I) {
>      if (r[i] > 0)
>        LastI = I;
>    }
>
>    Done = I == N;
>
>    for (I = FirstI; I <= LastI; ...) {
>      // Run the scalar/vector loop sequence as before.
>    }
> }
>
> I hope that we don't need to do something like this, but I've been
disappointed by C/POSIX semantics before. On the other hand, we might want to
break these into groups anyway, and allocate a local stack buffer to hold the
mask computed from the condition, so that we don't have to reevaluate the
condition in the loop.
>
> Also, I'm certainly open to other ways to do this. Are there any other
ways? Thoughts?
>
> Thanks again,
> Hal
>

Renato Golin

2013-Nov-14 10:18 UTC

head link

[LLVMdev] Vectorization of loops with conditional dereferencing

On 1 November 2013 13:40, Hal Finkel <hfinkel at anl.gov> wrote:
> Done = false;
> FirstI = -1, LastI = -1;
> while (!Done) {
>   for (I = FirstI+1; I < N; ++I)
>     if (r[i] > 0) {
>       FirstI = I;
>       break;
>     }
>
>   for (; I < N && !page_bound(&m[i]) && ...; ++I) {
>     if (r[i] > 0)
>       LastI = I;
>   }
>
>   Done = I == N;
>
>   for (I = FirstI; I <= LastI; ...) {
>     // Run the scalar/vector loop sequence as before.
>   }
> }
>
Hi Hal,

Even if you could do something like this, there is no guarantee that (Last
- First) will be multiple of VF, or even bigger than VF, and you'll have to
cope with boundary conditions on every external loop, which, depending on
the distribution of r[] could be as many as half the size of r[] itself.

I can't see an algorithmically (compile-time or run-time) way to guarantee
that the number of clusters will be small without scanning the whole array.
So, even if you add such a check in run-time and only vectorize if the
number of clusters is small, the worst case scenario will run twice as many
times, (the initial check can be vectorized, as it's a read-only loop), but
the later cannot. in the best case scenario, you'll have only a handful of
loops, which will run in parallel.

Worst case: n/VF + n
Best case: n/VF + ammortized n/VF

For VF == 2,
 * best case is as fast as scalar code, considering overheads.
 * worst case is 50% slower

For VF == 4,
 * best case is 50% faster than scalar code
 * worst case is 25% slower

And all that, depends on each workload, so it'll change for every different
set of arguments, which in an n-body simulation, changes dynamically.

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131114/1cbbfbff/attachment.html>

Nadav Rotem

2013-Nov-14 15:03 UTC

head link

[LLVMdev] Vectorization of loops with conditional dereferencing

I think that the best way to move forward with the vectorization of this loop is
to make progress on the vectorization pragmas.  The LoopVectorizer is already
prepared for handling pragmas and we just need to add the clang-side support. Is
anyone planning to work on this ?


On Nov 14, 2013, at 2:18 AM, Renato Golin <renato.golin at linaro.org>
wrote:
> On 1 November 2013 13:40, Hal Finkel <hfinkel at anl.gov> wrote:
> Done = false;
> FirstI = -1, LastI = -1;
> while (!Done) {
>   for (I = FirstI+1; I < N; ++I)
>     if (r[i] > 0) {
>       FirstI = I;
>       break;
>     }
> 
>   for (; I < N && !page_bound(&m[i]) && ...; ++I) {
>     if (r[i] > 0)
>       LastI = I;
>   }
> 
>   Done = I == N;
> 
>   for (I = FirstI; I <= LastI; ...) {
>     // Run the scalar/vector loop sequence as before.
>   }
> }
> 
> Hi Hal,
> 
> Even if you could do something like this, there is no guarantee that (Last
- First) will be multiple of VF, or even bigger than VF, and you'll have to
cope with boundary conditions on every external loop, which, depending on the
distribution of r[] could be as many as half the size of r[] itself.
> 
> I can't see an algorithmically (compile-time or run-time) way to
guarantee that the number of clusters will be small without scanning the whole
array. So, even if you add such a check in run-time and only vectorize if the
number of clusters is small, the worst case scenario will run twice as many
times, (the initial check can be vectorized, as it's a read-only loop), but
the later cannot. in the best case scenario, you'll have only a handful of
loops, which will run in parallel.
> 
> Worst case: n/VF + n
> Best case: n/VF + ammortized n/VF
> 
> For VF == 2,
>  * best case is as fast as scalar code, considering overheads.
>  * worst case is 50% slower
> 
> For VF == 4,
>  * best case is 50% faster than scalar code
>  * worst case is 25% slower
> 
> And all that, depends on each workload, so it'll change for every
different set of arguments, which in an n-body simulation, changes dynamically.
> 
> cheers,
> --renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131114/a52a3df7/attachment.html>

Possibly Parallel Threads

Search for more possibly parallel threads

llvm dev - Nov 2013 - [LLVMdev] Vectorization of loops with conditional dereferencing

[LLVMdev] Vectorization of loops with conditional dereferencing

[LLVMdev] Vectorization of loops with conditional dereferencing

[LLVMdev] Vectorization of loops with conditional dereferencing

[LLVMdev] Vectorization of loops with conditional dereferencing

[LLVMdev] Vectorization of loops with conditional dereferencing

[LLVMdev] Vectorization of loops with conditional dereferencing

[LLVMdev] Vectorization of loops with conditional dereferencing

Possibly Parallel Threads