Hal Finkel
2013-Nov-01  13:40 UTC
[LLVMdev] Vectorization of loops with conditional dereferencing
Nadav, Arnold, et al.,
I have a number of loops that I would like us to be able to autovectorize
(common, for example, in n-body inter-particle force kernels), and the problem
is that they look like this:
for (int i = 0; i < N; ++i) {
  if (r[i] > 0)
    v += m[i]*...;
}
where, as written, m[i] is not accessed unless the condition is true. The
general problem (as is noted by the loop vectorizer), is that we don't know
that dereferencing m[i] is safe for values where the guarding condition is not
true, and so if-converting the block directly is not allowed.
One possibility, if we disallow for m[i] to have been munmaped or page protected
in the middle, is to first collect the first and last index for which the
condition is true, and then run the vector loop only on that portion. For the
loop above, I mean something like:
FirstI = -1, LastI = -1, I = 0;
for (; I < N; ++I)
  if (r[i] > 0) {
    FirstI = I;
    break;
  }
for (; I < N; ++I) {
  if (r[i] > 0)
    LastI = I;
}
[In this case, since the guard dominates all side-effect, we don't need to
do anything outside that range, otherwise, we'd need to run the scalar loop
until FirstI, and then again from LastI until N]
for (I = FirstI; I <= LastI; ...) {
  // Run the scalar/vector loop sequence as before.
}
On the other hand, if we can't make this range assumption generally, then
I'd assume that we could make it per-page (where a page is 4kB or similar).
Then the scheme becomes more complicated, because we need to break the ranges on
page boundaries (maybe something like this):
Done = false;
FirstI = -1, LastI = -1;
while (!Done) {
  for (I = FirstI+1; I < N; ++I)
    if (r[i] > 0) {
      FirstI = I;
      break;
    }
  for (; I < N && !page_bound(&m[i]) && ...; ++I) {
    if (r[i] > 0)
      LastI = I;
  }
  Done = I == N;
  for (I = FirstI; I <= LastI; ...) {
    // Run the scalar/vector loop sequence as before.
  }  
}
I hope that we don't need to do something like this, but I've been
disappointed by C/POSIX semantics before. On the other hand, we might want to
break these into groups anyway, and allocate a local stack buffer to hold the
mask computed from the condition, so that we don't have to reevaluate the
condition in the loop.
Also, I'm certainly open to other ways to do this. Are there any other ways?
Thoughts?
Thanks again,
Hal
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory
Nadav Rotem
2013-Nov-01  16:38 UTC
[LLVMdev] Vectorization of loops with conditional dereferencing
Hi Hal,
Yes, I agree that this is a problem that prevents vectorization in many loops. 
Another problem that we have is that sunk loads don’t preserve their control
dependence properties.  For example in the code below, if we sink the load into
the branch then we can't vectorize the loop.
x = A[i]
if (cond) {
	sum += x;	
}
I agree with you that checking the first and last element for each 4k page has
many disadvantages and I would like to explore other options. Yesterday Pekka
proposed marking some operations as 'no-trap'.  Maybe we should consider
marking some memory regions as ‘mapped'.  Every time we malloc memory or
pass a pointer to statically allocated memory we could propagate the information
about the availability and size of the memory.  What do you think ?
Thanks,
Nadav 
On Nov 1, 2013, at 6:40 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> Nadav, Arnold, et al.,
> 
> I have a number of loops that I would like us to be able to autovectorize
(common, for example, in n-body inter-particle force kernels), and the problem
is that they look like this:
> 
> for (int i = 0; i < N; ++i) {
>  if (r[i] > 0)
>    v += m[i]*...;
> }
> 
> where, as written, m[i] is not accessed unless the condition is true. The
general problem (as is noted by the loop vectorizer), is that we don't know
that dereferencing m[i] is safe for values where the guarding condition is not
true, and so if-converting the block directly is not allowed.
> 
> One possibility, if we disallow for m[i] to have been munmaped or page
protected in the middle, is to first collect the first and last index for which
the condition is true, and then run the vector loop only on that portion. For
the loop above, I mean something like:
> 
> FirstI = -1, LastI = -1, I = 0;
> for (; I < N; ++I)
>  if (r[i] > 0) {
>    FirstI = I;
>    break;
>  }
> 
> for (; I < N; ++I) {
>  if (r[i] > 0)
>    LastI = I;
> }
> 
> [In this case, since the guard dominates all side-effect, we don't need
to do anything outside that range, otherwise, we'd need to run the scalar
loop until FirstI, and then again from LastI until N]
> 
> for (I = FirstI; I <= LastI; ...) {
>  // Run the scalar/vector loop sequence as before.
> }
> 
> On the other hand, if we can't make this range assumption generally,
then I'd assume that we could make it per-page (where a page is 4kB or
similar). Then the scheme becomes more complicated, because we need to break the
ranges on page boundaries (maybe something like this):
> 
> Done = false;
> FirstI = -1, LastI = -1;
> while (!Done) {
>  for (I = FirstI+1; I < N; ++I)
>    if (r[i] > 0) {
>      FirstI = I;
>      break;
>    }
> 
>  for (; I < N && !page_bound(&m[i]) && ...; ++I) {
>    if (r[i] > 0)
>      LastI = I;
>  }
> 
>  Done = I == N;
> 
>  for (I = FirstI; I <= LastI; ...) {
>    // Run the scalar/vector loop sequence as before.
>  }  
> }
> 
> I hope that we don't need to do something like this, but I've been
disappointed by C/POSIX semantics before. On the other hand, we might want to
break these into groups anyway, and allocate a local stack buffer to hold the
mask computed from the condition, so that we don't have to reevaluate the
condition in the loop.
> 
> Also, I'm certainly open to other ways to do this. Are there any other
ways? Thoughts?
> 
> Thanks again,
> Hal
> 
> -- 
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
Nadav Rotem
2013-Nov-01  16:47 UTC
[LLVMdev] Vectorization of loops with conditional dereferencing
I forgot to mention that it would be interesting to disable this safety check in the vectorizer and run the LLVM test suite. I wonder how many tests will run faster and how many will crash. On Nov 1, 2013, at 6:40 AM, Hal Finkel <hfinkel at anl.gov> wrote:> Nadav, Arnold, et al., > > I have a number of loops that I would like us to be able to autovectorize (common, for example, in n-body inter-particle force kernels), and the problem is that they look like this: > > for (int i = 0; i < N; ++i) { > if (r[i] > 0) > v += m[i]*...; > } > > where, as written, m[i] is not accessed unless the condition is true. The general problem (as is noted by the loop vectorizer), is that we don't know that dereferencing m[i] is safe for values where the guarding condition is not true, and so if-converting the block directly is not allowed. > > One possibility, if we disallow for m[i] to have been munmaped or page protected in the middle, is to first collect the first and last index for which the condition is true, and then run the vector loop only on that portion. For the loop above, I mean something like: > > FirstI = -1, LastI = -1, I = 0; > for (; I < N; ++I) > if (r[i] > 0) { > FirstI = I; > break; > } > > for (; I < N; ++I) { > if (r[i] > 0) > LastI = I; > } > > [In this case, since the guard dominates all side-effect, we don't need to do anything outside that range, otherwise, we'd need to run the scalar loop until FirstI, and then again from LastI until N] > > for (I = FirstI; I <= LastI; ...) { > // Run the scalar/vector loop sequence as before. > } > > On the other hand, if we can't make this range assumption generally, then I'd assume that we could make it per-page (where a page is 4kB or similar). Then the scheme becomes more complicated, because we need to break the ranges on page boundaries (maybe something like this): > > Done = false; > FirstI = -1, LastI = -1; > while (!Done) { > for (I = FirstI+1; I < N; ++I) > if (r[i] > 0) { > FirstI = I; > break; > } > > for (; I < N && !page_bound(&m[i]) && ...; ++I) { > if (r[i] > 0) > LastI = I; > } > > Done = I == N; > > for (I = FirstI; I <= LastI; ...) { > // Run the scalar/vector loop sequence as before. > } > } > > I hope that we don't need to do something like this, but I've been disappointed by C/POSIX semantics before. On the other hand, we might want to break these into groups anyway, and allocate a local stack buffer to hold the mask computed from the condition, so that we don't have to reevaluate the condition in the loop. > > Also, I'm certainly open to other ways to do this. Are there any other ways? Thoughts? > > Thanks again, > Hal > > -- > Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory
Hal Finkel
2013-Nov-01  19:41 UTC
[LLVMdev] Vectorization of loops with conditional dereferencing
----- Original Message -----> Hi Hal, > > Yes, I agree that this is a problem that prevents vectorization in > many loops. Another problem that we have is that sunk loads don’t > preserve their control dependence properties. For example in the > code below, if we sink the load into the branch then we can't > vectorize the loop. > > x = A[i] > if (cond) { > sum += x; > } > > I agree with you that checking the first and last element for each 4k > page has many disadvantages and I would like to explore other > options.Yes -- So far, I've failed to think of a better way that will work automatically (meaning without adding pragmas, etc.) :(> Yesterday Pekka proposed marking some operations as > 'no-trap'. Maybe we should consider marking some memory regions as > ‘mapped'. Every time we malloc memory or pass a pointer to > statically allocated memory we could propagate the information about > the availability and size of the memory. What do you think ?I think that this is a good idea (and I suspect that code already exists to do some of this; poolalloc perhaps?), although I fear that it generally depends on LTO to work for general codes. In the short term, I think a pragma, and a compiler option that changes the default, may be the best solution. Realistically, I know of 0 real applications that use loop-dependent conditions to prevent dereferencing invalid pointers. As you say, it will be interesting to turn off the safety check, and see what in the test suite is affected. Thanks again, Hal> > Thanks, > Nadav > > > > On Nov 1, 2013, at 6:40 AM, Hal Finkel <hfinkel at anl.gov> wrote: > > > Nadav, Arnold, et al., > > > > I have a number of loops that I would like us to be able to > > autovectorize (common, for example, in n-body inter-particle force > > kernels), and the problem is that they look like this: > > > > for (int i = 0; i < N; ++i) { > > if (r[i] > 0) > > v += m[i]*...; > > } > > > > where, as written, m[i] is not accessed unless the condition is > > true. The general problem (as is noted by the loop vectorizer), is > > that we don't know that dereferencing m[i] is safe for values > > where the guarding condition is not true, and so if-converting the > > block directly is not allowed. > > > > One possibility, if we disallow for m[i] to have been munmaped or > > page protected in the middle, is to first collect the first and > > last index for which the condition is true, and then run the > > vector loop only on that portion. For the loop above, I mean > > something like: > > > > FirstI = -1, LastI = -1, I = 0; > > for (; I < N; ++I) > > if (r[i] > 0) { > > FirstI = I; > > break; > > } > > > > for (; I < N; ++I) { > > if (r[i] > 0) > > LastI = I; > > } > > > > [In this case, since the guard dominates all side-effect, we don't > > need to do anything outside that range, otherwise, we'd need to > > run the scalar loop until FirstI, and then again from LastI until > > N] > > > > for (I = FirstI; I <= LastI; ...) { > > // Run the scalar/vector loop sequence as before. > > } > > > > On the other hand, if we can't make this range assumption > > generally, then I'd assume that we could make it per-page (where a > > page is 4kB or similar). Then the scheme becomes more complicated, > > because we need to break the ranges on page boundaries (maybe > > something like this): > > > > Done = false; > > FirstI = -1, LastI = -1; > > while (!Done) { > > for (I = FirstI+1; I < N; ++I) > > if (r[i] > 0) { > > FirstI = I; > > break; > > } > > > > for (; I < N && !page_bound(&m[i]) && ...; ++I) { > > if (r[i] > 0) > > LastI = I; > > } > > > > Done = I == N; > > > > for (I = FirstI; I <= LastI; ...) { > > // Run the scalar/vector loop sequence as before. > > } > > } > > > > I hope that we don't need to do something like this, but I've been > > disappointed by C/POSIX semantics before. On the other hand, we > > might want to break these into groups anyway, and allocate a local > > stack buffer to hold the mask computed from the condition, so that > > we don't have to reevaluate the condition in the loop. > > > > Also, I'm certainly open to other ways to do this. Are there any > > other ways? Thoughts? > > > > Thanks again, > > Hal > > > > -- > > Hal Finkel > > Assistant Computational Scientist > > Leadership Computing Facility > > Argonne National Laboratory > >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory
Ralf Karrenberg
2013-Nov-14  08:16 UTC
[LLVMdev] Vectorization of loops with conditional dereferencing
Hi Hal,
I may have missed the point here, but what about the "naive" solution
of
using a cascade of guarded, scalar loads in the vectorized code?
for (int i = 0; i < N; i+=4) {
   v0 = ...
   mask cond = r[i] > 0;
   vector l;
   if (mask[0])
     l[0] = m[0];
   if (mask[1])
     l[0] = m[1];
   ...
   v1 = v0 + l*...;
   blend(cond, v0, v1);
}
(I used array notation for accessing vector elements)
The rest of the code can remain vectorized (e.g. the += and *).
Or is it the case that the load is exactly the instruction you want 
vectorized because it dominates the runtime of the loop?
Oh, and sorry for the late reply, I don't have the time to check LLVMdev 
regularly atm :).
Cheers,
Ralf
On 01/11/13 14:40, Hal Finkel wrote:> Nadav, Arnold, et al.,
>
> I have a number of loops that I would like us to be able to autovectorize
(common, for example, in n-body inter-particle force kernels), and the problem
is that they look like this:
>
> for (int i = 0; i < N; ++i) {
>    if (r[i] > 0)
>      v += m[i]*...;
> }
>
> where, as written, m[i] is not accessed unless the condition is true. The
general problem (as is noted by the loop vectorizer), is that we don't know
that dereferencing m[i] is safe for values where the guarding condition is not
true, and so if-converting the block directly is not allowed.
>
> One possibility, if we disallow for m[i] to have been munmaped or page
protected in the middle, is to first collect the first and last index for which
the condition is true, and then run the vector loop only on that portion. For
the loop above, I mean something like:
>
> FirstI = -1, LastI = -1, I = 0;
> for (; I < N; ++I)
>    if (r[i] > 0) {
>      FirstI = I;
>      break;
>    }
>
> for (; I < N; ++I) {
>    if (r[i] > 0)
>      LastI = I;
> }
>
> [In this case, since the guard dominates all side-effect, we don't need
to do anything outside that range, otherwise, we'd need to run the scalar
loop until FirstI, and then again from LastI until N]
>
> for (I = FirstI; I <= LastI; ...) {
>    // Run the scalar/vector loop sequence as before.
> }
>
> On the other hand, if we can't make this range assumption generally,
then I'd assume that we could make it per-page (where a page is 4kB or
similar). Then the scheme becomes more complicated, because we need to break the
ranges on page boundaries (maybe something like this):
>
> Done = false;
> FirstI = -1, LastI = -1;
> while (!Done) {
>    for (I = FirstI+1; I < N; ++I)
>      if (r[i] > 0) {
>        FirstI = I;
>        break;
>      }
>
>    for (; I < N && !page_bound(&m[i]) && ...; ++I) {
>      if (r[i] > 0)
>        LastI = I;
>    }
>
>    Done = I == N;
>
>    for (I = FirstI; I <= LastI; ...) {
>      // Run the scalar/vector loop sequence as before.
>    }
> }
>
> I hope that we don't need to do something like this, but I've been
disappointed by C/POSIX semantics before. On the other hand, we might want to
break these into groups anyway, and allocate a local stack buffer to hold the
mask computed from the condition, so that we don't have to reevaluate the
condition in the loop.
>
> Also, I'm certainly open to other ways to do this. Are there any other
ways? Thoughts?
>
> Thanks again,
> Hal
>
Renato Golin
2013-Nov-14  10:18 UTC
[LLVMdev] Vectorization of loops with conditional dereferencing
On 1 November 2013 13:40, Hal Finkel <hfinkel at anl.gov> wrote:> Done = false; > FirstI = -1, LastI = -1; > while (!Done) { > for (I = FirstI+1; I < N; ++I) > if (r[i] > 0) { > FirstI = I; > break; > } > > for (; I < N && !page_bound(&m[i]) && ...; ++I) { > if (r[i] > 0) > LastI = I; > } > > Done = I == N; > > for (I = FirstI; I <= LastI; ...) { > // Run the scalar/vector loop sequence as before. > } > } >Hi Hal, Even if you could do something like this, there is no guarantee that (Last - First) will be multiple of VF, or even bigger than VF, and you'll have to cope with boundary conditions on every external loop, which, depending on the distribution of r[] could be as many as half the size of r[] itself. I can't see an algorithmically (compile-time or run-time) way to guarantee that the number of clusters will be small without scanning the whole array. So, even if you add such a check in run-time and only vectorize if the number of clusters is small, the worst case scenario will run twice as many times, (the initial check can be vectorized, as it's a read-only loop), but the later cannot. in the best case scenario, you'll have only a handful of loops, which will run in parallel. Worst case: n/VF + n Best case: n/VF + ammortized n/VF For VF == 2, * best case is as fast as scalar code, considering overheads. * worst case is 50% slower For VF == 4, * best case is 50% faster than scalar code * worst case is 25% slower And all that, depends on each workload, so it'll change for every different set of arguments, which in an n-body simulation, changes dynamically. cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131114/1cbbfbff/attachment.html>
Nadav Rotem
2013-Nov-14  15:03 UTC
[LLVMdev] Vectorization of loops with conditional dereferencing
I think that the best way to move forward with the vectorization of this loop is to make progress on the vectorization pragmas. The LoopVectorizer is already prepared for handling pragmas and we just need to add the clang-side support. Is anyone planning to work on this ? On Nov 14, 2013, at 2:18 AM, Renato Golin <renato.golin at linaro.org> wrote:> On 1 November 2013 13:40, Hal Finkel <hfinkel at anl.gov> wrote: > Done = false; > FirstI = -1, LastI = -1; > while (!Done) { > for (I = FirstI+1; I < N; ++I) > if (r[i] > 0) { > FirstI = I; > break; > } > > for (; I < N && !page_bound(&m[i]) && ...; ++I) { > if (r[i] > 0) > LastI = I; > } > > Done = I == N; > > for (I = FirstI; I <= LastI; ...) { > // Run the scalar/vector loop sequence as before. > } > } > > Hi Hal, > > Even if you could do something like this, there is no guarantee that (Last - First) will be multiple of VF, or even bigger than VF, and you'll have to cope with boundary conditions on every external loop, which, depending on the distribution of r[] could be as many as half the size of r[] itself. > > I can't see an algorithmically (compile-time or run-time) way to guarantee that the number of clusters will be small without scanning the whole array. So, even if you add such a check in run-time and only vectorize if the number of clusters is small, the worst case scenario will run twice as many times, (the initial check can be vectorized, as it's a read-only loop), but the later cannot. in the best case scenario, you'll have only a handful of loops, which will run in parallel. > > Worst case: n/VF + n > Best case: n/VF + ammortized n/VF > > For VF == 2, > * best case is as fast as scalar code, considering overheads. > * worst case is 50% slower > > For VF == 4, > * best case is 50% faster than scalar code > * worst case is 25% slower > > And all that, depends on each workload, so it'll change for every different set of arguments, which in an n-body simulation, changes dynamically. > > cheers, > --renato-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131114/a52a3df7/attachment.html>
Maybe Matching Threads
- [LLVMdev] Vectorization of loops with conditional dereferencing
- [LLVMdev] Vectorization of loops with conditional dereferencing
- [LLVMdev] Vectorization of loops with conditional dereferencing
- [LLVMdev] Vectorization of loops with conditional dereferencing
- [LLVMdev] Vectorization of loops with conditional dereferencing