thr3ads.net - llvm dev - [LLVMdev] Vectorization of loops with conditional dereferencing [Nov 2013]

If this information is useful, please help other people find it:
Share via:

Nadav Rotem

2013-Nov-14 15:03 UTC

[LLVMdev] Vectorization of loops with conditional dereferencing

I think that the best way to move forward with the vectorization of this loop is
to make progress on the vectorization pragmas.  The LoopVectorizer is already
prepared for handling pragmas and we just need to add the clang-side support. Is
anyone planning to work on this ?


On Nov 14, 2013, at 2:18 AM, Renato Golin <renato.golin at linaro.org>
wrote:
> On 1 November 2013 13:40, Hal Finkel <hfinkel at anl.gov> wrote:
> Done = false;
> FirstI = -1, LastI = -1;
> while (!Done) {
>   for (I = FirstI+1; I < N; ++I)
>     if (r[i] > 0) {
>       FirstI = I;
>       break;
>     }
> 
>   for (; I < N && !page_bound(&m[i]) && ...; ++I) {
>     if (r[i] > 0)
>       LastI = I;
>   }
> 
>   Done = I == N;
> 
>   for (I = FirstI; I <= LastI; ...) {
>     // Run the scalar/vector loop sequence as before.
>   }
> }
> 
> Hi Hal,
> 
> Even if you could do something like this, there is no guarantee that (Last
- First) will be multiple of VF, or even bigger than VF, and you'll have to
cope with boundary conditions on every external loop, which, depending on the
distribution of r[] could be as many as half the size of r[] itself.
> 
> I can't see an algorithmically (compile-time or run-time) way to
guarantee that the number of clusters will be small without scanning the whole
array. So, even if you add such a check in run-time and only vectorize if the
number of clusters is small, the worst case scenario will run twice as many
times, (the initial check can be vectorized, as it's a read-only loop), but
the later cannot. in the best case scenario, you'll have only a handful of
loops, which will run in parallel.
> 
> Worst case: n/VF + n
> Best case: n/VF + ammortized n/VF
> 
> For VF == 2,
>  * best case is as fast as scalar code, considering overheads.
>  * worst case is 50% slower
> 
> For VF == 4,
>  * best case is 50% faster than scalar code
>  * worst case is 25% slower
> 
> And all that, depends on each workload, so it'll change for every
different set of arguments, which in an n-body simulation, changes dynamically.
> 
> cheers,
> --renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131114/a52a3df7/attachment.html>

Renato Golin

2013-Nov-14 15:09 UTC

head link

[LLVMdev] Vectorization of loops with conditional dereferencing

On 14 November 2013 15:03, Nadav Rotem <nrotem at apple.com> wrote:
> I think that the best way to move forward with the vectorization of this
> loop is to make progress on the vectorization pragmas.  The LoopVectorizer
> is already prepared for handling pragmas and we just need to add the
> clang-side support. Is anyone planning to work on this ?
>
I'm not. :(

What kind of pragmas would work for this loop? Something telling that it's
safe to speculatively read from m[] at any position? In this reduction case
it might be enough. But if this would be an induction store like:

for () {
  if (a[i] > 0)
    x[i] = ... + m[i];

then, the store would be a more complicated way to write to memory and
you'd need the read-pragma to not affect such cases.

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131114/93d3c2a5/attachment.html>

Pekka Jääskeläinen

2013-Nov-14 15:45 UTC

head link

[LLVMdev] Vectorization of loops with conditional dereferencing

Hi all,

On 11/14/2013 05:09 PM, Renato Golin wrote:> What kind of pragmas would work for this loop?
Wouldn't the "all-trap-type-covering" 'notrap' attribute
take
care of this one too?

-- 
--Pekka

Nadav Rotem

2013-Nov-14 16:38 UTC

head link

[LLVMdev] Vectorization of loops with conditional dereferencing

> 
> I'm not. :(
> 
I think that this is probably the most important feature for the vectorizer
right now.  Other features require adding complexity to the vectorizer while
this feature is relatively simple.
> What kind of pragmas would work for this loop? Something telling that
it's safe to speculatively read from m[] at any position? In this reduction
case it might be enough. But if this would be an induction store like:
> 
> for () {
>   if (a[i] > 0)
>     x[i] = ... + m[i];
> 
Sure.  Vectorization of stores is done by loading the current value from memory,
blending the new value and saving it back to memory.

> then, the store would be a more complicated way to write to memory and
you'd need the read-pragma to not affect such cases.

There is no need for read pragma or even a special attribute.   The ‘vectorize’
pragma tells the vectorizer that it is safe to access the predicated memory
(read or write).

Apparently Analagous Threads

Search for more possibly parallel threads

llvm dev - Nov 2013 - [LLVMdev] Vectorization of loops with conditional dereferencing

[LLVMdev] Vectorization of loops with conditional dereferencing

[LLVMdev] Vectorization of loops with conditional dereferencing

[LLVMdev] Vectorization of loops with conditional dereferencing

[LLVMdev] Vectorization of loops with conditional dereferencing

Apparently Analagous Threads