thr3ads.net - llvm dev - [LLVMdev] LoopVectorizer in OpenCL C work group autovectorization [Jan 2013]

If this information is useful, please help other people find it:
Share via:

Pekka Jääskeläinen

2013-Jan-25 11:35 UTC

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

On 01/25/2013 09:56 AM, Nadav Rotem wrote:> Thanks for checking the Loop Vectorizer, I am interested in hearing your
> feedback. The Loop Vectorizer does not fit here. OpenCL vectorization is
> completely different because the language itself is data-parallel. You
> don't need all of the legality checks that the loop vectorizer has.
I'm aware of this and it was my point in the original post.
However, I do not see why the loop vectorizer wouldn't fit
this use case given how the pocl's "kernel compiler" is
structured.

How I see it, the data parallel input simply makes the vectorizer's job
easier (skip some of the legality checks) while reusing most of the
implementation (e.g. cost estimation, unrolling decisions, the
vector instruction formation itself, predication/if-conversion,
speculative execution+blend, etc.).

Now pocl's kernel compiler detects the "parallel regions" (the
regions between work group barriers) and generates a new function suitable
for executing multiple work items (WI) in the work group. One method to
generate such functions is to generate embarrassingly parallel
"for-loops"
(wiloops) that produce the multi-WI DLP execution. That is, the loop
executes the code in the parallel regions for each work item in the work
group.

This step is needed to make the multi-WI kernel executable on
non-SIMD/SIMT platforms (read: CPUs). On the "SPMD-tailored"
processors
(many GPUs) this step is not always necessary as they can input the single
kernel instructions and do the "spreading" on the fly. We have a
different
method to generate the WG functions for such targets.
> Moreover, OpenCL has lots of language specific APIs such as
> "get_global_id" and builtin function calls, and without knowledge
of these
> calls it is impossible to vectorize OpenCL.
In pocl the whole kernel is "flattened", that is, the processed kernel
code
does not usually have function calls. Well, printf() and some intrisics
calls might be exceptions. In such cases the vectorization could be
simply not done and the parallelization can be attempted using some other
method (e.g. pure unrolling), like usual.

get_local_id is converted to regular iteration variables (local id space x,
y,z) in the wiloop.

I played yesterday a bit by kludge-hacking the LoopVectorizer code to
skip the canVectorizeMemory() check for these wiloop constructs and it
managed to vectorize a kernel as expected.
> You need to implement something like Whole Function Vectorization
> (http://dl.acm.org/citation.cfm?id=2190061). The loop vectorizer can't
> help you here. Ralf Karrenberg open sourced his implementation on github.
> You should take a look.
I think the WFV paper has plenty of good ideas that could be applied to
*improve* the vectorizability of DLP code/parallel loops (e.g. the mask
generation for diverging branches where the traditional if-conversion won't
do, especially intra kernel for-loops), but the actual vectorization
could be modularized to generic passes to, e.g., allow the choice of 
target-specific parallelization methods later on.

-- 
Pekka

Hal Finkel

2013-Jan-25 14:00 UTC

head link

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

----- Original Message -----> From: "Pekka Jääskeläinen" <pekka.jaaskelainen at tut.fi>
> To: "Nadav Rotem" <nrotem at apple.com>
> Cc: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu>
> Sent: Friday, January 25, 2013 5:35:16 AM
> Subject: Re: [LLVMdev] LoopVectorizer in OpenCL C work group
autovectorization
> 
> On 01/25/2013 09:56 AM, Nadav Rotem wrote:
> > Thanks for checking the Loop Vectorizer, I am interested in hearing
> > your
> > feedback. The Loop Vectorizer does not fit here. OpenCL
> > vectorization is
> > completely different because the language itself is data-parallel.
> > You
> > don't need all of the legality checks that the loop vectorizer
has.
> 
> I'm aware of this and it was my point in the original post.
> However, I do not see why the loop vectorizer wouldn't fit
> this use case given how the pocl's "kernel compiler" is
structured.
> 
> How I see it, the data parallel input simply makes the vectorizer's
> job
> easier (skip some of the legality checks) while reusing most of the
> implementation (e.g. cost estimation, unrolling decisions, the
> vector instruction formation itself, predication/if-conversion,
> speculative execution+blend, etc.).
> 
> Now pocl's kernel compiler detects the "parallel regions"
(the
> regions between work group barriers) and generates a new function
> suitable
> for executing multiple work items (WI) in the work group. One method
> to
> generate such functions is to generate embarrassingly parallel
> "for-loops"
> (wiloops) that produce the multi-WI DLP execution. That is, the loop
> executes the code in the parallel regions for each work item in the
> work
> group.
> 
> This step is needed to make the multi-WI kernel executable on
> non-SIMD/SIMT platforms (read: CPUs). On the "SPMD-tailored"
> processors
> (many GPUs) this step is not always necessary as they can input the
> single
> kernel instructions and do the "spreading" on the fly. We have a
> different
> method to generate the WG functions for such targets.
> 
> > Moreover, OpenCL has lots of language specific APIs such as
> > "get_global_id" and builtin function calls, and without
knowledge
> > of these
> > calls it is impossible to vectorize OpenCL.
> 
> In pocl the whole kernel is "flattened", that is, the processed
> kernel code
> does not usually have function calls. Well, printf() and some
> intrisics
> calls might be exceptions. In such cases the vectorization could be
> simply not done and the parallelization can be attempted using some
> other
> method (e.g. pure unrolling), like usual.
> 
> get_local_id is converted to regular iteration variables (local id
> space x,
> y,z) in the wiloop.
> 
> I played yesterday a bit by kludge-hacking the LoopVectorizer code to
> skip the canVectorizeMemory() check for these wiloop constructs and
> it
> managed to vectorize a kernel as expected.
Based on this experience, can you propose some metadata that would allow this to
happen (so that the LoopVectorizer would be generally useful for POCL)? I
suspect this same metadata might be useful in other contexts (such as
implementing iteration-independence pragmas).

 -Hal
> 
> > You need to implement something like Whole Function Vectorization
> > (http://dl.acm.org/citation.cfm?id=2190061). The loop vectorizer
> > can't
> > help you here. Ralf Karrenberg open sourced his implementation on
> > github.
> > You should take a look.
> 
> I think the WFV paper has plenty of good ideas that could be applied
> to
> *improve* the vectorizability of DLP code/parallel loops (e.g. the
> mask
> generation for diverging branches where the traditional if-conversion
> won't
> do, especially intra kernel for-loops), but the actual vectorization
> could be modularized to generic passes to, e.g., allow the choice of
> target-specific parallelization methods later on.
> 
> --
> Pekka
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

Pekka Jääskeläinen

2013-Jan-25 14:14 UTC

head link

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

On 01/25/2013 04:00 PM, Hal Finkel wrote:> Based on this experience, can you propose some metadata that would allow
> this to happen (so that the LoopVectorizer would be generally useful for
> POCL)? I suspect this same metadata might be useful in other contexts (such
> as implementing iteration-independence pragmas).
I cannot yet. In this hack I simply changed LoopVectorizer to assume
all loops the vectorizer sees are parallel (as the kernels I tried
didn't have loops inside) to see where the other potential
vectorization obstacles are.

I'm planning to try next an approach where I add metadata
to the loop header basic block that simply marks that the loop is parallel.
The loop vectorizer, when it sees such metadata in the loop can then
skip cross-iteration memory dependency checks. If you think this is a
dead-end, please let me know. Otherwise, I'll try and see how it
works.

BR,
-- 
Pekka

Nadav Rotem

2013-Jan-25 17:11 UTC

head link

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

Hi Pekka, 

> How I see it, the data parallel input simply makes the vectorizer's job
> easier (skip some of the legality checks) while reusing most of the
> implementation (e.g. cost estimation, unrolling decisions, the
> vector instruction formation itself, predication/if-conversion,
> speculative execution+blend, etc.).
> 
What you need is outer loop vectorization while the loop vectorizer is an inner
loop vectorizer.
If you decide to use the Loop Vectorizer then you won't be able to vectorize
kernels that have inner loops or kernels that have barriers in them.
If you look at the AMD OpenCL SDK you will see that most of the workloads have
barriers, inner loops.

Another problem that you may run into is 'early exits'.  In many kernels
you will see something like  " if (get_global_id(0) > N) return; "

Not to mention that it will be very important for you to vectorize function
calls. Vectorization needs to happen before inlining because you don't want
to vectorize a cloud of instructions when you can convert a single function
call.  Think about image samplers or slightly more complex builtins that have
control flow in them.

> Now pocl's kernel compiler detects the "parallel regions"
(the
> regions between work group barriers) and generates a new function suitable
> for executing multiple work items (WI) in the work group. One method to
> generate such functions is to generate embarrassingly parallel
"for-loops"
> (wiloops) that produce the multi-WI DLP execution. That is, the loop
> executes the code in the parallel regions for each work item in the work
> group.
> 
> This step is needed to make the multi-WI kernel executable on
> non-SIMD/SIMT platforms (read: CPUs). On the "SPMD-tailored"
processors
> (many GPUs) this step is not always necessary as they can input the single
> kernel instructions and do the "spreading" on the fly. We have a
different
> method to generate the WG functions for such targets.
> 
>> Moreover, OpenCL has lots of language specific APIs such as
>> "get_global_id" and builtin function calls, and without
knowledge of these
>> calls it is impossible to vectorize OpenCL.
> 
> In pocl the whole kernel is "flattened", that is, the processed
kernel code
> does not usually have function calls. Well, printf() and some intrisics
> calls might be exceptions. In such cases the vectorization could be
> simply not done and the parallelization can be attempted using some other
> method (e.g. pure unrolling), like usual.
> 
> get_local_id is converted to regular iteration variables (local id space x,
> y,z) in the wiloop.
> 
> I played yesterday a bit by kludge-hacking the LoopVectorizer code to
> skip the canVectorizeMemory() check for these wiloop constructs and it
> managed to vectorize a kernel as expected.
> 
>> You need to implement something like Whole Function Vectorization
>> (http://dl.acm.org/citation.cfm?id=2190061). The loop vectorizer
can't
>> help you here. Ralf Karrenberg open sourced his implementation on
github.
>> You should take a look.
> 
> I think the WFV paper has plenty of good ideas that could be applied to
> *improve* the vectorizability of DLP code/parallel loops (e.g. the mask
> generation for diverging branches where the traditional if-conversion
won't
> do, especially intra kernel for-loops), but the actual vectorization
> could be modularized to generic passes to, e.g., allow the choice of
target-specific parallelization methods later on.
> 
> -- 
> Pekka

Pekka Jääskeläinen

2013-Jan-25 19:18 UTC

head link

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

Hi Nadav,

On 01/25/2013 07:11 PM, Nadav Rotem wrote:> What you need is outer loop vectorization while the loop vectorizer is an
> inner loop vectorizer. If you decide to use the Loop Vectorizer then you
> won't be able to vectorize kernels that have inner loops or kernels
that have
> barriers in them. If you look at the AMD OpenCL SDK you will see that most
of
> the workloads have barriers, inner loops.
Barriers are the problem of the "parallel region formation phase" of
pocl. It's a distinct problem from the actual parallelization method
such as vectorization (or, e.g., unroll+VLIW schedule).

Non-divergent iteration count kernel loops can be executed in lock step
and also vectorized. The parallel region/wiloop can be formed
inside the kernel loop which can be then vectorized.

kernel_for_loop {
    parallel_wiloop over x {
      .. the original kernel loop body ..
    }
}

Vectorizing divergent loops needs masking or similar, e.g., as presented
in the WFV paper, but this doesn't need to be an OpenCL specific
optimization
as it helps vectorization in general.

It's a case of the inner-loop iteration count depending on the outer
loop.

parallel_wiloop over x {
    kernel_for_loop i := 0...x { // or similar variable range depending on x
      ...
    }
}

to

kernel_for_loop {
    parallel_wiloop {
       // the whole body predicated with the kernel_for_loop condition
       // that includes 'x' somewhere
    }
}
> Another problem that you may run into is 'early exits'.  In many
kernels you
> will see something like  " if (get_global_id(0)>  N) return; "
Now in pocl this ends up being a parallel region similar to this:

parallel_wiloop over x {
    if (x > N) goto ret;
    ... kernel code here
ret:
}

Not the easiest case to parallelize but might be doable because N
can be used to modify the wiloop iteration range.

parallel_wiloop x:= 0...N-1 {
    ... kernel code here
}

Anyways, one cannot be expected to defeat all the bad kernel coding practices.
> Not to mention that it will be very important for you to vectorize function
> calls. Vectorization needs to happen before inlining because you don't
want
> to vectorize a cloud of instructions when you can convert a single function
> call.  Think about image samplers or slightly more complex builtins that
have
> control flow in them.
I think function calls are one thing, builtins/intrinsics another.

Vectorizing builtins is something that is partially OpenCL specific (if the
builtins itself are OpenCL-specific), but I think there should be benefit in
a generic implementation of that case also. I.e., converting builtin/intrinsics
calls to their vector counterparts, if available.

Say,

for_loop {
   call @llvm.sinf32(a[x]);
   ...
}

is useful to be vectorizable if the target ISA can do SIMD sinf.

In any case, it's clear some kernels are not vectorizable (at least
beneficially so), e.g. due to non-predicateable (huh!) control flow, but
that does not have much to do with the actual vectorizing method or the input
language used.

BR,
-- 
Pekka

Seemingly Similar Threads

Search for more possibly parallel threads

llvm dev - Jan 2013 - [LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

Seemingly Similar Threads