thr3ads.net - llvm dev - [LLVMdev] LoopVectorizer in OpenCL C work group autovectorization [Jan 2013]

If this information is useful, please help other people find it:
Share via:

Pekka Jääskeläinen

2013-Jan-25 21:54 UTC

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

> I am in favor of adding metadata to control different aspects of
> vectorization, mainly for supporting user-level pargmas [1] but also for
> DSLs. Before we start adding metadata to the IR we need to define the
> semantics of the tags. "Parallel_for" is too general. We also
want to control
> vectorization factor, unroll factor, cost model, etc.
These are used to control *how* the loops are parallelized.
The generic "parallel_for" lets the compiler (to try) to do the actual
parallelization decisions based on the target (aim for performance
portability). So, both have their uses.
> Doug Gregor suggested to add the metadata to the branch instruction of the
> latch block in the loop.
OK that should work better. I'll look into it next week.
> My main concern is that your approach for vectorizing OpenCL is wrong.
OpenCL
> was designed for SPMD/outer-loop vectorization and any good OpenCL
vectorizer
> should be able to vectorize 100% of the workloads.  The Loop Vectorizer
> vectorizes innermost loops only. It has a completely different cost model
and
> legality checks. You also have no use for reduction variables, reverse
> iterators, etc. If all you are interested in is the widening of
instructions
> then you can easily implement it.
Sorry, I still don't see the problem in the "modular" approach vs.
generating
vector instructions directly in pocl -- but then again, I'm not a
vectorization
expert. All I'm really trying to do is to delegate the "widening of
instructions" and the related tasks to the loop vectorizer. If it
doesn't
need all of the vectorizer's features it should not be a problem AFAIU. I
think
it's better for me just play a bit with it, and experience the possible
problems
in it.

-- 
--Pekka

Ralf Karrenberg

2013-Jan-31 15:44 UTC

head link

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

Hi Pekka, hi Nadav,

I didn't find the time to read this thread until now, sorry for that.

I actually think you are both right :).
As for the current status, the loop vectorizer is only able to vectorize 
inner loops and (I think) does not handle function calls and memory 
operations well. This will prevent it from vectorizing a large group of 
OpenCL kernels, and certainly all "interesting", more complex ones.
However, in the long run, I think the only difference between WFV-like 
approaches and classic loop vectorization a la LoopVectorizer in an 
OpenCL context is the following:
WFV assumes that there is at least one outer loop that has increments of 
one, runs a multiple of the SIMD width iterations, and that every 
iteration is independent (barriers can be handled by the OpenCL driver 
*after* WFV).

On the other hand, LoopVectorizer may not be aimed at covering all kinds 
of code inside the body and/or instead focus more on things not required 
by WFV, such as handling reductions and other kinds of loop-carried 
dependencies.

In any case, since our own OpenCL driver is more of a proof-of-concept 
implementation and not very robust, I'd be willing to give it a try to 
integrate the current libWFV into pocl. This should boost performance 
quite a bit for many kernels without too much effort ;). I just don't 
know (yet) where to start - can you give me a hint, Pekka?

Cheers,
Ralf

On 1/25/13 10:54 PM, Pekka Jääskeläinen wrote:>> I am in favor of adding metadata to control different aspects of
>> vectorization, mainly for supporting user-level pargmas [1] but also
for
>> DSLs. Before we start adding metadata to the IR we need to define the
>> semantics of the tags. "Parallel_for" is too general. We also
want to
>> control
>> vectorization factor, unroll factor, cost model, etc.
>
> These are used to control *how* the loops are parallelized.
> The generic "parallel_for" lets the compiler (to try) to do the
actual
> parallelization decisions based on the target (aim for performance
> portability). So, both have their uses.
>
>> Doug Gregor suggested to add the metadata to the branch instruction of
>> the
>> latch block in the loop.
>
> OK that should work better. I'll look into it next week.
>
>> My main concern is that your approach for vectorizing OpenCL is wrong.
>> OpenCL
>> was designed for SPMD/outer-loop vectorization and any good OpenCL
>> vectorizer
>> should be able to vectorize 100% of the workloads.  The Loop Vectorizer
>> vectorizes innermost loops only. It has a completely different cost
>> model and
>> legality checks. You also have no use for reduction variables, reverse
>> iterators, etc. If all you are interested in is the widening of
>> instructions
>> then you can easily implement it.
>
> Sorry, I still don't see the problem in the "modular"
approach vs.
> generating
> vector instructions directly in pocl -- but then again, I'm not a
> vectorization
> expert. All I'm really trying to do is to delegate the "widening
of
> instructions" and the related tasks to the loop vectorizer. If it
doesn't
> need all of the vectorizer's features it should not be a problem AFAIU.
> I think
> it's better for me just play a bit with it, and experience the possible
> problems
> in it.
>

Pekka Jääskeläinen

2013-Jan-31 17:15 UTC

head link

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

Hi Ralf,

On 01/31/2013 05:44 PM, Ralf Karrenberg wrote:> As for the current status, the loop vectorizer is only able to vectorize
> inner loops and (I think) does not handle function calls and memory
> operations well. This will prevent it from vectorizing a large group of
> OpenCL kernels, and certainly all "interesting", more complex
ones.
Agreed -- but not being able to handle function calls/intrinsics is
not an OpenCL-specific limitation. Any vectorizable input suffers from
that. Also, an inner loop vectorizer might be able to handle outer loops
e.g. via loop interchange. I'm planning to look into that if time allows.
> However, in the long run, I think the only difference between WFV-like
> approaches and classic loop vectorization a la LoopVectorizer in an
> OpenCL context is the following:
> WFV assumes that there is at least one outer loop that has increments of
> one, runs a multiple of the SIMD width iterations, and that every
> iteration is independent (barriers can be handled by the OpenCL driver
> *after* WFV).
Yes, this is the case with the "wiloops" work group generation
method of pocl. The parallel outer loops are the max 3 dimensions of the
local space. The actual wg barrier calls are converted to no-ops (compiler
barriers) for the current targets.
> On the other hand, LoopVectorizer may not be aimed at covering all kinds
> of code inside the body and/or instead focus more on things not required
> by WFV, such as handling reductions and other kinds of loop-carried
> dependencies.
It is true that the feature set of the LoopVectorizer goes beyond the
"embarrassingly parallel loops" that the implicit WI loops are.
However,
I don't see this as a show-stopper for trying to provide a modularized
approach to work group vectorization.

Moreover, parallelization-helping optimizations such as "loop masking"
for
the diverging inner-loops (kernel loops) are more generally useful, and, IMHO
should be added to LLVM upstream (not to an OpenCL implementation only)
eventually as generic loop vectorization routines.
> In any case, since our own OpenCL driver is more of a proof-of-concept
> implementation and not very robust, I'd be willing to give it a try to
> integrate the current libWFV into pocl. This should boost performance
> quite a bit for many kernels without too much effort ;). I just don't
> know (yet) where to start - can you give me a hint, Pekka?
I'm very glad to hear this! Luckily, the pocl code base has been modularized
to allow easily switching the "work group function generation method"
which I
think your WFV work actually is.

Perhaps the detailed instructions on how to start are out of topic here and
you might want to join the pocl-devel list (and #pocl) where the pocl
developers can give more hints. See http://pocl.sourceforge.net/discussion.html.

BR,
-- 
Pekka

Seemingly Similar Threads

Search for more possibly parallel threads

llvm dev - Jan 2013 - [LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

Seemingly Similar Threads