thr3ads.net - llvm dev - [LLVMdev] LoopVectorizer in OpenCL C work group autovectorization [Jan 2013]

If this information is useful, please help other people find it:
Share via:

Pekka Jääskeläinen

2013-Jan-25 17:16 UTC

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

On 01/25/2013 04:21 PM, Hal Finkel wrote:> My point is that I specifically think that you should try it. I'm
curious
> to see how what you come up with might apply to other use cases as well.
OK, attached is the first quick attempt towards this. I'm not
proposing committing this, but would like to get comments
to possibly move towards something committable.

It simply looks for a metadata named 'parallel_for' in any of the
instructions in the loop's header and assumes the loop is a parallel
one if such is found. This metadata is added by the pocl's wiloops
generation routine. It passes the pocl test suite when enabled but
probably cannot vectorize many kernels (at least) due to the missing
intra-kernel vector scalarizer.

Some known problems that need addressing:

- Metadata can only be attached to Instructions (not Loops or even
BasicBlocks), therefore the brute force approach of marking all
instructions in the header BB in hopes of that optimizers
might retain at least one of them. E.g., a special intrinsics call
might be a better solution.

- The loop header can be potentially shared with multilevel loops where the
outer or inner levels might not be parallel. Not a problem in the pocl use
case as the wiloops are fully parallel at all the three levels, but needs
to be sorted out in a general solution.

Perhaps it would be better to attach the metadata to the iteration
count increment/check instruction(s) or similar to better identify the
parallel (for) loop in question.

- Are there optimizations that might push code *illegally* to the parallel
loop from the outside of it? If there's, e.g., a non-parallel loop inside
a parallel loop, loop invariant code motion might move code from the
inner loop to the parallel loop's body. That should be a safe
optimization,
to my understanding (it preservers the ordering semantics), but I wonder if
there are others that might cause breakage.

--
Pekka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: llvm-3.3-loopvectorizer-parallel_for-metadata-detection.patch
Type: text/x-patch
Size: 1761 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130125/e4f8f53b/attachment.bin>

Nadav Rotem

2013-Jan-25 20:38 UTC

head link

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

Pekka, 

I am in favor of adding metadata to control different aspects of vectorization,
mainly for supporting user-level pargmas [1] but also for DSLs.
Before we start adding metadata to the IR we need to define the semantics of the
tags. "Parallel_for" is too general. We also want to control
vectorization factor, unroll factor, cost model, etc.

Doug Gregor suggested to add the metadata to the branch instruction of the latch
block in the loop.

My main concern is that your approach for vectorizing OpenCL is wrong. OpenCL
was designed for SPMD/outer-loop vectorization and any good OpenCL vectorizer
should be able to vectorize 100% of the workloads.  The Loop Vectorizer
vectorizes innermost loops only. It has a completely different cost model and
legality checks. You also have no use for reduction variables, reverse
iterators, etc. If all you are interested in is the widening of instructions
then you can easily implement it.

- Nadav

[1]
http://software.intel.com/en-us/articles/vectorization-with-the-intel-compilers-part-i


On Jan 25, 2013, at 9:16 AM, Pekka Jääskeläinen <pekka.jaaskelainen at
tut.fi> wrote:
> On 01/25/2013 04:21 PM, Hal Finkel wrote:
>> My point is that I specifically think that you should try it. I'm
curious
>> to see how what you come up with might apply to other use cases as
well.
> 
> OK, attached is the first quick attempt towards this. I'm not
> proposing committing this, but would like to get comments
> to possibly move towards something committable.
> 
> It simply looks for a metadata named 'parallel_for' in any of the
> instructions in the loop's header and assumes the loop is a parallel
> one if such is found. This metadata is added by the pocl's wiloops
> generation routine. It passes the pocl test suite when enabled but
> probably cannot vectorize many kernels (at least) due to the missing
> intra-kernel vector scalarizer.
> 
> Some known problems that need addressing:
> 
> - Metadata can only be attached to Instructions (not Loops or even
>  BasicBlocks), therefore the brute force approach of marking all
>  instructions in the header BB in hopes of that optimizers
>  might retain at least one of them. E.g., a special intrinsics call
>  might be a better solution.
> 
> - The loop header can be potentially shared with multilevel loops where the
>  outer or inner levels might not be parallel. Not a problem in the pocl use
>  case as the wiloops are fully parallel at all the three levels, but needs
>  to be sorted out in a general solution.
> 
>  Perhaps it would be better to attach the metadata to the iteration
>  count increment/check instruction(s) or similar to better identify the
>  parallel (for) loop in question.
> 
> - Are there optimizations that might push code *illegally* to the parallel
>  loop from the outside of it? If there's, e.g., a non-parallel loop
inside
>  a parallel loop, loop invariant code motion might move code from the
>  inner loop to the parallel loop's body. That should be a safe
optimization,
>  to my understanding (it preservers the ordering semantics), but I wonder
if
>  there are others that might cause breakage.
> 
> -- 
> Pekka
> <llvm-3.3-loopvectorizer-parallel_for-metadata-detection.patch>

Pekka Jääskeläinen

2013-Jan-25 21:54 UTC

head link

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

> I am in favor of adding metadata to control different aspects of
> vectorization, mainly for supporting user-level pargmas [1] but also for
> DSLs. Before we start adding metadata to the IR we need to define the
> semantics of the tags. "Parallel_for" is too general. We also
want to control
> vectorization factor, unroll factor, cost model, etc.
These are used to control *how* the loops are parallelized.
The generic "parallel_for" lets the compiler (to try) to do the actual
parallelization decisions based on the target (aim for performance
portability). So, both have their uses.
> Doug Gregor suggested to add the metadata to the branch instruction of the
> latch block in the loop.
OK that should work better. I'll look into it next week.
> My main concern is that your approach for vectorizing OpenCL is wrong.
OpenCL
> was designed for SPMD/outer-loop vectorization and any good OpenCL
vectorizer
> should be able to vectorize 100% of the workloads.  The Loop Vectorizer
> vectorizes innermost loops only. It has a completely different cost model
and
> legality checks. You also have no use for reduction variables, reverse
> iterators, etc. If all you are interested in is the widening of
instructions
> then you can easily implement it.
Sorry, I still don't see the problem in the "modular" approach vs.
generating
vector instructions directly in pocl -- but then again, I'm not a
vectorization
expert. All I'm really trying to do is to delegate the "widening of
instructions" and the related tasks to the loop vectorizer. If it
doesn't
need all of the vectorizer's features it should not be a problem AFAIU. I
think
it's better for me just play a bit with it, and experience the possible
problems
in it.

-- 
--Pekka

Nick Lewycky

2013-Jan-28 10:17 UTC

head link

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

Pekka Jääskeläinen wrote:> On 01/25/2013 04:21 PM, Hal Finkel wrote:
>> My point is that I specifically think that you should try it. I'm
curious
>> to see how what you come up with might apply to other use cases as
well.
>
> OK, attached is the first quick attempt towards this. I'm not
> proposing committing this, but would like to get comments
> to possibly move towards something committable.
>
> It simply looks for a metadata named 'parallel_for' in any of the
> instructions in the loop's header and assumes the loop is a parallel
> one if such is found.
Aren't all loops in OpenCL parallel? Or are you planning to inline 
non-OpenCL code into your OpenCL code before running the vectorizer? If 
not, just have the vectorizer run as part of the pipeline you set up 
when producing IR from OpenCL code. That it would miscompile non-OpenCL 
code is irrelevant.


+  for (BasicBlock::iterator ii = header->begin();
+       ii != header->end(); ii++) {

http://llvm.org/docs/CodingStandards.html#don-t-evaluate-end-every-time-through-a-loop

Nick

  This metadata is added by the pocl's wiloops> generation routine. It passes the pocl test suite when enabled but
> probably cannot vectorize many kernels (at least) due to the missing
> intra-kernel vector scalarizer.
>
> Some known problems that need addressing:
>
> - Metadata can only be attached to Instructions (not Loops or even
> BasicBlocks), therefore the brute force approach of marking all
> instructions in the header BB in hopes of that optimizers
> might retain at least one of them. E.g., a special intrinsics call
> might be a better solution.
>
> - The loop header can be potentially shared with multilevel loops where the
> outer or inner levels might not be parallel. Not a problem in the pocl use
> case as the wiloops are fully parallel at all the three levels, but needs
> to be sorted out in a general solution.
>
> Perhaps it would be better to attach the metadata to the iteration
> count increment/check instruction(s) or similar to better identify the
> parallel (for) loop in question.
>
> - Are there optimizations that might push code *illegally* to the parallel
> loop from the outside of it? If there's, e.g., a non-parallel loop
inside
> a parallel loop, loop invariant code motion might move code from the
> inner loop to the parallel loop's body. That should be a safe
optimization,
> to my understanding (it preservers the ordering semantics), but I wonder if
> there are others that might cause breakage.
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Pekka Jääskeläinen

2013-Jan-28 11:53 UTC

head link

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

Hi Nick,

On 01/28/2013 12:17 PM, Nick Lewycky wrote:> Aren't all loops in OpenCL parallel? Or are you planning to inline
The intra-kernel loops (what the OpenCL C programmer writes) are not by
default parallel. Only the implicit "work group loops" (that iterate
over the work items in the local work space for the regions between
barriers) are.
> non-OpenCL code into your OpenCL code before running the vectorizer? If
> not, just have the vectorizer run as part of the pipeline you set up
> when producing IR from OpenCL code. That it would miscompile non-OpenCL
> code is irrelevant.
I (still) think a cleaner and a more modularized approach is to simply add
parallel loop-awareness to the regular vectorizer. This should help
other parallel languages with parallel loop constructs, too.

The basic idea is to use a loop interchange-style optimization to convert
the work group function to a generic inner loop vectorization problem.
Effectively doing outer-loop vectorization this way like Nadav Rotem
suggested. Let's see how it goes.
> + for (BasicBlock::iterator ii = header->begin();
> + ii != header->end(); ii++) {
>
>
http://llvm.org/docs/CodingStandards.html#don-t-evaluate-end-every-time-through-a-loop
Thanks. I'll send an updated patch shortly in a separate
email thread.

BR,
-- 
Pekka

Reasonably Related Threads

Search for more seemingly similar threads

llvm dev - Jan 2013 - [LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

Reasonably Related Threads