thr3ads.net - llvm dev - [LLVMdev] LoopVectorizer in OpenCL C work group autovectorization [Jan 2013]

If this information is useful, please help other people find it:
Share via:

Pekka Jääskeläinen

2013-Jan-24 17:47 UTC

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

Hi,

I started to play with the LoopVectorizer of LLVM trunk
on the work-item loops produced by pocl's OpenCL C
kernel compiler, in hopes of implementing multi-work-item
work group autovectorization in a modular manner.

The vectorizer seems to refuse to vectorize the loop if it sees
multiple writes to the same memory object within the
same iteration. In case of parallel loops such as
the work-item loops, it could just assume vectorization is doable
from the data dependency point of view -- no matter what kind of
memory accesses the single iteration does.

What would be the cleanest way to communicate the parallel loop
information to the vectorizer? There was some discussion of
parallelism information in LLVM some time ago in this list, but
it somehow died. Was adding some parallelism information to
the LLVM IR decided to be a bad idea? Any conclusion in that?

Another thing with OpenCL C autovectorization is that the
language itself has vector datatypes. In order to autovectorize
multi-WI work groups efficiently, it might be beneficial to
break the vectors in the single work item to scalars to get more
efficient vector hardware utilization. Is there an existing pass
that breaks vectors to scalars and that works on the LLVM IR level?
There seems to be such at the code gen level according to
this blog post: http://blog.llvm.org/2011/12/llvm-31-vector-changes.html

Thanks,
-- 
Pekka

Nadav Rotem

2013-Jan-25 07:56 UTC

head link

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

Hi Pekka, 
> Hi,
> 
> I started to play with the LoopVectorizer of LLVM trunk
> on the work-item loops produced by pocl's OpenCL C
> kernel compiler, in hopes of implementing multi-work-item
> work group autovectorization in a modular manner.
> 
Thanks for checking the Loop Vectorizer, I am interested in hearing your
feedback. The Loop Vectorizer does not fit here. OpenCL vectorization is
completely different because the language itself is data-parallel. You don't
need all of the legality checks that the loop vectorizer has. Moreover, OpenCL
has lots of language specific APIs such as "get_global_id" and builtin
function calls, and without knowledge of these calls it is impossible to
vectorize OpenCL.
> The vectorizer seems to refuse to vectorize the loop if it sees
> multiple writes to the same memory object within the
> same iteration. In case of parallel loops such as
> the work-item loops, it could just assume vectorization is doable
> from the data dependency point of view -- no matter what kind of
> memory accesses the single iteration does.
> 
Yep. 
> What would be the cleanest way to communicate the parallel loop
> information to the vectorizer? There was some discussion of
> parallelism information in LLVM some time ago in this list, but
> it somehow died. Was adding some parallelism information to
> the LLVM IR decided to be a bad idea? Any conclusion in that?
> 
You need to implement something like Whole Function Vectorization
(http://dl.acm.org/citation.cfm?id=2190061). The loop vectorizer can't help
you here. Ralf Karrenberg open sourced his implementation on github. You should
take a look.

> Another thing with OpenCL C autovectorization is that the
> language itself has vector datatypes.
Unfortunately yes. And OpenCL compilers scalarize these vector operations at
some point in the compilation pipeline.
> In order to autovectorize
> multi-WI work groups efficiently, it might be beneficial to
> break the vectors in the single work item to scalars to get more
> efficient vector hardware utilization. Is there an existing pass
> that breaks vectors to scalars and that works on the LLVM IR level?
No. But this pass needs to be OpenCL specific because you want to scalarize
function calls. OpenCL is "blessed" with lots of function calls, even
for trivial type conversions.
> There seems to be such at the code gen level according to
> this blog post: http://blog.llvm.org/2011/12/llvm-31-vector-changes.html
Yes but you can't use it because you need to do this at IR-level.

- Nadav
> 
> Thanks,
> -- 
> Pekka
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Renato Golin

2013-Jan-25 08:29 UTC

head link

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

On 25 January 2013 07:56, Nadav Rotem <nrotem at apple.com> wrote:
> You need to implement something like Whole Function Vectorization (
> http://dl.acm.org/citation.cfm?id=2190061). The loop vectorizer can't
> help you here. Ralf Karrenberg open sourced his implementation on github.
> You should take a look.
>
It'd be great to have this in LLVM, though some care must be taken to
continue relevant (unlike the C back-end, for example). There are lots of
secrets around GPUs and OpenCL concrete implementation, which could make
very hard to predict or model costs for each different GPU.

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130125/b5086326/attachment.html>

Pekka Jääskeläinen

2013-Jan-25 11:35 UTC

head link

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

On 01/25/2013 09:56 AM, Nadav Rotem wrote:> Thanks for checking the Loop Vectorizer, I am interested in hearing your
> feedback. The Loop Vectorizer does not fit here. OpenCL vectorization is
> completely different because the language itself is data-parallel. You
> don't need all of the legality checks that the loop vectorizer has.
I'm aware of this and it was my point in the original post.
However, I do not see why the loop vectorizer wouldn't fit
this use case given how the pocl's "kernel compiler" is
structured.

How I see it, the data parallel input simply makes the vectorizer's job
easier (skip some of the legality checks) while reusing most of the
implementation (e.g. cost estimation, unrolling decisions, the
vector instruction formation itself, predication/if-conversion,
speculative execution+blend, etc.).

Now pocl's kernel compiler detects the "parallel regions" (the
regions between work group barriers) and generates a new function suitable
for executing multiple work items (WI) in the work group. One method to
generate such functions is to generate embarrassingly parallel
"for-loops"
(wiloops) that produce the multi-WI DLP execution. That is, the loop
executes the code in the parallel regions for each work item in the work
group.

This step is needed to make the multi-WI kernel executable on
non-SIMD/SIMT platforms (read: CPUs). On the "SPMD-tailored"
processors
(many GPUs) this step is not always necessary as they can input the single
kernel instructions and do the "spreading" on the fly. We have a
different
method to generate the WG functions for such targets.
> Moreover, OpenCL has lots of language specific APIs such as
> "get_global_id" and builtin function calls, and without knowledge
of these
> calls it is impossible to vectorize OpenCL.
In pocl the whole kernel is "flattened", that is, the processed kernel
code
does not usually have function calls. Well, printf() and some intrisics
calls might be exceptions. In such cases the vectorization could be
simply not done and the parallelization can be attempted using some other
method (e.g. pure unrolling), like usual.

get_local_id is converted to regular iteration variables (local id space x,
y,z) in the wiloop.

I played yesterday a bit by kludge-hacking the LoopVectorizer code to
skip the canVectorizeMemory() check for these wiloop constructs and it
managed to vectorize a kernel as expected.
> You need to implement something like Whole Function Vectorization
> (http://dl.acm.org/citation.cfm?id=2190061). The loop vectorizer can't
> help you here. Ralf Karrenberg open sourced his implementation on github.
> You should take a look.
I think the WFV paper has plenty of good ideas that could be applied to
*improve* the vectorizability of DLP code/parallel loops (e.g. the mask
generation for diverging branches where the traditional if-conversion won't
do, especially intra kernel for-loops), but the actual vectorization
could be modularized to generic passes to, e.g., allow the choice of 
target-specific parallelization methods later on.

-- 
Pekka

Maybe Matching Threads

Search for more reasonably related threads

llvm dev - Jan 2013 - [LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

Maybe Matching Threads