Pekka Jääskeläinen
2013-Jan-25 21:54 UTC
[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization
> I am in favor of adding metadata to control different aspects of > vectorization, mainly for supporting user-level pargmas [1] but also for > DSLs. Before we start adding metadata to the IR we need to define the > semantics of the tags. "Parallel_for" is too general. We also want to control > vectorization factor, unroll factor, cost model, etc.These are used to control *how* the loops are parallelized. The generic "parallel_for" lets the compiler (to try) to do the actual parallelization decisions based on the target (aim for performance portability). So, both have their uses.> Doug Gregor suggested to add the metadata to the branch instruction of the > latch block in the loop.OK that should work better. I'll look into it next week.> My main concern is that your approach for vectorizing OpenCL is wrong. OpenCL > was designed for SPMD/outer-loop vectorization and any good OpenCL vectorizer > should be able to vectorize 100% of the workloads. The Loop Vectorizer > vectorizes innermost loops only. It has a completely different cost model and > legality checks. You also have no use for reduction variables, reverse > iterators, etc. If all you are interested in is the widening of instructions > then you can easily implement it.Sorry, I still don't see the problem in the "modular" approach vs. generating vector instructions directly in pocl -- but then again, I'm not a vectorization expert. All I'm really trying to do is to delegate the "widening of instructions" and the related tasks to the loop vectorizer. If it doesn't need all of the vectorizer's features it should not be a problem AFAIU. I think it's better for me just play a bit with it, and experience the possible problems in it. -- --Pekka
Ralf Karrenberg
2013-Jan-31 15:44 UTC
[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization
Hi Pekka, hi Nadav, I didn't find the time to read this thread until now, sorry for that. I actually think you are both right :). As for the current status, the loop vectorizer is only able to vectorize inner loops and (I think) does not handle function calls and memory operations well. This will prevent it from vectorizing a large group of OpenCL kernels, and certainly all "interesting", more complex ones. However, in the long run, I think the only difference between WFV-like approaches and classic loop vectorization a la LoopVectorizer in an OpenCL context is the following: WFV assumes that there is at least one outer loop that has increments of one, runs a multiple of the SIMD width iterations, and that every iteration is independent (barriers can be handled by the OpenCL driver *after* WFV). On the other hand, LoopVectorizer may not be aimed at covering all kinds of code inside the body and/or instead focus more on things not required by WFV, such as handling reductions and other kinds of loop-carried dependencies. In any case, since our own OpenCL driver is more of a proof-of-concept implementation and not very robust, I'd be willing to give it a try to integrate the current libWFV into pocl. This should boost performance quite a bit for many kernels without too much effort ;). I just don't know (yet) where to start - can you give me a hint, Pekka? Cheers, Ralf On 1/25/13 10:54 PM, Pekka Jääskeläinen wrote:>> I am in favor of adding metadata to control different aspects of >> vectorization, mainly for supporting user-level pargmas [1] but also for >> DSLs. Before we start adding metadata to the IR we need to define the >> semantics of the tags. "Parallel_for" is too general. We also want to >> control >> vectorization factor, unroll factor, cost model, etc. > > These are used to control *how* the loops are parallelized. > The generic "parallel_for" lets the compiler (to try) to do the actual > parallelization decisions based on the target (aim for performance > portability). So, both have their uses. > >> Doug Gregor suggested to add the metadata to the branch instruction of >> the >> latch block in the loop. > > OK that should work better. I'll look into it next week. > >> My main concern is that your approach for vectorizing OpenCL is wrong. >> OpenCL >> was designed for SPMD/outer-loop vectorization and any good OpenCL >> vectorizer >> should be able to vectorize 100% of the workloads. The Loop Vectorizer >> vectorizes innermost loops only. It has a completely different cost >> model and >> legality checks. You also have no use for reduction variables, reverse >> iterators, etc. If all you are interested in is the widening of >> instructions >> then you can easily implement it. > > Sorry, I still don't see the problem in the "modular" approach vs. > generating > vector instructions directly in pocl -- but then again, I'm not a > vectorization > expert. All I'm really trying to do is to delegate the "widening of > instructions" and the related tasks to the loop vectorizer. If it doesn't > need all of the vectorizer's features it should not be a problem AFAIU. > I think > it's better for me just play a bit with it, and experience the possible > problems > in it. >
Pekka Jääskeläinen
2013-Jan-31 17:15 UTC
[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization
Hi Ralf, On 01/31/2013 05:44 PM, Ralf Karrenberg wrote:> As for the current status, the loop vectorizer is only able to vectorize > inner loops and (I think) does not handle function calls and memory > operations well. This will prevent it from vectorizing a large group of > OpenCL kernels, and certainly all "interesting", more complex ones.Agreed -- but not being able to handle function calls/intrinsics is not an OpenCL-specific limitation. Any vectorizable input suffers from that. Also, an inner loop vectorizer might be able to handle outer loops e.g. via loop interchange. I'm planning to look into that if time allows.> However, in the long run, I think the only difference between WFV-like > approaches and classic loop vectorization a la LoopVectorizer in an > OpenCL context is the following: > WFV assumes that there is at least one outer loop that has increments of > one, runs a multiple of the SIMD width iterations, and that every > iteration is independent (barriers can be handled by the OpenCL driver > *after* WFV).Yes, this is the case with the "wiloops" work group generation method of pocl. The parallel outer loops are the max 3 dimensions of the local space. The actual wg barrier calls are converted to no-ops (compiler barriers) for the current targets.> On the other hand, LoopVectorizer may not be aimed at covering all kinds > of code inside the body and/or instead focus more on things not required > by WFV, such as handling reductions and other kinds of loop-carried > dependencies.It is true that the feature set of the LoopVectorizer goes beyond the "embarrassingly parallel loops" that the implicit WI loops are. However, I don't see this as a show-stopper for trying to provide a modularized approach to work group vectorization. Moreover, parallelization-helping optimizations such as "loop masking" for the diverging inner-loops (kernel loops) are more generally useful, and, IMHO should be added to LLVM upstream (not to an OpenCL implementation only) eventually as generic loop vectorization routines.> In any case, since our own OpenCL driver is more of a proof-of-concept > implementation and not very robust, I'd be willing to give it a try to > integrate the current libWFV into pocl. This should boost performance > quite a bit for many kernels without too much effort ;). I just don't > know (yet) where to start - can you give me a hint, Pekka?I'm very glad to hear this! Luckily, the pocl code base has been modularized to allow easily switching the "work group function generation method" which I think your WFV work actually is. Perhaps the detailed instructions on how to start are out of topic here and you might want to join the pocl-devel list (and #pocl) where the pocl developers can give more hints. See http://pocl.sourceforge.net/discussion.html. BR, -- Pekka
Reasonably Related Threads
- [LLVMdev] LoopVectorizer in OpenCL C work group autovectorization
- [LLVMdev] LoopVectorizer in OpenCL C work group autovectorization
- [LLVMdev] LoopVectorizer in OpenCL C work group autovectorization
- [LLVMdev] LoopVectorizer in OpenCL C work group autovectorization
- [LLVMdev] LoopVectorizer in OpenCL C work group autovectorization