Pekka Jääskeläinen
2013-Jan-31 17:15 UTC
[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization
Hi Ralf, On 01/31/2013 05:44 PM, Ralf Karrenberg wrote:> As for the current status, the loop vectorizer is only able to vectorize > inner loops and (I think) does not handle function calls and memory > operations well. This will prevent it from vectorizing a large group of > OpenCL kernels, and certainly all "interesting", more complex ones.Agreed -- but not being able to handle function calls/intrinsics is not an OpenCL-specific limitation. Any vectorizable input suffers from that. Also, an inner loop vectorizer might be able to handle outer loops e.g. via loop interchange. I'm planning to look into that if time allows.> However, in the long run, I think the only difference between WFV-like > approaches and classic loop vectorization a la LoopVectorizer in an > OpenCL context is the following: > WFV assumes that there is at least one outer loop that has increments of > one, runs a multiple of the SIMD width iterations, and that every > iteration is independent (barriers can be handled by the OpenCL driver > *after* WFV).Yes, this is the case with the "wiloops" work group generation method of pocl. The parallel outer loops are the max 3 dimensions of the local space. The actual wg barrier calls are converted to no-ops (compiler barriers) for the current targets.> On the other hand, LoopVectorizer may not be aimed at covering all kinds > of code inside the body and/or instead focus more on things not required > by WFV, such as handling reductions and other kinds of loop-carried > dependencies.It is true that the feature set of the LoopVectorizer goes beyond the "embarrassingly parallel loops" that the implicit WI loops are. However, I don't see this as a show-stopper for trying to provide a modularized approach to work group vectorization. Moreover, parallelization-helping optimizations such as "loop masking" for the diverging inner-loops (kernel loops) are more generally useful, and, IMHO should be added to LLVM upstream (not to an OpenCL implementation only) eventually as generic loop vectorization routines.> In any case, since our own OpenCL driver is more of a proof-of-concept > implementation and not very robust, I'd be willing to give it a try to > integrate the current libWFV into pocl. This should boost performance > quite a bit for many kernels without too much effort ;). I just don't > know (yet) where to start - can you give me a hint, Pekka?I'm very glad to hear this! Luckily, the pocl code base has been modularized to allow easily switching the "work group function generation method" which I think your WFV work actually is. Perhaps the detailed instructions on how to start are out of topic here and you might want to join the pocl-devel list (and #pocl) where the pocl developers can give more hints. See http://pocl.sourceforge.net/discussion.html. BR, -- Pekka
Hal Finkel
2013-Jan-31 17:47 UTC
[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization
----- Original Message -----> From: "Pekka Jääskeläinen" <pekka.jaaskelainen at tut.fi> > To: "Ralf Karrenberg" <Chareos at gmx.de> > Cc: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu> > Sent: Thursday, January 31, 2013 11:15:43 AM > Subject: Re: [LLVMdev] LoopVectorizer in OpenCL C work group autovectorization > > Hi Ralf, > > On 01/31/2013 05:44 PM, Ralf Karrenberg wrote: > > As for the current status, the loop vectorizer is only able to > > vectorize > > inner loops and (I think) does not handle function calls and memory > > operations well. This will prevent it from vectorizing a large > > group of > > OpenCL kernels, and certainly all "interesting", more complex ones. > > Agreed -- but not being able to handle function calls/intrinsics is > not an OpenCL-specific limitation. Any vectorizable input suffers > from > that. Also, an inner loop vectorizer might be able to handle outer > loops > e.g. via loop interchange. I'm planning to look into that if time > allows.This is also on my TODO list. Let's collaborate when you have time.> > > However, in the long run, I think the only difference between > > WFV-like > > approaches and classic loop vectorization a la LoopVectorizer in an > > OpenCL context is the following: > > WFV assumes that there is at least one outer loop that has > > increments of > > one, runs a multiple of the SIMD width iterations, and that every > > iteration is independent (barriers can be handled by the OpenCL > > driver > > *after* WFV). > > Yes, this is the case with the "wiloops" work group generation > method of pocl. The parallel outer loops are the max 3 dimensions of > the > local space. The actual wg barrier calls are converted to no-ops > (compiler > barriers) for the current targets. > > > On the other hand, LoopVectorizer may not be aimed at covering all > > kinds > > of code inside the body and/or instead focus more on things not > > required > > by WFV, such as handling reductions and other kinds of loop-carried > > dependencies. > > It is true that the feature set of the LoopVectorizer goes beyond the > "embarrassingly parallel loops" that the implicit WI loops are. > However, > I don't see this as a show-stopper for trying to provide a > modularized > approach to work group vectorization. > > Moreover, parallelization-helping optimizations such as "loop > masking" for > the diverging inner-loops (kernel loops) are more generally useful, > and, IMHO > should be added to LLVM upstream (not to an OpenCL implementation > only) > eventually as generic loop vectorization routines.I completely agree.> > > In any case, since our own OpenCL driver is more of a > > proof-of-concept > > implementation and not very robust, I'd be willing to give it a try > > to > > integrate the current libWFV into pocl. This should boost > > performance > > quite a bit for many kernels without too much effort ;). I justRalf, Does this mean that you're close to releasing the new version? Thanks again, Hal> > don't > > know (yet) where to start - can you give me a hint, Pekka? > > I'm very glad to hear this! Luckily, the pocl code base has been > modularized > to allow easily switching the "work group function generation method" > which I > think your WFV work actually is. > > Perhaps the detailed instructions on how to start are out of topic > here and > you might want to join the pocl-devel list (and #pocl) where the pocl > developers can give more hints. See > http://pocl.sourceforge.net/discussion.html. > > BR, > -- > Pekka > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >
Ralf Karrenberg
2013-Feb-01 07:45 UTC
[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization
Hi Pekka, On 1/31/13 6:15 PM, Pekka Jääskeläinen wrote:>> On the other hand, LoopVectorizer may not be aimed at covering all kinds >> of code inside the body and/or instead focus more on things not required >> by WFV, such as handling reductions and other kinds of loop-carried >> dependencies. > > It is true that the feature set of the LoopVectorizer goes beyond the > "embarrassingly parallel loops" that the implicit WI loops are. However, > I don't see this as a show-stopper for trying to provide a modularized > approach to work group vectorization. > > Moreover, parallelization-helping optimizations such as "loop masking" for > the diverging inner-loops (kernel loops) are more generally useful, and, > IMHO > should be added to LLVM upstream (not to an OpenCL implementation only) > eventually as generic loop vectorization routines.Yes, I fully agree. I already told Nadav that he will immediately get access to my new implementation when he reaches that point to prevent him from re-implementing everything (or at least to have some code to refer to ;) ). The code is not released yet, but it is under LLVM license so there's no problem with that.>> In any case, since our own OpenCL driver is more of a proof-of-concept >> implementation and not very robust, I'd be willing to give it a try to >> integrate the current libWFV into pocl. This should boost performance >> quite a bit for many kernels without too much effort ;). I just don't >> know (yet) where to start - can you give me a hint, Pekka? > > I'm very glad to hear this! Luckily, the pocl code base has been > modularized > to allow easily switching the "work group function generation method" > which I > think your WFV work actually is. > > Perhaps the detailed instructions on how to start are out of topic here and > you might want to join the pocl-devel list (and #pocl) where the pocl > developers can give more hints. See > http://pocl.sourceforge.net/discussion.html.I'll do this now :). Cheers, Ralf
Ralf Karrenberg
2013-Feb-01 07:49 UTC
[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization
Hi Hal, On 1/31/13 6:47 PM, Hal Finkel wrote:>>> In any case, since our own OpenCL driver is more of a >>> proof-of-concept >>> implementation and not very robust, I'd be willing to give it a try >>> to >>> integrate the current libWFV into pocl. This should boost >>> performance >>> quite a bit for many kernels without too much effort ;). I just > > Ralf, Does this mean that you're close to releasing the new version?It depends ;). The new version is already running in our OpenCL driver, which means that it is more or less at the same level of the old implementation now. However, the exploitation of the divergence analysis as described in our CC'12 paper is not fully implemented yet, I can't seem to find the time for that right now :(. Anyway, if you guys are interested, I can give you access to the repository. Best, Ralf
Reasonably Related Threads
- [LLVMdev] LoopVectorizer in OpenCL C work group autovectorization
- [LLVMdev] LoopVectorizer in OpenCL C work group autovectorization
- [LLVMdev] LoopVectorizer in OpenCL C work group autovectorization
- [LLVMdev] LoopVectorizer in OpenCL C work group autovectorization
- [LLVMdev] LoopVectorizer in OpenCL C work group autovectorization