Pekka Jääskeläinen
2013-Jan-28 11:58 UTC
[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer
Hi, Attached is a patch which uses a simple "parallel_loop" metadata attached to the loop branch instruction in the loop latch for skipping cross-iteration memory dependency checking in the LoopVectorizer. This was briefly discussed in the email thread "LoopVectorizer in OpenCL C work group autovectorization". It also converts the "min iteration count to vectorize" to a parameter so this can be controlled from the command line. Comments welcomed. Thanks in advance, -- Pekka -------------- next part -------------- A non-text attachment was scrubbed... Name: llvm-3.3-loopvectorizer-parallel_for-metadata-detection.patch Type: text/x-patch Size: 2049 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130128/f580afe7/attachment.bin>
Tobias Grosser
2013-Jan-28 12:58 UTC
[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer
On 01/28/2013 12:58 PM, Pekka Jääskeläinen wrote:> Hi, > > Attached is a patch which uses a simple "parallel_loop" metadata attached > to the loop branch instruction in the loop latch for skipping > cross-iteration > memory dependency checking in the LoopVectorizer. This was briefly > discussed > in the email thread "LoopVectorizer in OpenCL C work group > autovectorization".Can you provide a test case?> It also converts the "min iteration count to vectorize" to a parameter so > this can be controlled from the command line.This seems to be a separate patch? Tobi
Renato Golin
2013-Jan-28 13:22 UTC
[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer
On 28 January 2013 11:58, Pekka Jääskeläinen <pekka.jaaskelainen at tut.fi>wrote:> Attached is a patch which uses a simple "parallel_loop" metadata attached > to the loop branch instruction in the loop latch for skipping > cross-iteration > memory dependency checking in the LoopVectorizer. This was briefly > discussed > in the email thread "LoopVectorizer in OpenCL C work group > autovectorization". >I agree this is a good idea, but not sure about the implementation. + if (latch->getTerminator()->getMetadata("parallel_loop") != NULL) { This seems an awfully specific check on a generic part of the code... If this metadata standard in any form? If this OpenCL specific? Does all OpenCL front-ends generate the same meta-data in that way? Etc... It also converts the "min iteration count to vectorize" to a parameter so> this can be controlled from the command line. >Is this really necessary? Do you have use cases where this would make sense? I think you should send a test case with this patch, not separate. cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130128/45bd0f7e/attachment.html>
Pekka Jääskeläinen
2013-Jan-28 13:49 UTC
[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer
Hi Renato, On 01/28/2013 03:22 PM, Renato Golin wrote:> This seems an awfully specific check on a generic part of the code... IfTrue. Perhaps the check is better encapsulated, e.g., in the Loop class? Or, if there's such thing as a loop-carried data dependency analyzer, the correct place could be there, as a trivial "no deps" analysis. > this metadata standard in any form? If this OpenCL specific? Does all This metadata is not standard in any form. Therefore the request for comments. However, its meaning is generic, not OpenCL specific at all. It specifies that the loop iterations can be treated as independent, regardless of the memory operations the body contains. Thus, the potential cross-iteration memory dependencies can be considered a programming error. > OpenCL front-ends generate the same meta-data in that way? Etc... I have no knowledge of other OpenCL implementations than pocl as I haven't seen their code.> It also converts the "min iteration count to vectorize" to a > parameter so > this can be controlled from the command line. > > > Is this really necessary? Do you have use cases where this would make sense?Where a lower threshold could be useful? At least with loops having long bodies and loops with outer loops that iterate the inner loop many times. In fact, shouldn't the default minimum be the minimum vector width of the machine? The cost estimation routine should take care of the actual profitability estimate?> I think you should send a test case with this patch, not separate.As soon as there's a consensus on the metadata format and where the check shall reside in, I'll prepare a proper patch with a vectorizer test case. Thanks for the comments so far, -- Pekka
Tobias Grosser
2013-Jan-29 08:51 UTC
[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer
On 01/28/2013 12:58 PM, Pekka Jääskeläinen wrote:> Hi, > > Attached is a patch which uses a simple "parallel_loop" metadata attached > to the loop branch instruction in the loop latch for skipping > cross-iteration > memory dependency checking in the LoopVectorizer. This was briefly > discussed > in the email thread "LoopVectorizer in OpenCL C work group > autovectorization". > > It also converts the "min iteration count to vectorize" to a parameter so > this can be controlled from the command line.Hi Pekka, I was a little bit fast in my last email. I would like to discuss the test cases I want to see a little bit more. I am especially interested to see that the meta data is robust in case of transformations. Assuming we have something like; # ignore assumed dependences. for (i = 0; i < 4; i++) { tmp1 = A[3i+1]; tmp2 = A[3i+2]; tmp3 = tmp1 + tmp2; A[3i] = tmp3; } Now I apply for whatever reason a partial reg2mem transformation. float tmp3[1]; # ignore assumed dependences. // Still valid? for (i = 0; i < 4; i++) { tmp1 = A[3i+1]; tmp2 = A[3i+2]; tmp3[0] = tmp1 + tmp2; A[3i] = tmp3[0]; } Is the meta data now still valid or how do we ensure the invalid meta data is removed? I have the feeling it may be necessary to link the loop as well as the accesses for which we can ignore dependences together and mark the whole construct as "this set of memory accesses does not have dependences within the mentioned loop". Like this, the introduction of additional memory accesses which may very well introduce dependences that we need to preserve, can be easily detected. The parallelism check would now be: "In case the llvm.loop.ignore_assumed_deps meta-data is found in a loop header _and_ all memory accesses in the loop are referenced by this meta-data, the loop is parallel. If there is a memory access that does not reference the ignore_assumed_deps meta-data, we can not assume anything about the loop. Cheers, Tobias
Pekka Jääskeläinen
2013-Jan-29 10:03 UTC
[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer
Hi Tobias, On 01/29/2013 10:51 AM, Tobias Grosser wrote:> Is the meta data now still valid or how do we ensure the invalid meta data is > removed?It seems it's not valid anymore. Good catch. I was requesting for these transformation cases earlier. Probably there are more not thought of yet.> I have the feeling it may be necessary to link the loop as well as the accesses > for which we can ignore dependences together and mark the whole construct as > "this set of memory accesses does not have dependences within the mentioned > loop". Like this, the introduction of additional memory accesses which may very > well introduce dependences that we need to preserve, can be easily detected. The > parallelism check would now be: "In case the llvm.loop.ignore_assumed_deps > meta-data is found in a loop header _and_ all memory accesses in the loop are > referenced by this meta-data, the loop is parallel. If there is a memory > access that does not reference the ignore_assumed_deps meta-data, we > can not assume anything about the loop.Sounds reasonable, as it's too intrusive to start making all the passes parallel loop-aware. Also, I was thinking how to *retain* this data during unrolling. At least in-order targets benefit from the information in their alias analysis to produce more instruction scheduling freedom while scheduling code from the unrolled parallel iterations. Also the BBVectorizer should benefit from this. Annotating the memory operations in the iteration should cover this use case also if the metadata also includes the iteration id. Then at the unroll time, the metadata can be replicated and the id incremented. An alias analyzer (something similar we have in pocl for unrolled/chained work items) then can check for this metadata and return NO_ALIAS for accesses from different iterations. Going even further, maybe this same (or similar) format can be used to retain restricted pointer (noalias) information across inlining. Lack of this has been a bit of a nuisance for us in TCE. In the restricted ptr case one wants to mark pointers with some kind of context marker (e.g. a function name + pointer identification) which communicates that the accesses through that pointer do not alias with other pointers in the same function context. The mem accesses in the parallel loop iteration could have a metadata "llvm.mem.parallel_loop_iter 0 0" or similar where the first id is a loop id, the latter the iteration id. Restricted pointer mem accesses could use metadata "llvm.mem.restrict my_function my_pointer". A parallel loop alias analyzer can use the parallel_loop_iter metadata (if loop id equals but iteration id doesn't, return NO_ALIAS) and a restrict pointer AA use the latter (if both of the queried pointers have it and it differs for the pointer part only, return NO_ALIAS). -- --Pekka
Nadav Rotem
2013-Jan-29 18:58 UTC
[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer
On Jan 29, 2013, at 12:51 AM, Tobias Grosser <tobias at grosser.es> wrote:> > # ignore assumed dependences. > for (i = 0; i < 4; i++) { > tmp1 = A[3i+1]; > tmp2 = A[3i+2]; > tmp3 = tmp1 + tmp2; > A[3i] = tmp3; > } > > Now I apply for whatever reason a partial reg2mem transformation. > > float tmp3[1]; > > # ignore assumed dependences. // Still valid? > for (i = 0; i < 4; i++) { > tmp1 = A[3i+1]; > tmp2 = A[3i+2]; > tmp3[0] = tmp1 + tmp2; > A[3i] = tmp3[0]; > }The transformation that you described is illegal because it changes the behavior of the loop. In the first version only A is modified, and in the second version of the loop both A and tmp3 are modified. Can you think of another example that demonstrates why the per-instruction attribute is needed ? I am afraid that so many different llvm transformations will have to be modified to preserve parallelism. This is not something that I want to slip in. If we want to add new parallelism semantics to LLVM them we need to discuss the bigger picture. We need to plan a mechanism that will allow us to implement support for a number of different models (Vectorizers, SPMD languages such as GL and CL, parallel threads such as OpenMP, etc). -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130129/010444a8/attachment.html>