Nadav Rotem
2013-Jan-29 18:58 UTC
[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer
On Jan 29, 2013, at 12:51 AM, Tobias Grosser <tobias at grosser.es> wrote:> > # ignore assumed dependences. > for (i = 0; i < 4; i++) { > tmp1 = A[3i+1]; > tmp2 = A[3i+2]; > tmp3 = tmp1 + tmp2; > A[3i] = tmp3; > } > > Now I apply for whatever reason a partial reg2mem transformation. > > float tmp3[1]; > > # ignore assumed dependences. // Still valid? > for (i = 0; i < 4; i++) { > tmp1 = A[3i+1]; > tmp2 = A[3i+2]; > tmp3[0] = tmp1 + tmp2; > A[3i] = tmp3[0]; > }The transformation that you described is illegal because it changes the behavior of the loop. In the first version only A is modified, and in the second version of the loop both A and tmp3 are modified. Can you think of another example that demonstrates why the per-instruction attribute is needed ? I am afraid that so many different llvm transformations will have to be modified to preserve parallelism. This is not something that I want to slip in. If we want to add new parallelism semantics to LLVM them we need to discuss the bigger picture. We need to plan a mechanism that will allow us to implement support for a number of different models (Vectorizers, SPMD languages such as GL and CL, parallel threads such as OpenMP, etc). -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130129/010444a8/attachment.html>
Pekka Jääskeläinen
2013-Jan-29 20:25 UTC
[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer
On 01/29/2013 08:58 PM, Nadav Rotem wrote:> > On Jan 29, 2013, at 12:51 AM, Tobias Grosser <tobias at grosser.es > <mailto:tobias at grosser.es>> wrote: > >> >> # ignore assumed dependences. >> for (i = 0; i < 4; i++) { >> tmp1 = A[3i+1]; >> tmp2 = A[3i+2]; >> tmp3 = tmp1 + tmp2; >> A[3i] = tmp3; >> } >> >> Now I apply for whatever reason a partial reg2mem transformation. >> >> float tmp3[1]; >> >> # ignore assumed dependences. // Still valid? >> for (i = 0; i < 4; i++) { >> tmp1 = A[3i+1]; >> tmp2 = A[3i+2]; >> tmp3[0] = tmp1 + tmp2; >> A[3i] = tmp3[0]; >> } > > > The transformation that you described is illegal because it changes the behavior > of the loop. In the first version only A is modified, and in the second version > of the loop both A and tmp3 are modified. Can you think of another example that > demonstrates why the per-instruction attribute is needed ?The problem here is that a "parallel loop" and a traditional C "sequential loop" are different beasts semantically and should be treated as such by all optimizations. I'm afraid the above is a legal transformation in sequential C code. That tmp3 is a stack object added by the compiler, thus the programmer's view of the program behavior doesn't change (similar to reg spilling). Due to the C's sequential semantics of loops, the additional loop carried dependency still retains the original intended behavior (but breaks the parallel loop semantics). The correct way would be to treat the loop as a parallel loop, not a sequential loop (as the programmer has wanted), thus add and *respect* the parallel loop metadata throughout LLVM. I.e., not allow optimizations like this because the loop's semantics become different. I think that's the best way to go but quite intrusive and risky (takes time to stabilize) therefore the other way around might make more sense until the cases have been shaken out. Or, we can just "jump straight into the cold water" and add the parallel loop information to the loop branch only and fix illegal optimizations as they appear. I'm OK with this also.> I am afraid that so many different llvm transformations will have to be modified > to preserve parallelism. This is not something that I want to slip in. If we > want to add new parallelism semantics to LLVM them we need to discuss the bigger > picture. We need to plan a mechanism that will allow us to implement support for > a number of different models (Vectorizers, SPMD languages such as GL and CL, > parallel threads such as OpenMP, etc).In my opionion we should start with the lowest hanging fruit (the parallel loops) and improve on this as we get experience on it. -- --Pekka
Tobias Grosser
2013-Jan-30 23:45 UTC
[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer
On 01/29/2013 07:58 PM, Nadav Rotem wrote:> > On Jan 29, 2013, at 12:51 AM, Tobias Grosser <tobias at grosser.es > <mailto:tobias at grosser.es>> wrote: > >> >> # ignore assumed dependences. >> for (i = 0; i < 4; i++) { >> tmp1 = A[3i+1]; >> tmp2 = A[3i+2]; >> tmp3 = tmp1 + tmp2; >> A[3i] = tmp3; >> } >> >> Now I apply for whatever reason a partial reg2mem transformation. >> >> float tmp3[1]; >> >> # ignore assumed dependences. // Still valid? >> for (i = 0; i < 4; i++) { >> tmp1 = A[3i+1]; >> tmp2 = A[3i+2]; >> tmp3[0] = tmp1 + tmp2; >> A[3i] = tmp3[0]; >> } > > > The transformation that you described is illegal because it changes the > behavior of the loop. In the first version only A is modified, and in > the second version of the loop both A and tmp3 are modified. Can you > think of another example that demonstrates why the per-instruction > attribute is needed ?Hi Nadav, I can not directly follow why this transformation would be illegal by itself. Introducing stack memory and performing calculations there is something -reg2mem does and that should be legal in the context of sequential LLVM-IR. Did I miss something? I think the transformation I describe is only 'illegal' in the sense that it makes the llvm.loop.parallel metadata incorrect. This is exactly what I wanted to point out. Metadata was until now always optional, meaning transformations that don't understand a piece of metadata would never transform code in a way that the metadata becomes incorrect. Instead, transformations either know the metadata and update it accordingly or the metadata will be automatically removed as soon as instructions are touched. My impression here comes e.g. from the blog post describing LLVM meta data [1]: "A subtle point that was touched on above is that we don't want the optimizers to have to know about metadata." You asked for another example. I had the feeling clang should generate this metadata automatically given certain user defined pragmas, right? Here a simple ".c" code: void foo(float *A) { # pragma vectorize for (long i = 0; i < 4; i++) { float tmp3 = A[i]; A[i + 4] = tmp3; } } Do you agree this code would be something we want to execute in parallel? Looking at the LLVM-IR 'clang -O0 -S' generates from it, we actually get the following: > define void @foo(float* %A) nounwind uwtable { > entry: > %A.addr = alloca float*, align 8 > %i = alloca i64, align 8 > %tmp3 = alloca float, align 4 > store float* %A, float** %A.addr, align 8 > store i64 0, i64* %i, align 8 > br label %for.cond > > for.cond: ; preds = %for.inc, > > %0 = load i64* %i, align 8 > %cmp = icmp slt i64 %0, 4 > br i1 %cmp, label %for.body, label %for.end > > for.body: ; preds = %for.cond > %1 = load i64* %i, align 8 > %2 = load float** %A.addr, align 8 > %arrayidx = getelementptr inbounds float* %2, i64 %1 > %3 = load float* %arrayidx, align 4 > store float %3, float* %tmp3, align 4 clang produces by default a lot of temporary stack arrays. This loop is not vectorizable before -mem2reg is executed. Attaching the loop.parallel metadata would either be incorrect or we would need to define precisely which memory references need to be moved to registers before the parallelism that was declared by the metadata is actually there.> I am afraid that so many different llvm transformations will have to be > modified to preserve parallelism. This is not something that I want to > slip in. If we want to add new parallelism semantics to LLVM them we > need to discuss the bigger picture.> We need to plan a mechanism that> will allow us to implement support for a number of different models > (Vectorizers, SPMD languages such as GL and CL, parallel threads such as > OpenMP, etc).I am not proposing to change the types of parallelism the proposed meta-data should cover. I just want to make sure the semantics of the proposed meta-data are well defined. Cheers Tobi [1] http://blog.llvm.org/2010/04/extensible-metadata-in-llvm-ir.html
Nadav Rotem
2013-Jan-31 00:17 UTC
[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer
Tobi, I completely agree with everything you wrote here. I share your concerns and I would also like to see a clear definition of the 'llvm.parallel' pragma. Thanks, Nadav On Jan 30, 2013, at 3:45 PM, Tobias Grosser <tobias at grosser.es> wrote:> On 01/29/2013 07:58 PM, Nadav Rotem wrote: >> >> On Jan 29, 2013, at 12:51 AM, Tobias Grosser <tobias at grosser.es >> <mailto:tobias at grosser.es>> wrote: >> >>> >>> # ignore assumed dependences. >>> for (i = 0; i < 4; i++) { >>> tmp1 = A[3i+1]; >>> tmp2 = A[3i+2]; >>> tmp3 = tmp1 + tmp2; >>> A[3i] = tmp3; >>> } >>> >>> Now I apply for whatever reason a partial reg2mem transformation. >>> >>> float tmp3[1]; >>> >>> # ignore assumed dependences. // Still valid? >>> for (i = 0; i < 4; i++) { >>> tmp1 = A[3i+1]; >>> tmp2 = A[3i+2]; >>> tmp3[0] = tmp1 + tmp2; >>> A[3i] = tmp3[0]; >>> } >> >> >> The transformation that you described is illegal because it changes the >> behavior of the loop. In the first version only A is modified, and in >> the second version of the loop both A and tmp3 are modified. Can you >> think of another example that demonstrates why the per-instruction >> attribute is needed ? > > Hi Nadav, > > I can not directly follow why this transformation would be illegal by itself. Introducing stack memory and performing calculations there is something -reg2mem does and that should be legal in the context of sequential LLVM-IR. Did I miss something? > > I think the transformation I describe is only 'illegal' in the sense that it makes the llvm.loop.parallel metadata incorrect. This is exactly what I wanted to point out. Metadata was until now always optional, meaning transformations that don't understand a piece of metadata would never transform code in a way that the metadata becomes incorrect. Instead, transformations either know the metadata and > update it accordingly or the metadata will be automatically removed as soon as instructions are touched. My impression here comes e.g. from the blog post describing LLVM meta data [1]: "A subtle point that was touched on above is that we don't want the optimizers to have to know about metadata." > > You asked for another example. I had the feeling clang should generate this metadata automatically given certain user defined pragmas, right? > Here a simple ".c" code: > > void foo(float *A) { > # pragma vectorize > for (long i = 0; i < 4; i++) { > float tmp3 = A[i]; > A[i + 4] = tmp3; > } > } > > Do you agree this code would be something we want to execute in parallel? Looking at the LLVM-IR 'clang -O0 -S' generates from it, we actually get the following: > > > define void @foo(float* %A) nounwind uwtable { > > entry: > > %A.addr = alloca float*, align 8 > > %i = alloca i64, align 8 > > %tmp3 = alloca float, align 4 > > store float* %A, float** %A.addr, align 8 > > store i64 0, i64* %i, align 8 > > br label %for.cond > > > > for.cond: ; preds = %for.inc, > > > %0 = load i64* %i, align 8 > > %cmp = icmp slt i64 %0, 4 > > br i1 %cmp, label %for.body, label %for.end > > > > for.body: ; preds = %for.cond > > %1 = load i64* %i, align 8 > > %2 = load float** %A.addr, align 8 > > %arrayidx = getelementptr inbounds float* %2, i64 %1 > > %3 = load float* %arrayidx, align 4 > > store float %3, float* %tmp3, align 4 > > clang produces by default a lot of temporary stack arrays. This loop is not vectorizable before -mem2reg is executed. Attaching the loop.parallel metadata would either be incorrect or we would need to define precisely which memory references need to be moved to registers > before the parallelism that was declared by the metadata is actually there. > >> I am afraid that so many different llvm transformations will have to be >> modified to preserve parallelism. This is not something that I want to >> slip in. If we want to add new parallelism semantics to LLVM them we >> need to discuss the bigger picture. > > We need to plan a mechanism that >> will allow us to implement support for a number of different models >> (Vectorizers, SPMD languages such as GL and CL, parallel threads such as >> OpenMP, etc). > > I am not proposing to change the types of parallelism the proposed meta-data should cover. I just want to make sure the semantics of the proposed meta-data are well defined. > > Cheers > Tobi > > [1] http://blog.llvm.org/2010/04/extensible-metadata-in-llvm-ir.html