thr3ads.net - llvm dev - [LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer [Jan 2013]

If this information is useful, please help other people find it:
Share via:

Nadav Rotem

2013-Jan-29 18:58 UTC

[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer

On Jan 29, 2013, at 12:51 AM, Tobias Grosser <tobias at grosser.es> wrote:
> 
> # ignore assumed dependences.
> for (i = 0; i < 4; i++) {
>   tmp1 = A[3i+1];
>   tmp2 = A[3i+2];
>   tmp3 = tmp1 + tmp2;
>   A[3i] = tmp3;
> }
> 
> Now I apply for whatever reason a partial reg2mem transformation.
> 
> float tmp3[1];
> 
> # ignore assumed dependences. // Still valid?
> for (i = 0; i < 4; i++) {
>   tmp1 = A[3i+1];
>   tmp2 = A[3i+2];
>   tmp3[0] = tmp1 + tmp2;
>   A[3i] = tmp3[0];
> }

The transformation that you described is illegal because it changes the behavior
of the loop. In the first version only A is modified, and in the second version
of the loop both A and tmp3 are modified. Can you think of another example that
demonstrates why the per-instruction attribute is needed ?

I am afraid that so many different llvm transformations will have to be modified
to preserve parallelism. This is not something that I want to slip in. If we
want to add new parallelism semantics to LLVM them we need to discuss the bigger
picture. We need to plan a mechanism that will allow us to implement support for
a number of different models (Vectorizers, SPMD languages such as GL and CL,
parallel threads such as OpenMP, etc).

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130129/010444a8/attachment.html>

Pekka Jääskeläinen

2013-Jan-29 20:25 UTC

head link

[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer

On 01/29/2013 08:58 PM, Nadav Rotem wrote:>
> On Jan 29, 2013, at 12:51 AM, Tobias Grosser <tobias at grosser.es
> <mailto:tobias at grosser.es>> wrote:
>
>>
>> # ignore assumed dependences.
>> for (i = 0; i < 4; i++) {
>>   tmp1 = A[3i+1];
>>   tmp2 = A[3i+2];
>>   tmp3 = tmp1 + tmp2;
>>   A[3i] = tmp3;
>> }
>>
>> Now I apply for whatever reason a partial reg2mem transformation.
>>
>> float tmp3[1];
>>
>> # ignore assumed dependences. // Still valid?
>> for (i = 0; i < 4; i++) {
>>   tmp1 = A[3i+1];
>>   tmp2 = A[3i+2];
>>   tmp3[0] = tmp1 + tmp2;
>>   A[3i] = tmp3[0];
>> }
>
>
> The transformation that you described is illegal because it changes the
behavior
> of the loop. In the first version only A is modified, and in the second
version
> of the loop both A and tmp3 are modified. Can you think of another example
that
> demonstrates why the per-instruction attribute is needed ?
The problem here is that a "parallel loop" and a traditional C
"sequential loop"
are different beasts semantically and should be treated as such by all
optimizations.

I'm afraid the above is a legal transformation in sequential C code. That
tmp3
is a stack object added by the compiler, thus the programmer's view of the
program behavior doesn't change (similar to reg spilling). Due to the
C's
sequential semantics of loops, the additional loop carried dependency
still retains the original intended behavior (but breaks the parallel
loop semantics).

The correct way would be to treat the loop as a parallel loop, not a
sequential loop (as the programmer has wanted), thus add and *respect* the
parallel loop metadata throughout LLVM. I.e., not allow optimizations like this
because the loop's semantics become different.

I think that's the best way to go but quite intrusive and risky (takes
time to stabilize) therefore the other way around might make more sense
until the cases have been shaken out. Or, we can just "jump straight
into the cold water" and add the parallel loop information to the loop
branch
only and fix illegal optimizations as they appear. I'm OK with this also.
> I am afraid that so many different llvm transformations will have to be
modified
> to preserve parallelism. This is not something that I want to slip in. If
we
> want to add new parallelism semantics to LLVM them we need to discuss the
bigger
> picture. We need to plan a mechanism that will allow us to implement
support for
> a number of different models (Vectorizers, SPMD languages such as GL and
CL,
> parallel threads such as OpenMP, etc).
In my opionion we should start with the lowest hanging fruit (the parallel
loops) and improve on this as we get experience on it.

-- 
--Pekka

Tobias Grosser

2013-Jan-30 23:45 UTC

head link

[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer

On 01/29/2013 07:58 PM, Nadav Rotem wrote:>
> On Jan 29, 2013, at 12:51 AM, Tobias Grosser <tobias at grosser.es
> <mailto:tobias at grosser.es>> wrote:
>
>>
>> # ignore assumed dependences.
>> for (i = 0; i < 4; i++) {
>>   tmp1 = A[3i+1];
>>   tmp2 = A[3i+2];
>>   tmp3 = tmp1 + tmp2;
>>   A[3i] = tmp3;
>> }
>>
>> Now I apply for whatever reason a partial reg2mem transformation.
>>
>> float tmp3[1];
>>
>> # ignore assumed dependences. // Still valid?
>> for (i = 0; i < 4; i++) {
>>   tmp1 = A[3i+1];
>>   tmp2 = A[3i+2];
>>   tmp3[0] = tmp1 + tmp2;
>>   A[3i] = tmp3[0];
>> }
>
>
> The transformation that you described is illegal because it changes the
> behavior of the loop. In the first version only A is modified, and in
> the second version of the loop both A and tmp3 are modified. Can you
> think of another example that demonstrates why the per-instruction
> attribute is needed ?
Hi Nadav,

I can not directly follow why this transformation would be illegal by 
itself. Introducing stack memory and performing calculations there is 
something -reg2mem does and that should be legal in the context of 
sequential LLVM-IR. Did I miss something?

I think the transformation I describe is only 'illegal' in the sense 
that it makes the llvm.loop.parallel metadata incorrect. This is exactly 
what I wanted to point out. Metadata was until now always optional, 
meaning transformations that don't understand a piece of metadata would 
never transform code in a way that the metadata becomes incorrect. 
Instead, transformations either know the metadata and
update it accordingly or the metadata will be automatically removed as 
soon as instructions are touched. My impression here comes e.g. from the 
blog post describing LLVM meta data [1]: "A subtle point that was 
touched on above is that we don't want the optimizers to have to know 
about metadata."

You asked for another example. I had the feeling clang should generate 
this metadata automatically given certain user defined pragmas, right?
Here a simple ".c" code:

void foo(float *A) {
         # pragma vectorize
         for (long i = 0; i < 4; i++) {
                 float tmp3 = A[i];
                 A[i + 4] = tmp3;
         }
}

Do you agree this code would be something we want to execute in 
parallel? Looking at the LLVM-IR 'clang -O0 -S' generates from it, we 
actually get the following:

 > define void @foo(float* %A) nounwind uwtable {
 > entry:
 >   %A.addr = alloca float*, align 8
 >   %i = alloca i64, align 8
 >   %tmp3 = alloca float, align 4
 >   store float* %A, float** %A.addr, align 8
 >   store i64 0, i64* %i, align 8
 >   br label %for.cond
 >
 > for.cond:                                         ; preds = %for.inc, >
 >   %0 = load i64* %i, align 8
 >   %cmp = icmp slt i64 %0, 4
 >   br i1 %cmp, label %for.body, label %for.end
 >
 > for.body:                                         ; preds = %for.cond
 >   %1 = load i64* %i, align 8
 >   %2 = load float** %A.addr, align 8
 >   %arrayidx = getelementptr inbounds float* %2, i64 %1
 >   %3 = load float* %arrayidx, align 4
 >   store float %3, float* %tmp3, align 4

clang produces by default a lot of temporary stack arrays. This loop is 
not vectorizable before -mem2reg is executed. Attaching the 
loop.parallel metadata would either be incorrect or we would need to 
define precisely which memory references need to be moved to registers
before the parallelism that was declared by the metadata is actually there.
> I am afraid that so many different llvm transformations will have to be
> modified to preserve parallelism. This is not something that I want to
> slip in. If we want to add new parallelism semantics to LLVM them we
> need to discuss the bigger picture.
 > We need to plan a mechanism that> will allow us to implement support for a number of different models
> (Vectorizers, SPMD languages such as GL and CL, parallel threads such as
> OpenMP, etc).
I am not proposing to change the types of parallelism the proposed 
meta-data should cover. I just want to make sure the semantics of the 
proposed meta-data are well defined.

Cheers
Tobi

[1] http://blog.llvm.org/2010/04/extensible-metadata-in-llvm-ir.html

Nadav Rotem

2013-Jan-31 00:17 UTC

head link

[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer

Tobi, 

I completely agree with everything you wrote here. I share your concerns and I
would also like to see a clear definition of the 'llvm.parallel' pragma.

Thanks,
Nadav

On Jan 30, 2013, at 3:45 PM, Tobias Grosser <tobias at grosser.es> wrote:
> On 01/29/2013 07:58 PM, Nadav Rotem wrote:
>> 
>> On Jan 29, 2013, at 12:51 AM, Tobias Grosser <tobias at grosser.es
>> <mailto:tobias at grosser.es>> wrote:
>> 
>>> 
>>> # ignore assumed dependences.
>>> for (i = 0; i < 4; i++) {
>>>  tmp1 = A[3i+1];
>>>  tmp2 = A[3i+2];
>>>  tmp3 = tmp1 + tmp2;
>>>  A[3i] = tmp3;
>>> }
>>> 
>>> Now I apply for whatever reason a partial reg2mem transformation.
>>> 
>>> float tmp3[1];
>>> 
>>> # ignore assumed dependences. // Still valid?
>>> for (i = 0; i < 4; i++) {
>>>  tmp1 = A[3i+1];
>>>  tmp2 = A[3i+2];
>>>  tmp3[0] = tmp1 + tmp2;
>>>  A[3i] = tmp3[0];
>>> }
>> 
>> 
>> The transformation that you described is illegal because it changes the
>> behavior of the loop. In the first version only A is modified, and in
>> the second version of the loop both A and tmp3 are modified. Can you
>> think of another example that demonstrates why the per-instruction
>> attribute is needed ?
> 
> Hi Nadav,
> 
> I can not directly follow why this transformation would be illegal by
itself. Introducing stack memory and performing calculations there is something
-reg2mem does and that should be legal in the context of sequential LLVM-IR. Did
I miss something?
> 
> I think the transformation I describe is only 'illegal' in the
sense that it makes the llvm.loop.parallel metadata incorrect. This is exactly
what I wanted to point out. Metadata was until now always optional, meaning
transformations that don't understand a piece of metadata would never
transform code in a way that the metadata becomes incorrect. Instead,
transformations either know the metadata and
> update it accordingly or the metadata will be automatically removed as soon
as instructions are touched. My impression here comes e.g. from the blog post
describing LLVM meta data [1]: "A subtle point that was touched on above is
that we don't want the optimizers to have to know about metadata."
> 
> You asked for another example. I had the feeling clang should generate this
metadata automatically given certain user defined pragmas, right?
> Here a simple ".c" code:
> 
> void foo(float *A) {
>        # pragma vectorize
>        for (long i = 0; i < 4; i++) {
>                float tmp3 = A[i];
>                A[i + 4] = tmp3;
>        }
> }
> 
> Do you agree this code would be something we want to execute in parallel?
Looking at the LLVM-IR 'clang -O0 -S' generates from it, we actually get
the following:
> 
> > define void @foo(float* %A) nounwind uwtable {
> > entry:
> >   %A.addr = alloca float*, align 8
> >   %i = alloca i64, align 8
> >   %tmp3 = alloca float, align 4
> >   store float* %A, float** %A.addr, align 8
> >   store i64 0, i64* %i, align 8
> >   br label %for.cond
> >
> > for.cond:                                         ; preds = %for.inc,
>
> >   %0 = load i64* %i, align 8
> >   %cmp = icmp slt i64 %0, 4
> >   br i1 %cmp, label %for.body, label %for.end
> >
> > for.body:                                         ; preds = %for.cond
> >   %1 = load i64* %i, align 8
> >   %2 = load float** %A.addr, align 8
> >   %arrayidx = getelementptr inbounds float* %2, i64 %1
> >   %3 = load float* %arrayidx, align 4
> >   store float %3, float* %tmp3, align 4
> 
> clang produces by default a lot of temporary stack arrays. This loop is not
vectorizable before -mem2reg is executed. Attaching the loop.parallel metadata
would either be incorrect or we would need to define precisely which memory
references need to be moved to registers
> before the parallelism that was declared by the metadata is actually there.
> 
>> I am afraid that so many different llvm transformations will have to be
>> modified to preserve parallelism. This is not something that I want to
>> slip in. If we want to add new parallelism semantics to LLVM them we
>> need to discuss the bigger picture.
> > We need to plan a mechanism that
>> will allow us to implement support for a number of different models
>> (Vectorizers, SPMD languages such as GL and CL, parallel threads such
as
>> OpenMP, etc).
> 
> I am not proposing to change the types of parallelism the proposed
meta-data should cover. I just want to make sure the semantics of the proposed
meta-data are well defined.
> 
> Cheers
> Tobi
> 
> [1] http://blog.llvm.org/2010/04/extensible-metadata-in-llvm-ir.html

Reasonably Related Threads

Search for more apparently analagous threads

llvm dev - Jan 2013 - [LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer

[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer

[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer

[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer

[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer

Reasonably Related Threads