thr3ads.net - llvm dev - [LLVMdev] parallel loop metadata simplification [Mar 2013]

If this information is useful, please help other people find it:
Share via:

Tobias Grosser

2013-Mar-02 18:44 UTC

[LLVMdev] parallel loop metadata simplification

On 03/01/2013 10:05 PM, Redmond, Paul wrote:
[...]> I have discovered that you can provide a custom inserter to IRBuilder (who
knew!). This has basically solved all my problems and allowed me to generate the
proper metadata with minimal changes to clang codegen. Currently it adds the
metadata to all loads and stores but I don't think this is a problem and can
be refined later if necessary.
Great that you found a good solution. I have one example, which we may 
want to have a look into:

Given the following input (test.c):

         #pragma ivdep
	for (long i = 0; i < 100; i++) {
		long t = i + 2;
		A[i] = t;
	}

clang creates the following IR for the loop body:

   %1 = load i64* %i, align 8
   %add = add nsw i64 %1, 2
   store i64 %add, i64* %t, align 8
   %2 = load i64* %t, align 8
   %3 = load i64* %i, align 8
   %4 = load i64** %A.addr, align 8
   %arrayidx = getelementptr inbounds i64* %4, i64 %3
   store i64 %2, i64* %arrayidx, align 8
   br label %for.inc

What are the implications of pragma ivdep on this code? My intuition 
would be that 't' is iteration private and that using pragma ivdep for
this loop is correct. If the use of ivdep is correct, it seems necessary 
to _not_ annotate the loads and stores from and to 't'. Only after
't'
is moved into a register, the loop is actually parallel on the IR level.

Cheers,
Tobi
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.c
Type: text/x-csrc
Size: 89 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130302/adb705ea/attachment.c>

Pekka Jääskeläinen

2013-Mar-03 10:29 UTC

head link

[LLVMdev] parallel loop metadata simplification

On 03/02/2013 08:44 PM, Tobias Grosser wrote:> If the use of ivdep is correct, it seems necessary to _not_ annotate the
loads
> and stores from and to 't'. Only after 't' is moved into a
register, the loop is
> actually parallel on the IR level.
I didn't realize this is a problem in general because in pocl we
explicitly "privatize" the OpenCL C kernel private variables in the
generated work-item loops. In this case, the 't' would not be a scalar,
but a per-iteration variable in an array with an element for each iteration or
a private vreg.

We have two cases for the kernel private variables in OpenCL C:

1) Variables accessed outside one loop (private variables that
span multiple parallel regions). These need to be allocated to function
scope per-iteration (work-item) arrays.
2) Temporary variables accessed only inside one loop. These can stay as loop
private variables. No need to allocate stack space for all the iterations at
once, but for only those executed in parallel.

Seems Hal was right that allocas need to be treated specially. But
how to deal with this type of cases without excluding the parallel-safe alloca
cases? E.g. a programmer-written array (in stack) that is accessed safely from
the parallel iterations or scalars that are only read inside the loop?

The general problem is that parallel loops, due to their differing
semantics, should be treated differently from standard serial C loops during
the Clang codegen to avoid cases like this properly. E.g., the alloca in this
case should not be simply pushed outside the loop because it converts the
parallel loop to a serial one. Instead, each iteration should have their own
private instance of the loop body scope temporary variables.

As a conclusion, at this state, it is not safe to just blindly annotate all
memory accesses with the llvm.mem.parallel_access. It seems quite easy to
produce broken code that way. The easy way forward is to skip marking allocas
altogether and hope mem2reg/SROA makes the loop parallel, but unfortunately it
serializes some of the valid parallel loop cases too. Improved version would
generate loop-scope (temporary) variables in a parallel-loop aware way.

BTW I noted Clik Plus has actually two different parallel loop constructs.
Have you, Cilk Plus developers, thought about the parallel loop code
generation yet?

BR,
-- 
--Pekka

Tobias Grosser

2013-Mar-03 12:34 UTC

head link

[LLVMdev] parallel loop metadata simplification

On 03/02/2013 07:44 PM, Tobias Grosser wrote:> On 03/01/2013 10:05 PM, Redmond, Paul wrote:
> [...]
>> I have discovered that you can provide a custom inserter to IRBuilder
>> (who knew!). This has basically solved all my problems and allowed me
>> to generate the proper metadata with minimal changes to clang codegen.
>> Currently it adds the metadata to all loads and stores but I don't
>> think this is a problem and can be refined later if necessary.
>
> Great that you found a good solution. I have one example, which we may
> want to have a look into:
>
> Given the following input (test.c):
>
>          #pragma ivdep
>      for (long i = 0; i < 100; i++) {
>          long t = i + 2;
>          A[i] = t;
>      }
I attached three more examples. All are vectorized by icc without 
multi-versioning if pragma ivdep is given.

 From my point of view, we should only add metadata to loads and stores 
that are inserted due to memory accesses in the source code. Meaning 
they are due to an array or pointer access.

Cheers,
Tobi

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ivdep_t_private.c
Type: text/x-csrc
Size: 131 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130303/add42a0a/attachment.c>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ivdep_t_non-private.c
Type: text/x-csrc
Size: 130 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130303/add42a0a/attachment-0001.c>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ivdep_pointer.c
Type: text/x-csrc
Size: 160 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130303/add42a0a/attachment-0002.c>

Pekka Jääskeläinen

2013-Mar-03 14:34 UTC

head link

[LLVMdev] parallel loop metadata simplification

On 03/03/2013 02:34 PM, Tobias Grosser wrote:> Meaning they are due to an array or pointer access.
What about loop-scope arrays?

void foo(long *A, long b) {
     long i;

     #pragma ivdep
     for (i = 0; i < 100; i++) {
         long t[100];
         t[0] = i + 2;
         A[i] = A[i+b] + t[0];
     }
}

Clang places the alloca for t to the entry block, creating
a new race condition.

In your example where you moved t outside the loop
it's a programmer's mistake (icc might vectorize it but the
results are undefined due to the dependency), but here I don't
think it is. The t array is supposed to be a loop-private variable,
and each parallel iteration refer to their own isolated instance.

-- 
--Pekka

Langmuir, Ben

2013-Mar-04 17:05 UTC

head link

[LLVMdev] parallel loop metadata simplification

> BTW I noted Clik Plus has actually two different parallel loop constructs.
> Have you, Cilk Plus developers, thought about the parallel loop code
generation yet?
Hi Pekka,

We've started thinking about the parallel loop code generation, but so far
mostly about cilk_for, which is the task-parallel loop construct.  For cilk_for
loops, we intend to outline the loop into a function in Clang using a captured
statement (described here:
http://lists.cs.uiuc.edu/pipermail/cfe-dev/2013-January/027540.html).

For #pragma simd (vector-parallel loop), the hope is that we can use the
parallel loop metadata in the same way that Paul is implementing #pragma ivdep.
There will also be some extra pieces needed to implement the clauses that
#pragma simd allows, such as vectorlength (specifies the maximum simd width).  
However, #pragma simd is otherwise very similar to ivdep, and we expect the
codegen to be similar.

Ben

Apparently Analagous Threads

Search for more reasonably related threads

llvm dev - Mar 2013 - [LLVMdev] parallel loop metadata simplification

[LLVMdev] parallel loop metadata simplification

[LLVMdev] parallel loop metadata simplification

[LLVMdev] parallel loop metadata simplification

[LLVMdev] parallel loop metadata simplification

[LLVMdev] parallel loop metadata simplification

Apparently Analagous Threads