On 03/01/2013 10:05 PM, Redmond, Paul wrote: [...]> I have discovered that you can provide a custom inserter to IRBuilder (who knew!). This has basically solved all my problems and allowed me to generate the proper metadata with minimal changes to clang codegen. Currently it adds the metadata to all loads and stores but I don't think this is a problem and can be refined later if necessary.Great that you found a good solution. I have one example, which we may want to have a look into: Given the following input (test.c): #pragma ivdep for (long i = 0; i < 100; i++) { long t = i + 2; A[i] = t; } clang creates the following IR for the loop body: %1 = load i64* %i, align 8 %add = add nsw i64 %1, 2 store i64 %add, i64* %t, align 8 %2 = load i64* %t, align 8 %3 = load i64* %i, align 8 %4 = load i64** %A.addr, align 8 %arrayidx = getelementptr inbounds i64* %4, i64 %3 store i64 %2, i64* %arrayidx, align 8 br label %for.inc What are the implications of pragma ivdep on this code? My intuition would be that 't' is iteration private and that using pragma ivdep for this loop is correct. If the use of ivdep is correct, it seems necessary to _not_ annotate the loads and stores from and to 't'. Only after 't' is moved into a register, the loop is actually parallel on the IR level. Cheers, Tobi -------------- next part -------------- A non-text attachment was scrubbed... Name: test.c Type: text/x-csrc Size: 89 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130302/adb705ea/attachment.c>
On 03/02/2013 08:44 PM, Tobias Grosser wrote:> If the use of ivdep is correct, it seems necessary to _not_ annotate the loads > and stores from and to 't'. Only after 't' is moved into a register, the loop is > actually parallel on the IR level.I didn't realize this is a problem in general because in pocl we explicitly "privatize" the OpenCL C kernel private variables in the generated work-item loops. In this case, the 't' would not be a scalar, but a per-iteration variable in an array with an element for each iteration or a private vreg. We have two cases for the kernel private variables in OpenCL C: 1) Variables accessed outside one loop (private variables that span multiple parallel regions). These need to be allocated to function scope per-iteration (work-item) arrays. 2) Temporary variables accessed only inside one loop. These can stay as loop private variables. No need to allocate stack space for all the iterations at once, but for only those executed in parallel. Seems Hal was right that allocas need to be treated specially. But how to deal with this type of cases without excluding the parallel-safe alloca cases? E.g. a programmer-written array (in stack) that is accessed safely from the parallel iterations or scalars that are only read inside the loop? The general problem is that parallel loops, due to their differing semantics, should be treated differently from standard serial C loops during the Clang codegen to avoid cases like this properly. E.g., the alloca in this case should not be simply pushed outside the loop because it converts the parallel loop to a serial one. Instead, each iteration should have their own private instance of the loop body scope temporary variables. As a conclusion, at this state, it is not safe to just blindly annotate all memory accesses with the llvm.mem.parallel_access. It seems quite easy to produce broken code that way. The easy way forward is to skip marking allocas altogether and hope mem2reg/SROA makes the loop parallel, but unfortunately it serializes some of the valid parallel loop cases too. Improved version would generate loop-scope (temporary) variables in a parallel-loop aware way. BTW I noted Clik Plus has actually two different parallel loop constructs. Have you, Cilk Plus developers, thought about the parallel loop code generation yet? BR, -- --Pekka
On 03/02/2013 07:44 PM, Tobias Grosser wrote:> On 03/01/2013 10:05 PM, Redmond, Paul wrote: > [...] >> I have discovered that you can provide a custom inserter to IRBuilder >> (who knew!). This has basically solved all my problems and allowed me >> to generate the proper metadata with minimal changes to clang codegen. >> Currently it adds the metadata to all loads and stores but I don't >> think this is a problem and can be refined later if necessary. > > Great that you found a good solution. I have one example, which we may > want to have a look into: > > Given the following input (test.c): > > #pragma ivdep > for (long i = 0; i < 100; i++) { > long t = i + 2; > A[i] = t; > }I attached three more examples. All are vectorized by icc without multi-versioning if pragma ivdep is given. From my point of view, we should only add metadata to loads and stores that are inserted due to memory accesses in the source code. Meaning they are due to an array or pointer access. Cheers, Tobi -------------- next part -------------- A non-text attachment was scrubbed... Name: ivdep_t_private.c Type: text/x-csrc Size: 131 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130303/add42a0a/attachment.c> -------------- next part -------------- A non-text attachment was scrubbed... Name: ivdep_t_non-private.c Type: text/x-csrc Size: 130 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130303/add42a0a/attachment-0001.c> -------------- next part -------------- A non-text attachment was scrubbed... Name: ivdep_pointer.c Type: text/x-csrc Size: 160 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130303/add42a0a/attachment-0002.c>
On 03/03/2013 02:34 PM, Tobias Grosser wrote:> Meaning they are due to an array or pointer access.What about loop-scope arrays? void foo(long *A, long b) { long i; #pragma ivdep for (i = 0; i < 100; i++) { long t[100]; t[0] = i + 2; A[i] = A[i+b] + t[0]; } } Clang places the alloca for t to the entry block, creating a new race condition. In your example where you moved t outside the loop it's a programmer's mistake (icc might vectorize it but the results are undefined due to the dependency), but here I don't think it is. The t array is supposed to be a loop-private variable, and each parallel iteration refer to their own isolated instance. -- --Pekka
> BTW I noted Clik Plus has actually two different parallel loop constructs. > Have you, Cilk Plus developers, thought about the parallel loop code generation yet?Hi Pekka, We've started thinking about the parallel loop code generation, but so far mostly about cilk_for, which is the task-parallel loop construct. For cilk_for loops, we intend to outline the loop into a function in Clang using a captured statement (described here: http://lists.cs.uiuc.edu/pipermail/cfe-dev/2013-January/027540.html). For #pragma simd (vector-parallel loop), the hope is that we can use the parallel loop metadata in the same way that Paul is implementing #pragma ivdep. There will also be some extra pieces needed to implement the clauses that #pragma simd allows, such as vectorlength (specifies the maximum simd width). However, #pragma simd is otherwise very similar to ivdep, and we expect the codegen to be similar. Ben