thr3ads.net - llvm dev - [LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer [Jan 2013]

If this information is useful, please help other people find it:
Share via:

Pekka Jääskeläinen

2013-Jan-28 11:58 UTC

[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer

Hi,

Attached is a patch which uses a simple "parallel_loop" metadata
attached
to the loop branch instruction in the loop latch for skipping cross-iteration
memory dependency checking in the LoopVectorizer. This was briefly discussed
in the email thread "LoopVectorizer in OpenCL C work group
autovectorization".

It also converts the "min iteration count to vectorize" to a parameter
so
this can be controlled from the command line.

Comments welcomed.

Thanks in advance,
-- 
Pekka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: llvm-3.3-loopvectorizer-parallel_for-metadata-detection.patch
Type: text/x-patch
Size: 2049 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130128/f580afe7/attachment.bin>

Tobias Grosser

2013-Jan-28 12:58 UTC

head link

[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer

On 01/28/2013 12:58 PM, Pekka Jääskeläinen wrote:> Hi,
>
> Attached is a patch which uses a simple "parallel_loop" metadata
attached
> to the loop branch instruction in the loop latch for skipping
> cross-iteration
> memory dependency checking in the LoopVectorizer. This was briefly
> discussed
> in the email thread "LoopVectorizer in OpenCL C work group
> autovectorization".
Can you provide a test case?
> It also converts the "min iteration count to vectorize" to a
parameter so
> this can be controlled from the command line.
This seems to be a separate patch?

Tobi

Renato Golin

2013-Jan-28 13:22 UTC

head link

[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer

On 28 January 2013 11:58, Pekka Jääskeläinen <pekka.jaaskelainen at
tut.fi>wrote:
> Attached is a patch which uses a simple "parallel_loop" metadata
attached
> to the loop branch instruction in the loop latch for skipping
> cross-iteration
> memory dependency checking in the LoopVectorizer. This was briefly
> discussed
> in the email thread "LoopVectorizer in OpenCL C work group
> autovectorization".
>
I agree this is a good idea, but not sure about the implementation.

+  if (latch->getTerminator()->getMetadata("parallel_loop") !=
NULL) {

This seems an awfully specific check on a generic part of the code... If
this metadata standard in any form? If this OpenCL specific? Does all
OpenCL front-ends generate the same meta-data in that way? Etc...


It also converts the "min iteration count to vectorize" to a parameter
so> this can be controlled from the command line.
>
Is this really necessary? Do you have use cases where this would make sense?

I think you should send a test case with this patch, not separate.

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130128/45bd0f7e/attachment.html>

Pekka Jääskeläinen

2013-Jan-28 13:49 UTC

head link

[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer

Hi Renato,

On 01/28/2013 03:22 PM, Renato Golin wrote:> This seems an awfully specific check on a generic part of the code... If
True. Perhaps the check is better encapsulated, e.g., in the Loop class?
Or, if there's such thing as a loop-carried data dependency analyzer,
the correct place could be there, as a trivial "no deps" analysis.

 > this metadata standard in any form? If this OpenCL specific? Does all

This metadata is not standard in any form. Therefore the request
for comments. However, its meaning is generic, not OpenCL
specific at all. It specifies that the loop iterations can be
treated as independent, regardless of the memory operations the
body contains. Thus, the potential cross-iteration memory dependencies
can be considered a programming error.

 > OpenCL front-ends generate the same meta-data in that way? Etc...

I have no knowledge of other OpenCL implementations than
pocl as I haven't seen their code.
>     It also converts the "min iteration count to vectorize" to a
>     parameter so
>     this can be controlled from the command line.
>
>
> Is this really necessary? Do you have use cases where this would make
sense?
Where a lower threshold could be useful? At least with loops having long
bodies and loops with outer loops that iterate the inner loop many
times.

In fact, shouldn't the default minimum be the minimum vector width of the
machine? The cost estimation routine should take care of the actual
profitability estimate?
> I think you should send a test case with this patch, not separate.
As soon as there's a consensus on the metadata format and where
the check shall reside in, I'll prepare a proper patch with
a vectorizer test case.

Thanks for the comments so far,
-- 
Pekka

Tobias Grosser

2013-Jan-29 08:51 UTC

head link

[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer

On 01/28/2013 12:58 PM, Pekka Jääskeläinen wrote:> Hi,
>
> Attached is a patch which uses a simple "parallel_loop" metadata
attached
> to the loop branch instruction in the loop latch for skipping
> cross-iteration
> memory dependency checking in the LoopVectorizer. This was briefly
> discussed
> in the email thread "LoopVectorizer in OpenCL C work group
> autovectorization".
>
> It also converts the "min iteration count to vectorize" to a
parameter so
> this can be controlled from the command line.
Hi Pekka,

I was a little bit fast in my last email. I would like to discuss the 
test cases I want to see a little bit more. I am especially interested
to see that the meta data is robust in case of transformations.

Assuming we have something like;

# ignore assumed dependences.
for (i = 0; i < 4; i++) {
    tmp1 = A[3i+1];
    tmp2 = A[3i+2];
    tmp3 = tmp1 + tmp2;
    A[3i] = tmp3;
}

Now I apply for whatever reason a partial reg2mem transformation.

float tmp3[1];

# ignore assumed dependences. // Still valid?
for (i = 0; i < 4; i++) {
    tmp1 = A[3i+1];
    tmp2 = A[3i+2];
    tmp3[0] = tmp1 + tmp2;
    A[3i] = tmp3[0];
}

Is the meta data now still valid or how do we ensure the invalid meta 
data is removed?

I have the feeling it may be necessary to link the loop as well as the 
accesses for which we can ignore dependences together and mark the whole 
construct as "this set of memory accesses does not have dependences 
within the mentioned loop". Like this, the introduction of additional 
memory accesses which may very well introduce dependences that we need 
to preserve, can be easily detected. The parallelism check would now be: 
"In case the llvm.loop.ignore_assumed_deps meta-data is found in a loop 
header _and_ all memory accesses in the loop are referenced by this 
meta-data, the loop is parallel. If there is a memory
access that does not reference the ignore_assumed_deps meta-data, we
can not assume anything about the loop.

Cheers,
Tobias

Pekka Jääskeläinen

2013-Jan-29 10:03 UTC

head link

[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer

Hi Tobias,

On 01/29/2013 10:51 AM, Tobias Grosser wrote:> Is the meta data now still valid or how do we ensure the invalid meta data
is
> removed?
It seems it's not valid anymore. Good catch. I was requesting for these
transformation cases earlier. Probably there are more not thought of yet.
> I have the feeling it may be necessary to link the loop as well as the
accesses
> for which we can ignore dependences together and mark the whole construct
as
> "this set of memory accesses does not have dependences within the
mentioned
> loop". Like this, the introduction of additional memory accesses which
may very
> well introduce dependences that we need to preserve, can be easily
detected. The
> parallelism check would now be: "In case the
llvm.loop.ignore_assumed_deps
> meta-data is found in a loop header _and_ all memory accesses in the loop
are
> referenced by this meta-data, the loop is parallel. If there is a memory
> access that does not reference the ignore_assumed_deps meta-data, we
> can not assume anything about the loop.
Sounds reasonable, as it's too intrusive to start making all the passes
parallel loop-aware.

Also, I was thinking how to *retain* this data during unrolling. At least
in-order targets benefit from the information in their alias analysis to
produce more instruction scheduling freedom while scheduling code from the
unrolled parallel iterations. Also the BBVectorizer should benefit from this.

Annotating the memory operations in the iteration should cover this use case
also if the metadata also includes the iteration id. Then at the unroll time,
the metadata can be replicated and the id incremented. An alias analyzer
(something similar we have in pocl for unrolled/chained work items) then can
check for this metadata and return NO_ALIAS for accesses from different
iterations.

Going even further, maybe this same (or similar) format can be used to
retain restricted pointer (noalias) information across inlining. Lack of this
has been a bit of a nuisance for us in TCE.

In the restricted ptr case one wants to mark pointers with some kind of
context marker (e.g. a function name + pointer identification) which
communicates that the accesses through that pointer do not alias with
other pointers in the same function context.

The mem accesses in the parallel loop iteration could have a metadata
"llvm.mem.parallel_loop_iter 0 0" or similar where the first id is a
loop id, the latter the iteration id. Restricted pointer mem accesses
could use metadata "llvm.mem.restrict my_function my_pointer". A
parallel
loop alias analyzer can use the parallel_loop_iter metadata (if loop id
equals but iteration id doesn't, return NO_ALIAS) and a restrict pointer
AA use the latter (if both of the queried pointers have it and it differs
for the pointer part only, return NO_ALIAS).

-- 
--Pekka

Nadav Rotem

2013-Jan-29 18:58 UTC

head link

[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer

On Jan 29, 2013, at 12:51 AM, Tobias Grosser <tobias at grosser.es> wrote:
> 
> # ignore assumed dependences.
> for (i = 0; i < 4; i++) {
>   tmp1 = A[3i+1];
>   tmp2 = A[3i+2];
>   tmp3 = tmp1 + tmp2;
>   A[3i] = tmp3;
> }
> 
> Now I apply for whatever reason a partial reg2mem transformation.
> 
> float tmp3[1];
> 
> # ignore assumed dependences. // Still valid?
> for (i = 0; i < 4; i++) {
>   tmp1 = A[3i+1];
>   tmp2 = A[3i+2];
>   tmp3[0] = tmp1 + tmp2;
>   A[3i] = tmp3[0];
> }

The transformation that you described is illegal because it changes the behavior
of the loop. In the first version only A is modified, and in the second version
of the loop both A and tmp3 are modified. Can you think of another example that
demonstrates why the per-instruction attribute is needed ?

I am afraid that so many different llvm transformations will have to be modified
to preserve parallelism. This is not something that I want to slip in. If we
want to add new parallelism semantics to LLVM them we need to discuss the bigger
picture. We need to plan a mechanism that will allow us to implement support for
a number of different models (Vectorizers, SPMD languages such as GL and CL,
parallel threads such as OpenMP, etc).

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130129/010444a8/attachment.html>

Maybe Matching Threads

Search for more possibly parallel threads

llvm dev - Jan 2013 - [LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer

[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer

[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer

[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer

[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer

[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer

[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer

[LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer

Maybe Matching Threads