thr3ads.net - llvm dev - [LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.) [Oct 2012]

If this information is useful, please help other people find it:
Share via:

Hal Finkel

2012-Aug-14 15:51 UTC

[LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

On Tue, 14 Aug 2012 10:22:35 +0300
Pekka Jääskeläinen <pekka.jaaskelainen at tut.fi> wrote:
> On 08/13/2012 10:54 PM, Hal Finkel wrote:
> > I had thought about uses for shared-memory OpenCL implementations,
> > but I don't know enough about the use cases to make a specific
> > proposal. Is your metadata documented anywhere?
> 
> It is now a quick "brute force hack", that's why I got
interested in
> your proposal. We just wanted to communicate the OpenCL work item
> information further down in the compiler as easily as possible and
> didn't have time to beautify it.
> 
> Now all instructions of the "chained" OpenCL kernel instances
> (work items) are annotated with their work item ID, their "parallel
> region ID" (from which region between barriers the instruction
> originates from) and a sequence ID. So, lots of metadata bloat.
> 
> These annotations allow finding the matching instructions later on to
> vectorize multiple work items together by just combining the matching
> instructions from the different WIs. The alias analyzer uses this
> metadata to return NO_ALIAS for any memory access combination where
> the accesses are from different work items within the same parallel
> region (the specs say if they do alias, the results are undefined,
> thus a programmer's fault).
> 
> With your annotations this hack could be probably cleaned up by using
> the "parallel for loop" metadata which the vectorizer and/or
"thread
> lib call injector" (or the static instruction scheduler for a
> VLIW/TTA) can then use to parallelize the kernel as desired.
> 
> I'd remind that its usefulness is not limited to a shared memory
> multicore (or even multicore) for the kernel execution device. All
> non-SIMT targets require laying out the code for all the work-items
> (like they were parallel for loops, unrolled or vectorized or not) for
> valid OpenCL kernel execution when there are more than 1 WI per
> work-group, thus potentially benefit from this.
Fair enough. My Thought process here was that, first, I was not going
to propose anything specifically for non-shared-memory systems (those
require data copying directives, and I'd want to let others who have
experience with those do the proposing), and second, I was not going to
propose anything specifically for multi-target (heterogeneous) systems.
I think that single-target shared-memory systems fall into the model
I've sketched, and support for anything else will require further
extension.
> 
> > I agree, and this is specifically why I don't want to support
> > OpenMP by lowering it into runtime calls in the frontend. I want to
> > allow for other optimizations (vectorization, etc.) in combination
> > with (or instead of) multi-threading. I think that my current
> > proposal allows for that.
> 
> Yes it should, as far as I can see. If the loop body is a function and
> the iteration count (or its multiple) is known, one should be able to
> (vectorize multiple copies of the function without dependence
> checking. In the multi-WI OpenCL C case this function would contain
> the code for a single work item between a region between barriers
> (implicit or not).
> 
> I'm unsure if forcing the function extraction of the parallel
> regions brings unnecessary problems or not. Another option would be to
> mark the basic blocks that form parallel regions. Maybe all of the BBs
> could be marked with a PR identifier MD? This would require BB
> metadata (are they supported?).
I thought about this. There had been some patches provided for BB
metadata (by Ralf Karrenberg back in May), I don't recall what happened
with those. BB metadata might work, but I worry about existing
optimization passes, which don't know about this metadata, moving
things in and out of parallel regions in illegal ways. For example,
moving a call to some get_number_of_threads() function, or some inline
assembly region, in or out of a parallel region. Putting things in
functions just seemed safer (and BB metadata is not upstream). Also, it
would require extra checking to keep the parallel basic blocks together.
Furthermore, in many cases, the parallel regions need to end up as
separate functions anyway (because their passed as callbacks to the
runtime library).
> 
> >> Also, one user of this metadata could be the alias analysis: it
> >> should be easy to write an AA that can exploit the parallelism
> >> information. Parallel regions by definition do not have (defined)
> >> dependencies between each other (between synchronization points)
> >> which should be useful information for optimization purposes even
> >> if parallel hardware was not targeted.
> >
> > I really like this idea! -- and it sounds like you may already have
> > something like this in POCL?
> 
> Yes, an OpenCL AA that exploits the work-item independence and address
> space independence. With your annotations there could be a generic
> AA for the "independence information from parallelism metadata"
part
> and a separate OpenCL-specific AA for the rest.
> 
> > Regarding having scheduling be separate, care is required to ensure
> > correctness. A large constraint on the design of a metadata API is
> > that
> 
> OK, I see.
> 
> I suppose it's not a big deal to add the scheduling property. At
> least if one (later) allows adding scheduling modes supported by other
> standards than OpenMP as well. I.e., not modes like "static" but
> "openmp31_static" or similar. For OpenCL work item loops the
> scheduling mode could be "auto" or left empty.
I think that this makes sense. For some things, like 'static', we can
define backend-independent semantics. For other things, like OpenMP's
'runtime', which is tied to how the application calls OpenMP runtime
functions, I agree, we should probably call that 'openmp_runtime' (or
something like that).
> 
> > I agree. I think that vectorization is best done earlier in the
> > optimization schedule. Vectorization, however, should appropriately
> > update loop metadata to allow for proper integration with
> > parallelization, etc. Lowering to runtime libraries (for
> > multi-threading in whatever form) should be done relatively late in
> > the process (because further higher-level optimizations are often
> > not possible after that point).
> 
> Yes, to enable automatic mixing of vectorization and threading from
> the single (data parallel) kernel.
Yep, that is exactly what I want to be able to do.

Thanks again,
Hal
> 


-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Sanjoy Das

2012-Sep-26 13:36 UTC

head link

[LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

Hi,

Sorry for the hiatus, busy time at my university. :)

After a false start and some (hopefully cogent) thought, I am now of
the opinion that it will be better to have llvm natively support a
somewhat different notion of parallel computation and have the
frontend lower OpenMP directives (and possibly other such things) into
the same.

In short, I propose a intrinsic based approach which hinges on the
concept of a "parallel map".  The immediate effect of using intrinsics
is that we no longer have to worry about missing metadata.  Moreover,
we are still free to lower the intrinsics in a variety of ways --
including vectorizing them or lowering them to calls to an actual
openmp backend.

I have also tried to make this representation more orthogonal and
general; mirroring a significant subset of openmp directives inside
llvm's IR doesn't feel right.  For instance, in the following
proposal, the OpenMP TASK directive is lowered using the more general
parallel_map construct.  A pass lowering the intrinsics into an OpenMP
backend may "reverse engineer" the mentioned pattern into tasks when
possible, but, in principle, I think the directives we introduce into
llvm's IR are best left as mutually exclusive as possible.  Keeping
the intrinsics simple and orthogonal should also help in asserting and
verifying correctness.

I plan to first implement a null lowering pass which simply lowers intrinsics
to something semantically correct but with no multithreading.  Once that is
done, I'll try to lower the intrinsics into something more interesting,
perhaps
libgomp or maybe even a custom runtime.


The proposal
------------

I propose introducing four new intrinsics to llvm:

1. void @llvm.parallel_map(i32 limit, void (i32, i8 *) fn, i8* priv)

Semantics:

Executes `limit` copies of fn, _possibly_ in parallel.  The map index
(i32, ranging from 0 to (limit - 1) both inclusive) and `priv` are
passed to each of the invocations.  The only invariant is that the 0th
iteration is always executed in the invoking thread.

It is legal to have calls to parallel_map inside a function being
parallel_map'ed over.


2. void @llvm.sync_region(void (i32, i8 *) fn, i8 type)

Semantics:

It is only legal to call sync_region from within the dynamic extent of
a parallel_map.  It ensures that `limit` copies (the `limit` is the
limit from the parallel_map) of fn are executed with mutual exclusion.

`type` can either be 0 (`Any`) signifying that the synchronized
regions can be run in any order or 1 (`Ordered`) signifying that the
synchronized regions must be run in increasing order of the index.

3. i32 @llvm.get_num_threads()

Semantics:

Returns the number of threads in the thread pool.

4. i32 @llvm.set_num_threads(i32)

Set the number of threads in the thread pool.



It should be possible to lower all OpenMP directives to the above four
intrinsics in the frontend (please read this in conjunction with [1]):


Parallel regions can be lowered as a parallel_map with
@llvm.num_threads as the limit.

```
#pragma PARALLEL
  block
```

desugars to

```
@llvm.parallel_map(num_threads, block_closure, shared_closure)
...

void block_closure(i32 tid, i8* shared_vars) {
 block
}
```

Reductions are handled by a parallel_map followed by a regular
reduction loop (exactly as in Hal's proposal).

Serial blocks reduce to a block conditional on the index inside the
function being parallelly mapped.

We lower critical and ordered regions into calls to `sync_region`.

Tasks are lowered (recursively) this way:

```
TASK
  block
more_code_may_contain_more_tasks
TASK_WAIT
```

desugars to

```
@llvm.parallel_map(2, task_closure, task_priv)

void task_closure(i32 index, i8* private) {
  if (index == 0) {
    more_code_may_contain_more_tasks
  } else {
    block
  }
}
```

Parallel loops are basically `parallel_map`s.

Thoughts?

[1] http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-August/052472.html
-- 
Sanjoy Das
http://playingwithpointers.com

Das, Dibyendu

2012-Sep-26 15:06 UTC

head link

[LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

How are you representing things like various scheduling mechanisms without
metadata - extra parameters to intrinsics ?
- dibyendu

----- Original Message -----
From: Sanjoy Das [mailto:sanjoy at playingwithpointers.com]
Sent: Wednesday, September 26, 2012 08:36 AM
To: Hal Finkel <hfinkel at anl.gov>
Cc: llvmdev at cs.uiuc.edu <llvmdev at cs.uiuc.edu>; cfe-dev at
cs.uiuc.edu <cfe-dev at cs.uiuc.edu>
Subject: Re: [LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM
(for OpenMP, etc.)

Hi,

Sorry for the hiatus, busy time at my university. :)

After a false start and some (hopefully cogent) thought, I am now of
the opinion that it will be better to have llvm natively support a
somewhat different notion of parallel computation and have the
frontend lower OpenMP directives (and possibly other such things) into
the same.

In short, I propose a intrinsic based approach which hinges on the
concept of a "parallel map".  The immediate effect of using intrinsics
is that we no longer have to worry about missing metadata.  Moreover,
we are still free to lower the intrinsics in a variety of ways --
including vectorizing them or lowering them to calls to an actual
openmp backend.

I have also tried to make this representation more orthogonal and
general; mirroring a significant subset of openmp directives inside
llvm's IR doesn't feel right.  For instance, in the following
proposal, the OpenMP TASK directive is lowered using the more general
parallel_map construct.  A pass lowering the intrinsics into an OpenMP
backend may "reverse engineer" the mentioned pattern into tasks when
possible, but, in principle, I think the directives we introduce into
llvm's IR are best left as mutually exclusive as possible.  Keeping
the intrinsics simple and orthogonal should also help in asserting and
verifying correctness.

I plan to first implement a null lowering pass which simply lowers intrinsics
to something semantically correct but with no multithreading.  Once that is
done, I'll try to lower the intrinsics into something more interesting,
perhaps
libgomp or maybe even a custom runtime.

The proposal
------------

I propose introducing four new intrinsics to llvm:

1. void @llvm.parallel_map(i32 limit, void (i32, i8 *) fn, i8* priv)

Semantics:

Executes `limit` copies of fn, _possibly_ in parallel.  The map index
(i32, ranging from 0 to (limit - 1) both inclusive) and `priv` are
passed to each of the invocations.  The only invariant is that the 0th
iteration is always executed in the invoking thread.

It is legal to have calls to parallel_map inside a function being
parallel_map'ed over.

2. void @llvm.sync_region(void (i32, i8 *) fn, i8 type)

Semantics:

It is only legal to call sync_region from within the dynamic extent of
a parallel_map.  It ensures that `limit` copies (the `limit` is the
limit from the parallel_map) of fn are executed with mutual exclusion.

`type` can either be 0 (`Any`) signifying that the synchronized
regions can be run in any order or 1 (`Ordered`) signifying that the
synchronized regions must be run in increasing order of the index.

3. i32 @llvm.get_num_threads()

Semantics:

Returns the number of threads in the thread pool.

4. i32 @llvm.set_num_threads(i32)

Set the number of threads in the thread pool.

It should be possible to lower all OpenMP directives to the above four
intrinsics in the frontend (please read this in conjunction with [1]):

Parallel regions can be lowered as a parallel_map with
@llvm.num_threads as the limit.

```
#pragma PARALLEL
  block
```

desugars to

```
@llvm.parallel_map(num_threads, block_closure, shared_closure)
...

void block_closure(i32 tid, i8* shared_vars) {
 block
}
```

Reductions are handled by a parallel_map followed by a regular
reduction loop (exactly as in Hal's proposal).

Serial blocks reduce to a block conditional on the index inside the
function being parallelly mapped.

We lower critical and ordered regions into calls to `sync_region`.

Tasks are lowered (recursively) this way:

```
TASK
  block
more_code_may_contain_more_tasks
TASK_WAIT
```

desugars to

```
@llvm.parallel_map(2, task_closure, task_priv)

void task_closure(i32 index, i8* private) {
  if (index == 0) {
    more_code_may_contain_more_tasks
  } else {
    block
  }
}
```

Parallel loops are basically `parallel_map`s.

Thoughts?

[1] http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-August/052472.html
-- 
Sanjoy Das
http://playingwithpointers.com
_______________________________________________
LLVM Developers mailing list
LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Hal Finkel

2012-Sep-26 17:54 UTC

head link

[LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

On Wed, 26 Sep 2012 19:06:29 +0530
Sanjoy Das <sanjoy at playingwithpointers.com> wrote:
> Hi,
> 
> Sorry for the hiatus, busy time at my university. :)
> 
> After a false start and some (hopefully cogent) thought, I am now of
> the opinion that it will be better to have llvm natively support a
> somewhat different notion of parallel computation and have the
> frontend lower OpenMP directives (and possibly other such things) into
> the same.
> 
> In short, I propose a intrinsic based approach which hinges on the
> concept of a "parallel map".  The immediate effect of using
intrinsics
> is that we no longer have to worry about missing metadata.  Moreover,
> we are still free to lower the intrinsics in a variety of ways --
> including vectorizing them or lowering them to calls to an actual
> openmp backend.
> 
> I have also tried to make this representation more orthogonal and
> general; mirroring a significant subset of openmp directives inside
> llvm's IR doesn't feel right.  For instance, in the following
> proposal, the OpenMP TASK directive is lowered using the more general
> parallel_map construct.  A pass lowering the intrinsics into an OpenMP
> backend may "reverse engineer" the mentioned pattern into tasks
when
> possible, but, in principle, I think the directives we introduce into
> llvm's IR are best left as mutually exclusive as possible.  Keeping
> the intrinsics simple and orthogonal should also help in asserting and
> verifying correctness.
> 
> I plan to first implement a null lowering pass which simply lowers
> intrinsics to something semantically correct but with no
> multithreading.  Once that is done, I'll try to lower the intrinsics
> into something more interesting, perhaps libgomp or maybe even a
> custom runtime.
> 
> 
> The proposal
> ------------
> 
> I propose introducing four new intrinsics to llvm:
> 
> 1. void @llvm.parallel_map(i32 limit, void (i32, i8 *) fn, i8* priv)
> 
> Semantics:
> 
> Executes `limit` copies of fn, _possibly_ in parallel.  The map index
> (i32, ranging from 0 to (limit - 1) both inclusive) and `priv` are
> passed to each of the invocations.  The only invariant is that the 0th
> iteration is always executed in the invoking thread.
> 
> It is legal to have calls to parallel_map inside a function being
> parallel_map'ed over.
> 
> 
> 2. void @llvm.sync_region(void (i32, i8 *) fn, i8 type)
> 
> Semantics:
> 
> It is only legal to call sync_region from within the dynamic extent of
> a parallel_map.  It ensures that `limit` copies (the `limit` is the
> limit from the parallel_map) of fn are executed with mutual exclusion.
> 
> `type` can either be 0 (`Any`) signifying that the synchronized
> regions can be run in any order or 1 (`Ordered`) signifying that the
> synchronized regions must be run in increasing order of the index.
> 
> 3. i32 @llvm.get_num_threads()
> 
> Semantics:
> 
> Returns the number of threads in the thread pool.
> 
> 4. i32 @llvm.set_num_threads(i32)
> 
> Set the number of threads in the thread pool.
> 
> 
> 
> It should be possible to lower all OpenMP directives to the above four
> intrinsics in the frontend (please read this in conjunction with [1]):
> 
> 
> Parallel regions can be lowered as a parallel_map with
> @llvm.num_threads as the limit.
> 
> ```
> #pragma PARALLEL
>   block
> ```
> 
> desugars to
> 
> ```
> @llvm.parallel_map(num_threads, block_closure, shared_closure)
> ...
> 
> void block_closure(i32 tid, i8* shared_vars) {
>  block
> }
> ```
> 
> Reductions are handled by a parallel_map followed by a regular
> reduction loop (exactly as in Hal's proposal).
> 
> Serial blocks reduce to a block conditional on the index inside the
> function being parallelly mapped.
> 
> We lower critical and ordered regions into calls to `sync_region`.
> 
> Tasks are lowered (recursively) this way:
> 
> ```
> TASK
>   block
> more_code_may_contain_more_tasks
> TASK_WAIT
> ```
> 
> desugars to
> 
> ```
> @llvm.parallel_map(2, task_closure, task_priv)
> 
> void task_closure(i32 index, i8* private) {
>   if (index == 0) {
>     more_code_may_contain_more_tasks
>   } else {
>     block
>   }
> }
> ```
> 
> Parallel loops are basically `parallel_map`s.
> 
> Thoughts?
Most of this sounds alright, but a few things worry me. My approach was
designed to require minimal changes to the rest of the infrastructure.
In your case you'll need to:

 - Teach LoopInfo, ScalarEvolution, etc. how to 'see' the
   loops spanning the parallel_map calls. You'll need to teach LICM,
   etc. how to do their transformations in the presence of
   parallel_map. This may require that these passes be promoted to
   module-level passes, and have other undesirable consequences.

 - Teach the inliner and other associated passes to
   understand the parallel_map intrinsic. Especially when using the
   null implementation (or an implementation that does not actually
   support task dispatch), small tasks should be considered for
   inlining.

In short, I think that your approach will require a lot more work to
get to a production-quality state than mine. Your approach has the
advantage of a simpler API, but I want to make sure that you've thought
through the various changes to the existing passes that will be
necessary.

Thanks again,
Hal
> 
> [1] http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-August/052472.html


-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

greened at obbligato.org

2012-Oct-02 01:16 UTC

head link

[LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

Sanjoy Das <sanjoy at playingwithpointers.com> writes:
> In short, I propose a intrinsic based approach which hinges on the
> concept of a "parallel map".  The immediate effect of using
intrinsics
> is that we no longer have to worry about missing metadata.  Moreover,
> we are still free to lower the intrinsics in a variety of ways --
> including vectorizing them or lowering them to calls to an actual
> openmp backend.
I'll re-ask here since this is in its own thread.

Why can't we just make ordinary function calls to runtime routines?

                     -David

Seemingly Similar Threads

Search for more maybe matching threads

llvm dev - Oct 2012 - [LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

[LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

[LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

[LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

[LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

[LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

Seemingly Similar Threads