Hal Finkel
2012-Aug-10  20:06 UTC
[LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)
Hello,
I'd like to see support in clang/LLVM for multi-core parallelism,
especially support for OpenMP. I think that the best way to do this is
by designing an LLVM-based API (metadata and intrinsics) for
expressing parallelism constructs, and having clang lower OpenMP code
to that API. This will allow maximal preservation of optimization
capabilities including target-specific lowering. What follows outlines
a set of metadata and intrinsics which should allow support for the
full OpenMP specification, and I'd like to know what the community
thinks about this.
As a general note: My intent here is to make the metadata safe in the
traditional sense: it can be removed by optimization passes that don't
understand it, and while this might result in the loss of the
parallelization, the removal will not be otherwise unsafe. I believe
that many existing passes will require minor modification in order to
preserve the metadata as appropriate, but I think these changes are
relatively small. In addition, the authors of passes that preserve
parallelization by dealing with parallelization metadata will need to
explicitly think about how to handle it; hopefully, this will yield
fewer bugs.
In the following I will outline the API and explain how OpenMP will be
lowered. My idea is to follow OpenMP's semantics, so if these differ
from the OpenMP spec, then I'd like to correct that. If there are other
parallelism models that we would like to support, then I think those
can be incorporated as well (maybe something with lightweight tasks
such as Cilk).
---- Parallel Regions ----
Inside a parallel region, a team of threads execute the sequence of
instructions.
A parallel region is specified by a function. This function may be
executed by one or more threads in parallel. In terms of OpenMP:
private() variables become variables local to the function.
firstprivate() variables become parameters to the function. shared()
variables become pass-by-pointer parameters. If the shared variable is
not a global, then we allocate a local copy, using alloca followed by a
store, and pass the new pointer to the function. For copyin()
variables, we pass a copy of the variable to the function, and the
function then uses that copy to update the thread's version of the
(TLS) variable. The function should have private (or internal) linkage
for optimization purposes.
To mark this function as a parallel region, a module-level 'parallel'
metadata entry is created. The call site(s) of this function are marked
with this metadata,. The metadata has entries:
 - The string "region"
 - A reference to the parallel-region function
 - If applicable, a list of metadata references specifying
special-handling child regions (parallel loops and serialized/critical
regions)
If the special-handling region metadata is no longer referenced by code
within the parallel region, then the region has become invalid, and
will be removed (meaning all parallelization metadata will be removed)
by the ParallelizationCleanup. The same is true for all other
cross-referenced metadata below.
Note that parallel regions can be nested.
As a quick example, something like:
int main() {
  int a;
#pragma omp parallel firstprivate(a) 
  do_something(a)
  ...
}
becomes something like:
define private void @parreg(i32 %a) {
entry:
  call void @do_something(i32 %a)
  ret
}
define i32 @main() {
entry:
...
call void @parreg1(i32 %a) !parallel !0
...
!0 = metadata !{ metadata !"region", @parreg }
-- Reductions --
To handle reductions, first, the variable is converted into a output
pass-by-pointer parameter to the function. The pointer refers to an
array of values, one for each thread that will execute the region.
After the region completes, a loop must be created to actually perform
the requested reduction. Inside the parallel region, each thread
accesses its value using its thread id as the index. See the nthreads
and tidx intrinsics below.
-- Special handling regions --
- Serial Regions -
Serial regions within parallel blocks (called 'single' in OpenMP) are
executed only by one thread. As with parallel regions themselves, they
are lowered as functions; the call site(s) of these functions are
tagged with 'parallel' metadata. This metadata has entries:
  - The string "serial"
  - A reference to the single-region function
  - A metadata reference to the parent parallel-region or loop metadata
  - Optionally, a type: "master" or "any" (the default)
For regions with "master" only the master thread may execute the
region.
- Critical Regions -
Critical regions are like serial regions, but they are executed by all
threads with mutual-exclusion. These are identified by 'parallel'
metadata with entries:
  - The string "critical"
  - A reference to the critical-region function
  - A metadata reference to the parent parallel-region, loop or task
metadata
  - Optionally, a global name string used for non-local synchronization
(all regions with the same name string are mutually exclusive)
- Loops -
Parallel loops are indicated by tagging all backedge branches with
'parallel' metadata. This metadata has the following entries:
  - The string "loop"
  - A metadata reference to the parent parallel-region metadata
  - Optionally, a string specifying the scheduling mode: "static",
"dynamic", "guided", "runtime", or
"auto" (the default)
  - Optionally, an integer specifying the number of loop levels over
which to parallelize (the default is 1)
  - If applicable, a list of metadata references specifying ordered and
serial/critical regions within the loop.
Note that what makes this metadata safe is the cross referencing
between the parent region metadata, the loop metadata and the metadata
references on the instructions. If any of these are removed or become
inconsistent, then the whole parallel region must be removed. The
ParallelizationCleanup pass will check this prior to lowering.
To lower lastprivate() OpenMP variables, first we allocate a copy of
the variable outside the loop. At the end of the loop body we insert a
check to determine if the current iteration is the last one (over all
threads), and if so, we update the common copy with the local version.
Note that for OpenMP loops that have private, firstprivate, etc.
clauses that cannot be made part of the parent parallel region, these
loops will also need to be placed into their own functions to handle
the relevant scope issues.
Ordered regions (those which much execute in the original iteration
order) are lowered as functions, much in the same way as serial
regions. The call site(s) are tagged with 'parallel' metadata. This
metadata has entries:
  - The string "ordered"
  - A reference to the function specifying the ordered region
  - A metadata reference to the parent parallel loop
Serial regions and loop that don't have the 'nowait' OpenMP clause
must
be followed by a barrier intrinsic.
- Tasks -
Explicit tasks are also lowered as functions similar to other special
handling regions. Their call site(s) are marked with 'parallel'
metadata. Depending on the implementation, they may not actually start
executing until the main thread executes a taskwait intrinsic or
reaches the end of the parallel region. The task metadata has:
  - The string "task"
  - A reference to the function specifying the task
  - A metadata reference to the parent region, task, loop, etc.
  - Optionally, an affinity mode: "untied" or "tied" (the
default). In
tied mode, once a task starts executing in a particular thread, it must
continue to execute in that thread until completion. An untied task can
be passed in between threads.
  - If applicable, a list of metadata references specifying ordered and
serial/critical regions within the task.
-- Intrinsics --
Because metadata does not count as a variable use, and some runtime
controls take general expressions, supporting these requires
intrinsics. Many of these intrinsics are tied to their parent parallel
regions by taking a metadata parameter specifying the parallel region,
loop, etc.
void @llvm.parallel.if(i1, !) - Takes a boolean expression controlling
whether the referenced region (or task) is executed in parallel (the
true case) or in serial (the false case). For a task, this controls the
choice between queued or immediate in-place execution.
void @llvm.parallel.final(i1, !) - Takes a boolean expression
controlling whether the referenced task is considered final. A final
task can have no subtasks (or, for that matter, nested parallel
regions).
void @llvm.parallel.setnt(i32, !) - Specify the number of threads used
to execute the parallel region.
i32 @llvm.parallel.nthreads(!) - Determine the total number of threads
that will be used to execute the referenced parallel region (this is
used to setup the array for reductions).
i32 @llvm.parallel.tidx(!) - Obtain the current thread index; this is
not the global thread id, or even the application-specific thread id.
These indices run only from 0 through one less than the total number of
threads active in the referenced region (this is used to access
elements in a reduction array).
void @llvm.parallel.chunksz(i32 or i64, !) - Specify the size of the
chunks used to decompose a parallel loop. The metadata reference is to
the metadata which tags the loop backedges.
void @llvm.parallel.barrier() - A barrier for all threads in the
current parallel region.
void @llvm.parallel.taskwait() - Wait for all child tasks of the
current task (or all top-level tasks).
void @llvm.parallel.taskyield() - Optionally yield execution to other
tasks.
---- Parallel Sections ----
OpenMP parallel sections are lowered as parallel loops. The loop
executes a fixed number of times (once per section), and within the
loop body a switch statement selects the correct section (in order)
based on the iteration number.
---- Thread-Local Data ----
#pragma omp threadprivate(<variable-list>), which applies only to
global variables, is handled by declaring global variables with the
existing thread_local attribute.
---- Atomic Operations ----
OpenMP atomic operations are encoded using existing LLVM atomic
intrinsics.
---- Flush ----
In general, an OpenMP flush operation, regardless of the contents of
the variable list, can be lowered as: fence seq_cst.
---- Passes ----
-- Early Passes --
ParallelRegionWidening - This is an early pass that tries to combine
consecutive parallel regions. Non-parallel "in between" regions can be
converted into serialized blocks. This can be done so long as any
reductions can be delayed until the end of the last region, and any
converted serial regions do not have external function calls or inline
assembly regions (both of which could be sensitive to the real number
of active threads). This not only reduces thread-startup overhead, but
will also allow other optimizations, such as loop fusion.
-- Late Passes (Lowering) --
The parallelization lowering will be done by IR level passes in CodeGen
prior to SelectionDAG conversion. Currently, this means after
loop-strength reduction. Like loop-strength reduction, these IR level
passes will get a TLI object pointer and will have target-specific
override capabilities.
ParallelizationCleanup - This pass will be scheduled prior to the other
parallelization lowering passes (and anywhere else we decide). Its job
is to remove parallelization metadata that had been rendered
inconsistent by earlier optimization passes. When a parallelization
region is removed, any parallelization intrinsics that can be removed
are then also removed.
ParallelizationLowering - This pass will actual lower paralleliztion
constructs into a combination of runtime-library calls and, optionally,
target-specific intrinsics. I think that an initial generic
implementation will target libgomp.
* I would like to see support for OpenMP 3.1 [1] plus an extension for
  user-defined-reductions (UDRs) [2]. 
[1] OpenMP Specification 3.1. July, 2011.
    http://www.openmp.org/mp-documents/OpenMP3.1.pdf
[2] A. Duran, et al. "A proposal for User-Defined Reductions in
OpenMP". IWOMP, 2010.
http://www.ccs.tsukuba.ac.jp/workshop/IWOMP2010/slides/Alex-udrs.pdf
Thanks again,
Hal
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
Pekka Jääskeläinen
2012-Aug-13  09:38 UTC
[LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)
Hi, On 08/10/2012 11:06 PM, Hal Finkel wrote: > I'd like to see support in clang/LLVM for multi-core parallelism, > especially support for OpenMP. I think that the best way to do this is > by designing an LLVM-based API (metadata and intrinsics) for > expressing parallelism constructs, and having clang lower OpenMP code > to that API. This will allow maximal preservation of optimization > capabilities including target-specific lowering. What follows outlines > a set of metadata and intrinsics which should allow support for the > full OpenMP specification, and I'd like to know what the community > thinks about this. Something like this would be useful also for OpenCL C work group parallelization. At the moment in pocl we do this in a hackish way with an "overkill" OpenCL C-specific metadata that is fed to a modified bb-vectorizer of yours for autovectorization and a custom alias analyzer for AA benefits. I'd like to remind that multithreading is just one option on how to map the "parallel regions/loops" in parallel programs to parallel hardware. Within a single core, vectorization/DLP (SIMD/vector extensions) and static ILP (basically VLIW) are the other interesting ones. In order to exploit all the parallel resources one could try to intelligently combine the mapping over all of those. Also, one user of this metadata could be the alias analysis: it should be easy to write an AA that can exploit the parallelism information. Parallel regions by definition do not have (defined) dependencies between each other (between synchronization points) which should be useful information for optimization purposes even if parallel hardware was not targeted.> - Loops - > > Parallel loops are indicated by tagging all backedge branches with > 'parallel' metadata. This metadata has the following entries: > - The string "loop" > - A metadata reference to the parent parallel-region metadata > - Optionally, a string specifying the scheduling mode: "static", > "dynamic", "guided", "runtime", or "auto" (the default) > - Optionally, an integer specifying the number of loop levels over > which to parallelize (the default is 1) > - If applicable, a list of metadata references specifying ordered and > serial/critical regions within the loop.IMHO the generic metadata used to mark parallelism (basically to denote independence of iterations in this case) should be separated from OpenMP- specific ones such as the scheduling mode. After all, there are and will be more of parallel programming languages/standards in the future than just OpenMP that could generate this new metadata and get the mapping to the parallel hardware (via thread library calls or autovectorization, for example) automagically.> -- Late Passes (Lowering) -- > > The parallelization lowering will be done by IR level passes in CodeGen > prior to SelectionDAG conversion. Currently, this means after > loop-strength reduction. Like loop-strength reduction, these IR level > passes will get a TLI object pointer and will have target-specific > override capabilities. > > ParallelizationCleanup - This pass will be scheduled prior to the other > parallelization lowering passes (and anywhere else we decide). Its job > is to remove parallelization metadata that had been rendered > inconsistent by earlier optimization passes. When a parallelization > region is removed, any parallelization intrinsics that can be removed > are then also removed. > > ParallelizationLowering - This pass will actual lower paralleliztion > constructs into a combination of runtime-library calls and, optionally, > target-specific intrinsics. I think that an initial generic > implementation will target libgomp.A vectorization pass could trivially vectorize parallel loops without calls etc. here. BR, -- Pekka
Hal Finkel
2012-Aug-13  19:54 UTC
[LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)
On Mon, 13 Aug 2012 12:38:02 +0300 Pekka Jääskeläinen <pekka.jaaskelainen at tut.fi> wrote:> Hi, > > On 08/10/2012 11:06 PM, Hal Finkel wrote: > > I'd like to see support in clang/LLVM for multi-core parallelism, > > especially support for OpenMP. I think that the best way to do > > this is by designing an LLVM-based API (metadata and intrinsics) > > for expressing parallelism constructs, and having clang lower > > OpenMP code to that API. This will allow maximal preservation of > > optimization capabilities including target-specific lowering. What > > follows outlines a set of metadata and intrinsics which should > > allow support for the full OpenMP specification, and I'd like to > > know what the community thinks about this. > > Something like this would be useful also for OpenCL C > work group parallelization. At the moment in pocl we do thisI had thought about uses for shared-memory OpenCL implementations, but I don't know enough about the use cases to make a specific proposal. Is your metadata documented anywhere?> in a > hackish way with an "overkill" OpenCL C-specific metadata that is fed > to a modified bb-vectorizer of yours for autovectorization and > a custom alias analyzer for AA benefits. > > I'd like to remind that multithreading is just one option on how > to map the "parallel regions/loops" in parallel programs to parallel > hardware. Within a single core, vectorization/DLP (SIMD/vector > extensions) and static ILP (basically VLIW) are the other interesting > ones. In order to exploit all the parallel resources one could try to > intelligently combine the mapping over all of those.I agree, and this is specifically why I don't want to support OpenMP by lowering it into runtime calls in the frontend. I want to allow for other optimizations (vectorization, etc.) in combination with (or instead of) multi-threading. I think that my current proposal allows for that.> > Also, one user of this metadata could be the alias analysis: it should > be easy to write an AA that can exploit the parallelism > information. Parallel regions by definition do not have (defined) > dependencies between each other (between synchronization points) which > should be useful information for optimization purposes even if > parallel hardware was not targeted.I really like this idea! -- and it sounds like you may already have something like this in POCL?> > > - Loops - > > > > Parallel loops are indicated by tagging all backedge branches with > > 'parallel' metadata. This metadata has the following entries: > > - The string "loop" > > - A metadata reference to the parent parallel-region metadata > > - Optionally, a string specifying the scheduling mode: "static", > > "dynamic", "guided", "runtime", or "auto" (the default) > > - Optionally, an integer specifying the number of loop levels > > over which to parallelize (the default is 1) > > - If applicable, a list of metadata references specifying > > ordered and serial/critical regions within the loop. > > IMHO the generic metadata used to mark parallelism (basically to > denote independence of iterations in this case) should be separated > from OpenMP- specific ones such as the scheduling mode. After all, > there are and will be more of parallel programming > languages/standards in the future than just OpenMP that could > generate this new metadata and get the mapping to the parallel > hardware (via thread library calls or autovectorization, for example) > automagically.I think that making the metadata more modular sounds like a good idea. Regarding having scheduling be separate, care is required to ensure correctness. A large constraint on the design of a metadata API is that different pieces of metadata can be independently dropped by transformation passes, and that must be made safe w.r.t. the correctness of the code. For example, if a user specified that an OpenMP loop is to be parallelized with runtime scheduling, then if an OpenMP parallel loop is generated, we need to be sure to honor the runtime scheduling mode. I've tried propose metadata with a sufficient amount of cross-referencing so that dropping any piece of metadata will preserve correctness (even if that means loosing a parallel region).> > > -- Late Passes (Lowering) -- > > > > The parallelization lowering will be done by IR level passes in > > CodeGen prior to SelectionDAG conversion. Currently, this means > > after loop-strength reduction. Like loop-strength reduction, these > > IR level passes will get a TLI object pointer and will have > > target-specific override capabilities. > > > > ParallelizationCleanup - This pass will be scheduled prior to the > > other parallelization lowering passes (and anywhere else we > > decide). Its job is to remove parallelization metadata that had > > been rendered inconsistent by earlier optimization passes. When a > > parallelization region is removed, any parallelization intrinsics > > that can be removed are then also removed. > > > > ParallelizationLowering - This pass will actual lower paralleliztion > > constructs into a combination of runtime-library calls and, > > optionally, target-specific intrinsics. I think that an initial > > generic implementation will target libgomp. > > A vectorization pass could trivially vectorize parallel loops > without calls etc. here.I agree. I think that vectorization is best done earlier in the optimization schedule. Vectorization, however, should appropriately update loop metadata to allow for proper integration with parallelization, etc. Lowering to runtime libraries (for multi-threading in whatever form) should be done relatively late in the process (because further higher-level optimizations are often not possible after that point). Thanks for your comments! Please feel free to propose specific metadata forms and/or intrinsics to capture your ideas; then we can work on combining them. -Hal> > BR,-- Hal Finkel Postdoctoral Appointee Leadership Computing Facility Argonne National Laboratory
Apparently Analagous Threads
- [LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)
- [LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)
- [LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)
- [LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)
- [LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)