Hal Finkel
2012-Aug-10 20:06 UTC
[LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)
Hello, I'd like to see support in clang/LLVM for multi-core parallelism, especially support for OpenMP. I think that the best way to do this is by designing an LLVM-based API (metadata and intrinsics) for expressing parallelism constructs, and having clang lower OpenMP code to that API. This will allow maximal preservation of optimization capabilities including target-specific lowering. What follows outlines a set of metadata and intrinsics which should allow support for the full OpenMP specification, and I'd like to know what the community thinks about this. As a general note: My intent here is to make the metadata safe in the traditional sense: it can be removed by optimization passes that don't understand it, and while this might result in the loss of the parallelization, the removal will not be otherwise unsafe. I believe that many existing passes will require minor modification in order to preserve the metadata as appropriate, but I think these changes are relatively small. In addition, the authors of passes that preserve parallelization by dealing with parallelization metadata will need to explicitly think about how to handle it; hopefully, this will yield fewer bugs. In the following I will outline the API and explain how OpenMP will be lowered. My idea is to follow OpenMP's semantics, so if these differ from the OpenMP spec, then I'd like to correct that. If there are other parallelism models that we would like to support, then I think those can be incorporated as well (maybe something with lightweight tasks such as Cilk). ---- Parallel Regions ---- Inside a parallel region, a team of threads execute the sequence of instructions. A parallel region is specified by a function. This function may be executed by one or more threads in parallel. In terms of OpenMP: private() variables become variables local to the function. firstprivate() variables become parameters to the function. shared() variables become pass-by-pointer parameters. If the shared variable is not a global, then we allocate a local copy, using alloca followed by a store, and pass the new pointer to the function. For copyin() variables, we pass a copy of the variable to the function, and the function then uses that copy to update the thread's version of the (TLS) variable. The function should have private (or internal) linkage for optimization purposes. To mark this function as a parallel region, a module-level 'parallel' metadata entry is created. The call site(s) of this function are marked with this metadata,. The metadata has entries: - The string "region" - A reference to the parallel-region function - If applicable, a list of metadata references specifying special-handling child regions (parallel loops and serialized/critical regions) If the special-handling region metadata is no longer referenced by code within the parallel region, then the region has become invalid, and will be removed (meaning all parallelization metadata will be removed) by the ParallelizationCleanup. The same is true for all other cross-referenced metadata below. Note that parallel regions can be nested. As a quick example, something like: int main() { int a; #pragma omp parallel firstprivate(a) do_something(a) ... } becomes something like: define private void @parreg(i32 %a) { entry: call void @do_something(i32 %a) ret } define i32 @main() { entry: ... call void @parreg1(i32 %a) !parallel !0 ... !0 = metadata !{ metadata !"region", @parreg } -- Reductions -- To handle reductions, first, the variable is converted into a output pass-by-pointer parameter to the function. The pointer refers to an array of values, one for each thread that will execute the region. After the region completes, a loop must be created to actually perform the requested reduction. Inside the parallel region, each thread accesses its value using its thread id as the index. See the nthreads and tidx intrinsics below. -- Special handling regions -- - Serial Regions - Serial regions within parallel blocks (called 'single' in OpenMP) are executed only by one thread. As with parallel regions themselves, they are lowered as functions; the call site(s) of these functions are tagged with 'parallel' metadata. This metadata has entries: - The string "serial" - A reference to the single-region function - A metadata reference to the parent parallel-region or loop metadata - Optionally, a type: "master" or "any" (the default) For regions with "master" only the master thread may execute the region. - Critical Regions - Critical regions are like serial regions, but they are executed by all threads with mutual-exclusion. These are identified by 'parallel' metadata with entries: - The string "critical" - A reference to the critical-region function - A metadata reference to the parent parallel-region, loop or task metadata - Optionally, a global name string used for non-local synchronization (all regions with the same name string are mutually exclusive) - Loops - Parallel loops are indicated by tagging all backedge branches with 'parallel' metadata. This metadata has the following entries: - The string "loop" - A metadata reference to the parent parallel-region metadata - Optionally, a string specifying the scheduling mode: "static", "dynamic", "guided", "runtime", or "auto" (the default) - Optionally, an integer specifying the number of loop levels over which to parallelize (the default is 1) - If applicable, a list of metadata references specifying ordered and serial/critical regions within the loop. Note that what makes this metadata safe is the cross referencing between the parent region metadata, the loop metadata and the metadata references on the instructions. If any of these are removed or become inconsistent, then the whole parallel region must be removed. The ParallelizationCleanup pass will check this prior to lowering. To lower lastprivate() OpenMP variables, first we allocate a copy of the variable outside the loop. At the end of the loop body we insert a check to determine if the current iteration is the last one (over all threads), and if so, we update the common copy with the local version. Note that for OpenMP loops that have private, firstprivate, etc. clauses that cannot be made part of the parent parallel region, these loops will also need to be placed into their own functions to handle the relevant scope issues. Ordered regions (those which much execute in the original iteration order) are lowered as functions, much in the same way as serial regions. The call site(s) are tagged with 'parallel' metadata. This metadata has entries: - The string "ordered" - A reference to the function specifying the ordered region - A metadata reference to the parent parallel loop Serial regions and loop that don't have the 'nowait' OpenMP clause must be followed by a barrier intrinsic. - Tasks - Explicit tasks are also lowered as functions similar to other special handling regions. Their call site(s) are marked with 'parallel' metadata. Depending on the implementation, they may not actually start executing until the main thread executes a taskwait intrinsic or reaches the end of the parallel region. The task metadata has: - The string "task" - A reference to the function specifying the task - A metadata reference to the parent region, task, loop, etc. - Optionally, an affinity mode: "untied" or "tied" (the default). In tied mode, once a task starts executing in a particular thread, it must continue to execute in that thread until completion. An untied task can be passed in between threads. - If applicable, a list of metadata references specifying ordered and serial/critical regions within the task. -- Intrinsics -- Because metadata does not count as a variable use, and some runtime controls take general expressions, supporting these requires intrinsics. Many of these intrinsics are tied to their parent parallel regions by taking a metadata parameter specifying the parallel region, loop, etc. void @llvm.parallel.if(i1, !) - Takes a boolean expression controlling whether the referenced region (or task) is executed in parallel (the true case) or in serial (the false case). For a task, this controls the choice between queued or immediate in-place execution. void @llvm.parallel.final(i1, !) - Takes a boolean expression controlling whether the referenced task is considered final. A final task can have no subtasks (or, for that matter, nested parallel regions). void @llvm.parallel.setnt(i32, !) - Specify the number of threads used to execute the parallel region. i32 @llvm.parallel.nthreads(!) - Determine the total number of threads that will be used to execute the referenced parallel region (this is used to setup the array for reductions). i32 @llvm.parallel.tidx(!) - Obtain the current thread index; this is not the global thread id, or even the application-specific thread id. These indices run only from 0 through one less than the total number of threads active in the referenced region (this is used to access elements in a reduction array). void @llvm.parallel.chunksz(i32 or i64, !) - Specify the size of the chunks used to decompose a parallel loop. The metadata reference is to the metadata which tags the loop backedges. void @llvm.parallel.barrier() - A barrier for all threads in the current parallel region. void @llvm.parallel.taskwait() - Wait for all child tasks of the current task (or all top-level tasks). void @llvm.parallel.taskyield() - Optionally yield execution to other tasks. ---- Parallel Sections ---- OpenMP parallel sections are lowered as parallel loops. The loop executes a fixed number of times (once per section), and within the loop body a switch statement selects the correct section (in order) based on the iteration number. ---- Thread-Local Data ---- #pragma omp threadprivate(<variable-list>), which applies only to global variables, is handled by declaring global variables with the existing thread_local attribute. ---- Atomic Operations ---- OpenMP atomic operations are encoded using existing LLVM atomic intrinsics. ---- Flush ---- In general, an OpenMP flush operation, regardless of the contents of the variable list, can be lowered as: fence seq_cst. ---- Passes ---- -- Early Passes -- ParallelRegionWidening - This is an early pass that tries to combine consecutive parallel regions. Non-parallel "in between" regions can be converted into serialized blocks. This can be done so long as any reductions can be delayed until the end of the last region, and any converted serial regions do not have external function calls or inline assembly regions (both of which could be sensitive to the real number of active threads). This not only reduces thread-startup overhead, but will also allow other optimizations, such as loop fusion. -- Late Passes (Lowering) -- The parallelization lowering will be done by IR level passes in CodeGen prior to SelectionDAG conversion. Currently, this means after loop-strength reduction. Like loop-strength reduction, these IR level passes will get a TLI object pointer and will have target-specific override capabilities. ParallelizationCleanup - This pass will be scheduled prior to the other parallelization lowering passes (and anywhere else we decide). Its job is to remove parallelization metadata that had been rendered inconsistent by earlier optimization passes. When a parallelization region is removed, any parallelization intrinsics that can be removed are then also removed. ParallelizationLowering - This pass will actual lower paralleliztion constructs into a combination of runtime-library calls and, optionally, target-specific intrinsics. I think that an initial generic implementation will target libgomp. * I would like to see support for OpenMP 3.1 [1] plus an extension for user-defined-reductions (UDRs) [2]. [1] OpenMP Specification 3.1. July, 2011. http://www.openmp.org/mp-documents/OpenMP3.1.pdf [2] A. Duran, et al. "A proposal for User-Defined Reductions in OpenMP". IWOMP, 2010. http://www.ccs.tsukuba.ac.jp/workshop/IWOMP2010/slides/Alex-udrs.pdf Thanks again, Hal -- Hal Finkel Postdoctoral Appointee Leadership Computing Facility Argonne National Laboratory
Pekka Jääskeläinen
2012-Aug-13 09:38 UTC
[LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)
Hi, On 08/10/2012 11:06 PM, Hal Finkel wrote: > I'd like to see support in clang/LLVM for multi-core parallelism, > especially support for OpenMP. I think that the best way to do this is > by designing an LLVM-based API (metadata and intrinsics) for > expressing parallelism constructs, and having clang lower OpenMP code > to that API. This will allow maximal preservation of optimization > capabilities including target-specific lowering. What follows outlines > a set of metadata and intrinsics which should allow support for the > full OpenMP specification, and I'd like to know what the community > thinks about this. Something like this would be useful also for OpenCL C work group parallelization. At the moment in pocl we do this in a hackish way with an "overkill" OpenCL C-specific metadata that is fed to a modified bb-vectorizer of yours for autovectorization and a custom alias analyzer for AA benefits. I'd like to remind that multithreading is just one option on how to map the "parallel regions/loops" in parallel programs to parallel hardware. Within a single core, vectorization/DLP (SIMD/vector extensions) and static ILP (basically VLIW) are the other interesting ones. In order to exploit all the parallel resources one could try to intelligently combine the mapping over all of those. Also, one user of this metadata could be the alias analysis: it should be easy to write an AA that can exploit the parallelism information. Parallel regions by definition do not have (defined) dependencies between each other (between synchronization points) which should be useful information for optimization purposes even if parallel hardware was not targeted.> - Loops - > > Parallel loops are indicated by tagging all backedge branches with > 'parallel' metadata. This metadata has the following entries: > - The string "loop" > - A metadata reference to the parent parallel-region metadata > - Optionally, a string specifying the scheduling mode: "static", > "dynamic", "guided", "runtime", or "auto" (the default) > - Optionally, an integer specifying the number of loop levels over > which to parallelize (the default is 1) > - If applicable, a list of metadata references specifying ordered and > serial/critical regions within the loop.IMHO the generic metadata used to mark parallelism (basically to denote independence of iterations in this case) should be separated from OpenMP- specific ones such as the scheduling mode. After all, there are and will be more of parallel programming languages/standards in the future than just OpenMP that could generate this new metadata and get the mapping to the parallel hardware (via thread library calls or autovectorization, for example) automagically.> -- Late Passes (Lowering) -- > > The parallelization lowering will be done by IR level passes in CodeGen > prior to SelectionDAG conversion. Currently, this means after > loop-strength reduction. Like loop-strength reduction, these IR level > passes will get a TLI object pointer and will have target-specific > override capabilities. > > ParallelizationCleanup - This pass will be scheduled prior to the other > parallelization lowering passes (and anywhere else we decide). Its job > is to remove parallelization metadata that had been rendered > inconsistent by earlier optimization passes. When a parallelization > region is removed, any parallelization intrinsics that can be removed > are then also removed. > > ParallelizationLowering - This pass will actual lower paralleliztion > constructs into a combination of runtime-library calls and, optionally, > target-specific intrinsics. I think that an initial generic > implementation will target libgomp.A vectorization pass could trivially vectorize parallel loops without calls etc. here. BR, -- Pekka
Hal Finkel
2012-Aug-13 19:54 UTC
[LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)
On Mon, 13 Aug 2012 12:38:02 +0300 Pekka Jääskeläinen <pekka.jaaskelainen at tut.fi> wrote:> Hi, > > On 08/10/2012 11:06 PM, Hal Finkel wrote: > > I'd like to see support in clang/LLVM for multi-core parallelism, > > especially support for OpenMP. I think that the best way to do > > this is by designing an LLVM-based API (metadata and intrinsics) > > for expressing parallelism constructs, and having clang lower > > OpenMP code to that API. This will allow maximal preservation of > > optimization capabilities including target-specific lowering. What > > follows outlines a set of metadata and intrinsics which should > > allow support for the full OpenMP specification, and I'd like to > > know what the community thinks about this. > > Something like this would be useful also for OpenCL C > work group parallelization. At the moment in pocl we do thisI had thought about uses for shared-memory OpenCL implementations, but I don't know enough about the use cases to make a specific proposal. Is your metadata documented anywhere?> in a > hackish way with an "overkill" OpenCL C-specific metadata that is fed > to a modified bb-vectorizer of yours for autovectorization and > a custom alias analyzer for AA benefits. > > I'd like to remind that multithreading is just one option on how > to map the "parallel regions/loops" in parallel programs to parallel > hardware. Within a single core, vectorization/DLP (SIMD/vector > extensions) and static ILP (basically VLIW) are the other interesting > ones. In order to exploit all the parallel resources one could try to > intelligently combine the mapping over all of those.I agree, and this is specifically why I don't want to support OpenMP by lowering it into runtime calls in the frontend. I want to allow for other optimizations (vectorization, etc.) in combination with (or instead of) multi-threading. I think that my current proposal allows for that.> > Also, one user of this metadata could be the alias analysis: it should > be easy to write an AA that can exploit the parallelism > information. Parallel regions by definition do not have (defined) > dependencies between each other (between synchronization points) which > should be useful information for optimization purposes even if > parallel hardware was not targeted.I really like this idea! -- and it sounds like you may already have something like this in POCL?> > > - Loops - > > > > Parallel loops are indicated by tagging all backedge branches with > > 'parallel' metadata. This metadata has the following entries: > > - The string "loop" > > - A metadata reference to the parent parallel-region metadata > > - Optionally, a string specifying the scheduling mode: "static", > > "dynamic", "guided", "runtime", or "auto" (the default) > > - Optionally, an integer specifying the number of loop levels > > over which to parallelize (the default is 1) > > - If applicable, a list of metadata references specifying > > ordered and serial/critical regions within the loop. > > IMHO the generic metadata used to mark parallelism (basically to > denote independence of iterations in this case) should be separated > from OpenMP- specific ones such as the scheduling mode. After all, > there are and will be more of parallel programming > languages/standards in the future than just OpenMP that could > generate this new metadata and get the mapping to the parallel > hardware (via thread library calls or autovectorization, for example) > automagically.I think that making the metadata more modular sounds like a good idea. Regarding having scheduling be separate, care is required to ensure correctness. A large constraint on the design of a metadata API is that different pieces of metadata can be independently dropped by transformation passes, and that must be made safe w.r.t. the correctness of the code. For example, if a user specified that an OpenMP loop is to be parallelized with runtime scheduling, then if an OpenMP parallel loop is generated, we need to be sure to honor the runtime scheduling mode. I've tried propose metadata with a sufficient amount of cross-referencing so that dropping any piece of metadata will preserve correctness (even if that means loosing a parallel region).> > > -- Late Passes (Lowering) -- > > > > The parallelization lowering will be done by IR level passes in > > CodeGen prior to SelectionDAG conversion. Currently, this means > > after loop-strength reduction. Like loop-strength reduction, these > > IR level passes will get a TLI object pointer and will have > > target-specific override capabilities. > > > > ParallelizationCleanup - This pass will be scheduled prior to the > > other parallelization lowering passes (and anywhere else we > > decide). Its job is to remove parallelization metadata that had > > been rendered inconsistent by earlier optimization passes. When a > > parallelization region is removed, any parallelization intrinsics > > that can be removed are then also removed. > > > > ParallelizationLowering - This pass will actual lower paralleliztion > > constructs into a combination of runtime-library calls and, > > optionally, target-specific intrinsics. I think that an initial > > generic implementation will target libgomp. > > A vectorization pass could trivially vectorize parallel loops > without calls etc. here.I agree. I think that vectorization is best done earlier in the optimization schedule. Vectorization, however, should appropriately update loop metadata to allow for proper integration with parallelization, etc. Lowering to runtime libraries (for multi-threading in whatever form) should be done relatively late in the process (because further higher-level optimizations are often not possible after that point). Thanks for your comments! Please feel free to propose specific metadata forms and/or intrinsics to capture your ideas; then we can work on combining them. -Hal> > BR,-- Hal Finkel Postdoctoral Appointee Leadership Computing Facility Argonne National Laboratory
Seemingly Similar Threads
- [LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)
- [LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)
- [LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)
- [LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)
- [LLVMdev] LoopVectorizer in OpenCL C work group autovectorization