Hal Finkel
2012-Aug-13 19:54 UTC
[LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)
On Mon, 13 Aug 2012 12:38:02 +0300 Pekka Jääskeläinen <pekka.jaaskelainen at tut.fi> wrote:> Hi, > > On 08/10/2012 11:06 PM, Hal Finkel wrote: > > I'd like to see support in clang/LLVM for multi-core parallelism, > > especially support for OpenMP. I think that the best way to do > > this is by designing an LLVM-based API (metadata and intrinsics) > > for expressing parallelism constructs, and having clang lower > > OpenMP code to that API. This will allow maximal preservation of > > optimization capabilities including target-specific lowering. What > > follows outlines a set of metadata and intrinsics which should > > allow support for the full OpenMP specification, and I'd like to > > know what the community thinks about this. > > Something like this would be useful also for OpenCL C > work group parallelization. At the moment in pocl we do thisI had thought about uses for shared-memory OpenCL implementations, but I don't know enough about the use cases to make a specific proposal. Is your metadata documented anywhere?> in a > hackish way with an "overkill" OpenCL C-specific metadata that is fed > to a modified bb-vectorizer of yours for autovectorization and > a custom alias analyzer for AA benefits. > > I'd like to remind that multithreading is just one option on how > to map the "parallel regions/loops" in parallel programs to parallel > hardware. Within a single core, vectorization/DLP (SIMD/vector > extensions) and static ILP (basically VLIW) are the other interesting > ones. In order to exploit all the parallel resources one could try to > intelligently combine the mapping over all of those.I agree, and this is specifically why I don't want to support OpenMP by lowering it into runtime calls in the frontend. I want to allow for other optimizations (vectorization, etc.) in combination with (or instead of) multi-threading. I think that my current proposal allows for that.> > Also, one user of this metadata could be the alias analysis: it should > be easy to write an AA that can exploit the parallelism > information. Parallel regions by definition do not have (defined) > dependencies between each other (between synchronization points) which > should be useful information for optimization purposes even if > parallel hardware was not targeted.I really like this idea! -- and it sounds like you may already have something like this in POCL?> > > - Loops - > > > > Parallel loops are indicated by tagging all backedge branches with > > 'parallel' metadata. This metadata has the following entries: > > - The string "loop" > > - A metadata reference to the parent parallel-region metadata > > - Optionally, a string specifying the scheduling mode: "static", > > "dynamic", "guided", "runtime", or "auto" (the default) > > - Optionally, an integer specifying the number of loop levels > > over which to parallelize (the default is 1) > > - If applicable, a list of metadata references specifying > > ordered and serial/critical regions within the loop. > > IMHO the generic metadata used to mark parallelism (basically to > denote independence of iterations in this case) should be separated > from OpenMP- specific ones such as the scheduling mode. After all, > there are and will be more of parallel programming > languages/standards in the future than just OpenMP that could > generate this new metadata and get the mapping to the parallel > hardware (via thread library calls or autovectorization, for example) > automagically.I think that making the metadata more modular sounds like a good idea. Regarding having scheduling be separate, care is required to ensure correctness. A large constraint on the design of a metadata API is that different pieces of metadata can be independently dropped by transformation passes, and that must be made safe w.r.t. the correctness of the code. For example, if a user specified that an OpenMP loop is to be parallelized with runtime scheduling, then if an OpenMP parallel loop is generated, we need to be sure to honor the runtime scheduling mode. I've tried propose metadata with a sufficient amount of cross-referencing so that dropping any piece of metadata will preserve correctness (even if that means loosing a parallel region).> > > -- Late Passes (Lowering) -- > > > > The parallelization lowering will be done by IR level passes in > > CodeGen prior to SelectionDAG conversion. Currently, this means > > after loop-strength reduction. Like loop-strength reduction, these > > IR level passes will get a TLI object pointer and will have > > target-specific override capabilities. > > > > ParallelizationCleanup - This pass will be scheduled prior to the > > other parallelization lowering passes (and anywhere else we > > decide). Its job is to remove parallelization metadata that had > > been rendered inconsistent by earlier optimization passes. When a > > parallelization region is removed, any parallelization intrinsics > > that can be removed are then also removed. > > > > ParallelizationLowering - This pass will actual lower paralleliztion > > constructs into a combination of runtime-library calls and, > > optionally, target-specific intrinsics. I think that an initial > > generic implementation will target libgomp. > > A vectorization pass could trivially vectorize parallel loops > without calls etc. here.I agree. I think that vectorization is best done earlier in the optimization schedule. Vectorization, however, should appropriately update loop metadata to allow for proper integration with parallelization, etc. Lowering to runtime libraries (for multi-threading in whatever form) should be done relatively late in the process (because further higher-level optimizations are often not possible after that point). Thanks for your comments! Please feel free to propose specific metadata forms and/or intrinsics to capture your ideas; then we can work on combining them. -Hal> > BR,-- Hal Finkel Postdoctoral Appointee Leadership Computing Facility Argonne National Laboratory
Pekka Jääskeläinen
2012-Aug-14 07:22 UTC
[LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)
On 08/13/2012 10:54 PM, Hal Finkel wrote:> I had thought about uses for shared-memory OpenCL implementations, but > I don't know enough about the use cases to make a specific proposal. Is > your metadata documented anywhere?It is now a quick "brute force hack", that's why I got interested in your proposal. We just wanted to communicate the OpenCL work item information further down in the compiler as easily as possible and didn't have time to beautify it. Now all instructions of the "chained" OpenCL kernel instances (work items) are annotated with their work item ID, their "parallel region ID" (from which region between barriers the instruction originates from) and a sequence ID. So, lots of metadata bloat. These annotations allow finding the matching instructions later on to vectorize multiple work items together by just combining the matching instructions from the different WIs. The alias analyzer uses this metadata to return NO_ALIAS for any memory access combination where the accesses are from different work items within the same parallel region (the specs say if they do alias, the results are undefined, thus a programmer's fault). With your annotations this hack could be probably cleaned up by using the "parallel for loop" metadata which the vectorizer and/or "thread lib call injector" (or the static instruction scheduler for a VLIW/TTA) can then use to parallelize the kernel as desired. I'd remind that its usefulness is not limited to a shared memory multicore (or even multicore) for the kernel execution device. All non-SIMT targets require laying out the code for all the work-items (like they were parallel for loops, unrolled or vectorized or not) for valid OpenCL kernel execution when there are more than 1 WI per work-group, thus potentially benefit from this.> I agree, and this is specifically why I don't want to support OpenMP by > lowering it into runtime calls in the frontend. I want to allow for > other optimizations (vectorization, etc.) in combination > with (or instead of) multi-threading. I think that my current proposal > allows for that.Yes it should, as far as I can see. If the loop body is a function and the iteration count (or its multiple) is known, one should be able to (vectorize multiple copies of the function without dependence checking. In the multi-WI OpenCL C case this function would contain the code for a single work item between a region between barriers (implicit or not). I'm unsure if forcing the function extraction of the parallel regions brings unnecessary problems or not. Another option would be to mark the basic blocks that form parallel regions. Maybe all of the BBs could be marked with a PR identifier MD? This would require BB metadata (are they supported?).>> Also, one user of this metadata could be the alias analysis: it should >> be easy to write an AA that can exploit the parallelism >> information. Parallel regions by definition do not have (defined) >> dependencies between each other (between synchronization points) which >> should be useful information for optimization purposes even if >> parallel hardware was not targeted. > > I really like this idea! -- and it sounds like you may already have > something like this in POCL?Yes, an OpenCL AA that exploits the work-item independence and address space independence. With your annotations there could be a generic AA for the "independence information from parallelism metadata" part and a separate OpenCL-specific AA for the rest.> Regarding having scheduling be separate, care is required to ensure > correctness. A large constraint on the design of a metadata API is thatOK, I see. I suppose it's not a big deal to add the scheduling property. At least if one (later) allows adding scheduling modes supported by other standards than OpenMP as well. I.e., not modes like "static" but "openmp31_static" or similar. For OpenCL work item loops the scheduling mode could be "auto" or left empty.> I agree. I think that vectorization is best done earlier in the > optimization schedule. Vectorization, however, should appropriately > update loop metadata to allow for proper integration with > parallelization, etc. Lowering to runtime libraries (for > multi-threading in whatever form) should be done relatively late in > the process (because further higher-level optimizations are often not > possible after that point).Yes, to enable automatic mixing of vectorization and threading from the single (data parallel) kernel. -- Pekka
Hal Finkel
2012-Aug-14 15:51 UTC
[LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)
On Tue, 14 Aug 2012 10:22:35 +0300 Pekka Jääskeläinen <pekka.jaaskelainen at tut.fi> wrote:> On 08/13/2012 10:54 PM, Hal Finkel wrote: > > I had thought about uses for shared-memory OpenCL implementations, > > but I don't know enough about the use cases to make a specific > > proposal. Is your metadata documented anywhere? > > It is now a quick "brute force hack", that's why I got interested in > your proposal. We just wanted to communicate the OpenCL work item > information further down in the compiler as easily as possible and > didn't have time to beautify it. > > Now all instructions of the "chained" OpenCL kernel instances > (work items) are annotated with their work item ID, their "parallel > region ID" (from which region between barriers the instruction > originates from) and a sequence ID. So, lots of metadata bloat. > > These annotations allow finding the matching instructions later on to > vectorize multiple work items together by just combining the matching > instructions from the different WIs. The alias analyzer uses this > metadata to return NO_ALIAS for any memory access combination where > the accesses are from different work items within the same parallel > region (the specs say if they do alias, the results are undefined, > thus a programmer's fault). > > With your annotations this hack could be probably cleaned up by using > the "parallel for loop" metadata which the vectorizer and/or "thread > lib call injector" (or the static instruction scheduler for a > VLIW/TTA) can then use to parallelize the kernel as desired. > > I'd remind that its usefulness is not limited to a shared memory > multicore (or even multicore) for the kernel execution device. All > non-SIMT targets require laying out the code for all the work-items > (like they were parallel for loops, unrolled or vectorized or not) for > valid OpenCL kernel execution when there are more than 1 WI per > work-group, thus potentially benefit from this.Fair enough. My Thought process here was that, first, I was not going to propose anything specifically for non-shared-memory systems (those require data copying directives, and I'd want to let others who have experience with those do the proposing), and second, I was not going to propose anything specifically for multi-target (heterogeneous) systems. I think that single-target shared-memory systems fall into the model I've sketched, and support for anything else will require further extension.> > > I agree, and this is specifically why I don't want to support > > OpenMP by lowering it into runtime calls in the frontend. I want to > > allow for other optimizations (vectorization, etc.) in combination > > with (or instead of) multi-threading. I think that my current > > proposal allows for that. > > Yes it should, as far as I can see. If the loop body is a function and > the iteration count (or its multiple) is known, one should be able to > (vectorize multiple copies of the function without dependence > checking. In the multi-WI OpenCL C case this function would contain > the code for a single work item between a region between barriers > (implicit or not). > > I'm unsure if forcing the function extraction of the parallel > regions brings unnecessary problems or not. Another option would be to > mark the basic blocks that form parallel regions. Maybe all of the BBs > could be marked with a PR identifier MD? This would require BB > metadata (are they supported?).I thought about this. There had been some patches provided for BB metadata (by Ralf Karrenberg back in May), I don't recall what happened with those. BB metadata might work, but I worry about existing optimization passes, which don't know about this metadata, moving things in and out of parallel regions in illegal ways. For example, moving a call to some get_number_of_threads() function, or some inline assembly region, in or out of a parallel region. Putting things in functions just seemed safer (and BB metadata is not upstream). Also, it would require extra checking to keep the parallel basic blocks together. Furthermore, in many cases, the parallel regions need to end up as separate functions anyway (because their passed as callbacks to the runtime library).> > >> Also, one user of this metadata could be the alias analysis: it > >> should be easy to write an AA that can exploit the parallelism > >> information. Parallel regions by definition do not have (defined) > >> dependencies between each other (between synchronization points) > >> which should be useful information for optimization purposes even > >> if parallel hardware was not targeted. > > > > I really like this idea! -- and it sounds like you may already have > > something like this in POCL? > > Yes, an OpenCL AA that exploits the work-item independence and address > space independence. With your annotations there could be a generic > AA for the "independence information from parallelism metadata" part > and a separate OpenCL-specific AA for the rest. > > > Regarding having scheduling be separate, care is required to ensure > > correctness. A large constraint on the design of a metadata API is > > that > > OK, I see. > > I suppose it's not a big deal to add the scheduling property. At > least if one (later) allows adding scheduling modes supported by other > standards than OpenMP as well. I.e., not modes like "static" but > "openmp31_static" or similar. For OpenCL work item loops the > scheduling mode could be "auto" or left empty.I think that this makes sense. For some things, like 'static', we can define backend-independent semantics. For other things, like OpenMP's 'runtime', which is tied to how the application calls OpenMP runtime functions, I agree, we should probably call that 'openmp_runtime' (or something like that).> > > I agree. I think that vectorization is best done earlier in the > > optimization schedule. Vectorization, however, should appropriately > > update loop metadata to allow for proper integration with > > parallelization, etc. Lowering to runtime libraries (for > > multi-threading in whatever form) should be done relatively late in > > the process (because further higher-level optimizations are often > > not possible after that point). > > Yes, to enable automatic mixing of vectorization and threading from > the single (data parallel) kernel.Yep, that is exactly what I want to be able to do. Thanks again, Hal>-- Hal Finkel Postdoctoral Appointee Leadership Computing Facility Argonne National Laboratory
Apparently Analagous Threads
- [LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)
- [LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)
- [LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)
- [LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)
- [LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM