Vinay Madhusudan via llvm-dev
2020-Feb-14 18:20 UTC
[llvm-dev] About OpenMP dialect in MLIR
Thanks for the reply! It sounds like LLVM IR is being considered for optimizations in OpenMP constructs. There seems to be plans regarding improvement of LLVM IR Framework for providing things required for OpenMP / flang(?) Are there any design considerations which contain pros and cons about using the MLIR vs LLVM IR for various OpenMP related optimizations/ transformations? The latest RFC [ (3) in my original post ] mentions that:> So there exist some questions regarding where the optimisations should becarried out. Could you please provide more details on this? I would like to quote Chris here: “if you ignore the engineering expense, it would clearly make sense to reimplement the mid-level LLVM optimizers on top of MLIR and replace include/llvm/IR with a dialect definition in MLIR instead.“ -- http://lists.llvm.org/pipermail/llvm-dev/2020-January/138341.html *Rest of the comment are inlined.* On Thu, Feb 13, 2020 at 11:48 PM Johannes Doerfert <jdoerfert at anl.gov> wrote:> Hi Vinay, > > Thanks for taking an interest and the detailed discussion. > > To start by picking a few paragraph from your email to clarify a couple > of things that lead to the current design or that might otherwise need > clarification. We can talk about other points later as well. > > [ > Site notes: > 1) I'm not an MLIR person. > 2) It seems unfortnuate that we do not have a mlir-dev list. > ] > > > > 1. With the current design, the number of transformations / optimizations > > that one can write on OpenMP constructs would become limited as there can > > be any custom loop structure with custom operations / types inside it. > > OpenMP, as an input language, does not make many assumptions about the > code inside of constructs*.This isn’t entirely correct because the current OpenMP API specification ( https://www.openmp.org/spec-html/5.0/openmpch1.html) assumes that the code inside the constructs belong to C, C++ and Fortran programs.> So, inside a parallel can be almost anything > the base language has to offer, both lexically and dynamically. >I am mostly concerned with the MLIR side of things for OpenMP representation. MLIR can not only support operations for General Purpose languages like C,C++, Fortran, etc but also various Domain Specific Language representations as dialects (Example, ML, etc.). Note that there is also SPIR V dialect which is again meant for “Parallel Compute”. It becomes important to define the scope of the dialects / operations / types supported inside OpenMP operations in MLIR.> Assuming otherwise is not going to work. Analyzing a "generic" OpenMP > representation in order to determine if can be represented as a more > restricted "op" seems at least plausible. You will run into various > issue, some mentioned explicitly below.Isn’t it the other way around? For example, it doesn’t make much sense to wrap OpenMP operations for SPIR-V operations / types. I think it is important to specify (in the design) which existing MLIR dialects are supported in this effort and the various lowerings / transformations / optimizations which are planned for them.> For starters, you still have to > generate proper OpenMP runtime calls, e.g., from your GPU dialect, even > if it is "just" to make sure the OMPD/OMPT interfaces expose useful > information. > >You can have a well-defined call-like mlir::Operation which calls the GPU kernel. Perform all cross-device transformations in an easier way. Then, this operation can be lowered to OpenMP runtime calls during LLVM dialect conversion. I think this is much better than directly having calls to the OpenMP runtime library based on a kernel name mentioned in llvm::GlobalVariable.> > * I preclude the `omp loop` construct here as it is not even implemented > anywhere as far as I know. > > > > 2. It would also be easier to transform the Loop nests containing OpenMP > > constructs if the body of the OpenMP operations is well defined (i.e., > does > > not accept arbitrary loop structures). Having nested redundant > "parallel" , > > "target" and "do" regions seems unnecessary. > > As mentioned above, you cannot start with the assumption OpenMP input is > structured this this way. You have to analyze it first. This is the same > reason we cannot simply transform C/C++ `for loops` into `affine.for` > without proper analysis of the loop body. > > Now, more concrete. Nested parallel and target regions are not > necessarily redundant, nor can/should we require the user not to have > them. Nested parallelism can easily make sense, depending on the problem > decomposition. Nested target will make a lot of sense with reverse > offload, which is already in the standard, and it also should be allowed > for the sake of a modular (user) code base. >Just to be clear, having all three of “target”, “parallel” and “do” doesn’t represent “Nested parallelism” at all in the proposed design! ( 2(d) ). omp.target { omp.parallel { omp.do { …... } } } Above invokes a call to the tgt_target() for the code inside omp.do as mentioned in the proposal.> > > 3. There would also be new sets of loop structures in new dialects when > > C/C++ is compiled to MLIR. It would complicate the number of possible > > combinations inside the OpenMP region. > > Is anyone working on this? If so, what is the timeline? I personally was > not expecting Clang to switch over to MLIR any time soon but I am happy > if someone wants to correct me on this. I mention this only because it > interacts with the arguments I will make below. > > > > E. Lowering of target constructs mentioned in ( 2(d) ) specifies direct > > lowering to LLVM IR ignoring all the advantages that MLIR provides. Being > > able to compile the code for heterogeneous hardware is one of the biggest > > advantages that MLIR brings to the table. That is being completely missed > > here. This also requires solving the problem of handling target > information > > in MLIR. But that is a problem which needs to be solved anyway. Using GPU > > dialect also gives us an opportunity to represent offloading semantics in > > MLIR. > > I'm unsure what the problem with "handling target information in MLIR" is > but > whatever design we end up with, we need to know about the target > (triple) in all stages of the pipeline, even if it is just to pass it > down. > > > > Given the ability to represent multiple ModuleOps and the existence of > GPU > > dialect, couldn't higher level optimizations on offloaded code be done at > > MLIR level?. The proposed design would lead us to the same problems that > we > > are currently facing in LLVM IR. > > > > Also, OpenMP codegen will automatically benefit from the GPU dialect > based > > optimizations. For example, it would be way easier to hoist a memory > > reference out of GPU kernel in MLIR than in LLVM IR. > > While I agree with the premise that you can potentially reuse MLIR > transformations, it might not be as simple in practice. > > As mentioned above, you cannot assume much about OpenMP codes, almost > nothing for a lot of application codes I have seen. Some examples: > > If you have a function call, or any synchronization event for that > matter, located between two otherwise adjacent target regions (see > below), you cannot assume the two target regions will be offloaded to > the same device. > ``` > #omp target > {} > foo(); > #omp target > {} > ``` >These kinds of optimizations are much easier to write in MLIR: LLVM IR for the above code would contain a series of instructions of OpenMP runtime call setup and foo() in the middle followed by another set of OpenMP runtime related instructions. The body of the two target constructs would be in two different outlined functions (if not modules). It takes quite a bit of code to do analysis / transformation to write any optimization on the generated LLVM IR. vs. MLIR provides a way to represent the operations closer to the source. It is as simple as checking the next operation(s) in the mlir::Block. OpenMP target operation contains an inlined region which can easily be fused/ split / or any other valid transformation for that matter. Note that you can also perform various Control Structure Analysis / Transformations much easier in MLIR. For example, you can decide to execute foo() based on certain conditions, and you can merge the two target regions in the else path.> Similarly, you cannot assume a `omp parallel` is allowed to be executed > with more than a single thread, or that a `omp [parallel] for` does not > have loop carried data-dependences, ... >With multi-dimensional index support for arrays, wouldn’t it be better to do the data dependence analysis in MLIR? LLVM IR has linearized subscripts for multi-dimensional arrays. llvm::DependenceAnalysis tries to “guess” the indices based on different patterns in SCEV. It takes an intrinsic <http://llvm.org/devmtg/2020-04/talks.html#LightningTalk_88> or metadata or some other mechanism of communication from the front end (not the built-in set of instructions) to solve this problem.> Data-sharing attributes are also something that has to be treated > carefully: > ``` > x = 5; > #omp task > x = 3; > print(x); > ``` > Should print 5, not 3. >You can have “x” as a locally defined variable inside the “task” contained region in MLIR OR custom data-sharing attributes in OpenMP dialect.> > I hope I convinced you that OpenMP is not trivially mappable to existing > dialects without proper analysis. If not, please let me know why you > expect it to be. > > I do not see much reason why the issues you mentioned can’t trivially bemapped to the MLIR infrastructure. There is an easy way to define custom operations / types / attributes in OpenMP dialect and perform optimizations based on the *IR that is created especially for OpenMP*. The analysis / transformations required can be easily written on the custom operations defined rather than having a lowered form in the LLVM IR. The various dialects / transformations in MLIR are in development / early phase (Example, GPU dialect) waiting to be improved with use cases such as this!> > Now when it comes to code analyses, LLVM-IR offers a variety of > interesting features, ranging from a mature set of passes to the > cross-language LTO capabilities. We are working on the missing parts, > e.g., heterogeneous llvm::Modules as we speak. Simple OpenMP > optimizations are already present in LLVM and interesting ones are > prototyped for a while now (let me know if you want to see more not-yet > merged patches/optimizations). I also have papers, results, and > talks that might be interesting here. Let me know if you need pointers > to them. > > > Cheers, > Johannes > > > > On 02/13, Vinay Madhusudan via llvm-dev wrote: > > Hi, > > > > I have few questions / concerns regarding the design of OpenMP dialect in > > MLIR that is currently being implemented, mainly for the f18 compiler. > > Below, I summarize the current state of various efforts in clang / f18 / > > MLIR / LLVM regarding this. Feel free to add to the list in case I have > > missed something. > > > > 1. [May 2019] An OpenMPIRBuilder in LLVM was proposed for flang and clang > > frontends. Note that this proposal was before considering MLIR for FIR. > > > > a. llvm-dev proposal : > > > http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-May/000197.html > > > > b. Patches in review: https://reviews.llvm.org/D70290. This also > includes > > the clang codegen changes. > > > > 2. [July - September 2019] OpenMP dialect for MLIR was discussed / > > proposed with respect to the f18 compilation stack (keeping FIR in mind). > > > > a. flang-dev discussion link: > > https://lists.llvm.org/pipermail/flang-dev/2019-September/000020.html > > > > b. Design decisions captured in PPT: > > https://drive.google.com/file/d/1vU6LsblsUYGA35B_3y9PmBvtKOTXj1Fu/view > > > > c. MLIR google groups discussion: > > > https://groups.google.com/a/tensorflow.org/forum/#!topic/mlir/4Aj_eawdHiw > > > > d. Target constructs design: > > > http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-September/000285.html > > > > e. SIMD constructs design: > > > http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-September/000278.html > > > > 3. [Jan 2020] OpenMP dialect RFC in llvm discourse : > > https://llvm.discourse.group/t/rfc-openmp-dialect-in-mlir/397 > > > > 4. [Jan- Feb 2020] Implementation of OpenMP dialect in MLIR: > > > > a. The first patch which introduces the OpenMP dialect was pushed. > > > > b. Review of barrier construct is in progress: > > https://reviews.llvm.org/D72962 > > > > I have tried to list below different topics of interest (to different > > people) around this work. Most of these are in the design phase (or very > > new) and multiple parties are interested with different sets of goals in > > mind. > > > > I. Flang frontend and its integration > > > > II. Fortran representation in MLIR / FIR development > > > > III. OpenMP development for flang, OpenMP builder in LLVM. > > > > IV. Loop Transformations in MLIR / LLVM with respect to OpenMP. > > > > It looks like the design has evolved over time and there is no one place > > which contains the latest design decisions that fits all the different > > pieces of the puzzle. I will try to deduce it from the above mentioned > > references. Please correct me If I am referring to anything which has > > changed. > > > > A. For most OpenMP design discussions, FIR examples are used (as seen in > > (2) and (3)). The MLIR examples mentioned in the design only talks about > > FIR dialect and LLVM dialect. > > > > This completely ignores the likes of standard, affine (where most loop > > transformations are supposed to happen) and loop dialects. I think it is > > critical to decouple the OpenMP dialect development in MLIR from the > > current flang / FIR effort. It would be useful if someone can mention > these > > examples using existing dialects in MLIR and also how the different > > transformations / lowerings are planned. > > > > B. In latest RFC(3), it is mentioned that the initial OpenMP dialect > > version will be as follows, > > > > omp.parallel { > > > > omp.do { > > > > fir.do %i = 0 to %ub3 : !fir.integer { > > > > ... > > > > } > > > > } > > > > } > > > > and then after the "LLVM conversion" it is converted as follows: > > > > omp.parallel { > > > > %ub3 > > > > omp.do %i = 0 to %ub3 : !llvm.integer { > > > > ... > > > > } > > > > } > > > > > > a. Is it the same omp.do operation which now contains the bounds and > > induction variables of the loop after the LLVM conversion? If so, will > the > > same operation have two different semantics during a single compilation? > > > > b. Will there be different lowerings for various loop operations from > > different dialects? loop.for and affine.for under omp operations would > need > > different OpenMP / LLVM lowerings. Currently, both of them are lowered to > > the CFG based loops during the LLVM dialect conversion (which is much > > before the proposed OpenMP dialect lowering). > > > > There would be no standard way to represent OpenMP operations (especially > > the ones which involve loops) in MLIR. This would drastically complicate > > lowering. > > > > C. It is also not mentioned how clauses like firstprivate, shared, > private, > > reduce, map, etc are lowered to OpenMP dialect. The example in the RFC > > contains FIR and LLVM types and nothing about std dialect types. Consider > > the below example: > > > > #pragma omp parallel for reduction(+:x) > > > > for (int i = 0; i < N; ++i) > > > > x += a[i]; > > > > How would the above be represented in OpenMP dialect? and What type would > > "x" be in MLIR? It is not mentioned in the design as to how the various > > SSA values for various OpenMP clauses are passed around in OpenMP > > operations. > > > > D. Because of (A), (B) and (C), it would be beneficial to have an omp. > > parallel_do operation which has semantics similar to other loop > structures > > (may not be LoopLikeInterface) in MLIR. To me, it looks like having > OpenMP > > operations based on standard MLIR types and operations (scalars and > memrefs > > mainly) is the right way to go. > > > > Why not have omp.parallel_do operation with AffineMap based bounds, so as > > to decouple it from Value/Type similar to affine.for? > > > > 1. With the current design, the number of transformations / optimizations > > that one can write on OpenMP constructs would become limited as there can > > be any custom loop structure with custom operations / types inside it. > > > > 2. It would also be easier to transform the Loop nests containing OpenMP > > constructs if the body of the OpenMP operations is well defined (i.e., > does > > not accept arbitrary loop structures). Having nested redundant > "parallel" , > > "target" and "do" regions seems unnecessary. > > > > 3. There would also be new sets of loop structures in new dialects when > > C/C++ is compiled to MLIR. It would complicate the number of possible > > combinations inside the OpenMP region. > > > > E. Lowering of target constructs mentioned in ( 2(d) ) specifies direct > > lowering to LLVM IR ignoring all the advantages that MLIR provides. Being > > able to compile the code for heterogeneous hardware is one of the biggest > > advantages that MLIR brings to the table. That is being completely missed > > here. This also requires solving the problem of handling target > information > > in MLIR. But that is a problem which needs to be solved anyway. Using GPU > > dialect also gives us an opportunity to represent offloading semantics in > > MLIR. > > > > Given the ability to represent multiple ModuleOps and the existence of > GPU > > dialect, couldn't higher level optimizations on offloaded code be done at > > MLIR level?. The proposed design would lead us to the same problems that > we > > are currently facing in LLVM IR. > > > > Also, OpenMP codegen will automatically benefit from the GPU dialect > based > > optimizations. For example, it would be way easier to hoist a memory > > reference out of GPU kernel in MLIR than in LLVM IR. >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200214/edacb30c/attachment-0001.html>
Johannes Doerfert via llvm-dev
2020-Feb-14 19:20 UTC
[llvm-dev] About OpenMP dialect in MLIR
On 02/14, Vinay Madhusudan wrote:> It sounds like LLVM IR is being considered for optimizations in OpenMP > constructs. There seems to be plans regarding improvement of LLVM IR > Framework for providing things required for OpenMP / flang(?)LLVM has the OpenMPOpt pass now [0] in which we can put OpenMP specific transformations. For now it is simple but we have some more downstream patches, e.g., parallel region expansion [Section 5, 1]. Other optimizations [Section 3 & 4, 1], will be performed by the Attributor (see [4] after [2,3]) after one missing piece (basically [5] with some more plumming) was put in place, see [2,3] for details on the idea. Please feel free to ask questions on any of this. [0] https://reviews.llvm.org/D69930 [1] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf [2] https://www.youtube.com/watch?v=zfiHaPaoQPc [3] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt_lcpc18.pdf [4] https://youtu.be/CzWkc_JcfS0 [5] https://reviews.llvm.org/D71505 It might also worth looking into [6,7] mentioned below.> Are there any design considerations which contain pros and cons about using > the MLIR vs LLVM IR for various OpenMP related optimizations/ > transformations?The biggest pro for LLVM-IR is that it works for C/C++ right now. In addition, as I mentioned before, LLVM-IR has mature analysis and transformation passes for real world programs and support things like LTO out of the box.> The latest RFC [ (3) in my original post ] mentions that: > > > So there exist some questions regarding where the optimisations should be > carried out. > > Could you please provide more details on this? > > I would like to quote Chris here: > > “if you ignore the engineering expense, it would clearly make sense to > reimplement the mid-level LLVM optimizers on top of MLIR and replace > include/llvm/IR with a dialect definition in MLIR instead.“ -- > http://lists.llvm.org/pipermail/llvm-dev/2020-January/138341.html > > *Rest of the comment are inlined.* > > On Thu, Feb 13, 2020 at 11:48 PM Johannes Doerfert <jdoerfert at anl.gov> > wrote: > > > Hi Vinay, > > > > Thanks for taking an interest and the detailed discussion. > > > > To start by picking a few paragraph from your email to clarify a couple > > of things that lead to the current design or that might otherwise need > > clarification. We can talk about other points later as well. > > > > [ > > Site notes: > > 1) I'm not an MLIR person. > > 2) It seems unfortnuate that we do not have a mlir-dev list. > > ] > > > > > > > 1. With the current design, the number of transformations / optimizations > > > that one can write on OpenMP constructs would become limited as there can > > > be any custom loop structure with custom operations / types inside it. > > > > OpenMP, as an input language, does not make many assumptions about the > > code inside of constructs*. > > > This isn’t entirely correct because the current OpenMP API specification ( > https://www.openmp.org/spec-html/5.0/openmpch1.html) assumes that the code > inside the constructs belong to C, C++ and Fortran programs.(FWIW, my next sentence specified that I talk about the base language but anyway.) While technically true, I will recommend not to make that assumption. We do already allow non-base language constructs, e.g., CUDA intrinsics in target regions, and that will not go away because it is required to maximize performance.> > So, inside a parallel can be almost anything > > the base language has to offer, both lexically and dynamically. > > > > > I am mostly concerned with the MLIR side of things for OpenMP > representation. > > MLIR can not only support operations for General Purpose languages like > C,C++, Fortran, etc but also various Domain Specific Language > representations as dialects (Example, ML, etc.). Note that there is also > SPIR V dialect which is again meant for “Parallel Compute”. > > It becomes important to define the scope of the dialects / operations / > types supported inside OpenMP operations in MLIR.Arguably, the OpenMP dialect in MLIR should match the OpenMP directives and clauses as defined by the standard. Anything else is "not OpenMP".> > Assuming otherwise is not going to work. Analyzing a "generic" OpenMP > > representation in order to determine if can be represented as a more > > restricted "op" seems at least plausible. You will run into various > > issue, some mentioned explicitly below. > > Isn’t it the other way around? For example, it doesn’t make much sense to > wrap OpenMP operations for SPIR-V operations / types.I maybe misunderstanding but I thought you want to use something like the GPU / Affine dialect to represent an OpenMP target region / loop. That is plausible if you analyze the target region / loop and verify it fits into the more generic dialect semantics.> I think it is important to specify (in the design) which existing MLIR > dialects are supported in this effort and the various lowerings / > transformations / optimizations which are planned for them.That I cannot really help you with. TBH, I don't even know what transformations people plan to do on OpenMP MLIR (and why).> > For starters, you still have to > > generate proper OpenMP runtime calls, e.g., from your GPU dialect, even > > if it is "just" to make sure the OMPD/OMPT interfaces expose useful > > information. > > > > > You can have a well-defined call-like mlir::Operation which calls the GPU > kernel. Perform all cross-device transformations in an easier way. > Then, this operation can be lowered to OpenMP runtime calls during LLVM > dialect conversion.You missed my point I made in the other email. An OpenMP target region is statically not a GPU offload so you should not model it as such "for some time".> I think this is much better than directly having calls > to the OpenMP runtime library based on a kernel name mentioned in > llvm::GlobalVariable.(Current) implementation is not semantics. There is no reason not to change the way we lower OpenMP, e.g., by getting rid of the global variables. They are present for a reason but not intrinsically required. See the TRegions for example [6,7], they totally change the GPU lowering, making it device agnostic and easy to analyze and optimize in the middle end. Arguing the current encoding of OpenMP in LLVM-IR is problematic is the same as arguing MLIR's LLVM dialect doesn't support atomic_rmw, it might be true but its changeable. [6] https://link.springer.com/chapter/10.1007/978-3-030-28596-8_11 [7] http://parallel.auckland.ac.nz/iwomp2019/slides_TRegion.pdf> > * I preclude the `omp loop` construct here as it is not even implemented > > anywhere as far as I know. > > > > > > > 2. It would also be easier to transform the Loop nests containing OpenMP > > > constructs if the body of the OpenMP operations is well defined (i.e., > > does > > > not accept arbitrary loop structures). Having nested redundant > > "parallel" , > > > "target" and "do" regions seems unnecessary. > > > > As mentioned above, you cannot start with the assumption OpenMP input is > > structured this this way. You have to analyze it first. This is the same > > reason we cannot simply transform C/C++ `for loops` into `affine.for` > > without proper analysis of the loop body. > > > > Now, more concrete. Nested parallel and target regions are not > > necessarily redundant, nor can/should we require the user not to have > > them. Nested parallelism can easily make sense, depending on the problem > > decomposition. Nested target will make a lot of sense with reverse > > offload, which is already in the standard, and it also should be allowed > > for the sake of a modular (user) code base. > > > > Just to be clear, having all three of “target”, “parallel” and “do” doesn’t > represent “Nested parallelism” at all in the proposed design! ( 2(d) ). > > omp.target { > > omp.parallel { > > omp.do { > > …... > > } > > } > > } > > Above invokes a call to the tgt_target() for the code inside omp.do as > mentioned in the proposal.I do not follow. Just to make sure, the above should be roughly equivalent to the code below, correct? There is no "nested" parallelism, sure, but I thought you were talking about the case where there is, e.g. add another `#pragma omp parallel` inside the one that already is there. That is nested parallelism which can happen and make total sense for the application. #pragma omp target { #pragma omp parallel { #pragma omp for for (...) { ... } } }> > > > > 3. There would also be new sets of loop structures in new dialects when > > > C/C++ is compiled to MLIR. It would complicate the number of possible > > > combinations inside the OpenMP region. > > > > Is anyone working on this? If so, what is the timeline? I personally was > > not expecting Clang to switch over to MLIR any time soon but I am happy > > if someone wants to correct me on this. I mention this only because it > > interacts with the arguments I will make below. > > > > > > > E. Lowering of target constructs mentioned in ( 2(d) ) specifies direct > > > lowering to LLVM IR ignoring all the advantages that MLIR provides. Being > > > able to compile the code for heterogeneous hardware is one of the biggest > > > advantages that MLIR brings to the table. That is being completely missed > > > here. This also requires solving the problem of handling target > > information > > > in MLIR. But that is a problem which needs to be solved anyway. Using GPU > > > dialect also gives us an opportunity to represent offloading semantics in > > > MLIR. > > > > I'm unsure what the problem with "handling target information in MLIR" is > > but > > whatever design we end up with, we need to know about the target > > (triple) in all stages of the pipeline, even if it is just to pass it > > down. > > > > > > > Given the ability to represent multiple ModuleOps and the existence of > > GPU > > > dialect, couldn't higher level optimizations on offloaded code be done at > > > MLIR level?. The proposed design would lead us to the same problems that > > we > > > are currently facing in LLVM IR. > > > > > > Also, OpenMP codegen will automatically benefit from the GPU dialect > > based > > > optimizations. For example, it would be way easier to hoist a memory > > > reference out of GPU kernel in MLIR than in LLVM IR. > > > > While I agree with the premise that you can potentially reuse MLIR > > transformations, it might not be as simple in practice. > > > > As mentioned above, you cannot assume much about OpenMP codes, almost > > nothing for a lot of application codes I have seen. Some examples: > > > > If you have a function call, or any synchronization event for that > > matter, located between two otherwise adjacent target regions (see > > below), you cannot assume the two target regions will be offloaded to > > the same device. > > ``` > > #omp target > > {} > > foo(); > > #omp target > > {} > > ``` > > > > These kinds of optimizations are much easier to write in MLIR: > > LLVM IR for the above code would contain a series of instructions of OpenMP > runtime call setup and foo() in the middle followed by another set of > OpenMP runtime related instructions. The body of the two target constructs > would be in two different outlined functions (if not modules). > > It takes quite a bit of code to do analysis / transformation to write any > optimization on the generated LLVM IR.You are right about the module's being a problem. As I mentioned in my last email, we are working on that by not having them in different ones during the optimization pipeline. If we make the `target` `parallel` instead we can simulate that right now. The bodies are in different functions, sure, but does it matter? Let's walk through parallel region expansion (see above [Section 5, 1]) so you can judge for yourself: ``` #omp parallel { body0 } some_code #omp parallel { body1 } ``` will become ``` __kmpc_fork_call(..., @body0_fn, ...) some_code __kmpc_fork_call(..., @body1_fn, ...) ``` in IR. Simplified, there are 3 cases here: 1) some_code is harmless, meaning all of it can be executed redundantly. 2) parts of some some_code need to be guarded to be sequential but they can be executed in a parallel region otherwise, e.g., the code will not observe the difference through runtime calls. 3) parts of some some_code cannot be executed in a parallel region as they might observe the difference through runtime calls. First note that you need to do the classification regardless of your encoding (=IR). In case of 3) we are done and nothing is happening. Let's consider case 2) as 1) is basically a special case of it. As shown in the paper [1], you need to broadcast values created by some_code across all threads and synchronize appropriately to preserve semantic. Other than that, the transformation is straight forward: A) Create a function "@body01_fn" that is basically the outlined region in which code is then guarded and __kmpc_fork_call are replaced by direct calls. It looks like this: ``` call @body0_fn(...) #omp master some_code #omp barrier call @body1_fn(...) ``` B) Replace the region you put in the new function with a __kmpc_fork_call to it: ``` __kmpc_fork_call(..., @body01_fn, ...) ``` C) Done. If you are interested in the implementation I'll add you as a reviewer once I put it on Phab. I'm in the process of cleaning up my stand alone pass and moving it into the OpenMPOpt pass instead.> vs. > > MLIR provides a way to represent the operations closer to the source. It is > as simple as checking the next operation(s) in the mlir::Block. OpenMP > target operation contains an inlined region which can easily be fused/ > split / or any other valid transformation for that matter. > > Note that you can also perform various Control Structure Analysis / > Transformations much easier in MLIR. For example, you can decide to execute > foo() based on certain conditions, and you can merge the two target regions > in the else path.At the end, it's an encoding difference. Sure, the handling might be easier in certain situations but all the validity checks, hence code analyses, are still required. The actual "rewrite" is usually not the hard part.> > Similarly, you cannot assume a `omp parallel` is allowed to be executed > > with more than a single thread, or that a `omp [parallel] for` does not > > have loop carried data-dependences, ... > > > > With multi-dimensional index support for arrays, wouldn’t it be better to > do the data dependence analysis in MLIR?Yes, probably.> LLVM IR has linearized subscripts for multi-dimensional arrays. > llvm::DependenceAnalysis tries to “guess” the indices based on different > patterns in SCEV. It takes an intrinsic > <http://llvm.org/devmtg/2020-04/talks.html#LightningTalk_88> or metadata or > some other mechanism of communication from the front end (not the built-in > set of instructions) to solve this problem.Not disagreeing with you on this one ;) The only caveat is that we still live in a world in which C/C++ is a thing.> > Data-sharing attributes are also something that has to be treated > > carefully: > > ``` > > x = 5; > > #omp task > > x = 3; > > print(x); > > ``` > > Should print 5, not 3. > > > > You can have “x” as a locally defined variable inside the “task” contained > region in MLIR OR custom data-sharing attributes in OpenMP dialect.I'm not saying it is impossible or even hard, but maybe not as straight forward as one might think. Your encoding is for example very reasonable. In the example below you need to print 3, not 5, e.g., constant prop on the outer level should not happen. ``` x = 5; #omp task shared(x) { x = 3; some_form_of_sync(); ... } some_form_of_sync(); print(x); ```> > I hope I convinced you that OpenMP is not trivially mappable to existing > > dialects without proper analysis. If not, please let me know why you > > expect it to be. > > > I do not see much reason why the issues you mentioned can’t trivially be > mapped to the MLIR infrastructure. There is an easy way to define custom > operations / types / attributes in OpenMP dialect and perform optimizations > based on the *IR that is created especially for OpenMP*. The analysis / > transformations required can be easily written on the custom operations > defined rather than having a lowered form in the LLVM IR.You can totally define your OpenMP dialect and map it to that. Mapping to other dialects is the problematic part. As I mentioned, `omp parallel` does not mean "parallel" or "dependence-free". Since you mention it, why do you think it is conceptually or practically harder to write an analysis/transformations on IR? I mean, you teach your analysis what the op "omp.parallel" means, right? Why not teach an (interprocedural) analysis what __kmpc_fork_call() does (see [2,3] above)? FWIW, there are LLVM analyses and transformations that already know about the transitive call made by __kmpc_fork_call and pthread_create (see [4] above). It is done in a way that you can easily annotate your own C/C++ or IR to make use of it, e.g., for your own transitive callbacks: https://clang.llvm.org/docs/AttributeReference.html#callback https://llvm.org/docs/LangRef.html#callback-metadata> The various dialects / transformations in MLIR are in development / early > phase (Example, GPU dialect) waiting to be improved with use cases such as > this!Great! I am eagerly looking forward to this. Cheers, Johannes> > > > Now when it comes to code analyses, LLVM-IR offers a variety of > > interesting features, ranging from a mature set of passes to the > > cross-language LTO capabilities. We are working on the missing parts, > > e.g., heterogeneous llvm::Modules as we speak. Simple OpenMP > > optimizations are already present in LLVM and interesting ones are > > prototyped for a while now (let me know if you want to see more not-yet > > merged patches/optimizations). I also have papers, results, and > > talks that might be interesting here. Let me know if you need pointers > > to them. > > > > > > Cheers, > > Johannes > > > > > > > > On 02/13, Vinay Madhusudan via llvm-dev wrote: > > > Hi, > > > > > > I have few questions / concerns regarding the design of OpenMP dialect in > > > MLIR that is currently being implemented, mainly for the f18 compiler. > > > Below, I summarize the current state of various efforts in clang / f18 / > > > MLIR / LLVM regarding this. Feel free to add to the list in case I have > > > missed something. > > > > > > 1. [May 2019] An OpenMPIRBuilder in LLVM was proposed for flang and clang > > > frontends. Note that this proposal was before considering MLIR for FIR. > > > > > > a. llvm-dev proposal : > > > > > http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-May/000197.html > > > > > > b. Patches in review: https://reviews.llvm.org/D70290. This also > > includes > > > the clang codegen changes. > > > > > > 2. [July - September 2019] OpenMP dialect for MLIR was discussed / > > > proposed with respect to the f18 compilation stack (keeping FIR in mind). > > > > > > a. flang-dev discussion link: > > > https://lists.llvm.org/pipermail/flang-dev/2019-September/000020.html > > > > > > b. Design decisions captured in PPT: > > > https://drive.google.com/file/d/1vU6LsblsUYGA35B_3y9PmBvtKOTXj1Fu/view > > > > > > c. MLIR google groups discussion: > > > > > https://groups.google.com/a/tensorflow.org/forum/#!topic/mlir/4Aj_eawdHiw > > > > > > d. Target constructs design: > > > > > http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-September/000285.html > > > > > > e. SIMD constructs design: > > > > > http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-September/000278.html > > > > > > 3. [Jan 2020] OpenMP dialect RFC in llvm discourse : > > > https://llvm.discourse.group/t/rfc-openmp-dialect-in-mlir/397 > > > > > > 4. [Jan- Feb 2020] Implementation of OpenMP dialect in MLIR: > > > > > > a. The first patch which introduces the OpenMP dialect was pushed. > > > > > > b. Review of barrier construct is in progress: > > > https://reviews.llvm.org/D72962 > > > > > > I have tried to list below different topics of interest (to different > > > people) around this work. Most of these are in the design phase (or very > > > new) and multiple parties are interested with different sets of goals in > > > mind. > > > > > > I. Flang frontend and its integration > > > > > > II. Fortran representation in MLIR / FIR development > > > > > > III. OpenMP development for flang, OpenMP builder in LLVM. > > > > > > IV. Loop Transformations in MLIR / LLVM with respect to OpenMP. > > > > > > It looks like the design has evolved over time and there is no one place > > > which contains the latest design decisions that fits all the different > > > pieces of the puzzle. I will try to deduce it from the above mentioned > > > references. Please correct me If I am referring to anything which has > > > changed. > > > > > > A. For most OpenMP design discussions, FIR examples are used (as seen in > > > (2) and (3)). The MLIR examples mentioned in the design only talks about > > > FIR dialect and LLVM dialect. > > > > > > This completely ignores the likes of standard, affine (where most loop > > > transformations are supposed to happen) and loop dialects. I think it is > > > critical to decouple the OpenMP dialect development in MLIR from the > > > current flang / FIR effort. It would be useful if someone can mention > > these > > > examples using existing dialects in MLIR and also how the different > > > transformations / lowerings are planned. > > > > > > B. In latest RFC(3), it is mentioned that the initial OpenMP dialect > > > version will be as follows, > > > > > > omp.parallel { > > > > > > omp.do { > > > > > > fir.do %i = 0 to %ub3 : !fir.integer { > > > > > > ... > > > > > > } > > > > > > } > > > > > > } > > > > > > and then after the "LLVM conversion" it is converted as follows: > > > > > > omp.parallel { > > > > > > %ub3 > > > > > > omp.do %i = 0 to %ub3 : !llvm.integer { > > > > > > ... > > > > > > } > > > > > > } > > > > > > > > > a. Is it the same omp.do operation which now contains the bounds and > > > induction variables of the loop after the LLVM conversion? If so, will > > the > > > same operation have two different semantics during a single compilation? > > > > > > b. Will there be different lowerings for various loop operations from > > > different dialects? loop.for and affine.for under omp operations would > > need > > > different OpenMP / LLVM lowerings. Currently, both of them are lowered to > > > the CFG based loops during the LLVM dialect conversion (which is much > > > before the proposed OpenMP dialect lowering). > > > > > > There would be no standard way to represent OpenMP operations (especially > > > the ones which involve loops) in MLIR. This would drastically complicate > > > lowering. > > > > > > C. It is also not mentioned how clauses like firstprivate, shared, > > private, > > > reduce, map, etc are lowered to OpenMP dialect. The example in the RFC > > > contains FIR and LLVM types and nothing about std dialect types. Consider > > > the below example: > > > > > > #pragma omp parallel for reduction(+:x) > > > > > > for (int i = 0; i < N; ++i) > > > > > > x += a[i]; > > > > > > How would the above be represented in OpenMP dialect? and What type would > > > "x" be in MLIR? It is not mentioned in the design as to how the various > > > SSA values for various OpenMP clauses are passed around in OpenMP > > > operations. > > > > > > D. Because of (A), (B) and (C), it would be beneficial to have an omp. > > > parallel_do operation which has semantics similar to other loop > > structures > > > (may not be LoopLikeInterface) in MLIR. To me, it looks like having > > OpenMP > > > operations based on standard MLIR types and operations (scalars and > > memrefs > > > mainly) is the right way to go. > > > > > > Why not have omp.parallel_do operation with AffineMap based bounds, so as > > > to decouple it from Value/Type similar to affine.for? > > > > > > 1. With the current design, the number of transformations / optimizations > > > that one can write on OpenMP constructs would become limited as there can > > > be any custom loop structure with custom operations / types inside it. > > > > > > 2. It would also be easier to transform the Loop nests containing OpenMP > > > constructs if the body of the OpenMP operations is well defined (i.e., > > does > > > not accept arbitrary loop structures). Having nested redundant > > "parallel" , > > > "target" and "do" regions seems unnecessary. > > > > > > 3. There would also be new sets of loop structures in new dialects when > > > C/C++ is compiled to MLIR. It would complicate the number of possible > > > combinations inside the OpenMP region. > > > > > > E. Lowering of target constructs mentioned in ( 2(d) ) specifies direct > > > lowering to LLVM IR ignoring all the advantages that MLIR provides. Being > > > able to compile the code for heterogeneous hardware is one of the biggest > > > advantages that MLIR brings to the table. That is being completely missed > > > here. This also requires solving the problem of handling target > > information > > > in MLIR. But that is a problem which needs to be solved anyway. Using GPU > > > dialect also gives us an opportunity to represent offloading semantics in > > > MLIR. > > > > > > Given the ability to represent multiple ModuleOps and the existence of > > GPU > > > dialect, couldn't higher level optimizations on offloaded code be done at > > > MLIR level?. The proposed design would lead us to the same problems that > > we > > > are currently facing in LLVM IR. > > > > > > Also, OpenMP codegen will automatically benefit from the GPU dialect > > based > > > optimizations. For example, it would be way easier to hoist a memory > > > reference out of GPU kernel in MLIR than in LLVM IR. > >-- Johannes Doerfert Researcher Argonne National Laboratory Lemont, IL 60439, USA jdoerfert at anl.gov -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 228 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200214/2f84126a/attachment.sig>
Am Fr., 14. Feb. 2020 um 12:21 Uhr schrieb Vinay Madhusudan via llvm-dev <llvm-dev at lists.llvm.org>:> I would like to quote Chris here: > > “if you ignore the engineering expense, it would clearly make sense to reimplement the mid-level LLVM optimizers on top of MLIR and replace include/llvm/IR with a dialect definition in MLIR instead.“ -- http://lists.llvm.org/pipermail/llvm-dev/2020-January/138341.htmlIMHO, it's not just the engineering expense, but also additional overhead from having a more general data structure that clang does not need. In some sense, LLVM-IR has been designed to match the semantics of C, such that a more general representation makes less sense. There are still opportunities, e.g. representing C++ virtual methods instead lowering to a vtable lookup. This could make devirtualization easier. However, it seems nobody is currently pushing for this change to happen, there is not even an RFC on whether the community wants this change. As such, I would not plan on using MLIR if your frontend language is C/C++/Objective-C. Michael
On Fri, Feb 14, 2020 at 11:21 AM Johannes Doerfert via llvm-dev < llvm-dev at lists.llvm.org> wrote:> On 02/14, Vinay Madhusudan wrote: > > It sounds like LLVM IR is being considered for optimizations in OpenMP > > constructs. There seems to be plans regarding improvement of LLVM IR > > Framework for providing things required for OpenMP / flang(?) > > LLVM has the OpenMPOpt pass now [0] in which we can put OpenMP specific > transformations. For now it is simple but we have some more downstream > patches, e.g., parallel region expansion [Section 5, 1]. Other > optimizations [Section 3 & 4, 1], will be performed by the Attributor > (see [4] after [2,3]) after one missing piece (basically [5] with some > more plumming) was put in place, see [2,3] for details on the idea. > > Please feel free to ask questions on any of this. > > [0] https://reviews.llvm.org/D69930 > [1] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf > [2] https://www.youtube.com/watch?v=zfiHaPaoQPc > [3] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt_lcpc18.pdf > [4] https://youtu.be/CzWkc_JcfS0 > [5] https://reviews.llvm.org/D71505 > > It might also worth looking into [6,7] mentioned below. > > > > > Are there any design considerations which contain pros and cons about > using > > the MLIR vs LLVM IR for various OpenMP related optimizations/ > > transformations? > > The biggest pro for LLVM-IR is that it works for C/C++ right now. In > addition, as I mentioned before, LLVM-IR has mature analysis and > transformation passes for real world programs and support things like > LTO out of the box. >+1: MLIR is awesome (purely unbiased opinion ;)) but it won't be in the C/C++ path anytime soon (I hope it'll be one day though). There are plenty of frontend targeting LLVM directly, and LLVM should continue to improve.> > > > The latest RFC [ (3) in my original post ] mentions that: > > > > > So there exist some questions regarding where the optimisations should > be > > carried out. > > > > Could you please provide more details on this? > > > > I would like to quote Chris here: > > > > “if you ignore the engineering expense, it would clearly make sense to > > reimplement the mid-level LLVM optimizers on top of MLIR and replace > > include/llvm/IR with a dialect definition in MLIR instead.“ -- > > http://lists.llvm.org/pipermail/llvm-dev/2020-January/138341.html > > > > *Rest of the comment are inlined.* > > > > On Thu, Feb 13, 2020 at 11:48 PM Johannes Doerfert <jdoerfert at anl.gov> > > wrote: > > > > > Hi Vinay, > > > > > > Thanks for taking an interest and the detailed discussion. > > > > > > To start by picking a few paragraph from your email to clarify a couple > > > of things that lead to the current design or that might otherwise need > > > clarification. We can talk about other points later as well. > > > > > > [ > > > Site notes: > > > 1) I'm not an MLIR person. > > > 2) It seems unfortnuate that we do not have a mlir-dev list. > > > ] > > > > > > > > > > 1. With the current design, the number of transformations / > optimizations > > > > that one can write on OpenMP constructs would become limited as > there can > > > > be any custom loop structure with custom operations / types inside > it. > > > > > > OpenMP, as an input language, does not make many assumptions about the > > > code inside of constructs*. > > > > > > This isn’t entirely correct because the current OpenMP API specification > ( > > https://www.openmp.org/spec-html/5.0/openmpch1.html) assumes that the > code > > inside the constructs belong to C, C++ and Fortran programs. > > (FWIW, my next sentence specified that I talk about the base language > but anyway.) > > While technically true, I will recommend not to make that assumption. We > do already allow non-base language constructs, e.g., CUDA intrinsics in > target regions, and that will not go away because it is required to > maximize performance. >+1 : I worked on a project where we added support of OpenMP directives to Python Numpy loops and compiling these!> > > > > So, inside a parallel can be almost anything > > > the base language has to offer, both lexically and dynamically. > > > > > > > > > I am mostly concerned with the MLIR side of things for OpenMP > > representation. > > > > MLIR can not only support operations for General Purpose languages like > > C,C++, Fortran, etc but also various Domain Specific Language > > representations as dialects (Example, ML, etc.). Note that there is also > > SPIR V dialect which is again meant for “Parallel Compute”. > > > > It becomes important to define the scope of the dialects / operations / > > types supported inside OpenMP operations in MLIR. > > Arguably, the OpenMP dialect in MLIR should match the OpenMP directives > and clauses as defined by the standard. Anything else is "not OpenMP". >+1> > > > > Assuming otherwise is not going to work. Analyzing a "generic" OpenMP > > > representation in order to determine if can be represented as a more > > > restricted "op" seems at least plausible. You will run into various > > > issue, some mentioned explicitly below. > > > > Isn’t it the other way around? For example, it doesn’t make much sense to > > wrap OpenMP operations for SPIR-V operations / types. > > I maybe misunderstanding but I thought you want to use something like > the GPU / Affine dialect to represent an OpenMP target region / loop. > That is plausible if you analyze the target region / loop and verify it > fits into the more generic dialect semantics. > > > > I think it is important to specify (in the design) which existing MLIR > > dialects are supported in this effort and the various lowerings / > > transformations / optimizations which are planned for them. >I agree, but I am puzzled why you bring this here instead of discussing this in the RFC thread. You seem to have good ideas overall, but if you'd like to discuss the development of MLIR and MLIR dialects I invite you to engage there instead, I suspect you'll have more impact.> > That I cannot really help you with. TBH, I don't even know what > transformations people plan to do on OpenMP MLIR (and why). > > > > > For starters, you still have to > > > generate proper OpenMP runtime calls, e.g., from your GPU dialect, even > > > if it is "just" to make sure the OMPD/OMPT interfaces expose useful > > > information. > > > > > > > > You can have a well-defined call-like mlir::Operation which calls the GPU > > kernel. Perform all cross-device transformations in an easier way. > > Then, this operation can be lowered to OpenMP runtime calls during LLVM > > dialect conversion. > > You missed my point I made in the other email. An OpenMP target region > is statically not a GPU offload so you should not model it as such "for > some time". > > > > I think this is much better than directly having calls > > to the OpenMP runtime library based on a kernel name mentioned in > > llvm::GlobalVariable. > > (Current) implementation is not semantics. There is no reason not to > change the way we lower OpenMP, e.g., by getting rid of the global > variables. They are present for a reason but not intrinsically required. > See the TRegions for example [6,7], they totally change the GPU lowering, > making it device agnostic and easy to analyze and optimize in the middle > end. Arguing the current encoding of OpenMP in LLVM-IR is problematic is > the same as arguing MLIR's LLVM dialect doesn't support atomic_rmw, it > might be true but its changeable. > > [6] https://link.springer.com/chapter/10.1007/978-3-030-28596-8_11 > [7] http://parallel.auckland.ac.nz/iwomp2019/slides_TRegion.pdf > > > > > * I preclude the `omp loop` construct here as it is not even > implemented > > > anywhere as far as I know. > > > > > > > > > > 2. It would also be easier to transform the Loop nests containing > OpenMP > > > > constructs if the body of the OpenMP operations is well defined > (i.e., > > > does > > > > not accept arbitrary loop structures). Having nested redundant > > > "parallel" , > > > > "target" and "do" regions seems unnecessary. > > > > > > As mentioned above, you cannot start with the assumption OpenMP input > is > > > structured this this way. You have to analyze it first. This is the > same > > > reason we cannot simply transform C/C++ `for loops` into `affine.for` > > > without proper analysis of the loop body. > > > > > > Now, more concrete. Nested parallel and target regions are not > > > necessarily redundant, nor can/should we require the user not to have > > > them. Nested parallelism can easily make sense, depending on the > problem > > > decomposition. Nested target will make a lot of sense with reverse > > > offload, which is already in the standard, and it also should be > allowed > > > for the sake of a modular (user) code base. > > > > > > > Just to be clear, having all three of “target”, “parallel” and “do” > doesn’t > > represent “Nested parallelism” at all in the proposed design! ( 2(d) ). > > > > omp.target { > > > > omp.parallel { > > > > omp.do { > > > > …... > > > > } > > > > } > > > > } > > > > Above invokes a call to the tgt_target() for the code inside omp.do as > > mentioned in the proposal. > > I do not follow. Just to make sure, the above should be roughly > equivalent to the code below, correct? There is no "nested" > parallelism, sure, but I thought you were talking about the case where > there is, e.g. add another `#pragma omp parallel` inside the one that > already is there. That is nested parallelism which can happen and make > total sense for the application. > > > #pragma omp target > { > #pragma omp parallel > { > #pragma omp for > for (...) > { > ... > } > } > } > > > > > > > > > > 3. There would also be new sets of loop structures in new dialects > when > > > > C/C++ is compiled to MLIR. It would complicate the number of possible > > > > combinations inside the OpenMP region. > > > > > > Is anyone working on this? If so, what is the timeline? I personally > was > > > not expecting Clang to switch over to MLIR any time soon but I am happy > > > if someone wants to correct me on this. I mention this only because it > > > interacts with the arguments I will make below. > > > > > > > > > > E. Lowering of target constructs mentioned in ( 2(d) ) specifies > direct > > > > lowering to LLVM IR ignoring all the advantages that MLIR provides. > Being > > > > able to compile the code for heterogeneous hardware is one of the > biggest > > > > advantages that MLIR brings to the table. That is being completely > missed > > > > here. This also requires solving the problem of handling target > > > information > > > > in MLIR. But that is a problem which needs to be solved anyway. > Using GPU > > > > dialect also gives us an opportunity to represent offloading > semantics in > > > > MLIR. > > > > > > I'm unsure what the problem with "handling target information in MLIR" > is > > > but > > > whatever design we end up with, we need to know about the target > > > (triple) in all stages of the pipeline, even if it is just to pass it > > > down. > > > > > > > > > > Given the ability to represent multiple ModuleOps and the existence > of > > > GPU > > > > dialect, couldn't higher level optimizations on offloaded code be > done at > > > > MLIR level?. The proposed design would lead us to the same problems > that > > > we > > > > are currently facing in LLVM IR. > > > > > > > > Also, OpenMP codegen will automatically benefit from the GPU dialect > > > based > > > > optimizations. For example, it would be way easier to hoist a memory > > > > reference out of GPU kernel in MLIR than in LLVM IR. > > > > > > While I agree with the premise that you can potentially reuse MLIR > > > transformations, it might not be as simple in practice. > > > > > > As mentioned above, you cannot assume much about OpenMP codes, almost > > > nothing for a lot of application codes I have seen. Some examples: > > > > > > If you have a function call, or any synchronization event for that > > > matter, located between two otherwise adjacent target regions (see > > > below), you cannot assume the two target regions will be offloaded to > > > the same device. > > > ``` > > > #omp target > > > {} > > > foo(); > > > #omp target > > > {} > > > ``` > > > > > > > These kinds of optimizations are much easier to write in MLIR: > > > > LLVM IR for the above code would contain a series of instructions of > OpenMP > > runtime call setup and foo() in the middle followed by another set of > > OpenMP runtime related instructions. The body of the two target > constructs > > would be in two different outlined functions (if not modules). > > > > It takes quite a bit of code to do analysis / transformation to write any > > optimization on the generated LLVM IR. >I agree that MLIR makes it easier, with the constraint that you have to express it at the right level of abstraction in the first place. But what is the suggestion here? Not doing it in LLVM at all? This does not seem quite reasonable with respect to all the LLVM users.> > You are right about the module's being a problem. As I mentioned in my > last email, we are working on that by not having them in different ones > during the optimization pipeline. If we make the `target` `parallel` > instead we can simulate that right now. The bodies are in different > functions, sure, but does it matter? Let's walk through parallel region > expansion (see above [Section 5, 1]) so you can judge for yourself: > > ``` > #omp parallel > { body0 } > some_code > #omp parallel > { body1 } > ``` > > will become > > ``` > __kmpc_fork_call(..., @body0_fn, ...) > some_code > __kmpc_fork_call(..., @body1_fn, ...) > ``` > > in IR. Simplified, there are 3 cases here: > 1) some_code is harmless, meaning all of it can be executed redundantly. > 2) parts of some some_code need to be guarded to be sequential but > they can be executed in a parallel region otherwise, e.g., the code > will not observe the difference through runtime calls. > 3) parts of some some_code cannot be executed in a parallel region as > they might observe the difference through runtime calls. > First note that you need to do the classification regardless of your > encoding (=IR). In case of 3) we are done and nothing is happening. > Let's consider case 2) as 1) is basically a special case of it. As shown > in the paper [1], you need to broadcast values created by some_code > across all threads and synchronize appropriately to preserve semantic. > Other than that, the transformation is straight forward: > > A) Create a function "@body01_fn" that is basically the outlined region > in which code is then guarded and __kmpc_fork_call are replaced by > direct calls. It looks like this: > ``` > call @body0_fn(...) > #omp master > some_code > #omp barrier > call @body1_fn(...) > ``` > > B) Replace the region you put in the new function with a > __kmpc_fork_call to it: > ``` > __kmpc_fork_call(..., @body01_fn, ...) > ``` > > C) Done. > > If you are interested in the implementation I'll add you as a reviewer > once I put it on Phab. I'm in the process of cleaning up my stand alone > pass and moving it into the OpenMPOpt pass instead. > > > > vs. > > > > MLIR provides a way to represent the operations closer to the source. It > is > > as simple as checking the next operation(s) in the mlir::Block. OpenMP > > target operation contains an inlined region which can easily be fused/ > > split / or any other valid transformation for that matter. > > > > Note that you can also perform various Control Structure Analysis / > > Transformations much easier in MLIR. For example, you can decide to > execute > > foo() based on certain conditions, and you can merge the two target > regions > > in the else path. > > At the end, it's an encoding difference. Sure, the handling might be > easier in certain situations but all the validity checks, hence code > analyses, are still required. The actual "rewrite" is usually not the > hard part. >In general this is correct. One subtlety about the "validity checks" being easier on MLIR is because the "encoding" can include in the IR the guarantees you need: for example a parallel loop can be encoded directly in the IR, you don't have to recover the information with an analysis. Even if you do recover information with an analysis, encoding it in the IR makes it more likely to survive through other transformations (it is more robust that LLVM metadata for example). But I'm off-topic here :)> > > > > Similarly, you cannot assume a `omp parallel` is allowed to be executed > > > with more than a single thread, or that a `omp [parallel] for` does not > > > have loop carried data-dependences, ... > > > > > > > With multi-dimensional index support for arrays, wouldn’t it be better to > > do the data dependence analysis in MLIR? > > Yes, probably.> > > LLVM IR has linearized subscripts for multi-dimensional arrays. > > llvm::DependenceAnalysis tries to “guess” the indices based on different > > patterns in SCEV. It takes an intrinsic > > <http://llvm.org/devmtg/2020-04/talks.html#LightningTalk_88> or > metadata or > > some other mechanism of communication from the front end (not the > built-in > > set of instructions) to solve this problem. > > Not disagreeing with you on this one ;) > > The only caveat is that we still live in a world in which C/C++ is a > thing. >+1 -- Mehdi> > > > > > Data-sharing attributes are also something that has to be treated > > > carefully: > > > ``` > > > x = 5; > > > #omp task > > > x = 3; > > > print(x); > > > ``` > > > Should print 5, not 3. > > > > > > > You can have “x” as a locally defined variable inside the “task” > contained > > region in MLIR OR custom data-sharing attributes in OpenMP dialect. > > I'm not saying it is impossible or even hard, but maybe not as straight > forward as one might think. Your encoding is for example very reasonable. > > In the example below you need to print 3, not 5, e.g., constant prop on > the outer level should not happen. > > ``` > x = 5; > #omp task shared(x) > { > x = 3; > some_form_of_sync(); > ... > } > some_form_of_sync(); > print(x); > ``` > > > > > I hope I convinced you that OpenMP is not trivially mappable to > existing > > > dialects without proper analysis. If not, please let me know why you > > > expect it to be. > > > > > I do not see much reason why the issues you mentioned can’t trivially be > > mapped to the MLIR infrastructure. There is an easy way to define custom > > operations / types / attributes in OpenMP dialect and perform > optimizations > > based on the *IR that is created especially for OpenMP*. The analysis / > > transformations required can be easily written on the custom operations > > defined rather than having a lowered form in the LLVM IR. > > You can totally define your OpenMP dialect and map it to that. Mapping > to other dialects is the problematic part. As I mentioned, `omp > parallel` does not mean "parallel" or "dependence-free". > > > Since you mention it, why do you think it is conceptually or practically > harder to write an analysis/transformations on IR? I mean, you teach > your analysis what the op "omp.parallel" means, right? Why not teach an > (interprocedural) analysis what __kmpc_fork_call() does (see [2,3] above)? > > FWIW, there are LLVM analyses and transformations that already know > about the transitive call made by __kmpc_fork_call and pthread_create > (see [4] above). It is done in a way that you can easily annotate your > own C/C++ or IR to make use of it, e.g., for your own transitive > callbacks: > https://clang.llvm.org/docs/AttributeReference.html#callback > https://llvm.org/docs/LangRef.html#callback-metadata > > > > The various dialects / transformations in MLIR are in development / early > > phase (Example, GPU dialect) waiting to be improved with use cases such > as > > this! > > Great! I am eagerly looking forward to this. > > Cheers, > Johannes > > > > > > > Now when it comes to code analyses, LLVM-IR offers a variety of > > > interesting features, ranging from a mature set of passes to the > > > cross-language LTO capabilities. We are working on the missing parts, > > > e.g., heterogeneous llvm::Modules as we speak. Simple OpenMP > > > optimizations are already present in LLVM and interesting ones are > > > prototyped for a while now (let me know if you want to see more not-yet > > > merged patches/optimizations). I also have papers, results, and > > > talks that might be interesting here. Let me know if you need pointers > > > to them. > > > > > > > > > Cheers, > > > Johannes > > > > > > > > > > > > On 02/13, Vinay Madhusudan via llvm-dev wrote: > > > > Hi, > > > > > > > > I have few questions / concerns regarding the design of OpenMP > dialect in > > > > MLIR that is currently being implemented, mainly for the f18 > compiler. > > > > Below, I summarize the current state of various efforts in clang / > f18 / > > > > MLIR / LLVM regarding this. Feel free to add to the list in case I > have > > > > missed something. > > > > > > > > 1. [May 2019] An OpenMPIRBuilder in LLVM was proposed for flang and > clang > > > > frontends. Note that this proposal was before considering MLIR for > FIR. > > > > > > > > a. llvm-dev proposal : > > > > > > > > http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-May/000197.html > > > > > > > > b. Patches in review: https://reviews.llvm.org/D70290. This also > > > includes > > > > the clang codegen changes. > > > > > > > > 2. [July - September 2019] OpenMP dialect for MLIR was discussed / > > > > proposed with respect to the f18 compilation stack (keeping FIR in > mind). > > > > > > > > a. flang-dev discussion link: > > > > > https://lists.llvm.org/pipermail/flang-dev/2019-September/000020.html > > > > > > > > b. Design decisions captured in PPT: > > > > > https://drive.google.com/file/d/1vU6LsblsUYGA35B_3y9PmBvtKOTXj1Fu/view > > > > > > > > c. MLIR google groups discussion: > > > > > > > > https://groups.google.com/a/tensorflow.org/forum/#!topic/mlir/4Aj_eawdHiw > > > > > > > > d. Target constructs design: > > > > > > > > http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-September/000285.html > > > > > > > > e. SIMD constructs design: > > > > > > > > http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-September/000278.html > > > > > > > > 3. [Jan 2020] OpenMP dialect RFC in llvm discourse : > > > > https://llvm.discourse.group/t/rfc-openmp-dialect-in-mlir/397 > > > > > > > > 4. [Jan- Feb 2020] Implementation of OpenMP dialect in MLIR: > > > > > > > > a. The first patch which introduces the OpenMP dialect was pushed. > > > > > > > > b. Review of barrier construct is in progress: > > > > https://reviews.llvm.org/D72962 > > > > > > > > I have tried to list below different topics of interest (to different > > > > people) around this work. Most of these are in the design phase (or > very > > > > new) and multiple parties are interested with different sets of > goals in > > > > mind. > > > > > > > > I. Flang frontend and its integration > > > > > > > > II. Fortran representation in MLIR / FIR development > > > > > > > > III. OpenMP development for flang, OpenMP builder in LLVM. > > > > > > > > IV. Loop Transformations in MLIR / LLVM with respect to OpenMP. > > > > > > > > It looks like the design has evolved over time and there is no one > place > > > > which contains the latest design decisions that fits all the > different > > > > pieces of the puzzle. I will try to deduce it from the above > mentioned > > > > references. Please correct me If I am referring to anything which has > > > > changed. > > > > > > > > A. For most OpenMP design discussions, FIR examples are used (as > seen in > > > > (2) and (3)). The MLIR examples mentioned in the design only talks > about > > > > FIR dialect and LLVM dialect. > > > > > > > > This completely ignores the likes of standard, affine (where most > loop > > > > transformations are supposed to happen) and loop dialects. I think > it is > > > > critical to decouple the OpenMP dialect development in MLIR from the > > > > current flang / FIR effort. It would be useful if someone can mention > > > these > > > > examples using existing dialects in MLIR and also how the different > > > > transformations / lowerings are planned. > > > > > > > > B. In latest RFC(3), it is mentioned that the initial OpenMP dialect > > > > version will be as follows, > > > > > > > > omp.parallel { > > > > > > > > omp.do { > > > > > > > > fir.do %i = 0 to %ub3 : !fir.integer { > > > > > > > > ... > > > > > > > > } > > > > > > > > } > > > > > > > > } > > > > > > > > and then after the "LLVM conversion" it is converted as follows: > > > > > > > > omp.parallel { > > > > > > > > %ub3 > > > > > > > > omp.do %i = 0 to %ub3 : !llvm.integer { > > > > > > > > ... > > > > > > > > } > > > > > > > > } > > > > > > > > > > > > a. Is it the same omp.do operation which now contains the bounds and > > > > induction variables of the loop after the LLVM conversion? If so, > will > > > the > > > > same operation have two different semantics during a single > compilation? > > > > > > > > b. Will there be different lowerings for various loop operations from > > > > different dialects? loop.for and affine.for under omp operations > would > > > need > > > > different OpenMP / LLVM lowerings. Currently, both of them are > lowered to > > > > the CFG based loops during the LLVM dialect conversion (which is much > > > > before the proposed OpenMP dialect lowering). > > > > > > > > There would be no standard way to represent OpenMP operations > (especially > > > > the ones which involve loops) in MLIR. This would drastically > complicate > > > > lowering. > > > > > > > > C. It is also not mentioned how clauses like firstprivate, shared, > > > private, > > > > reduce, map, etc are lowered to OpenMP dialect. The example in the > RFC > > > > contains FIR and LLVM types and nothing about std dialect types. > Consider > > > > the below example: > > > > > > > > #pragma omp parallel for reduction(+:x) > > > > > > > > for (int i = 0; i < N; ++i) > > > > > > > > x += a[i]; > > > > > > > > How would the above be represented in OpenMP dialect? and What type > would > > > > "x" be in MLIR? It is not mentioned in the design as to how the > various > > > > SSA values for various OpenMP clauses are passed around in OpenMP > > > > operations. > > > > > > > > D. Because of (A), (B) and (C), it would be beneficial to have an > omp. > > > > parallel_do operation which has semantics similar to other loop > > > structures > > > > (may not be LoopLikeInterface) in MLIR. To me, it looks like having > > > OpenMP > > > > operations based on standard MLIR types and operations (scalars and > > > memrefs > > > > mainly) is the right way to go. > > > > > > > > Why not have omp.parallel_do operation with AffineMap based bounds, > so as > > > > to decouple it from Value/Type similar to affine.for? > > > > > > > > 1. With the current design, the number of transformations / > optimizations > > > > that one can write on OpenMP constructs would become limited as > there can > > > > be any custom loop structure with custom operations / types inside > it. > > > > > > > > 2. It would also be easier to transform the Loop nests containing > OpenMP > > > > constructs if the body of the OpenMP operations is well defined > (i.e., > > > does > > > > not accept arbitrary loop structures). Having nested redundant > > > "parallel" , > > > > "target" and "do" regions seems unnecessary. > > > > > > > > 3. There would also be new sets of loop structures in new dialects > when > > > > C/C++ is compiled to MLIR. It would complicate the number of possible > > > > combinations inside the OpenMP region. > > > > > > > > E. Lowering of target constructs mentioned in ( 2(d) ) specifies > direct > > > > lowering to LLVM IR ignoring all the advantages that MLIR provides. > Being > > > > able to compile the code for heterogeneous hardware is one of the > biggest > > > > advantages that MLIR brings to the table. That is being completely > missed > > > > here. This also requires solving the problem of handling target > > > information > > > > in MLIR. But that is a problem which needs to be solved anyway. > Using GPU > > > > dialect also gives us an opportunity to represent offloading > semantics in > > > > MLIR. > > > > > > > > Given the ability to represent multiple ModuleOps and the existence > of > > > GPU > > > > dialect, couldn't higher level optimizations on offloaded code be > done at > > > > MLIR level?. The proposed design would lead us to the same problems > that > > > we > > > > are currently facing in LLVM IR. > > > > > > > > Also, OpenMP codegen will automatically benefit from the GPU dialect > > > based > > > > optimizations. For example, it would be way easier to hoist a memory > > > > reference out of GPU kernel in MLIR than in LLVM IR. > > > > > -- > > Johannes Doerfert > Researcher > > Argonne National Laboratory > Lemont, IL 60439, USA > > jdoerfert at anl.gov > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200215/b8774774/attachment.html>
Vinay Madhusudan via llvm-dev
2020-Feb-17 17:58 UTC
[llvm-dev] About OpenMP dialect in MLIR
Please find the reply inline below: On Sat, Feb 15, 2020 at 12:51 AM Johannes Doerfert <jdoerfert at anl.gov> wrote:> On 02/14, Vinay Madhusudan wrote: > > It sounds like LLVM IR is being considered for optimizations in OpenMP > > constructs. There seems to be plans regarding improvement of LLVM IR > > Framework for providing things required for OpenMP / flang(?) > > LLVM has the OpenMPOpt pass now [0] in which we can put OpenMP specific > transformations. For now it is simple but we have some more downstream > patches, e.g., parallel region expansion [Section 5, 1]. Other > optimizations [Section 3 & 4, 1], will be performed by the Attributor > (see [4] after [2,3]) after one missing piece (basically [5] with some > more plumming) was put in place, see [2,3] for details on the idea. > > Please feel free to ask questions on any of this. > > [0] https://reviews.llvm.org/D69930 > [1] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf > [2] https://www.youtube.com/watch?v=zfiHaPaoQPc > [3] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt_lcpc18.pdf > [4] https://youtu.be/CzWkc_JcfS0 > [5] https://reviews.llvm.org/D71505 > > It might also worth looking into [6,7] mentioned below. > >Thanks for sharing all the interesting work going on LLVM IR side of things for OpenMP. I will take a look at it. As I have mentioned before, I am mostly concerned (in this thread) about the MLIR side of things for OpenMP and various optimizations/ transformations that are planned. I was surprised to see the plans for LLVM IR and not for MLIR.> > > > Are there any design considerations which contain pros and cons about > using > > the MLIR vs LLVM IR for various OpenMP related optimizations/ > > transformations? > > The biggest pro for LLVM-IR is that it works for C/C++ right now.> TBH, I don't even know what transformations people plan to do on OpenMPMLIR (and why). A good support for C/C++ in LLVM-IR shouldn’t be an impediment for considering OpenMP dialect transformations / optimizations for the ongoing *flang* effort. As we all agree, things like dependencies etc can be solved in an *easier way* in MLIR and C/C++ seems to be the “only caveat”.> In > addition, as I mentioned before, LLVM-IR has mature analysis and > transformation passes for real world programs and support things like > LTO out of the box. > > I think LTO for flang is planned to be based on LLVM. I do not see anyother option for now!> > The latest RFC [ (3) in my original post ] mentions that: > > > > > So there exist some questions regarding where the optimisations should > be > > carried out. > > > > Could you please provide more details on this? > > > > I would like to quote Chris here: > > > > “if you ignore the engineering expense, it would clearly make sense to > > reimplement the mid-level LLVM optimizers on top of MLIR and replace > > include/llvm/IR with a dialect definition in MLIR instead.“ -- > > http://lists.llvm.org/pipermail/llvm-dev/2020-January/138341.html > > > > *Rest of the comment are inlined.* > > > > On Thu, Feb 13, 2020 at 11:48 PM Johannes Doerfert <jdoerfert at anl.gov> > > wrote: > > > > > Hi Vinay, > > > > > > Thanks for taking an interest and the detailed discussion. > > > > > > To start by picking a few paragraph from your email to clarify a couple > > > of things that lead to the current design or that might otherwise need > > > clarification. We can talk about other points later as well. > > > > > > [ > > > Site notes: > > > 1) I'm not an MLIR person. > > > 2) It seems unfortnuate that we do not have a mlir-dev list. > > > ] > > > > > > > > > > 1. With the current design, the number of transformations / > optimizations > > > > that one can write on OpenMP constructs would become limited as > there can > > > > be any custom loop structure with custom operations / types inside > it. > > > > > > OpenMP, as an input language, does not make many assumptions about the > > > code inside of constructs*. > > > > > > This isn’t entirely correct because the current OpenMP API specification > ( > > https://www.openmp.org/spec-html/5.0/openmpch1.html) assumes that the > code > > inside the constructs belong to C, C++ and Fortran programs. > > (FWIW, my next sentence specified that I talk about the base language > but anyway.) > > While technically true, I will recommend not to make that assumption. We > do already allow non-base language constructs, e.g., CUDA intrinsics in > target regions, and that will not go away because it is required to > maximize performance. > > > > > So, inside a parallel can be almost anything > > > the base language has to offer, both lexically and dynamically. > > > > > > > > > I am mostly concerned with the MLIR side of things for OpenMP > > representation. > > > > MLIR can not only support operations for General Purpose languages like > > C,C++, Fortran, etc but also various Domain Specific Language > > representations as dialects (Example, ML, etc.). Note that there is also > > SPIR V dialect which is again meant for “Parallel Compute”. > > > > It becomes important to define the scope of the dialects / operations / > > types supported inside OpenMP operations in MLIR. > > Arguably, the OpenMP dialect in MLIR should match the OpenMP directives > and clauses as defined by the standard. Anything else is "not OpenMP". > >Yes! This is what needs to be “defined” for the various dialects...> > > > Assuming otherwise is not going to work. Analyzing a "generic" OpenMP > > > representation in order to determine if can be represented as a more > > > restricted "op" seems at least plausible. You will run into various > > > issue, some mentioned explicitly below. > > > > Isn’t it the other way around? For example, it doesn’t make much sense to > > wrap OpenMP operations for SPIR-V operations / types. > > I maybe misunderstanding but I thought you want to use something like > the GPU / Affine dialect to represent an OpenMP target region / loop. > That is plausible if you analyze the target region / loop and verify it > fits into the more generic dialect semantics. > > > > I think it is important to specify (in the design) which existing MLIR > > dialects are supported in this effort and the various lowerings / > > transformations / optimizations which are planned for them. > > That I cannot really help you with. TBH, I don't even know what > transformations people plan to do on OpenMP MLIR (and why). > > > > > For starters, you still have to > > > generate proper OpenMP runtime calls, e.g., from your GPU dialect, even > > > if it is "just" to make sure the OMPD/OMPT interfaces expose useful > > > information. > > > > > > > > You can have a well-defined call-like mlir::Operation which calls the GPU > > kernel. Perform all cross-device transformations in an easier way. > > Then, this operation can be lowered to OpenMP runtime calls during LLVM > > dialect conversion. > > You missed my point I made in the other email. An OpenMP target region > is statically not a GPU offload so you should not model it as such "for > some time". > > I did not get your point here.Why is OpenMP program compiled with single target (say, nvptx) isn’t a GPU offload case? Are you saying that tgt_target might invoke some code other than GPU kernel even if the user intends to run it on GPU? Even when there are multiple targets, all the code should get “statically” compiled and kept ready. In any case, unless it is statically proved that the GPU kernel is not executed, you need to optimize the code inside the GPU kernel and it would be better to do it in MLIR was the point.> > I think this is much better than directly having calls > > to the OpenMP runtime library based on a kernel name mentioned in > > llvm::GlobalVariable. > > (Current) implementation is not semantics. There is no reason not to > change the way we lower OpenMP, e.g., by getting rid of the global > variables. They are present for a reason but not intrinsically required. > See the TRegions for example [6,7], they totally change the GPU lowering, > making it device agnostic and easy to analyze and optimize in the middle > end. Arguing the current encoding of OpenMP in LLVM-IR is problematic is > the same as arguing MLIR's LLVM dialect doesn't support atomic_rmw, it > might be true but its changeable. > > [6] https://link.springer.com/chapter/10.1007/978-3-030-28596-8_11 > [7] http://parallel.auckland.ac.nz/iwomp2019/slides_TRegion.pdf > > > > > * I preclude the `omp loop` construct here as it is not even > implemented > > > anywhere as far as I know. > > > > > > > > > > 2. It would also be easier to transform the Loop nests containing > OpenMP > > > > constructs if the body of the OpenMP operations is well defined > (i.e., > > > does > > > > not accept arbitrary loop structures). Having nested redundant > > > "parallel" , > > > > "target" and "do" regions seems unnecessary. > > > > > > As mentioned above, you cannot start with the assumption OpenMP input > is > > > structured this this way. You have to analyze it first. This is the > same > > > reason we cannot simply transform C/C++ `for loops` into `affine.for` > > > without proper analysis of the loop body. > > > > > > Now, more concrete. Nested parallel and target regions are not > > > necessarily redundant, nor can/should we require the user not to have > > > them. Nested parallelism can easily make sense, depending on the > problem > > > decomposition. Nested target will make a lot of sense with reverse > > > offload, which is already in the standard, and it also should be > allowed > > > for the sake of a modular (user) code base. > > > > > > > Just to be clear, having all three of “target”, “parallel” and “do” > doesn’t > > represent “Nested parallelism” at all in the proposed design! ( 2(d) ). > > > > omp.target { > > > > omp.parallel { > > > > omp.do { > > > > …... > > > > } > > > > } > > > > } > > > > Above invokes a call to the tgt_target() for the code inside omp.do as > > mentioned in the proposal. > > I do not follow. Just to make sure, the above should be roughly > equivalent to the code below, correct? There is no "nested" > parallelism, sure, but I thought you were talking about the case where > there is, e.g. add another `#pragma omp parallel` inside the one thatalready is there. That is nested parallelism which can happen and make> total sense for the application. >No, actually. I was just saying that, all three could have been merged to one OpenMP operation “target_parallel_for” in MLIR rather than having three and analyzing them.> > #pragma omp target > { > #pragma omp parallel > { > #pragma omp for > for (...) > { > ... > } > } > } >> > > > > > > > 3. There would also be new sets of loop structures in new dialects > when > > > > C/C++ is compiled to MLIR. It would complicate the number of possible > > > > combinations inside the OpenMP region. > > > > > > Is anyone working on this? If so, what is the timeline? I personally > was > > > not expecting Clang to switch over to MLIR any time soon but I am happy > > > if someone wants to correct me on this. I mention this only because it > > > interacts with the arguments I will make below. > > > > > > > > > > E. Lowering of target constructs mentioned in ( 2(d) ) specifies > direct > > > > lowering to LLVM IR ignoring all the advantages that MLIR provides. > Being > > > > able to compile the code for heterogeneous hardware is one of the > biggest > > > > advantages that MLIR brings to the table. That is being completely > missed > > > > here. This also requires solving the problem of handling target > > > information > > > > in MLIR. But that is a problem which needs to be solved anyway. > Using GPU > > > > dialect also gives us an opportunity to represent offloading > semantics in > > > > MLIR. > > > > > > I'm unsure what the problem with "handling target information in MLIR" > is > > > but > > > whatever design we end up with, we need to know about the target > > > (triple) in all stages of the pipeline, even if it is just to pass it > > > down. > > > > > > > > > > Given the ability to represent multiple ModuleOps and the existence > of > > > GPU > > > > dialect, couldn't higher level optimizations on offloaded code be > done at > > > > MLIR level?. The proposed design would lead us to the same problems > that > > > we > > > > are currently facing in LLVM IR. > > > > > > > > Also, OpenMP codegen will automatically benefit from the GPU dialect > > > based > > > > optimizations. For example, it would be way easier to hoist a memory > > > > reference out of GPU kernel in MLIR than in LLVM IR. > > > > > > While I agree with the premise that you can potentially reuse MLIR > > > transformations, it might not be as simple in practice. > > > > > > As mentioned above, you cannot assume much about OpenMP codes, almost > > > nothing for a lot of application codes I have seen. Some examples: > > > > > > If you have a function call, or any synchronization event for that > > > matter, located between two otherwise adjacent target regions (see > > > below), you cannot assume the two target regions will be offloaded to > > > the same device. > > > ``` > > > #omp target > > > {} > > > foo(); > > > #omp target > > > {} > > > ``` > > > > > > > These kinds of optimizations are much easier to write in MLIR: > > > > LLVM IR for the above code would contain a series of instructions of > OpenMP > > runtime call setup and foo() in the middle followed by another set of > > OpenMP runtime related instructions. The body of the two target > constructs > > would be in two different outlined functions (if not modules). > > > > It takes quite a bit of code to do analysis / transformation to write any > > optimization on the generated LLVM IR. > > You are right about the module's being a problem. As I mentioned in my > last email, we are working on that by not having them in different ones > during the optimization pipeline. If we make the `target` `parallel` > instead we can simulate that right now. The bodies are in different > functions, sure, but does it matter? Let's walk through parallel region > expansion (see above [Section 5, 1]) so you can judge for yourself: > > ``` > #omp parallel > { body0 } > some_code > #omp parallel > { body1 } > ``` > > will become > > ``` > __kmpc_fork_call(..., @body0_fn, ...) > some_code > __kmpc_fork_call(..., @body1_fn, ...) > ``` > > in IR. Simplified, there are 3 cases here: > 1) some_code is harmless, meaning all of it can be executed redundantly. > 2) parts of some some_code need to be guarded to be sequential but > they can be executed in a parallel region otherwise, e.g., the code > will not observe the difference through runtime calls. > 3) parts of some some_code cannot be executed in a parallel region as > they might observe the difference through runtime calls. > First note that you need to do the classification regardless of your > encoding (=IR). In case of 3) we are done and nothing is happening. > Let's consider case 2) as 1) is basically a special case of it. As shown > in the paper [1], you need to broadcast values created by some_code > across all threads and synchronize appropriately to preserve semantic. > Other than that, the transformation is straight forward: > > A) Create a function "@body01_fn" that is basically the outlined region > in which code is then guarded and __kmpc_fork_call are replaced by > direct calls. It looks like this: > ``` > call @body0_fn(...) > #omp master > some_code > #omp barrier > call @body1_fn(...) > ``` > > B) Replace the region you put in the new function with a > __kmpc_fork_call to it: > ``` > __kmpc_fork_call(..., @body01_fn, ...) > ``` > > C) Done. > > If you are interested in the implementation I'll add you as a reviewer > once I put it on Phab. I'm in the process of cleaning up my stand alone > pass and moving it into the OpenMPOpt pass instead. > > > > vs. > > > > MLIR provides a way to represent the operations closer to the source. It > is > > as simple as checking the next operation(s) in the mlir::Block. OpenMP > > target operation contains an inlined region which can easily be fused/ > > split / or any other valid transformation for that matter. > > > > Note that you can also perform various Control Structure Analysis / > > Transformations much easier in MLIR. For example, you can decide to > execute > > foo() based on certain conditions, and you can merge the two target > regions > > in the else path. > > At the end, it's an encoding difference. Sure, the handling might be > easier in certain situations but all the validity checks, hence code > analyses, are still required. The actual "rewrite" is usually not the > hard part. > > > > > Similarly, you cannot assume a `omp parallel` is allowed to be executed > > > with more than a single thread, or that a `omp [parallel] for` does not > > > have loop carried data-dependences, ... > > > > > > > With multi-dimensional index support for arrays, wouldn’t it be better to > > do the data dependence analysis in MLIR? > > Yes, probably. > > > > LLVM IR has linearized subscripts for multi-dimensional arrays. > > llvm::DependenceAnalysis tries to “guess” the indices based on different > > patterns in SCEV. It takes an intrinsic > > <http://llvm.org/devmtg/2020-04/talks.html#LightningTalk_88> or > metadata or > > some other mechanism of communication from the front end (not the > built-in > > set of instructions) to solve this problem. > > Not disagreeing with you on this one ;) > > The only caveat is that we still live in a world in which C/C++ is a > thing. > >Doesn’t mean that we should not be having optimizations in MLIR for Fortran :-)> > > > > Data-sharing attributes are also something that has to be treated > > > carefully: > > > ``` > > > x = 5; > > > #omp task > > > x = 3; > > > print(x); > > > ``` > > > Should print 5, not 3. > > > > > > > You can have “x” as a locally defined variable inside the “task” > contained > > region in MLIR OR custom data-sharing attributes in OpenMP dialect. > > I'm not saying it is impossible or even hard, but maybe not as straight > forward as one might think. Your encoding is for example very reasonable. > > In the example below you need to print 3, not 5, e.g., constant prop on > the outer level should not happen. > >Yes, that is why I am concerned about the design considerations of clauses like shared, map, firstprivate, etc.> ``` > x = 5; > #omp task shared(x) > { > x = 3; > some_form_of_sync(); > ... > } > some_form_of_sync(); > print(x); > ``` > > > > > I hope I convinced you that OpenMP is not trivially mappable to > existing > > > dialects without proper analysis. If not, please let me know why you > > > expect it to be. > > > > > I do not see much reason why the issues you mentioned can’t trivially be > > mapped to the MLIR infrastructure. There is an easy way to define custom > > operations / types / attributes in OpenMP dialect and perform > optimizations > > based on the *IR that is created especially for OpenMP*. The analysis / > > transformations required can be easily written on the custom operations > > defined rather than having a lowered form in the LLVM IR. > > You can totally define your OpenMP dialect and map it to that. Mapping > to other dialects is the problematic part.Yes, this is why I have mentioned that RFC should talk about the following: "I think it is important to specify (in the design) which existing MLIR dialects are supported in this effort and the various lowerings / transformations / optimizations which are planned for them." As I mentioned, `omp> parallel` does not mean "parallel" or "dependence-free". > > > Since you mention it, why do you think it is conceptually or practically > harder to write an analysis/transformations on IR? I mean, you teach > your analysis what the op "omp.parallel" means, right? Why not teach an > (interprocedural) analysis what __kmpc_fork_call() does (see [2,3] above)? > > FWIW, there are LLVM analyses and transformations that already know > about the transitive call made by __kmpc_fork_call and pthread_create > (see [4] above). It is done in a way that you can easily annotate your > own C/C++ or IR to make use of it, e.g., for your own transitive > callbacks: > https://clang.llvm.org/docs/AttributeReference.html#callback > https://llvm.org/docs/LangRef.html#callback-metadata > > > > The various dialects / transformations in MLIR are in development / early > > phase (Example, GPU dialect) waiting to be improved with use cases such > as > > this! > > Great! I am eagerly looking forward to this. > > Cheers, > Johannes > > > > > > > Now when it comes to code analyses, LLVM-IR offers a variety of > > > interesting features, ranging from a mature set of passes to the > > > cross-language LTO capabilities. We are working on the missing parts, > > > e.g., heterogeneous llvm::Modules as we speak. Simple OpenMP > > > optimizations are already present in LLVM and interesting ones are > > > prototyped for a while now (let me know if you want to see more not-yet > > > merged patches/optimizations). I also have papers, results, and > > > talks that might be interesting here. Let me know if you need pointers > > > to them. > > > > > > > > > Cheers, > > > Johannes > > > > > > > > > > > > On 02/13, Vinay Madhusudan via llvm-dev wrote: > > > > Hi, > > > > > > > > I have few questions / concerns regarding the design of OpenMP > dialect in > > > > MLIR that is currently being implemented, mainly for the f18 > compiler. > > > > Below, I summarize the current state of various efforts in clang / > f18 / > > > > MLIR / LLVM regarding this. Feel free to add to the list in case I > have > > > > missed something. > > > > > > > > 1. [May 2019] An OpenMPIRBuilder in LLVM was proposed for flang and > clang > > > > frontends. Note that this proposal was before considering MLIR for > FIR. > > > > > > > > a. llvm-dev proposal : > > > > > > > > http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-May/000197.html > > > > > > > > b. Patches in review: https://reviews.llvm.org/D70290. This also > > > includes > > > > the clang codegen changes. > > > > > > > > 2. [July - September 2019] OpenMP dialect for MLIR was discussed / > > > > proposed with respect to the f18 compilation stack (keeping FIR in > mind). > > > > > > > > a. flang-dev discussion link: > > > > > https://lists.llvm.org/pipermail/flang-dev/2019-September/000020.html > > > > > > > > b. Design decisions captured in PPT: > > > > > https://drive.google.com/file/d/1vU6LsblsUYGA35B_3y9PmBvtKOTXj1Fu/view > > > > > > > > c. MLIR google groups discussion: > > > > > > > > https://groups.google.com/a/tensorflow.org/forum/#!topic/mlir/4Aj_eawdHiw > > > > > > > > d. Target constructs design: > > > > > > > > http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-September/000285.html > > > > > > > > e. SIMD constructs design: > > > > > > > > http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-September/000278.html > > > > > > > > 3. [Jan 2020] OpenMP dialect RFC in llvm discourse : > > > > https://llvm.discourse.group/t/rfc-openmp-dialect-in-mlir/397 > > > > > > > > 4. [Jan- Feb 2020] Implementation of OpenMP dialect in MLIR: > > > > > > > > a. The first patch which introduces the OpenMP dialect was pushed. > > > > > > > > b. Review of barrier construct is in progress: > > > > https://reviews.llvm.org/D72962 > > > > > > > > I have tried to list below different topics of interest (to different > > > > people) around this work. Most of these are in the design phase (or > very > > > > new) and multiple parties are interested with different sets of > goals in > > > > mind. > > > > > > > > I. Flang frontend and its integration > > > > > > > > II. Fortran representation in MLIR / FIR development > > > > > > > > III. OpenMP development for flang, OpenMP builder in LLVM. > > > > > > > > IV. Loop Transformations in MLIR / LLVM with respect to OpenMP. > > > > > > > > It looks like the design has evolved over time and there is no one > place > > > > which contains the latest design decisions that fits all the > different > > > > pieces of the puzzle. I will try to deduce it from the above > mentioned > > > > references. Please correct me If I am referring to anything which has > > > > changed. > > > > > > > > A. For most OpenMP design discussions, FIR examples are used (as > seen in > > > > (2) and (3)). The MLIR examples mentioned in the design only talks > about > > > > FIR dialect and LLVM dialect. > > > > > > > > This completely ignores the likes of standard, affine (where most > loop > > > > transformations are supposed to happen) and loop dialects. I think > it is > > > > critical to decouple the OpenMP dialect development in MLIR from the > > > > current flang / FIR effort. It would be useful if someone can mention > > > these > > > > examples using existing dialects in MLIR and also how the different > > > > transformations / lowerings are planned. > > > > > > > > B. In latest RFC(3), it is mentioned that the initial OpenMP dialect > > > > version will be as follows, > > > > > > > > omp.parallel { > > > > > > > > omp.do { > > > > > > > > fir.do %i = 0 to %ub3 : !fir.integer { > > > > > > > > ... > > > > > > > > } > > > > > > > > } > > > > > > > > } > > > > > > > > and then after the "LLVM conversion" it is converted as follows: > > > > > > > > omp.parallel { > > > > > > > > %ub3 > > > > > > > > omp.do %i = 0 to %ub3 : !llvm.integer { > > > > > > > > ... > > > > > > > > } > > > > > > > > } > > > > > > > > > > > > a. Is it the same omp.do operation which now contains the bounds and > > > > induction variables of the loop after the LLVM conversion? If so, > will > > > the > > > > same operation have two different semantics during a single > compilation? > > > > > > > > b. Will there be different lowerings for various loop operations from > > > > different dialects? loop.for and affine.for under omp operations > would > > > need > > > > different OpenMP / LLVM lowerings. Currently, both of them are > lowered to > > > > the CFG based loops during the LLVM dialect conversion (which is much > > > > before the proposed OpenMP dialect lowering). > > > > > > > > There would be no standard way to represent OpenMP operations > (especially > > > > the ones which involve loops) in MLIR. This would drastically > complicate > > > > lowering. > > > > > > > > C. It is also not mentioned how clauses like firstprivate, shared, > > > private, > > > > reduce, map, etc are lowered to OpenMP dialect. The example in the > RFC > > > > contains FIR and LLVM types and nothing about std dialect types. > Consider > > > > the below example: > > > > > > > > #pragma omp parallel for reduction(+:x) > > > > > > > > for (int i = 0; i < N; ++i) > > > > > > > > x += a[i]; > > > > > > > > How would the above be represented in OpenMP dialect? and What type > would > > > > "x" be in MLIR? It is not mentioned in the design as to how the > various > > > > SSA values for various OpenMP clauses are passed around in OpenMP > > > > operations. > > > > > > > > D. Because of (A), (B) and (C), it would be beneficial to have an > omp. > > > > parallel_do operation which has semantics similar to other loop > > > structures > > > > (may not be LoopLikeInterface) in MLIR. To me, it looks like having > > > OpenMP > > > > operations based on standard MLIR types and operations (scalars and > > > memrefs > > > > mainly) is the right way to go. > > > > > > > > Why not have omp.parallel_do operation with AffineMap based bounds, > so as > > > > to decouple it from Value/Type similar to affine.for? > > > > > > > > 1. With the current design, the number of transformations / > optimizations > > > > that one can write on OpenMP constructs would become limited as > there can > > > > be any custom loop structure with custom operations / types inside > it. > > > > > > > > 2. It would also be easier to transform the Loop nests containing > OpenMP > > > > constructs if the body of the OpenMP operations is well defined > (i.e., > > > does > > > > not accept arbitrary loop structures). Having nested redundant > > > "parallel" , > > > > "target" and "do" regions seems unnecessary. > > > > > > > > 3. There would also be new sets of loop structures in new dialects > when > > > > C/C++ is compiled to MLIR. It would complicate the number of possible > > > > combinations inside the OpenMP region. > > > > > > > > E. Lowering of target constructs mentioned in ( 2(d) ) specifies > direct > > > > lowering to LLVM IR ignoring all the advantages that MLIR provides. > Being > > > > able to compile the code for heterogeneous hardware is one of the > biggest > > > > advantages that MLIR brings to the table. That is being completely > missed > > > > here. This also requires solving the problem of handling target > > > information > > > > in MLIR. But that is a problem which needs to be solved anyway. > Using GPU > > > > dialect also gives us an opportunity to represent offloading > semantics in > > > > MLIR. > > > > > > > > Given the ability to represent multiple ModuleOps and the existence > of > > > GPU > > > > dialect, couldn't higher level optimizations on offloaded code be > done at > > > > MLIR level?. The proposed design would lead us to the same problems > that > > > we > > > > are currently facing in LLVM IRProposed design seems to allow such > optimizations.. > > > > > > > > Also, OpenMP codegen will automatically benefit from the GPU dialect > > > based > > > > optimizations. For example, it would be way easier to hoist a memory > > > > reference out of GPU kernel in MLIR than in LLVM IR. > > > > > -- > > Johannes Doerfert > Researcher > > Argonne National Laboratory > Lemont, IL 60439, USA > > jdoerfert at anl.gov >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200217/2eef41df/attachment.html>