Johannes Doerfert via llvm-dev
2018-Jun-07 10:25 UTC
[llvm-dev] [RFC] Abstract Parallel IR Optimizations
This is an RFC to add analyses and transformation passes into LLVM to optimize programs based on an abstract notion of a parallel region. == this is _not_ a proposal to add a new encoding of parallelism = We currently perform poorly when it comes to optimizations for parallel codes. In fact, parallelizing your loops might actually prevent various optimizations that would have been applied otherwise. One solution to this problem is to teach the compiler about the semantics of the used parallel representation. While this sounds tedious at first, it turns out that we can perform key optimizations with reasonable implementation effort (and thereby also reasonable maintenance costs). However, we have various parallel representations that are already in use (KMPC, GOMP, CILK runtime, ...) or proposed (Tapir, IntelPIR, ...). Our proposal seeks to introduce parallelism specific optimizations for multiple representations while minimizing the implementation overhead. This is done through an abstract notion of a parallel region which hides the actual representation from the analysis and optimization passes. In the schemata below, our current five optimizations (described in detail here [0]) are shown on the left, the abstract parallel IR interface is is in the middle, and the representation specific implementations is on the right. Optimization (A)nalysis/(T)ransformation Impl. --------------------------------------------------------------------------- CodePlacementOpt \ /---> ParallelRegionInfo (A) ---------|-> KMPCImpl (A) RegionExpander -\ | | GOMPImpl (A) AttributeAnnotator -|-|---> ParallelCommunicationInfo (A) --/ ... BarrierElimination -/ | VariablePrivatization / \---> ParallelIR/Builder (T) -----------> KMPCImpl (T) In our setting, a parallel region can be an outlined function called through a runtime library but also a fork-join/attach-reattach region embedded in an otherwise sequential code. The new optimizations will provide parallelism specific optimizations to all of them (if applicable). There are various reasons why we believe this is a worthwhile effort that belongs into the LLVM codebase, including: 1) We improve the performance of parallel programs, today. 2) It serves as a meaningful baseline for future discussions on (optimized) parallel representations. 3) It allows to determine the pros and cons of the different schemes when it comes to actual optimizations and inputs. 4) It helps to identify problems that might arise once we start to transform parallel programs but _before_ we commit to a specific representation. Our prototypes for the OpenMP KMPC library (used by clang) already shows significant speedups for various benchmarks [0]. It also exposed a (to me) prior unknown problem between restrict/noalias pointers and (potential) barriers (see Section 3 in [0]). We are currently in the process of cleaning the code, extending the support for OpenMP constructs and adding a second implementation for a embedded parallel regions. Though, a first horizontal prototype implementation is already available for review [1]. Inputs of any kind are welcome and reviewers are needed! Cheers, Johannes [0] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf [1] https://reviews.llvm.org/D47300 P.S. Sorry if you received this message multiple times! -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 228 bytes Desc: Digital signature URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180607/55cab9f6/attachment.sig>
Roger Ferrer Ibáñez via llvm-dev
2018-Jun-12 11:56 UTC
[llvm-dev] [RFC] Abstract Parallel IR Optimizations
Hi Johannes, apologies in advance if the questions following are silly or don't make sense. I lack a bit of context here and I'm not sure to fully understand your proposal. Currently clang (and flang) are lowering OpenMP when building LLVM IR (this is because LLVM IR can't express the parallel/concurrent concepts of OpenMP so they have to be lowered first). So, can I assume that your proposal starts off in a context where that lowering is not happening anymore in the front end but it'd happen later in a LLVM IR pass? If so, then you'd be assuming that there is already a way of representing OpenMP constructs in the LLVM IR, is my understanding correct here? I think that the Intel proposal [1] could be one way (not necessarily the one) to do this (disregarding the fact that it is tailored for OpenMP), does this still make sense? If this is the case, and given that you explicitly state that this is not a Parallel IR of any sort, is your suggestion to improve optimisation of OpenMP code, based on a "side-car"/ancillary representation built on top of the existing IR, which as I understand should already be able to represent OpenMP? But then this looks a bit redundant to me. So I'm pretty sure one of my assumptions is incorrect. Unless your auxiliar representation is more an alternative to the W-regions [1]. Or, maybe I am completely wrong here: you didn't say anything about the FE lowering, which would still happen, and then your proposal builds on top of that. I don't think you meant that, given that your proposal mentions KMP and GOMP (and the current lowering done by clang targets only KMP). Thank you very much, Roger [1] https://dl.acm.org/citation.cfm?id=3148191 2018-06-07 12:25 GMT+02:00 Johannes Doerfert via llvm-dev <llvm-dev at lists.llvm.org>:> This is an RFC to add analyses and transformation passes into LLVM to > optimize programs based on an abstract notion of a parallel region. > > == this is _not_ a proposal to add a new encoding of parallelism => > We currently perform poorly when it comes to optimizations for parallel > codes. In fact, parallelizing your loops might actually prevent various > optimizations that would have been applied otherwise. One solution to > this problem is to teach the compiler about the semantics of the used > parallel representation. While this sounds tedious at first, it turns > out that we can perform key optimizations with reasonable implementation > effort (and thereby also reasonable maintenance costs). However, we have > various parallel representations that are already in use (KMPC, > GOMP, CILK runtime, ...) or proposed (Tapir, IntelPIR, ...). > > Our proposal seeks to introduce parallelism specific optimizations for > multiple representations while minimizing the implementation overhead. > This is done through an abstract notion of a parallel region which hides > the actual representation from the analysis and optimization passes. In > the schemata below, our current five optimizations (described in detail > here [0]) are shown on the left, the abstract parallel IR interface is > is in the middle, and the representation specific implementations is on > the right. > > Optimization (A)nalysis/(T)ransformation Impl. > --------------------------------------------------------------------------- > CodePlacementOpt \ /---> ParallelRegionInfo (A) ---------|-> KMPCImpl (A) > RegionExpander -\ | | GOMPImpl (A) > AttributeAnnotator -|-|---> ParallelCommunicationInfo (A) --/ ... > BarrierElimination -/ | > VariablePrivatization / \---> ParallelIR/Builder (T) -----------> KMPCImpl (T) > > > In our setting, a parallel region can be an outlined function called > through a runtime library but also a fork-join/attach-reattach region > embedded in an otherwise sequential code. The new optimizations will > provide parallelism specific optimizations to all of them (if > applicable). There are various reasons why we believe this is a > worthwhile effort that belongs into the LLVM codebase, including: > > 1) We improve the performance of parallel programs, today. > 2) It serves as a meaningful baseline for future discussions on > (optimized) parallel representations. > 3) It allows to determine the pros and cons of the different schemes > when it comes to actual optimizations and inputs. > 4) It helps to identify problems that might arise once we start to > transform parallel programs but _before_ we commit to a specific > representation. > > Our prototypes for the OpenMP KMPC library (used by clang) already shows > significant speedups for various benchmarks [0]. It also exposed a (to > me) prior unknown problem between restrict/noalias pointers and > (potential) barriers (see Section 3 in [0]). > > We are currently in the process of cleaning the code, extending the > support for OpenMP constructs and adding a second implementation for a > embedded parallel regions. Though, a first horizontal prototype > implementation is already available for review [1]. > > Inputs of any kind are welcome and reviewers are needed! > > Cheers, > Johannes > > > [0] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf > [1] https://reviews.llvm.org/D47300 > > > P.S. > Sorry if you received this message multiple times! > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-- Roger Ferrer Ibáñez
Johannes Doerfert via llvm-dev
2018-Jun-12 14:23 UTC
[llvm-dev] [RFC] Abstract Parallel IR Optimizations
Hi Roger, On 06/12, Roger Ferrer Ibáñez wrote:> apologies in advance if the questions following are silly or don't > make sense. I lack a bit of context here and I'm not sure to fully > understand your proposal.No worries, I'm glad if people ask questions!> Currently clang (and flang) are lowering OpenMP when building LLVM IR > (this is because LLVM IR can't express the parallel/concurrent > concepts of OpenMP so they have to be lowered first). So, can I assume > that your proposal starts off in a context where that lowering is not > happening anymore in the front end but it'd happen later in a LLVM IR > pass? If so, then you'd be assuming that there is already a way of > representing OpenMP constructs in the LLVM IR, is my understanding > correct here? I think that the Intel proposal [1] could be one way > (not necessarily the one) to do this (disregarding the fact that it is > tailored for OpenMP), does this still make sense?My proposal does _not_ assume we change clang in any way, though it does also not require it. However, the initial patch [1] will only work with the OpenMP lowering used by clang right now. The idea is as follows: We have different representation of parallelism in the IR, for example the KMP runtime library calls emitted by clang or the Intel parallel IR you mentioned. For each of them we write a piece of code that (1) extracts domain specific information and (2) allows to modify the parallel representation. This is the only piece of code that has to be adapted for each parallel representation we want to optimize. On top of this are abstract interfaces that expose the information and modification options to parallel optimization passes. The patch [1] only contains the attribute annotator but we have more as explained in the paper [0]. The analysis/optimization logic is part of these passes and not aware of the underlying representation. We can consequently use the same passes to optimize code that was lowered to use different parallel runtime libraries (GOMP, KMP, Cilk runtime, TBB, ...) or into a native parallel IR (of any shape). This is especially useful as the native parallel IR might not always be usable. If that happens we have to fallback to early outlining, thus runtime library calls emitted by the front-end. Even if we at some point have a native parallel representation that is always used, we can simply remove the abstraction introduced by this approach but keep the analysis/optimizations around. [0] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf [1] https://reviews.llvm.org/D47300> If this is the case, and given that you explicitly state that this is > not a Parallel IR of any sort, is your suggestion to improve > optimisation of OpenMP code, based on a "side-car"/ancillary > representation built on top of the existing IR, which as I understand > should already be able to represent OpenMP? But then this looks a bit > redundant to me. So I'm pretty sure one of my assumptions is > incorrect. Unless your auxiliar representation is more an alternative > to the W-regions [1]. > > Or, maybe I am completely wrong here: you didn't say anything about > the FE lowering, which would still happen, and then your proposal > builds on top of that. I don't think you meant that, given that your > proposal mentions KMP and GOMP (and the current lowering done by clang > targets only KMP).I'm not sure if these paragraphs are still relevant. Does the above "explanation" answers you questions already? If not, please continue asking! Cheers, Johannes> Thank you very much, > Roger > > [1] https://dl.acm.org/citation.cfm?id=3148191 > > 2018-06-07 12:25 GMT+02:00 Johannes Doerfert via llvm-dev > <llvm-dev at lists.llvm.org>: > > This is an RFC to add analyses and transformation passes into LLVM to > > optimize programs based on an abstract notion of a parallel region. > > > > == this is _not_ a proposal to add a new encoding of parallelism => > > > We currently perform poorly when it comes to optimizations for parallel > > codes. In fact, parallelizing your loops might actually prevent various > > optimizations that would have been applied otherwise. One solution to > > this problem is to teach the compiler about the semantics of the used > > parallel representation. While this sounds tedious at first, it turns > > out that we can perform key optimizations with reasonable implementation > > effort (and thereby also reasonable maintenance costs). However, we have > > various parallel representations that are already in use (KMPC, > > GOMP, CILK runtime, ...) or proposed (Tapir, IntelPIR, ...). > > > > Our proposal seeks to introduce parallelism specific optimizations for > > multiple representations while minimizing the implementation overhead. > > This is done through an abstract notion of a parallel region which hides > > the actual representation from the analysis and optimization passes. In > > the schemata below, our current five optimizations (described in detail > > here [0]) are shown on the left, the abstract parallel IR interface is > > is in the middle, and the representation specific implementations is on > > the right. > > > > Optimization (A)nalysis/(T)ransformation Impl. > > --------------------------------------------------------------------------- > > CodePlacementOpt \ /---> ParallelRegionInfo (A) ---------|-> KMPCImpl (A) > > RegionExpander -\ | | GOMPImpl (A) > > AttributeAnnotator -|-|---> ParallelCommunicationInfo (A) --/ ... > > BarrierElimination -/ | > > VariablePrivatization / \---> ParallelIR/Builder (T) -----------> KMPCImpl (T) > > > > > > In our setting, a parallel region can be an outlined function called > > through a runtime library but also a fork-join/attach-reattach region > > embedded in an otherwise sequential code. The new optimizations will > > provide parallelism specific optimizations to all of them (if > > applicable). There are various reasons why we believe this is a > > worthwhile effort that belongs into the LLVM codebase, including: > > > > 1) We improve the performance of parallel programs, today. > > 2) It serves as a meaningful baseline for future discussions on > > (optimized) parallel representations. > > 3) It allows to determine the pros and cons of the different schemes > > when it comes to actual optimizations and inputs. > > 4) It helps to identify problems that might arise once we start to > > transform parallel programs but _before_ we commit to a specific > > representation. > > > > Our prototypes for the OpenMP KMPC library (used by clang) already shows > > significant speedups for various benchmarks [0]. It also exposed a (to > > me) prior unknown problem between restrict/noalias pointers and > > (potential) barriers (see Section 3 in [0]). > > > > We are currently in the process of cleaning the code, extending the > > support for OpenMP constructs and adding a second implementation for a > > embedded parallel regions. Though, a first horizontal prototype > > implementation is already available for review [1]. > > > > Inputs of any kind are welcome and reviewers are needed! > > > > Cheers, > > Johannes > > > > > > [0] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf > > [1] https://reviews.llvm.org/D47300 > > > > > > P.S. > > Sorry if you received this message multiple times! > > > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > > > > -- > Roger Ferrer Ibáñez-- Johannes Doerfert PhD Student / Researcher Compiler Design Lab (Professor Hack) / Argonne National Laboratory Saarland Informatics Campus, Germany / Lemont, IL 60439, USA Building E1.3, Room 4.31 Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de / jdoerfert at anl.gov Fax. +49 (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 228 bytes Desc: Digital signature URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180612/b6824780/attachment.sig>