<mehdi.amini at apple.com>, Bcc: Subject: Re: [llvm-dev] [RFC][PIR] Parallel LLVM IR -- Stage 0 -- IR extension Reply-To: In-Reply-To: <20170224221713.GA931 at arch-linux-jd.home> Ping. PS. Are there actually people interested in this? We will continue working anyway but it might not make sense to put it on reviews and announce it on the ML if nobody cares. On 02/24, Johannes Doerfert via llvm-dev wrote:> To (re)start a discussion here we published two patches to the > phabricator: > > 1) A documentation draft (including LangRef) for the PIR instructions > and the parallel regions [0]. > > 2) A draft implementation for a ParallelRegionInfo pass that identifies > and maintains parallel regions in the CFG [1]. It also serves as a > verifier pass for now. > > Please take a look and comment the patches either in the phabricator or > here on the list. > > Thanks! > > [0] https://reviews.llvm.org/D30353 > [1] https://reviews.llvm.org/D30354 > > > On 01/28, Johannes Doerfert via llvm-dev wrote: > > Dear all, > > > > This RFC proposes three new LLVM IR instructions to express high-level > > parallel constructs in a simple, low-level fashion. For this first stage > > we prepared two commits that add the proposed instructions and a pass to > > lower them to obtain sequential IR. Both patches have be uploaded for > > review [1, 2]. The latter patch is very simple and the former consists > > of almost only mechanical changes needed to add new instructions. > > > > The rest of this email contains (1) an introduction of the IR extension > > (2) the reasoning behind this approach, (3) a comparison to other ideas > > proposed so far, (4) a validation of the feasibility and potential > > impact, and (5) an outlook on the next steps. > > > > (1) IR extension: > > Parallel IR adds three new terminator instructions that define the > > beginning and the end of parallel regions in the CFG. A parallel region > > is a connected subgraph of the CFG that is potentially executed by two > > threads in parallel. It can only be entered with a fork instruction and > > spreads till a join instruction is reached. Therefor parallel regions > > are single-entry-multiple-exit regions. Parallel regions can be nested > > and if they are, they form a parallel region tree similar to the loop > > tree maintained by the natural loop info pass. Each parallel region > > defines two independent “sibling” tasks, namely the forked and > > continuation task. > > > > The new instructions are defined as follows: > > > > 1. fork: marks the beginning of parallel region. Every fork has two > > successor blocks which represent two parallel tasks. We call > > these two “sibling” tasks the forked and continuation tasks. > > Nested forking is supported, meaning that another fork can be > > reach prior to the join. > > > > 2. halt: marks the end of a forked task. The "sibling" continuation block > > (see fork above) is the operand of the halt terminator. This > > represents the idea of asymmetric parallelism as introduced by > > [1]. One advantage of asymmetric parallelism is that sequential > > semantics of the program are clear from its CFG (ref. [1]). > > Note that the edge from a forked block to a continuation block > > (the one introduced by the halt) represents the control flow > > when the two successors of a fork execute sequentially, not > > when they execute in parallel. In the latter case there is no > > “control transfer” happening via this edge but only > > synchronization between the tasks. > > > > 3. join: marks a synchronization point and the end of a parallel region. > > Once a join terminator is reached by a thread, execution stops > > in that thread until all tasks spawned by that thread finish > > their work, thus reach their respective halt instruction. A > > join shall only be reached by the continuation task of a fork, > > the forked task shall reach a halt with the continuation as a > > successor. > > > > > > Here is an example of a parallel OpenMP loop and its idiomatic lowering > > to Parallel IR. We set up a wiki [0] with additional examples. > > > > #pragma omp parallel > > for(int i = 0; i < n; ++i) { > > A[i] = C[i]; > > } > > > > > > preheader: > > br label %header > > > > header: > > %i = phi [ i32 0, %preheader ], [ %inc, %latch ] > > %done = icmp ge %i, %n > > br i1 %done, label %exit, label %body > > > > body: > > fork label %task, label %latch > > > > task: > > %aptr = getelementptr i32, i32* %A, i32 0, i32 %i > > %aval = load i32* %aptr > > %cptr = getelementptr i32, i32* %C, i32 0, i32 %i > > store i32 %aval, i32* %aptr > > halt label %latch > > > > latch: > > %inc = add i32, i32 %i, i32 1 > > br label %header > > > > exit: > > join label %afterloop > > > > afterloop: > > ... > > > > > > > > (2) Reasoning: > > The proposed approach is crafted such that the semantics of the parallel > > program is represented correctly in almost native, low-level IR right > > after front-end and preserved at any point till the final lowering to > > sequential IR or parallel runtime library calls. To this end, asymmetric > > parallelism is employed, a concept that uses control flow and the common > > concept of dominance to represent parts of the parallel semantics. In > > this model the parallel tasks do not dominate each other and only one > > parallel task dominates the code after the parallel region. As a > > consequence, various transformations that would break assumptions we > > make about parallel regions cannot happen (see [3,4]). While the > > explicitly modeled control flow together with dominance prevents various > > code motion problems, the use of terminators helps to minimize the > > changes needed to educate passes about parallel regions. Only a fraction > > of analysis and transformation passes deal with terminators explicitly. > > Most passes either test for known terminators (like branches), rely on > > dominance information, or work on a basic block level. To even further > > reduce changes to the existing passes, high-level concepts are broken > > down to already available low-level concepts instead of introducing new, > > semantically rich instructions/intrinsics (see the last paragraph of [5] > > and section 4 in the PIR white paper [6] for examples). Finally, this > > scheme allows a pass to simply reason about the sequential semantics of > > a parallel region, transform it back to one if needed or deemed > > beneficial and employ existing tooling solutions to debug and analyze > > the code [7]. > > > > (3) Comparison: > > The BoF discussion sheet [8] and the recent “[RFC] on IR-level region > > annotations” [9] both list pros and cons of different proposed schemes > > and implementations. We summarize and comment the discussion on the ones > > listen in the recent RFC here: > > (a) Metadata: It seems a consensus has been reached that metadata is > > not the solution but only a means to enhance a different solution. > > (c) One Intrinsic per directive/clause: This approach basically embeds > > a high-level (parallel) language in LLVM IR using intrinsics. It > > seems there is little to no support for this approach at the > > moment. > > (d) Parallel loop/region annotations: Here, intrinsics enclosing a > > parallel loop/region are used to represent parallelism. > > High-level knowledge is represented as attached metadata or in > > separate intrinsics. For more details please see the original > > RFC [9]. In the discussion several potential drawbacks have been > > mentioned: > > - The annotations might be too general [10]. > > - The IR is not semantically correct (or ready for optimization) > > after the front-end and needs an additional “prepare phase for > > pre-privatization" [11]. > > - The currently available “potential side effect for intrinsic > > calls” seem not to suffice for the proposed intrinsics as they > > do not have "call semantics" [12]. > > (b) Parallel instructions (this approach): The table in the region RFC > > [9] lists two drawbacks with this approach, both of which have > > already been called into question [5]. The first drawback is the > > effort needed to implement this scheme which is discussed in > > more detail in section (4) of this mail. The second drawback is > > the need for additional representation of high-level information > > that is not part of the semantics of the new fork-join > > instructions. As mentioned above, the choice to keep the new > > instructions as simple as possible is deliberate. This parallel > > IR is intended to be extensible, and in particular, compatible > > with representations of high-level parallel concepts that might > > be developed in the future. For the time being, the parallel IR > > is compatible with approach taken today of lowering high-level > > parallel linguistics, such as reductions and private memory, to > > existing IR constructs, such as parallel-runtime calls, > > atomicrmw instructions, and well placed alloca’s [5,6]. Although > > other extensions to the IR might allow LLVM to compile these > > higher-level constructs more effectively, we see no reason the > > parallel IR would conflict with any such extensions. (On the > > contrary, the parallel IR would seem to help compiler analyses > > of higher-level parallel constructs by exposing logical > > parallelism.) > > > > > > (4) Feasibility and Impact: > > The Tapir and PIR prototypes demonstrate the feasibility of this > > approach. The Tapir prototype [13] has recently proven its robustness as > > the standard compiler in the MIT class on parallel programming. It was > > implemented in ~ 5k LOC. However, >1k are explicit parallel > > optimization, 1k is used to add new instructions (thus mechanical) and > > 2k are used to lower the parallelism (basically needed for any scheme). > > Only the rest is required to make it work with existing analysis and > > transformation passes. While Tapir added explicit optimization passes > > for parallel regions/loops, the representation allows for a variety of > > classic optimizations (CSE, GVN, LICM, loop unrolling, TRE) to work with > > little to no modifications. Potential speedups compared to a classic > > “early-outlining” approaches can also be seen in the Tapir paper [13]. > > For the PIR prototype [14] we modified only three transformation passes > > (<20 LOC) [15] before we could run the O3 pipeline successfully on a > > parallel matrix multiplication. > > > > Together, these prototypes show how little passes actually inspect new > > (or “unknown”) terminators. The default assumption passes have to make, > > namely that control might be transferred to any successors at runtime, > > has, in terms of potential compiler transformations, a similar effect as > > the parallel semantics we want to model, namely that control is > > transferred to all successors. > > > > > > (5) Outlook: > > This first stage will only introduce and test the new instructions and > > the sequentialization pass. Afterwards we intend to start additions in > > different, partially overlapping but often orthogonal directions. We do > > welcome comments as well as developers for each of them: > > - Analysis and optimization: > > * A “parallel region info” pass to keep track of parallel regions > > and their nesting. The information can be made accessible in a > > “parallel region tree” similar to the loop tree maintained by the > > loop info pass. [stage 1, immediate next goal] > > * Extension of the verifier that allow to check parallel IR for > > “well-formedness”. [stage 1, immediate next goal] > > * Documentation of the PIR instructions in the language reference. > > [stage 1, immediate next goal] > > * A cost analysis for parallel tasks that can be queried by > > optimizations. The cost model needs to take the hardware, the > > runtime library and the parallel tasks into account. > > * Vectorizer enhancements to enable the vectorization of parallel > > * loops and tasks. > > * Parallelization centric optimizations: > > a) Parallel tasks can be balanced, merged or split as well as created > > from and lowered to sequential code. > > b) Barriers can be eliminated. > > c) Parallel loops can be statically scheduled or created from > > parallel recursive calls [13] > > * Analysis to extract high-level information (reductions, private > > memory, ...) from the low-level representation. > > - Front-end: > > * Lowering of simple OpenMP and Cilk++ annotations to PIR, including > > parallel sections and parallel loops with limited support for > > clauses (at first) (examples can be found here [1]). [milestone 1] > > * Generation of PIR code through automatic parallelization. A > > patched version of Polly exists that emits parallel loops using PIR instead of > > OpenMP runtime calls or llvm.parallel.loop metadata. [milestone 1] > > * Representation of more evolved high-level features like assignment > > of computation units. > > - Back-end: > > * Lowering of PIR regions to calls to the OpenMP (GOMP) and Cilk++ > > runtime library. [milestone 1] > > * A simple parallel library, e.g., based on pthreads, to be shipped > > with LLVM as a fallback implementation for parallel regions. > > > > > > Thank you all for your time and hopefully constructive input on this proposal! > > > > Cheers, > > Johannes, on behalf of the PIR team > > > > > > Disclaimer: > > This RFC, the patches, the wiki, etc. are a joint effort by Tao B. > > Schardl (MIT), Charles E. Leiserson (MIT), Kareem Ergawy (Saarland > > University), Simon Moll (Saarland University) and myself. However, ideas > > and feedback came from many people, including the members of the > > LLVM-HPC IR Extensions working group (Hal Finkel, Xinmin Tian, ...), the > > participants in the BoF at the US Developers’ meeting, everybody that > > commented on the BoF discussion sheet [16] and the recent RFC on > > IR-level region annotations [9] (Mehdi Amini, Sanjoy Das, Daniel Berlin, > > ...). > > > > > > > > [0] https://github.com/Parallel-IR/llvm-pir/wiki > > [1] https://reviews.llvm.org/D29250 > > [2] https://reviews.llvm.org/D29251 > > [3] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109302.html > > [4] http://lists.llvm.org/pipermail/llvm-dev/2015-March/083348.html > > [5] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109264.html > > [6] http://compilers.cs.uni-saarland.de/people/doerfert/parallelcfg.pdf > > [7] http://supertech.csail.mit.edu/papers/spbags.pdf & www.cse.wustl.edu/~angelee/papers/cilkprof.pdf > > [8] https://goo.gl/Blp2Xr > > [9] http://lists.llvm.org/pipermail/llvm-dev/2017-January/108906.html > > [10] http://lists.llvm.org/pipermail/llvm-dev/2017-January/108997.html > > [11] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109377.html > > [12] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109351.html > > [13] http://wsmoses.com/tapir.pdf > > [14] https://github.com/jdoerfert/llvm-pir/tree/feature/fork-join > > [15] https://github.com/jdoerfert/llvm-pir/commit/854259881d24d71f9f1f17e52547758c7be0618a > > [16] https://goo.gl/wKps3c > > > > > > -- > > > > Johannes Doerfert > > Researcher / PhD Student > > > > Compiler Design Lab (Prof. Hack) > > Saarland Informatics Campus, Germany > > Building E1.3, Room 4.31 > > > > Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de > > Fax. +49 (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert > > > > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > -- > > Johannes Doerfert > Researcher / PhD Student > > Compiler Design Lab (Prof. Hack) > Saarland Informatics Campus, Germany > Building E1.3, Room 4.31 > > Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de > Fax. +49 (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert> _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- Johannes Doerfert Researcher / PhD Student Compiler Design Lab (Prof. Hack) Saarland Informatics Campus, Germany Building E1.3, Room 4.31 Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de Fax. +49 (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 228 bytes Desc: Digital signature URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170308/6f1feaed/attachment.sig>
On 03/08/2017 07:36 AM, Johannes Doerfert wrote:> <mehdi.amini at apple.com>, > Bcc: > Subject: Re: [llvm-dev] [RFC][PIR] Parallel LLVM IR -- Stage 0 -- IR extension > Reply-To: > In-Reply-To: <20170224221713.GA931 at arch-linux-jd.home> > > Ping. > > > > > PS. > > Are there actually people interested in this? > > We will continue working anyway but it might not make sense to put it > on reviews and announce it on the ML if nobody cares.I certainly care, and will also respond to the RFC, this week or next. -Hal> > > > On 02/24, Johannes Doerfert via llvm-dev wrote: >> To (re)start a discussion here we published two patches to the >> phabricator: >> >> 1) A documentation draft (including LangRef) for the PIR instructions >> and the parallel regions [0]. >> >> 2) A draft implementation for a ParallelRegionInfo pass that identifies >> and maintains parallel regions in the CFG [1]. It also serves as a >> verifier pass for now. >> >> Please take a look and comment the patches either in the phabricator or >> here on the list. >> >> Thanks! >> >> [0] https://reviews.llvm.org/D30353 >> [1] https://reviews.llvm.org/D30354 >> >> >> On 01/28, Johannes Doerfert via llvm-dev wrote: >>> Dear all, >>> >>> This RFC proposes three new LLVM IR instructions to express high-level >>> parallel constructs in a simple, low-level fashion. For this first stage >>> we prepared two commits that add the proposed instructions and a pass to >>> lower them to obtain sequential IR. Both patches have be uploaded for >>> review [1, 2]. The latter patch is very simple and the former consists >>> of almost only mechanical changes needed to add new instructions. >>> >>> The rest of this email contains (1) an introduction of the IR extension >>> (2) the reasoning behind this approach, (3) a comparison to other ideas >>> proposed so far, (4) a validation of the feasibility and potential >>> impact, and (5) an outlook on the next steps. >>> >>> (1) IR extension: >>> Parallel IR adds three new terminator instructions that define the >>> beginning and the end of parallel regions in the CFG. A parallel region >>> is a connected subgraph of the CFG that is potentially executed by two >>> threads in parallel. It can only be entered with a fork instruction and >>> spreads till a join instruction is reached. Therefor parallel regions >>> are single-entry-multiple-exit regions. Parallel regions can be nested >>> and if they are, they form a parallel region tree similar to the loop >>> tree maintained by the natural loop info pass. Each parallel region >>> defines two independent “sibling” tasks, namely the forked and >>> continuation task. >>> >>> The new instructions are defined as follows: >>> >>> 1. fork: marks the beginning of parallel region. Every fork has two >>> successor blocks which represent two parallel tasks. We call >>> these two “sibling” tasks the forked and continuation tasks. >>> Nested forking is supported, meaning that another fork can be >>> reach prior to the join. >>> >>> 2. halt: marks the end of a forked task. The "sibling" continuation block >>> (see fork above) is the operand of the halt terminator. This >>> represents the idea of asymmetric parallelism as introduced by >>> [1]. One advantage of asymmetric parallelism is that sequential >>> semantics of the program are clear from its CFG (ref. [1]). >>> Note that the edge from a forked block to a continuation block >>> (the one introduced by the halt) represents the control flow >>> when the two successors of a fork execute sequentially, not >>> when they execute in parallel. In the latter case there is no >>> “control transfer” happening via this edge but only >>> synchronization between the tasks. >>> >>> 3. join: marks a synchronization point and the end of a parallel region. >>> Once a join terminator is reached by a thread, execution stops >>> in that thread until all tasks spawned by that thread finish >>> their work, thus reach their respective halt instruction. A >>> join shall only be reached by the continuation task of a fork, >>> the forked task shall reach a halt with the continuation as a >>> successor. >>> >>> >>> Here is an example of a parallel OpenMP loop and its idiomatic lowering >>> to Parallel IR. We set up a wiki [0] with additional examples. >>> >>> #pragma omp parallel >>> for(int i = 0; i < n; ++i) { >>> A[i] = C[i]; >>> } >>> >>> >>> preheader: >>> br label %header >>> >>> header: >>> %i = phi [ i32 0, %preheader ], [ %inc, %latch ] >>> %done = icmp ge %i, %n >>> br i1 %done, label %exit, label %body >>> >>> body: >>> fork label %task, label %latch >>> >>> task: >>> %aptr = getelementptr i32, i32* %A, i32 0, i32 %i >>> %aval = load i32* %aptr >>> %cptr = getelementptr i32, i32* %C, i32 0, i32 %i >>> store i32 %aval, i32* %aptr >>> halt label %latch >>> >>> latch: >>> %inc = add i32, i32 %i, i32 1 >>> br label %header >>> >>> exit: >>> join label %afterloop >>> >>> afterloop: >>> ... >>> >>> >>> >>> (2) Reasoning: >>> The proposed approach is crafted such that the semantics of the parallel >>> program is represented correctly in almost native, low-level IR right >>> after front-end and preserved at any point till the final lowering to >>> sequential IR or parallel runtime library calls. To this end, asymmetric >>> parallelism is employed, a concept that uses control flow and the common >>> concept of dominance to represent parts of the parallel semantics. In >>> this model the parallel tasks do not dominate each other and only one >>> parallel task dominates the code after the parallel region. As a >>> consequence, various transformations that would break assumptions we >>> make about parallel regions cannot happen (see [3,4]). While the >>> explicitly modeled control flow together with dominance prevents various >>> code motion problems, the use of terminators helps to minimize the >>> changes needed to educate passes about parallel regions. Only a fraction >>> of analysis and transformation passes deal with terminators explicitly. >>> Most passes either test for known terminators (like branches), rely on >>> dominance information, or work on a basic block level. To even further >>> reduce changes to the existing passes, high-level concepts are broken >>> down to already available low-level concepts instead of introducing new, >>> semantically rich instructions/intrinsics (see the last paragraph of [5] >>> and section 4 in the PIR white paper [6] for examples). Finally, this >>> scheme allows a pass to simply reason about the sequential semantics of >>> a parallel region, transform it back to one if needed or deemed >>> beneficial and employ existing tooling solutions to debug and analyze >>> the code [7]. >>> >>> (3) Comparison: >>> The BoF discussion sheet [8] and the recent “[RFC] on IR-level region >>> annotations” [9] both list pros and cons of different proposed schemes >>> and implementations. We summarize and comment the discussion on the ones >>> listen in the recent RFC here: >>> (a) Metadata: It seems a consensus has been reached that metadata is >>> not the solution but only a means to enhance a different solution. >>> (c) One Intrinsic per directive/clause: This approach basically embeds >>> a high-level (parallel) language in LLVM IR using intrinsics. It >>> seems there is little to no support for this approach at the >>> moment. >>> (d) Parallel loop/region annotations: Here, intrinsics enclosing a >>> parallel loop/region are used to represent parallelism. >>> High-level knowledge is represented as attached metadata or in >>> separate intrinsics. For more details please see the original >>> RFC [9]. In the discussion several potential drawbacks have been >>> mentioned: >>> - The annotations might be too general [10]. >>> - The IR is not semantically correct (or ready for optimization) >>> after the front-end and needs an additional “prepare phase for >>> pre-privatization" [11]. >>> - The currently available “potential side effect for intrinsic >>> calls” seem not to suffice for the proposed intrinsics as they >>> do not have "call semantics" [12]. >>> (b) Parallel instructions (this approach): The table in the region RFC >>> [9] lists two drawbacks with this approach, both of which have >>> already been called into question [5]. The first drawback is the >>> effort needed to implement this scheme which is discussed in >>> more detail in section (4) of this mail. The second drawback is >>> the need for additional representation of high-level information >>> that is not part of the semantics of the new fork-join >>> instructions. As mentioned above, the choice to keep the new >>> instructions as simple as possible is deliberate. This parallel >>> IR is intended to be extensible, and in particular, compatible >>> with representations of high-level parallel concepts that might >>> be developed in the future. For the time being, the parallel IR >>> is compatible with approach taken today of lowering high-level >>> parallel linguistics, such as reductions and private memory, to >>> existing IR constructs, such as parallel-runtime calls, >>> atomicrmw instructions, and well placed alloca’s [5,6]. Although >>> other extensions to the IR might allow LLVM to compile these >>> higher-level constructs more effectively, we see no reason the >>> parallel IR would conflict with any such extensions. (On the >>> contrary, the parallel IR would seem to help compiler analyses >>> of higher-level parallel constructs by exposing logical >>> parallelism.) >>> >>> >>> (4) Feasibility and Impact: >>> The Tapir and PIR prototypes demonstrate the feasibility of this >>> approach. The Tapir prototype [13] has recently proven its robustness as >>> the standard compiler in the MIT class on parallel programming. It was >>> implemented in ~ 5k LOC. However, >1k are explicit parallel >>> optimization, 1k is used to add new instructions (thus mechanical) and >>> 2k are used to lower the parallelism (basically needed for any scheme). >>> Only the rest is required to make it work with existing analysis and >>> transformation passes. While Tapir added explicit optimization passes >>> for parallel regions/loops, the representation allows for a variety of >>> classic optimizations (CSE, GVN, LICM, loop unrolling, TRE) to work with >>> little to no modifications. Potential speedups compared to a classic >>> “early-outlining” approaches can also be seen in the Tapir paper [13]. >>> For the PIR prototype [14] we modified only three transformation passes >>> (<20 LOC) [15] before we could run the O3 pipeline successfully on a >>> parallel matrix multiplication. >>> >>> Together, these prototypes show how little passes actually inspect new >>> (or “unknown”) terminators. The default assumption passes have to make, >>> namely that control might be transferred to any successors at runtime, >>> has, in terms of potential compiler transformations, a similar effect as >>> the parallel semantics we want to model, namely that control is >>> transferred to all successors. >>> >>> >>> (5) Outlook: >>> This first stage will only introduce and test the new instructions and >>> the sequentialization pass. Afterwards we intend to start additions in >>> different, partially overlapping but often orthogonal directions. We do >>> welcome comments as well as developers for each of them: >>> - Analysis and optimization: >>> * A “parallel region info” pass to keep track of parallel regions >>> and their nesting. The information can be made accessible in a >>> “parallel region tree” similar to the loop tree maintained by the >>> loop info pass. [stage 1, immediate next goal] >>> * Extension of the verifier that allow to check parallel IR for >>> “well-formedness”. [stage 1, immediate next goal] >>> * Documentation of the PIR instructions in the language reference. >>> [stage 1, immediate next goal] >>> * A cost analysis for parallel tasks that can be queried by >>> optimizations. The cost model needs to take the hardware, the >>> runtime library and the parallel tasks into account. >>> * Vectorizer enhancements to enable the vectorization of parallel >>> * loops and tasks. >>> * Parallelization centric optimizations: >>> a) Parallel tasks can be balanced, merged or split as well as created >>> from and lowered to sequential code. >>> b) Barriers can be eliminated. >>> c) Parallel loops can be statically scheduled or created from >>> parallel recursive calls [13] >>> * Analysis to extract high-level information (reductions, private >>> memory, ...) from the low-level representation. >>> - Front-end: >>> * Lowering of simple OpenMP and Cilk++ annotations to PIR, including >>> parallel sections and parallel loops with limited support for >>> clauses (at first) (examples can be found here [1]). [milestone 1] >>> * Generation of PIR code through automatic parallelization. A >>> patched version of Polly exists that emits parallel loops using PIR instead of >>> OpenMP runtime calls or llvm.parallel.loop metadata. [milestone 1] >>> * Representation of more evolved high-level features like assignment >>> of computation units. >>> - Back-end: >>> * Lowering of PIR regions to calls to the OpenMP (GOMP) and Cilk++ >>> runtime library. [milestone 1] >>> * A simple parallel library, e.g., based on pthreads, to be shipped >>> with LLVM as a fallback implementation for parallel regions. >>> >>> >>> Thank you all for your time and hopefully constructive input on this proposal! >>> >>> Cheers, >>> Johannes, on behalf of the PIR team >>> >>> >>> Disclaimer: >>> This RFC, the patches, the wiki, etc. are a joint effort by Tao B. >>> Schardl (MIT), Charles E. Leiserson (MIT), Kareem Ergawy (Saarland >>> University), Simon Moll (Saarland University) and myself. However, ideas >>> and feedback came from many people, including the members of the >>> LLVM-HPC IR Extensions working group (Hal Finkel, Xinmin Tian, ...), the >>> participants in the BoF at the US Developers’ meeting, everybody that >>> commented on the BoF discussion sheet [16] and the recent RFC on >>> IR-level region annotations [9] (Mehdi Amini, Sanjoy Das, Daniel Berlin, >>> ...). >>> >>> >>> >>> [0] https://github.com/Parallel-IR/llvm-pir/wiki >>> [1] https://reviews.llvm.org/D29250 >>> [2] https://reviews.llvm.org/D29251 >>> [3] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109302.html >>> [4] http://lists.llvm.org/pipermail/llvm-dev/2015-March/083348.html >>> [5] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109264.html >>> [6] http://compilers.cs.uni-saarland.de/people/doerfert/parallelcfg.pdf >>> [7] http://supertech.csail.mit.edu/papers/spbags.pdf & www.cse.wustl.edu/~angelee/papers/cilkprof.pdf >>> [8] https://goo.gl/Blp2Xr >>> [9] http://lists.llvm.org/pipermail/llvm-dev/2017-January/108906.html >>> [10] http://lists.llvm.org/pipermail/llvm-dev/2017-January/108997.html >>> [11] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109377.html >>> [12] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109351.html >>> [13] http://wsmoses.com/tapir.pdf >>> [14] https://github.com/jdoerfert/llvm-pir/tree/feature/fork-join >>> [15] https://github.com/jdoerfert/llvm-pir/commit/854259881d24d71f9f1f17e52547758c7be0618a >>> [16] https://goo.gl/wKps3c >>> >>> >>> -- >>> >>> Johannes Doerfert >>> Researcher / PhD Student >>> >>> Compiler Design Lab (Prof. Hack) >>> Saarland Informatics Campus, Germany >>> Building E1.3, Room 4.31 >>> >>> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de >>> Fax. +49 (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert >> >> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> -- >> >> Johannes Doerfert >> Researcher / PhD Student >> >> Compiler Design Lab (Prof. Hack) >> Saarland Informatics Campus, Germany >> Building E1.3, Room 4.31 >> >> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de >> Fax. +49 (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert > > >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory
A quick update, we have been looking through all LLVM passes to identify the impact of "IR-region annotation", and interaction issues with the rest of LoopOpt and scalarOpt, e.g. interaction with vectorization when you have schedule(simd:guided: 64). What are the common properties for optimizer to know on IR-region annotations. We have our implementation working from O0, O1, O2 to O3. So far, the changes we made in the existing LLVM opt passes are < 200 LOC. ClangFE to executable end-to-end linked our omp library, and we also updated our implementation using token/tag on the intrinsics based on feedback from Google (David) and Xilinx (Hongbin). I am still owe Mehdi some answers. One feedback for PIR RFC, putting "fork" into loop body does not work for loop vectorizer, it shuts down vectorization, this is the same issue as MIT's scheme. Thanks, Xinmin -----Original Message----- From: Hal Finkel [mailto:hfinkel at anl.gov] Sent: Wednesday, March 8, 2017 5:51 AM To: Johannes Doerfert <doerfert at cs.uni-saarland.de>; LLVM-Dev <llvm-dev at lists.llvm.org> Cc: Sanjoy Das <sanjoy at playingwithpointers.com>; Mehdi Amini <mehdi.amini at apple.com>; Chandler Carruth <chandlerc at gmail.com>; Adve, Vikram Sadanand <vadve at illinois.edu>; Tian, Xinmin <xinmin.tian at intel.com>; TB Schardl <neboat at mit.edu>; acjacob at us.ibm.com Subject: Re: On 03/08/2017 07:36 AM, Johannes Doerfert wrote:> <mehdi.amini at apple.com>, > Bcc: > Subject: Re: [llvm-dev] [RFC][PIR] Parallel LLVM IR -- Stage 0 -- IR > extension > Reply-To: > In-Reply-To: <20170224221713.GA931 at arch-linux-jd.home> > > Ping. > > > > > PS. > > Are there actually people interested in this? > > We will continue working anyway but it might not make sense to put it > on reviews and announce it on the ML if nobody cares.I certainly care, and will also respond to the RFC, this week or next. -Hal> > > > On 02/24, Johannes Doerfert via llvm-dev wrote: >> To (re)start a discussion here we published two patches to the >> phabricator: >> >> 1) A documentation draft (including LangRef) for the PIR instructions >> and the parallel regions [0]. >> >> 2) A draft implementation for a ParallelRegionInfo pass that identifies >> and maintains parallel regions in the CFG [1]. It also serves as a >> verifier pass for now. >> >> Please take a look and comment the patches either in the phabricator >> or here on the list. >> >> Thanks! >> >> [0] https://reviews.llvm.org/D30353 >> [1] https://reviews.llvm.org/D30354 >> >> >> On 01/28, Johannes Doerfert via llvm-dev wrote: >>> Dear all, >>> >>> This RFC proposes three new LLVM IR instructions to express >>> high-level parallel constructs in a simple, low-level fashion. For >>> this first stage we prepared two commits that add the proposed >>> instructions and a pass to lower them to obtain sequential IR. Both >>> patches have be uploaded for review [1, 2]. The latter patch is very >>> simple and the former consists of almost only mechanical changes needed to add new instructions. >>> >>> The rest of this email contains (1) an introduction of the IR >>> extension >>> (2) the reasoning behind this approach, (3) a comparison to other >>> ideas proposed so far, (4) a validation of the feasibility and >>> potential impact, and (5) an outlook on the next steps. >>> >>> (1) IR extension: >>> Parallel IR adds three new terminator instructions that define the >>> beginning and the end of parallel regions in the CFG. A parallel >>> region is a connected subgraph of the CFG that is potentially >>> executed by two threads in parallel. It can only be entered with a >>> fork instruction and spreads till a join instruction is reached. >>> Therefor parallel regions are single-entry-multiple-exit regions. >>> Parallel regions can be nested and if they are, they form a parallel >>> region tree similar to the loop tree maintained by the natural loop >>> info pass. Each parallel region defines two independent “sibling” >>> tasks, namely the forked and continuation task. >>> >>> The new instructions are defined as follows: >>> >>> 1. fork: marks the beginning of parallel region. Every fork has two >>> successor blocks which represent two parallel tasks. We call >>> these two “sibling” tasks the forked and continuation tasks. >>> Nested forking is supported, meaning that another fork can be >>> reach prior to the join. >>> >>> 2. halt: marks the end of a forked task. The "sibling" continuation block >>> (see fork above) is the operand of the halt terminator. This >>> represents the idea of asymmetric parallelism as introduced by >>> [1]. One advantage of asymmetric parallelism is that sequential >>> semantics of the program are clear from its CFG (ref. [1]). >>> Note that the edge from a forked block to a continuation block >>> (the one introduced by the halt) represents the control flow >>> when the two successors of a fork execute sequentially, not >>> when they execute in parallel. In the latter case there is no >>> “control transfer” happening via this edge but only >>> synchronization between the tasks. >>> >>> 3. join: marks a synchronization point and the end of a parallel region. >>> Once a join terminator is reached by a thread, execution stops >>> in that thread until all tasks spawned by that thread finish >>> their work, thus reach their respective halt instruction. A >>> join shall only be reached by the continuation task of a fork, >>> the forked task shall reach a halt with the continuation as a >>> successor. >>> >>> >>> Here is an example of a parallel OpenMP loop and its idiomatic >>> lowering to Parallel IR. We set up a wiki [0] with additional examples. >>> >>> #pragma omp parallel >>> for(int i = 0; i < n; ++i) { >>> A[i] = C[i]; >>> } >>> >>> >>> preheader: >>> br label %header >>> >>> header: >>> %i = phi [ i32 0, %preheader ], [ %inc, %latch ] >>> %done = icmp ge %i, %n >>> br i1 %done, label %exit, label %body >>> >>> body: >>> fork label %task, label %latch >>> >>> task: >>> %aptr = getelementptr i32, i32* %A, i32 0, i32 %i >>> %aval = load i32* %aptr >>> %cptr = getelementptr i32, i32* %C, i32 0, i32 %i >>> store i32 %aval, i32* %aptr >>> halt label %latch >>> >>> latch: >>> %inc = add i32, i32 %i, i32 1 >>> br label %header >>> >>> exit: >>> join label %afterloop >>> >>> afterloop: >>> ... >>> >>> >>> >>> (2) Reasoning: >>> The proposed approach is crafted such that the semantics of the >>> parallel program is represented correctly in almost native, >>> low-level IR right after front-end and preserved at any point till >>> the final lowering to sequential IR or parallel runtime library >>> calls. To this end, asymmetric parallelism is employed, a concept >>> that uses control flow and the common concept of dominance to >>> represent parts of the parallel semantics. In this model the >>> parallel tasks do not dominate each other and only one parallel task >>> dominates the code after the parallel region. As a consequence, >>> various transformations that would break assumptions we make about >>> parallel regions cannot happen (see [3,4]). While the explicitly >>> modeled control flow together with dominance prevents various code >>> motion problems, the use of terminators helps to minimize the >>> changes needed to educate passes about parallel regions. Only a fraction of analysis and transformation passes deal with terminators explicitly. >>> Most passes either test for known terminators (like branches), rely >>> on dominance information, or work on a basic block level. To even >>> further reduce changes to the existing passes, high-level concepts >>> are broken down to already available low-level concepts instead of >>> introducing new, semantically rich instructions/intrinsics (see the >>> last paragraph of [5] and section 4 in the PIR white paper [6] for >>> examples). Finally, this scheme allows a pass to simply reason about >>> the sequential semantics of a parallel region, transform it back to >>> one if needed or deemed beneficial and employ existing tooling >>> solutions to debug and analyze the code [7]. >>> >>> (3) Comparison: >>> The BoF discussion sheet [8] and the recent “[RFC] on IR-level >>> region annotations” [9] both list pros and cons of different >>> proposed schemes and implementations. We summarize and comment the >>> discussion on the ones listen in the recent RFC here: >>> (a) Metadata: It seems a consensus has been reached that metadata is >>> not the solution but only a means to enhance a different solution. >>> (c) One Intrinsic per directive/clause: This approach basically embeds >>> a high-level (parallel) language in LLVM IR using intrinsics. It >>> seems there is little to no support for this approach at the >>> moment. >>> (d) Parallel loop/region annotations: Here, intrinsics enclosing a >>> parallel loop/region are used to represent parallelism. >>> High-level knowledge is represented as attached metadata or in >>> separate intrinsics. For more details please see the original >>> RFC [9]. In the discussion several potential drawbacks have been >>> mentioned: >>> - The annotations might be too general [10]. >>> - The IR is not semantically correct (or ready for optimization) >>> after the front-end and needs an additional “prepare phase for >>> pre-privatization" [11]. >>> - The currently available “potential side effect for intrinsic >>> calls” seem not to suffice for the proposed intrinsics as they >>> do not have "call semantics" [12]. >>> (b) Parallel instructions (this approach): The table in the region RFC >>> [9] lists two drawbacks with this approach, both of which have >>> already been called into question [5]. The first drawback is the >>> effort needed to implement this scheme which is discussed in >>> more detail in section (4) of this mail. The second drawback is >>> the need for additional representation of high-level information >>> that is not part of the semantics of the new fork-join >>> instructions. As mentioned above, the choice to keep the new >>> instructions as simple as possible is deliberate. This parallel >>> IR is intended to be extensible, and in particular, compatible >>> with representations of high-level parallel concepts that might >>> be developed in the future. For the time being, the parallel IR >>> is compatible with approach taken today of lowering high-level >>> parallel linguistics, such as reductions and private memory, to >>> existing IR constructs, such as parallel-runtime calls, >>> atomicrmw instructions, and well placed alloca’s [5,6]. Although >>> other extensions to the IR might allow LLVM to compile these >>> higher-level constructs more effectively, we see no reason the >>> parallel IR would conflict with any such extensions. (On the >>> contrary, the parallel IR would seem to help compiler analyses >>> of higher-level parallel constructs by exposing logical >>> parallelism.) >>> >>> >>> (4) Feasibility and Impact: >>> The Tapir and PIR prototypes demonstrate the feasibility of this >>> approach. The Tapir prototype [13] has recently proven its >>> robustness as the standard compiler in the MIT class on parallel >>> programming. It was implemented in ~ 5k LOC. However, >1k are >>> explicit parallel optimization, 1k is used to add new instructions >>> (thus mechanical) and 2k are used to lower the parallelism (basically needed for any scheme). >>> Only the rest is required to make it work with existing analysis and >>> transformation passes. While Tapir added explicit optimization >>> passes for parallel regions/loops, the representation allows for a >>> variety of classic optimizations (CSE, GVN, LICM, loop unrolling, >>> TRE) to work with little to no modifications. Potential speedups >>> compared to a classic “early-outlining” approaches can also be seen in the Tapir paper [13]. >>> For the PIR prototype [14] we modified only three transformation >>> passes >>> (<20 LOC) [15] before we could run the O3 pipeline successfully on a >>> parallel matrix multiplication. >>> >>> Together, these prototypes show how little passes actually inspect >>> new (or “unknown”) terminators. The default assumption passes have >>> to make, namely that control might be transferred to any successors >>> at runtime, has, in terms of potential compiler transformations, a >>> similar effect as the parallel semantics we want to model, namely >>> that control is transferred to all successors. >>> >>> >>> (5) Outlook: >>> This first stage will only introduce and test the new instructions >>> and the sequentialization pass. Afterwards we intend to start >>> additions in different, partially overlapping but often orthogonal >>> directions. We do welcome comments as well as developers for each of them: >>> - Analysis and optimization: >>> * A “parallel region info” pass to keep track of parallel regions >>> and their nesting. The information can be made accessible in a >>> “parallel region tree” similar to the loop tree maintained by the >>> loop info pass. [stage 1, immediate next goal] >>> * Extension of the verifier that allow to check parallel IR for >>> “well-formedness”. [stage 1, immediate next goal] >>> * Documentation of the PIR instructions in the language reference. >>> [stage 1, immediate next goal] >>> * A cost analysis for parallel tasks that can be queried by >>> optimizations. The cost model needs to take the hardware, the >>> runtime library and the parallel tasks into account. >>> * Vectorizer enhancements to enable the vectorization of parallel >>> * loops and tasks. >>> * Parallelization centric optimizations: >>> a) Parallel tasks can be balanced, merged or split as well as created >>> from and lowered to sequential code. >>> b) Barriers can be eliminated. >>> c) Parallel loops can be statically scheduled or created from >>> parallel recursive calls [13] >>> * Analysis to extract high-level information (reductions, private >>> memory, ...) from the low-level representation. >>> - Front-end: >>> * Lowering of simple OpenMP and Cilk++ annotations to PIR, including >>> parallel sections and parallel loops with limited support for >>> clauses (at first) (examples can be found here [1]). [milestone 1] >>> * Generation of PIR code through automatic parallelization. A >>> patched version of Polly exists that emits parallel loops using PIR instead of >>> OpenMP runtime calls or llvm.parallel.loop metadata. [milestone 1] >>> * Representation of more evolved high-level features like assignment >>> of computation units. >>> - Back-end: >>> * Lowering of PIR regions to calls to the OpenMP (GOMP) and Cilk++ >>> runtime library. [milestone 1] >>> * A simple parallel library, e.g., based on pthreads, to be shipped >>> with LLVM as a fallback implementation for parallel regions. >>> >>> >>> Thank you all for your time and hopefully constructive input on this proposal! >>> >>> Cheers, >>> Johannes, on behalf of the PIR team >>> >>> >>> Disclaimer: >>> This RFC, the patches, the wiki, etc. are a joint effort by Tao B. >>> Schardl (MIT), Charles E. Leiserson (MIT), Kareem Ergawy (Saarland >>> University), Simon Moll (Saarland University) and myself. However, >>> ideas and feedback came from many people, including the members of >>> the LLVM-HPC IR Extensions working group (Hal Finkel, Xinmin Tian, >>> ...), the participants in the BoF at the US Developers’ meeting, >>> everybody that commented on the BoF discussion sheet [16] and the >>> recent RFC on IR-level region annotations [9] (Mehdi Amini, Sanjoy >>> Das, Daniel Berlin, ...). >>> >>> >>> >>> [0] https://github.com/Parallel-IR/llvm-pir/wiki >>> [1] https://reviews.llvm.org/D29250 >>> [2] https://reviews.llvm.org/D29251 >>> [3] >>> http://lists.llvm.org/pipermail/llvm-dev/2017-January/109302.html >>> [4] http://lists.llvm.org/pipermail/llvm-dev/2015-March/083348.html >>> [5] >>> http://lists.llvm.org/pipermail/llvm-dev/2017-January/109264.html >>> [6] >>> http://compilers.cs.uni-saarland.de/people/doerfert/parallelcfg.pdf >>> [7] http://supertech.csail.mit.edu/papers/spbags.pdf & >>> www.cse.wustl.edu/~angelee/papers/cilkprof.pdf >>> [8] https://goo.gl/Blp2Xr >>> [9] >>> http://lists.llvm.org/pipermail/llvm-dev/2017-January/108906.html >>> [10] >>> http://lists.llvm.org/pipermail/llvm-dev/2017-January/108997.html >>> [11] >>> http://lists.llvm.org/pipermail/llvm-dev/2017-January/109377.html >>> [12] >>> http://lists.llvm.org/pipermail/llvm-dev/2017-January/109351.html >>> [13] http://wsmoses.com/tapir.pdf >>> [14] https://github.com/jdoerfert/llvm-pir/tree/feature/fork-join >>> [15] >>> https://github.com/jdoerfert/llvm-pir/commit/854259881d24d71f9f1f17e >>> 52547758c7be0618a >>> [16] https://goo.gl/wKps3c >>> >>> >>> -- >>> >>> Johannes Doerfert >>> Researcher / PhD Student >>> >>> Compiler Design Lab (Prof. Hack) >>> Saarland Informatics Campus, Germany Building E1.3, Room 4.31 >>> >>> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de Fax. +49 >>> (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert >> >> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> -- >> >> Johannes Doerfert >> Researcher / PhD Student >> >> Compiler Design Lab (Prof. Hack) >> Saarland Informatics Campus, Germany >> Building E1.3, Room 4.31 >> >> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de Fax. +49 >> (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert > > >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory
> On Mar 8, 2017, at 5:36 AM, Johannes Doerfert <doerfert at cs.uni-saarland.de> wrote: > > <mehdi.amini at apple.com>, > Bcc: > Subject: Re: [llvm-dev] [RFC][PIR] Parallel LLVM IR -- Stage 0 -- IR extension > Reply-To: > In-Reply-To: <20170224221713.GA931 at arch-linux-jd.home> > > Ping. > > > > > PS. > > Are there actually people interested in this? > > We will continue working anyway but it might not make sense to put it > on reviews and announce it on the ML if nobody cares.I expect this to takes many months, unless it is on the critical path of some core devs, because it seems to me to be a huge feature to consider and quite a time sink (which is why even people that care haven’t been able to invest enough time into this). My impression is that this is major enough that it is unlikely to land in LLVM 5.0, and even before the next US dev meeting. Ideally we’d have enough discussions and review before then that we could have a BoF there to make sure we can move forward with the right representation. I hope the EuroLLVM meeting will be the opportunity for you to get some good feedback on this! Just my 2 cents :) — Mehdi> > > > On 02/24, Johannes Doerfert via llvm-dev wrote: >> To (re)start a discussion here we published two patches to the >> phabricator: >> >> 1) A documentation draft (including LangRef) for the PIR instructions >> and the parallel regions [0]. >> >> 2) A draft implementation for a ParallelRegionInfo pass that identifies >> and maintains parallel regions in the CFG [1]. It also serves as a >> verifier pass for now. >> >> Please take a look and comment the patches either in the phabricator or >> here on the list. >> >> Thanks! >> >> [0] https://reviews.llvm.org/D30353 >> [1] https://reviews.llvm.org/D30354 >> >> >> On 01/28, Johannes Doerfert via llvm-dev wrote: >>> Dear all, >>> >>> This RFC proposes three new LLVM IR instructions to express high-level >>> parallel constructs in a simple, low-level fashion. For this first stage >>> we prepared two commits that add the proposed instructions and a pass to >>> lower them to obtain sequential IR. Both patches have be uploaded for >>> review [1, 2]. The latter patch is very simple and the former consists >>> of almost only mechanical changes needed to add new instructions. >>> >>> The rest of this email contains (1) an introduction of the IR extension >>> (2) the reasoning behind this approach, (3) a comparison to other ideas >>> proposed so far, (4) a validation of the feasibility and potential >>> impact, and (5) an outlook on the next steps. >>> >>> (1) IR extension: >>> Parallel IR adds three new terminator instructions that define the >>> beginning and the end of parallel regions in the CFG. A parallel region >>> is a connected subgraph of the CFG that is potentially executed by two >>> threads in parallel. It can only be entered with a fork instruction and >>> spreads till a join instruction is reached. Therefor parallel regions >>> are single-entry-multiple-exit regions. Parallel regions can be nested >>> and if they are, they form a parallel region tree similar to the loop >>> tree maintained by the natural loop info pass. Each parallel region >>> defines two independent “sibling” tasks, namely the forked and >>> continuation task. >>> >>> The new instructions are defined as follows: >>> >>> 1. fork: marks the beginning of parallel region. Every fork has two >>> successor blocks which represent two parallel tasks. We call >>> these two “sibling” tasks the forked and continuation tasks. >>> Nested forking is supported, meaning that another fork can be >>> reach prior to the join. >>> >>> 2. halt: marks the end of a forked task. The "sibling" continuation block >>> (see fork above) is the operand of the halt terminator. This >>> represents the idea of asymmetric parallelism as introduced by >>> [1]. One advantage of asymmetric parallelism is that sequential >>> semantics of the program are clear from its CFG (ref. [1]). >>> Note that the edge from a forked block to a continuation block >>> (the one introduced by the halt) represents the control flow >>> when the two successors of a fork execute sequentially, not >>> when they execute in parallel. In the latter case there is no >>> “control transfer” happening via this edge but only >>> synchronization between the tasks. >>> >>> 3. join: marks a synchronization point and the end of a parallel region. >>> Once a join terminator is reached by a thread, execution stops >>> in that thread until all tasks spawned by that thread finish >>> their work, thus reach their respective halt instruction. A >>> join shall only be reached by the continuation task of a fork, >>> the forked task shall reach a halt with the continuation as a >>> successor. >>> >>> >>> Here is an example of a parallel OpenMP loop and its idiomatic lowering >>> to Parallel IR. We set up a wiki [0] with additional examples. >>> >>> #pragma omp parallel >>> for(int i = 0; i < n; ++i) { >>> A[i] = C[i]; >>> } >>> >>> >>> preheader: >>> br label %header >>> >>> header: >>> %i = phi [ i32 0, %preheader ], [ %inc, %latch ] >>> %done = icmp ge %i, %n >>> br i1 %done, label %exit, label %body >>> >>> body: >>> fork label %task, label %latch >>> >>> task: >>> %aptr = getelementptr i32, i32* %A, i32 0, i32 %i >>> %aval = load i32* %aptr >>> %cptr = getelementptr i32, i32* %C, i32 0, i32 %i >>> store i32 %aval, i32* %aptr >>> halt label %latch >>> >>> latch: >>> %inc = add i32, i32 %i, i32 1 >>> br label %header >>> >>> exit: >>> join label %afterloop >>> >>> afterloop: >>> ... >>> >>> >>> >>> (2) Reasoning: >>> The proposed approach is crafted such that the semantics of the parallel >>> program is represented correctly in almost native, low-level IR right >>> after front-end and preserved at any point till the final lowering to >>> sequential IR or parallel runtime library calls. To this end, asymmetric >>> parallelism is employed, a concept that uses control flow and the common >>> concept of dominance to represent parts of the parallel semantics. In >>> this model the parallel tasks do not dominate each other and only one >>> parallel task dominates the code after the parallel region. As a >>> consequence, various transformations that would break assumptions we >>> make about parallel regions cannot happen (see [3,4]). While the >>> explicitly modeled control flow together with dominance prevents various >>> code motion problems, the use of terminators helps to minimize the >>> changes needed to educate passes about parallel regions. Only a fraction >>> of analysis and transformation passes deal with terminators explicitly. >>> Most passes either test for known terminators (like branches), rely on >>> dominance information, or work on a basic block level. To even further >>> reduce changes to the existing passes, high-level concepts are broken >>> down to already available low-level concepts instead of introducing new, >>> semantically rich instructions/intrinsics (see the last paragraph of [5] >>> and section 4 in the PIR white paper [6] for examples). Finally, this >>> scheme allows a pass to simply reason about the sequential semantics of >>> a parallel region, transform it back to one if needed or deemed >>> beneficial and employ existing tooling solutions to debug and analyze >>> the code [7]. >>> >>> (3) Comparison: >>> The BoF discussion sheet [8] and the recent “[RFC] on IR-level region >>> annotations” [9] both list pros and cons of different proposed schemes >>> and implementations. We summarize and comment the discussion on the ones >>> listen in the recent RFC here: >>> (a) Metadata: It seems a consensus has been reached that metadata is >>> not the solution but only a means to enhance a different solution. >>> (c) One Intrinsic per directive/clause: This approach basically embeds >>> a high-level (parallel) language in LLVM IR using intrinsics. It >>> seems there is little to no support for this approach at the >>> moment. >>> (d) Parallel loop/region annotations: Here, intrinsics enclosing a >>> parallel loop/region are used to represent parallelism. >>> High-level knowledge is represented as attached metadata or in >>> separate intrinsics. For more details please see the original >>> RFC [9]. In the discussion several potential drawbacks have been >>> mentioned: >>> - The annotations might be too general [10]. >>> - The IR is not semantically correct (or ready for optimization) >>> after the front-end and needs an additional “prepare phase for >>> pre-privatization" [11]. >>> - The currently available “potential side effect for intrinsic >>> calls” seem not to suffice for the proposed intrinsics as they >>> do not have "call semantics" [12]. >>> (b) Parallel instructions (this approach): The table in the region RFC >>> [9] lists two drawbacks with this approach, both of which have >>> already been called into question [5]. The first drawback is the >>> effort needed to implement this scheme which is discussed in >>> more detail in section (4) of this mail. The second drawback is >>> the need for additional representation of high-level information >>> that is not part of the semantics of the new fork-join >>> instructions. As mentioned above, the choice to keep the new >>> instructions as simple as possible is deliberate. This parallel >>> IR is intended to be extensible, and in particular, compatible >>> with representations of high-level parallel concepts that might >>> be developed in the future. For the time being, the parallel IR >>> is compatible with approach taken today of lowering high-level >>> parallel linguistics, such as reductions and private memory, to >>> existing IR constructs, such as parallel-runtime calls, >>> atomicrmw instructions, and well placed alloca’s [5,6]. Although >>> other extensions to the IR might allow LLVM to compile these >>> higher-level constructs more effectively, we see no reason the >>> parallel IR would conflict with any such extensions. (On the >>> contrary, the parallel IR would seem to help compiler analyses >>> of higher-level parallel constructs by exposing logical >>> parallelism.) >>> >>> >>> (4) Feasibility and Impact: >>> The Tapir and PIR prototypes demonstrate the feasibility of this >>> approach. The Tapir prototype [13] has recently proven its robustness as >>> the standard compiler in the MIT class on parallel programming. It was >>> implemented in ~ 5k LOC. However, >1k are explicit parallel >>> optimization, 1k is used to add new instructions (thus mechanical) and >>> 2k are used to lower the parallelism (basically needed for any scheme). >>> Only the rest is required to make it work with existing analysis and >>> transformation passes. While Tapir added explicit optimization passes >>> for parallel regions/loops, the representation allows for a variety of >>> classic optimizations (CSE, GVN, LICM, loop unrolling, TRE) to work with >>> little to no modifications. Potential speedups compared to a classic >>> “early-outlining” approaches can also be seen in the Tapir paper [13]. >>> For the PIR prototype [14] we modified only three transformation passes >>> (<20 LOC) [15] before we could run the O3 pipeline successfully on a >>> parallel matrix multiplication. >>> >>> Together, these prototypes show how little passes actually inspect new >>> (or “unknown”) terminators. The default assumption passes have to make, >>> namely that control might be transferred to any successors at runtime, >>> has, in terms of potential compiler transformations, a similar effect as >>> the parallel semantics we want to model, namely that control is >>> transferred to all successors. >>> >>> >>> (5) Outlook: >>> This first stage will only introduce and test the new instructions and >>> the sequentialization pass. Afterwards we intend to start additions in >>> different, partially overlapping but often orthogonal directions. We do >>> welcome comments as well as developers for each of them: >>> - Analysis and optimization: >>> * A “parallel region info” pass to keep track of parallel regions >>> and their nesting. The information can be made accessible in a >>> “parallel region tree” similar to the loop tree maintained by the >>> loop info pass. [stage 1, immediate next goal] >>> * Extension of the verifier that allow to check parallel IR for >>> “well-formedness”. [stage 1, immediate next goal] >>> * Documentation of the PIR instructions in the language reference. >>> [stage 1, immediate next goal] >>> * A cost analysis for parallel tasks that can be queried by >>> optimizations. The cost model needs to take the hardware, the >>> runtime library and the parallel tasks into account. >>> * Vectorizer enhancements to enable the vectorization of parallel >>> * loops and tasks. >>> * Parallelization centric optimizations: >>> a) Parallel tasks can be balanced, merged or split as well as created >>> from and lowered to sequential code. >>> b) Barriers can be eliminated. >>> c) Parallel loops can be statically scheduled or created from >>> parallel recursive calls [13] >>> * Analysis to extract high-level information (reductions, private >>> memory, ...) from the low-level representation. >>> - Front-end: >>> * Lowering of simple OpenMP and Cilk++ annotations to PIR, including >>> parallel sections and parallel loops with limited support for >>> clauses (at first) (examples can be found here [1]). [milestone 1] >>> * Generation of PIR code through automatic parallelization. A >>> patched version of Polly exists that emits parallel loops using PIR instead of >>> OpenMP runtime calls or llvm.parallel.loop metadata. [milestone 1] >>> * Representation of more evolved high-level features like assignment >>> of computation units. >>> - Back-end: >>> * Lowering of PIR regions to calls to the OpenMP (GOMP) and Cilk++ >>> runtime library. [milestone 1] >>> * A simple parallel library, e.g., based on pthreads, to be shipped >>> with LLVM as a fallback implementation for parallel regions. >>> >>> >>> Thank you all for your time and hopefully constructive input on this proposal! >>> >>> Cheers, >>> Johannes, on behalf of the PIR team >>> >>> >>> Disclaimer: >>> This RFC, the patches, the wiki, etc. are a joint effort by Tao B. >>> Schardl (MIT), Charles E. Leiserson (MIT), Kareem Ergawy (Saarland >>> University), Simon Moll (Saarland University) and myself. However, ideas >>> and feedback came from many people, including the members of the >>> LLVM-HPC IR Extensions working group (Hal Finkel, Xinmin Tian, ...), the >>> participants in the BoF at the US Developers’ meeting, everybody that >>> commented on the BoF discussion sheet [16] and the recent RFC on >>> IR-level region annotations [9] (Mehdi Amini, Sanjoy Das, Daniel Berlin, >>> ...). >>> >>> >>> >>> [0] https://github.com/Parallel-IR/llvm-pir/wiki >>> [1] https://reviews.llvm.org/D29250 >>> [2] https://reviews.llvm.org/D29251 >>> [3] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109302.html >>> [4] http://lists.llvm.org/pipermail/llvm-dev/2015-March/083348.html >>> [5] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109264.html >>> [6] http://compilers.cs.uni-saarland.de/people/doerfert/parallelcfg.pdf >>> [7] http://supertech.csail.mit.edu/papers/spbags.pdf & www.cse.wustl.edu/~angelee/papers/cilkprof.pdf >>> [8] https://goo.gl/Blp2Xr >>> [9] http://lists.llvm.org/pipermail/llvm-dev/2017-January/108906.html >>> [10] http://lists.llvm.org/pipermail/llvm-dev/2017-January/108997.html >>> [11] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109377.html >>> [12] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109351.html >>> [13] http://wsmoses.com/tapir.pdf >>> [14] https://github.com/jdoerfert/llvm-pir/tree/feature/fork-join >>> [15] https://github.com/jdoerfert/llvm-pir/commit/854259881d24d71f9f1f17e52547758c7be0618a >>> [16] https://goo.gl/wKps3c >>> >>> >>> -- >>> >>> Johannes Doerfert >>> Researcher / PhD Student >>> >>> Compiler Design Lab (Prof. Hack) >>> Saarland Informatics Campus, Germany >>> Building E1.3, Room 4.31 >>> >>> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de >>> Fax. +49 (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert >> >> >> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> >> -- >> >> Johannes Doerfert >> Researcher / PhD Student >> >> Compiler Design Lab (Prof. Hack) >> Saarland Informatics Campus, Germany >> Building E1.3, Room 4.31 >> >> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de >> Fax. +49 (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert > > > >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > -- > > Johannes Doerfert > Researcher / PhD Student > > Compiler Design Lab (Prof. Hack) > Saarland Informatics Campus, Germany > Building E1.3, Room 4.31 > > Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de > Fax. +49 (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert
On 03/08, Mehdi Amini wrote:> > > On Mar 8, 2017, at 5:36 AM, Johannes Doerfert <doerfert at cs.uni-saarland.de> wrote: > > > > <mehdi.amini at apple.com>, > > Bcc: > > Subject: Re: [llvm-dev] [RFC][PIR] Parallel LLVM IR -- Stage 0 -- IR extension > > Reply-To: > > In-Reply-To: <20170224221713.GA931 at arch-linux-jd.home> > > > > Ping. > > > > > > > > > > PS. > > > > Are there actually people interested in this? > > > > We will continue working anyway but it might not make sense to put it > > on reviews and announce it on the ML if nobody cares. > > I expect this to takes many months, unless it is on the critical path > of some core devs, because it seems to me to be a huge feature to > consider and quite a time sink (which is why even people that care > haven’t been able to invest enough time into this). > > My impression is that this is major enough that it is unlikely to land > in LLVM 5.0, and even before the next US dev meeting. Ideally we’d > have enough discussions and review before then that we could have a > BoF there to make sure we can move forward with the right > representation. I hope the EuroLLVM meeting will be the opportunity > for you to get some good feedback on this! > > Just my 2 cents :)Thanks Mehdi :) I get your points and I generally agree. My intention was to get some kind of feedback to determine if (*) and possibly when people are going to invest that time. Since I already have one answer in that regard all is good! Cheers, Johannes (*) The RFC and patches were basically uncommented for quite some time now. PS. Maybe we should have scheduled a BoF at EuroLLVM for this general topic. Are interested people coming?> > On 02/24, Johannes Doerfert via llvm-dev wrote: > >> To (re)start a discussion here we published two patches to the > >> phabricator: > >> > >> 1) A documentation draft (including LangRef) for the PIR instructions > >> and the parallel regions [0]. > >> > >> 2) A draft implementation for a ParallelRegionInfo pass that identifies > >> and maintains parallel regions in the CFG [1]. It also serves as a > >> verifier pass for now. > >> > >> Please take a look and comment the patches either in the phabricator or > >> here on the list. > >> > >> Thanks! > >> > >> [0] https://reviews.llvm.org/D30353 > >> [1] https://reviews.llvm.org/D30354 > >> > >> > >> On 01/28, Johannes Doerfert via llvm-dev wrote: > >>> Dear all, > >>> > >>> This RFC proposes three new LLVM IR instructions to express high-level > >>> parallel constructs in a simple, low-level fashion. For this first stage > >>> we prepared two commits that add the proposed instructions and a pass to > >>> lower them to obtain sequential IR. Both patches have be uploaded for > >>> review [1, 2]. The latter patch is very simple and the former consists > >>> of almost only mechanical changes needed to add new instructions. > >>> > >>> The rest of this email contains (1) an introduction of the IR extension > >>> (2) the reasoning behind this approach, (3) a comparison to other ideas > >>> proposed so far, (4) a validation of the feasibility and potential > >>> impact, and (5) an outlook on the next steps. > >>> > >>> (1) IR extension: > >>> Parallel IR adds three new terminator instructions that define the > >>> beginning and the end of parallel regions in the CFG. A parallel region > >>> is a connected subgraph of the CFG that is potentially executed by two > >>> threads in parallel. It can only be entered with a fork instruction and > >>> spreads till a join instruction is reached. Therefor parallel regions > >>> are single-entry-multiple-exit regions. Parallel regions can be nested > >>> and if they are, they form a parallel region tree similar to the loop > >>> tree maintained by the natural loop info pass. Each parallel region > >>> defines two independent “sibling” tasks, namely the forked and > >>> continuation task. > >>> > >>> The new instructions are defined as follows: > >>> > >>> 1. fork: marks the beginning of parallel region. Every fork has two > >>> successor blocks which represent two parallel tasks. We call > >>> these two “sibling” tasks the forked and continuation tasks. > >>> Nested forking is supported, meaning that another fork can be > >>> reach prior to the join. > >>> > >>> 2. halt: marks the end of a forked task. The "sibling" continuation block > >>> (see fork above) is the operand of the halt terminator. This > >>> represents the idea of asymmetric parallelism as introduced by > >>> [1]. One advantage of asymmetric parallelism is that sequential > >>> semantics of the program are clear from its CFG (ref. [1]). > >>> Note that the edge from a forked block to a continuation block > >>> (the one introduced by the halt) represents the control flow > >>> when the two successors of a fork execute sequentially, not > >>> when they execute in parallel. In the latter case there is no > >>> “control transfer” happening via this edge but only > >>> synchronization between the tasks. > >>> > >>> 3. join: marks a synchronization point and the end of a parallel region. > >>> Once a join terminator is reached by a thread, execution stops > >>> in that thread until all tasks spawned by that thread finish > >>> their work, thus reach their respective halt instruction. A > >>> join shall only be reached by the continuation task of a fork, > >>> the forked task shall reach a halt with the continuation as a > >>> successor. > >>> > >>> > >>> Here is an example of a parallel OpenMP loop and its idiomatic lowering > >>> to Parallel IR. We set up a wiki [0] with additional examples. > >>> > >>> #pragma omp parallel > >>> for(int i = 0; i < n; ++i) { > >>> A[i] = C[i]; > >>> } > >>> > >>> > >>> preheader: > >>> br label %header > >>> > >>> header: > >>> %i = phi [ i32 0, %preheader ], [ %inc, %latch ] > >>> %done = icmp ge %i, %n > >>> br i1 %done, label %exit, label %body > >>> > >>> body: > >>> fork label %task, label %latch > >>> > >>> task: > >>> %aptr = getelementptr i32, i32* %A, i32 0, i32 %i > >>> %aval = load i32* %aptr > >>> %cptr = getelementptr i32, i32* %C, i32 0, i32 %i > >>> store i32 %aval, i32* %aptr > >>> halt label %latch > >>> > >>> latch: > >>> %inc = add i32, i32 %i, i32 1 > >>> br label %header > >>> > >>> exit: > >>> join label %afterloop > >>> > >>> afterloop: > >>> ... > >>> > >>> > >>> > >>> (2) Reasoning: > >>> The proposed approach is crafted such that the semantics of the parallel > >>> program is represented correctly in almost native, low-level IR right > >>> after front-end and preserved at any point till the final lowering to > >>> sequential IR or parallel runtime library calls. To this end, asymmetric > >>> parallelism is employed, a concept that uses control flow and the common > >>> concept of dominance to represent parts of the parallel semantics. In > >>> this model the parallel tasks do not dominate each other and only one > >>> parallel task dominates the code after the parallel region. As a > >>> consequence, various transformations that would break assumptions we > >>> make about parallel regions cannot happen (see [3,4]). While the > >>> explicitly modeled control flow together with dominance prevents various > >>> code motion problems, the use of terminators helps to minimize the > >>> changes needed to educate passes about parallel regions. Only a fraction > >>> of analysis and transformation passes deal with terminators explicitly. > >>> Most passes either test for known terminators (like branches), rely on > >>> dominance information, or work on a basic block level. To even further > >>> reduce changes to the existing passes, high-level concepts are broken > >>> down to already available low-level concepts instead of introducing new, > >>> semantically rich instructions/intrinsics (see the last paragraph of [5] > >>> and section 4 in the PIR white paper [6] for examples). Finally, this > >>> scheme allows a pass to simply reason about the sequential semantics of > >>> a parallel region, transform it back to one if needed or deemed > >>> beneficial and employ existing tooling solutions to debug and analyze > >>> the code [7]. > >>> > >>> (3) Comparison: > >>> The BoF discussion sheet [8] and the recent “[RFC] on IR-level region > >>> annotations” [9] both list pros and cons of different proposed schemes > >>> and implementations. We summarize and comment the discussion on the ones > >>> listen in the recent RFC here: > >>> (a) Metadata: It seems a consensus has been reached that metadata is > >>> not the solution but only a means to enhance a different solution. > >>> (c) One Intrinsic per directive/clause: This approach basically embeds > >>> a high-level (parallel) language in LLVM IR using intrinsics. It > >>> seems there is little to no support for this approach at the > >>> moment. > >>> (d) Parallel loop/region annotations: Here, intrinsics enclosing a > >>> parallel loop/region are used to represent parallelism. > >>> High-level knowledge is represented as attached metadata or in > >>> separate intrinsics. For more details please see the original > >>> RFC [9]. In the discussion several potential drawbacks have been > >>> mentioned: > >>> - The annotations might be too general [10]. > >>> - The IR is not semantically correct (or ready for optimization) > >>> after the front-end and needs an additional “prepare phase for > >>> pre-privatization" [11]. > >>> - The currently available “potential side effect for intrinsic > >>> calls” seem not to suffice for the proposed intrinsics as they > >>> do not have "call semantics" [12]. > >>> (b) Parallel instructions (this approach): The table in the region RFC > >>> [9] lists two drawbacks with this approach, both of which have > >>> already been called into question [5]. The first drawback is the > >>> effort needed to implement this scheme which is discussed in > >>> more detail in section (4) of this mail. The second drawback is > >>> the need for additional representation of high-level information > >>> that is not part of the semantics of the new fork-join > >>> instructions. As mentioned above, the choice to keep the new > >>> instructions as simple as possible is deliberate. This parallel > >>> IR is intended to be extensible, and in particular, compatible > >>> with representations of high-level parallel concepts that might > >>> be developed in the future. For the time being, the parallel IR > >>> is compatible with approach taken today of lowering high-level > >>> parallel linguistics, such as reductions and private memory, to > >>> existing IR constructs, such as parallel-runtime calls, > >>> atomicrmw instructions, and well placed alloca’s [5,6]. Although > >>> other extensions to the IR might allow LLVM to compile these > >>> higher-level constructs more effectively, we see no reason the > >>> parallel IR would conflict with any such extensions. (On the > >>> contrary, the parallel IR would seem to help compiler analyses > >>> of higher-level parallel constructs by exposing logical > >>> parallelism.) > >>> > >>> > >>> (4) Feasibility and Impact: > >>> The Tapir and PIR prototypes demonstrate the feasibility of this > >>> approach. The Tapir prototype [13] has recently proven its robustness as > >>> the standard compiler in the MIT class on parallel programming. It was > >>> implemented in ~ 5k LOC. However, >1k are explicit parallel > >>> optimization, 1k is used to add new instructions (thus mechanical) and > >>> 2k are used to lower the parallelism (basically needed for any scheme). > >>> Only the rest is required to make it work with existing analysis and > >>> transformation passes. While Tapir added explicit optimization passes > >>> for parallel regions/loops, the representation allows for a variety of > >>> classic optimizations (CSE, GVN, LICM, loop unrolling, TRE) to work with > >>> little to no modifications. Potential speedups compared to a classic > >>> “early-outlining” approaches can also be seen in the Tapir paper [13]. > >>> For the PIR prototype [14] we modified only three transformation passes > >>> (<20 LOC) [15] before we could run the O3 pipeline successfully on a > >>> parallel matrix multiplication. > >>> > >>> Together, these prototypes show how little passes actually inspect new > >>> (or “unknown”) terminators. The default assumption passes have to make, > >>> namely that control might be transferred to any successors at runtime, > >>> has, in terms of potential compiler transformations, a similar effect as > >>> the parallel semantics we want to model, namely that control is > >>> transferred to all successors. > >>> > >>> > >>> (5) Outlook: > >>> This first stage will only introduce and test the new instructions and > >>> the sequentialization pass. Afterwards we intend to start additions in > >>> different, partially overlapping but often orthogonal directions. We do > >>> welcome comments as well as developers for each of them: > >>> - Analysis and optimization: > >>> * A “parallel region info” pass to keep track of parallel regions > >>> and their nesting. The information can be made accessible in a > >>> “parallel region tree” similar to the loop tree maintained by the > >>> loop info pass. [stage 1, immediate next goal] > >>> * Extension of the verifier that allow to check parallel IR for > >>> “well-formedness”. [stage 1, immediate next goal] > >>> * Documentation of the PIR instructions in the language reference. > >>> [stage 1, immediate next goal] > >>> * A cost analysis for parallel tasks that can be queried by > >>> optimizations. The cost model needs to take the hardware, the > >>> runtime library and the parallel tasks into account. > >>> * Vectorizer enhancements to enable the vectorization of parallel > >>> * loops and tasks. > >>> * Parallelization centric optimizations: > >>> a) Parallel tasks can be balanced, merged or split as well as created > >>> from and lowered to sequential code. > >>> b) Barriers can be eliminated. > >>> c) Parallel loops can be statically scheduled or created from > >>> parallel recursive calls [13] > >>> * Analysis to extract high-level information (reductions, private > >>> memory, ...) from the low-level representation. > >>> - Front-end: > >>> * Lowering of simple OpenMP and Cilk++ annotations to PIR, including > >>> parallel sections and parallel loops with limited support for > >>> clauses (at first) (examples can be found here [1]). [milestone 1] > >>> * Generation of PIR code through automatic parallelization. A > >>> patched version of Polly exists that emits parallel loops using PIR instead of > >>> OpenMP runtime calls or llvm.parallel.loop metadata. [milestone 1] > >>> * Representation of more evolved high-level features like assignment > >>> of computation units. > >>> - Back-end: > >>> * Lowering of PIR regions to calls to the OpenMP (GOMP) and Cilk++ > >>> runtime library. [milestone 1] > >>> * A simple parallel library, e.g., based on pthreads, to be shipped > >>> with LLVM as a fallback implementation for parallel regions. > >>> > >>> > >>> Thank you all for your time and hopefully constructive input on this proposal! > >>> > >>> Cheers, > >>> Johannes, on behalf of the PIR team > >>> > >>> > >>> Disclaimer: > >>> This RFC, the patches, the wiki, etc. are a joint effort by Tao B. > >>> Schardl (MIT), Charles E. Leiserson (MIT), Kareem Ergawy (Saarland > >>> University), Simon Moll (Saarland University) and myself. However, ideas > >>> and feedback came from many people, including the members of the > >>> LLVM-HPC IR Extensions working group (Hal Finkel, Xinmin Tian, ...), the > >>> participants in the BoF at the US Developers’ meeting, everybody that > >>> commented on the BoF discussion sheet [16] and the recent RFC on > >>> IR-level region annotations [9] (Mehdi Amini, Sanjoy Das, Daniel Berlin, > >>> ...). > >>> > >>> > >>> > >>> [0] https://github.com/Parallel-IR/llvm-pir/wiki > >>> [1] https://reviews.llvm.org/D29250 > >>> [2] https://reviews.llvm.org/D29251 > >>> [3] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109302.html > >>> [4] http://lists.llvm.org/pipermail/llvm-dev/2015-March/083348.html > >>> [5] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109264.html > >>> [6] http://compilers.cs.uni-saarland.de/people/doerfert/parallelcfg.pdf > >>> [7] http://supertech.csail.mit.edu/papers/spbags.pdf & www.cse.wustl.edu/~angelee/papers/cilkprof.pdf > >>> [8] https://goo.gl/Blp2Xr > >>> [9] http://lists.llvm.org/pipermail/llvm-dev/2017-January/108906.html > >>> [10] http://lists.llvm.org/pipermail/llvm-dev/2017-January/108997.html > >>> [11] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109377.html > >>> [12] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109351.html > >>> [13] http://wsmoses.com/tapir.pdf > >>> [14] https://github.com/jdoerfert/llvm-pir/tree/feature/fork-join > >>> [15] https://github.com/jdoerfert/llvm-pir/commit/854259881d24d71f9f1f17e52547758c7be0618a > >>> [16] https://goo.gl/wKps3c > >>> > >>> > >>> -- > >>> > >>> Johannes Doerfert > >>> Researcher / PhD Student > >>> > >>> Compiler Design Lab (Prof. Hack) > >>> Saarland Informatics Campus, Germany > >>> Building E1.3, Room 4.31 > >>> > >>> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de > >>> Fax. +49 (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert > >> > >> > >> > >>> _______________________________________________ > >>> LLVM Developers mailing list > >>> llvm-dev at lists.llvm.org > >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >> > >> > >> -- > >> > >> Johannes Doerfert > >> Researcher / PhD Student > >> > >> Compiler Design Lab (Prof. Hack) > >> Saarland Informatics Campus, Germany > >> Building E1.3, Room 4.31 > >> > >> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de > >> Fax. +49 (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert > > > > > > > >> _______________________________________________ > >> LLVM Developers mailing list > >> llvm-dev at lists.llvm.org > >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > > > > -- > > > > Johannes Doerfert > > Researcher / PhD Student > > > > Compiler Design Lab (Prof. Hack) > > Saarland Informatics Campus, Germany > > Building E1.3, Room 4.31 > > > > Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de > > Fax. +49 (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert >-- Johannes Doerfert Researcher / PhD Student Compiler Design Lab (Prof. Hack) Saarland Informatics Campus, Germany Building E1.3, Room 4.31 Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de Fax. +49 (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 228 bytes Desc: Digital signature URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170308/aee7fbc4/attachment.sig>
> On Mar 8, 2017, at 10:55 AM, Mehdi Amini <mehdi.amini at apple.com> wrote: > >> >> On Mar 8, 2017, at 5:36 AM, Johannes Doerfert <doerfert at cs.uni-saarland.de> wrote: >> >> <mehdi.amini at apple.com>, >> Bcc: >> Subject: Re: [llvm-dev] [RFC][PIR] Parallel LLVM IR -- Stage 0 -- IR extension >> Reply-To: >> In-Reply-To: <20170224221713.GA931 at arch-linux-jd.home> >> >> Ping. >> >> PS. >> >> Are there actually people interested in this?I’m definitely interested too. I will have some high-level comments after next week.>> We will continue working anyway but it might not make sense to put it >> on reviews and announce it on the ML if nobody cares. > > I expect this to takes many months, unless it is on the critical path of some core devs, because it seems to me to be a huge feature to consider and quite a time sink (which is why even people that care haven’t been able to invest enough time into this).There are enough commercial and open-source projects working actively on this that I think the issue is reaching critical mass. A few of us — Hal, Jeff (Vetter), Johannes, TB Schardl, Xinmin, I, and a few others — have been discussing design ideas, building on lessons from some of these ongoing projects: * Intel’s new OpenMP compiler * ARES from Los Alamos and Oak Ridge * PIR from Saarland. * TAPIR from MIT * HPVM from Illinois Our goal is to put together a design for how to add representations for parallel programs that can leverage LLVM capabilities and features without requiring excessive changes to the existing infrastructure. We’ll submit design docs, patches for review, and one or more prototype compilers built on the design. We’re having a conference call this week or early next week, and a face-to-face meeting at the end of next week to work on the details.> > My impression is that this is major enough that it is unlikely to land in LLVM 5.0,That’s probably true.> and even before the next US dev meeting.I’m hoping we’ll have info submitted for review well before that.> Ideally we’d have enough discussions and review before then that we could have a BoF there to make sure we can move forward with the right representation. I hope the EuroLLVM meeting will be the opportunity for you to get some good feedback on this!+1 on both. —Vikram // Vikram S. Adve // Professor, Department of Computer Science // University of Illinois at Urbana-Champaign // vadve at illinois.edu // http://llvm.org> > Just my 2 cents :) > > — > Mehdi > > > > > >> >> >> >> On 02/24, Johannes Doerfert via llvm-dev wrote: >>> To (re)start a discussion here we published two patches to the >>> phabricator: >>> >>> 1) A documentation draft (including LangRef) for the PIR instructions >>> and the parallel regions [0]. >>> >>> 2) A draft implementation for a ParallelRegionInfo pass that identifies >>> and maintains parallel regions in the CFG [1]. It also serves as a >>> verifier pass for now. >>> >>> Please take a look and comment the patches either in the phabricator or >>> here on the list. >>> >>> Thanks! >>> >>> [0] https://reviews.llvm.org/D30353 >>> [1] https://reviews.llvm.org/D30354 >>> >>> >>> On 01/28, Johannes Doerfert via llvm-dev wrote: >>>> Dear all, >>>> >>>> This RFC proposes three new LLVM IR instructions to express high-level >>>> parallel constructs in a simple, low-level fashion. For this first stage >>>> we prepared two commits that add the proposed instructions and a pass to >>>> lower them to obtain sequential IR. Both patches have be uploaded for >>>> review [1, 2]. The latter patch is very simple and the former consists >>>> of almost only mechanical changes needed to add new instructions. >>>> >>>> The rest of this email contains (1) an introduction of the IR extension >>>> (2) the reasoning behind this approach, (3) a comparison to other ideas >>>> proposed so far, (4) a validation of the feasibility and potential >>>> impact, and (5) an outlook on the next steps. >>>> >>>> (1) IR extension: >>>> Parallel IR adds three new terminator instructions that define the >>>> beginning and the end of parallel regions in the CFG. A parallel region >>>> is a connected subgraph of the CFG that is potentially executed by two >>>> threads in parallel. It can only be entered with a fork instruction and >>>> spreads till a join instruction is reached. Therefor parallel regions >>>> are single-entry-multiple-exit regions. Parallel regions can be nested >>>> and if they are, they form a parallel region tree similar to the loop >>>> tree maintained by the natural loop info pass. Each parallel region >>>> defines two independent “sibling” tasks, namely the forked and >>>> continuation task. >>>> >>>> The new instructions are defined as follows: >>>> >>>> 1. fork: marks the beginning of parallel region. Every fork has two >>>> successor blocks which represent two parallel tasks. We call >>>> these two “sibling” tasks the forked and continuation tasks. >>>> Nested forking is supported, meaning that another fork can be >>>> reach prior to the join. >>>> >>>> 2. halt: marks the end of a forked task. The "sibling" continuation block >>>> (see fork above) is the operand of the halt terminator. This >>>> represents the idea of asymmetric parallelism as introduced by >>>> [1]. One advantage of asymmetric parallelism is that sequential >>>> semantics of the program are clear from its CFG (ref. [1]). >>>> Note that the edge from a forked block to a continuation block >>>> (the one introduced by the halt) represents the control flow >>>> when the two successors of a fork execute sequentially, not >>>> when they execute in parallel. In the latter case there is no >>>> “control transfer” happening via this edge but only >>>> synchronization between the tasks. >>>> >>>> 3. join: marks a synchronization point and the end of a parallel region. >>>> Once a join terminator is reached by a thread, execution stops >>>> in that thread until all tasks spawned by that thread finish >>>> their work, thus reach their respective halt instruction. A >>>> join shall only be reached by the continuation task of a fork, >>>> the forked task shall reach a halt with the continuation as a >>>> successor. >>>> >>>> >>>> Here is an example of a parallel OpenMP loop and its idiomatic lowering >>>> to Parallel IR. We set up a wiki [0] with additional examples. >>>> >>>> #pragma omp parallel >>>> for(int i = 0; i < n; ++i) { >>>> A[i] = C[i]; >>>> } >>>> >>>> >>>> preheader: >>>> br label %header >>>> >>>> header: >>>> %i = phi [ i32 0, %preheader ], [ %inc, %latch ] >>>> %done = icmp ge %i, %n >>>> br i1 %done, label %exit, label %body >>>> >>>> body: >>>> fork label %task, label %latch >>>> >>>> task: >>>> %aptr = getelementptr i32, i32* %A, i32 0, i32 %i >>>> %aval = load i32* %aptr >>>> %cptr = getelementptr i32, i32* %C, i32 0, i32 %i >>>> store i32 %aval, i32* %aptr >>>> halt label %latch >>>> >>>> latch: >>>> %inc = add i32, i32 %i, i32 1 >>>> br label %header >>>> >>>> exit: >>>> join label %afterloop >>>> >>>> afterloop: >>>> ... >>>> >>>> >>>> >>>> (2) Reasoning: >>>> The proposed approach is crafted such that the semantics of the parallel >>>> program is represented correctly in almost native, low-level IR right >>>> after front-end and preserved at any point till the final lowering to >>>> sequential IR or parallel runtime library calls. To this end, asymmetric >>>> parallelism is employed, a concept that uses control flow and the common >>>> concept of dominance to represent parts of the parallel semantics. In >>>> this model the parallel tasks do not dominate each other and only one >>>> parallel task dominates the code after the parallel region. As a >>>> consequence, various transformations that would break assumptions we >>>> make about parallel regions cannot happen (see [3,4]). While the >>>> explicitly modeled control flow together with dominance prevents various >>>> code motion problems, the use of terminators helps to minimize the >>>> changes needed to educate passes about parallel regions. Only a fraction >>>> of analysis and transformation passes deal with terminators explicitly. >>>> Most passes either test for known terminators (like branches), rely on >>>> dominance information, or work on a basic block level. To even further >>>> reduce changes to the existing passes, high-level concepts are broken >>>> down to already available low-level concepts instead of introducing new, >>>> semantically rich instructions/intrinsics (see the last paragraph of [5] >>>> and section 4 in the PIR white paper [6] for examples). Finally, this >>>> scheme allows a pass to simply reason about the sequential semantics of >>>> a parallel region, transform it back to one if needed or deemed >>>> beneficial and employ existing tooling solutions to debug and analyze >>>> the code [7]. >>>> >>>> (3) Comparison: >>>> The BoF discussion sheet [8] and the recent “[RFC] on IR-level region >>>> annotations” [9] both list pros and cons of different proposed schemes >>>> and implementations. We summarize and comment the discussion on the ones >>>> listen in the recent RFC here: >>>> (a) Metadata: It seems a consensus has been reached that metadata is >>>> not the solution but only a means to enhance a different solution. >>>> (c) One Intrinsic per directive/clause: This approach basically embeds >>>> a high-level (parallel) language in LLVM IR using intrinsics. It >>>> seems there is little to no support for this approach at the >>>> moment. >>>> (d) Parallel loop/region annotations: Here, intrinsics enclosing a >>>> parallel loop/region are used to represent parallelism. >>>> High-level knowledge is represented as attached metadata or in >>>> separate intrinsics. For more details please see the original >>>> RFC [9]. In the discussion several potential drawbacks have been >>>> mentioned: >>>> - The annotations might be too general [10]. >>>> - The IR is not semantically correct (or ready for optimization) >>>> after the front-end and needs an additional “prepare phase for >>>> pre-privatization" [11]. >>>> - The currently available “potential side effect for intrinsic >>>> calls” seem not to suffice for the proposed intrinsics as they >>>> do not have "call semantics" [12]. >>>> (b) Parallel instructions (this approach): The table in the region RFC >>>> [9] lists two drawbacks with this approach, both of which have >>>> already been called into question [5]. The first drawback is the >>>> effort needed to implement this scheme which is discussed in >>>> more detail in section (4) of this mail. The second drawback is >>>> the need for additional representation of high-level information >>>> that is not part of the semantics of the new fork-join >>>> instructions. As mentioned above, the choice to keep the new >>>> instructions as simple as possible is deliberate. This parallel >>>> IR is intended to be extensible, and in particular, compatible >>>> with representations of high-level parallel concepts that might >>>> be developed in the future. For the time being, the parallel IR >>>> is compatible with approach taken today of lowering high-level >>>> parallel linguistics, such as reductions and private memory, to >>>> existing IR constructs, such as parallel-runtime calls, >>>> atomicrmw instructions, and well placed alloca’s [5,6]. Although >>>> other extensions to the IR might allow LLVM to compile these >>>> higher-level constructs more effectively, we see no reason the >>>> parallel IR would conflict with any such extensions. (On the >>>> contrary, the parallel IR would seem to help compiler analyses >>>> of higher-level parallel constructs by exposing logical >>>> parallelism.) >>>> >>>> >>>> (4) Feasibility and Impact: >>>> The Tapir and PIR prototypes demonstrate the feasibility of this >>>> approach. The Tapir prototype [13] has recently proven its robustness as >>>> the standard compiler in the MIT class on parallel programming. It was >>>> implemented in ~ 5k LOC. However, >1k are explicit parallel >>>> optimization, 1k is used to add new instructions (thus mechanical) and >>>> 2k are used to lower the parallelism (basically needed for any scheme). >>>> Only the rest is required to make it work with existing analysis and >>>> transformation passes. While Tapir added explicit optimization passes >>>> for parallel regions/loops, the representation allows for a variety of >>>> classic optimizations (CSE, GVN, LICM, loop unrolling, TRE) to work with >>>> little to no modifications. Potential speedups compared to a classic >>>> “early-outlining” approaches can also be seen in the Tapir paper [13]. >>>> For the PIR prototype [14] we modified only three transformation passes >>>> (<20 LOC) [15] before we could run the O3 pipeline successfully on a >>>> parallel matrix multiplication. >>>> >>>> Together, these prototypes show how little passes actually inspect new >>>> (or “unknown”) terminators. The default assumption passes have to make, >>>> namely that control might be transferred to any successors at runtime, >>>> has, in terms of potential compiler transformations, a similar effect as >>>> the parallel semantics we want to model, namely that control is >>>> transferred to all successors. >>>> >>>> >>>> (5) Outlook: >>>> This first stage will only introduce and test the new instructions and >>>> the sequentialization pass. Afterwards we intend to start additions in >>>> different, partially overlapping but often orthogonal directions. We do >>>> welcome comments as well as developers for each of them: >>>> - Analysis and optimization: >>>> * A “parallel region info” pass to keep track of parallel regions >>>> and their nesting. The information can be made accessible in a >>>> “parallel region tree” similar to the loop tree maintained by the >>>> loop info pass. [stage 1, immediate next goal] >>>> * Extension of the verifier that allow to check parallel IR for >>>> “well-formedness”. [stage 1, immediate next goal] >>>> * Documentation of the PIR instructions in the language reference. >>>> [stage 1, immediate next goal] >>>> * A cost analysis for parallel tasks that can be queried by >>>> optimizations. The cost model needs to take the hardware, the >>>> runtime library and the parallel tasks into account. >>>> * Vectorizer enhancements to enable the vectorization of parallel >>>> * loops and tasks. >>>> * Parallelization centric optimizations: >>>> a) Parallel tasks can be balanced, merged or split as well as created >>>> from and lowered to sequential code. >>>> b) Barriers can be eliminated. >>>> c) Parallel loops can be statically scheduled or created from >>>> parallel recursive calls [13] >>>> * Analysis to extract high-level information (reductions, private >>>> memory, ...) from the low-level representation. >>>> - Front-end: >>>> * Lowering of simple OpenMP and Cilk++ annotations to PIR, including >>>> parallel sections and parallel loops with limited support for >>>> clauses (at first) (examples can be found here [1]). [milestone 1] >>>> * Generation of PIR code through automatic parallelization. A >>>> patched version of Polly exists that emits parallel loops using PIR instead of >>>> OpenMP runtime calls or llvm.parallel.loop metadata. [milestone 1] >>>> * Representation of more evolved high-level features like assignment >>>> of computation units. >>>> - Back-end: >>>> * Lowering of PIR regions to calls to the OpenMP (GOMP) and Cilk++ >>>> runtime library. [milestone 1] >>>> * A simple parallel library, e.g., based on pthreads, to be shipped >>>> with LLVM as a fallback implementation for parallel regions. >>>> >>>> >>>> Thank you all for your time and hopefully constructive input on this proposal! >>>> >>>> Cheers, >>>> Johannes, on behalf of the PIR team >>>> >>>> >>>> Disclaimer: >>>> This RFC, the patches, the wiki, etc. are a joint effort by Tao B. >>>> Schardl (MIT), Charles E. Leiserson (MIT), Kareem Ergawy (Saarland >>>> University), Simon Moll (Saarland University) and myself. However, ideas >>>> and feedback came from many people, including the members of the >>>> LLVM-HPC IR Extensions working group (Hal Finkel, Xinmin Tian, ...), the >>>> participants in the BoF at the US Developers’ meeting, everybody that >>>> commented on the BoF discussion sheet [16] and the recent RFC on >>>> IR-level region annotations [9] (Mehdi Amini, Sanjoy Das, Daniel Berlin, >>>> ...). >>>> >>>> >>>> >>>> [0] https://github.com/Parallel-IR/llvm-pir/wiki >>>> [1] https://reviews.llvm.org/D29250 >>>> [2] https://reviews.llvm.org/D29251 >>>> [3] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109302.html >>>> [4] http://lists.llvm.org/pipermail/llvm-dev/2015-March/083348.html >>>> [5] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109264.html >>>> [6] http://compilers.cs.uni-saarland.de/people/doerfert/parallelcfg.pdf >>>> [7] http://supertech.csail.mit.edu/papers/spbags.pdf & www.cse.wustl.edu/~angelee/papers/cilkprof.pdf >>>> [8] https://goo.gl/Blp2Xr >>>> [9] http://lists.llvm.org/pipermail/llvm-dev/2017-January/108906.html >>>> [10] http://lists.llvm.org/pipermail/llvm-dev/2017-January/108997.html >>>> [11] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109377.html >>>> [12] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109351.html >>>> [13] http://wsmoses.com/tapir.pdf >>>> [14] https://github.com/jdoerfert/llvm-pir/tree/feature/fork-join >>>> [15] https://github.com/jdoerfert/llvm-pir/commit/854259881d24d71f9f1f17e52547758c7be0618a >>>> [16] https://goo.gl/wKps3c >>>> >>>> >>>> -- >>>> >>>> Johannes Doerfert >>>> Researcher / PhD Student >>>> >>>> Compiler Design Lab (Prof. Hack) >>>> Saarland Informatics Campus, Germany >>>> Building E1.3, Room 4.31 >>>> >>>> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de >>>> Fax. +49 (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert >>> >>> >>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> llvm-dev at lists.llvm.org >>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>> >>> -- >>> >>> Johannes Doerfert >>> Researcher / PhD Student >>> >>> Compiler Design Lab (Prof. Hack) >>> Saarland Informatics Campus, Germany >>> Building E1.3, Room 4.31 >>> >>> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de >>> Fax. +49 (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert >> >> >> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> >> -- >> >> Johannes Doerfert >> Researcher / PhD Student >> >> Compiler Design Lab (Prof. Hack) >> Saarland Informatics Campus, Germany >> Building E1.3, Room 4.31 >> >> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de >> Fax. +49 (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert