On 03/08/2017 12:44 PM, Johannes Doerfert wrote:> I don't know who pointed it out first but Mehdi made me aware of it at > CGO. I try to explain it shortly. > > Given the following situation (in pseudo code): > > alloc A[100]; > parallel_for(i = 0; i < 100; i++) > A[i] = f(i); > > acc = 1; > for(i = 0; i < 100; i++) > acc = acc * A[i]; > > Afaik, with your parallel regions there won't be a CFG loop for the > parallel initialization, right? Instead some intrinsics that annotate > the parallel region and then one initialization. I imagine it somewhat > like this: > > token = llvm.parallel.for.start(0, 100, 1) > i = llvm.parallel.for.iterator(token) > A[i] = f(i) > llvm.parallel.for.end(token) > > Which would (in serial semantics) allow an optimization to remove the > second loop as it uses uninitialized (undef) values. In other words, it > is a problem if we assume only one A[i] was initialized but the rest was > not. > > @Mehdi Does that reflect what you explained to some degree?Yes, this is essentially the issue that Sanjoy pointed out on the mailing list (as well). We definitely need semantics that do not have this problem. -Hal> > On 03/08, Tian, Xinmin wrote: >> ".... the problem Mehdi pointed out regarding the missed initializations of array elements, did you comment on that one yet?" >> >> What is the initializations of array elements question? I don't remember this question. Please refresh my memory. Thanks. >> I thought Mehdi's question is more about what are attributes needed for these IR-annotation for other LLVM pass to understand and make sense out of them. >> >> Xinmin >> >> -----Original Message----- >> From: Johannes Doerfert [mailto:doerfert at cs.uni-saarland.de] >> Sent: Wednesday, March 8, 2017 9:58 AM >> To: Tian, Xinmin <xinmin.tian at intel.com> >> Cc: Hal Finkel <hfinkel at anl.gov>; LLVM-Dev <llvm-dev at lists.llvm.org>; Sanjoy Das <sanjoy at playingwithpointers.com>; Mehdi Amini <mehdi.amini at apple.com>; Chandler Carruth <chandlerc at gmail.com>; Adve, Vikram Sadanand <vadve at illinois.edu>; TB Schardl <neboat at mit.edu>; acjacob at us.ibm.com >> Subject: Re: >> >> On 03/08, Tian, Xinmin wrote: >>> A quick update, we have been looking through all LLVM passes to >>> identify the impact of "IR-region annotation", and interaction issues >>> with the rest of LoopOpt and scalarOpt, e.g. interaction with >>> vectorization when you have schedule(simd:guided: 64). What are the >>> common properties for optimizer to know on IR-region annotations. We >>> have our implementation working from O0, O1, O2 to O3. So far, the >>> changes we made in the existing LLVM opt passes are < 200 LOC. ClangFE >>> to executable end-to-end linked our omp library, and we also updated >>> our implementation using token/tag on the intrinsics based on feedback >>> from Google (David) and Xilinx (Hongbin). I am still owe Mehdi some >>> answers. >> Thanks for the update. I will look into the discussion thread again soon. I am especially interested in the problem Mehdi pointed out regarding the missed initializations of array elements, did you comment on that one yet? >> >>> One feedback for PIR RFC, putting "fork" into loop body does not work >>> for loop vectorizer, it shuts down vectorization, this is the same >>> issue as MIT's scheme. >> Totally true. The vectorizer needs to be thought about the fork-join at some point. However, you can for now sequentialize and introduce the llvm.parallel.loop annotations it already knows about. We (Simon Moll and me) are currently discussing how to teach the RV vectorizer to handle fork-join parallel regions in loops but also in otherwise straight line code. Once we make progress we'll let you know! >> >> Thanks! >> Johannes >> >>> -----Original Message----- >>> From: Hal Finkel [mailto:hfinkel at anl.gov] >>> Sent: Wednesday, March 8, 2017 5:51 AM >>> To: Johannes Doerfert <doerfert at cs.uni-saarland.de>; LLVM-Dev >>> <llvm-dev at lists.llvm.org> >>> Cc: Sanjoy Das <sanjoy at playingwithpointers.com>; Mehdi Amini >>> <mehdi.amini at apple.com>; Chandler Carruth <chandlerc at gmail.com>; Adve, >>> Vikram Sadanand <vadve at illinois.edu>; Tian, Xinmin >>> <xinmin.tian at intel.com>; TB Schardl <neboat at mit.edu>; >>> acjacob at us.ibm.com >>> Subject: Re: >>> >>> >>> On 03/08/2017 07:36 AM, Johannes Doerfert wrote: >>>> <mehdi.amini at apple.com>, >>>> Bcc: >>>> Subject: Re: [llvm-dev] [RFC][PIR] Parallel LLVM IR -- Stage 0 -- IR >>>> extension >>>> Reply-To: >>>> In-Reply-To: <20170224221713.GA931 at arch-linux-jd.home> >>>> >>>> Ping. >>>> >>>> >>>> >>>> >>>> PS. >>>> >>>> Are there actually people interested in this? >>>> >>>> We will continue working anyway but it might not make sense to put it >>>> on reviews and announce it on the ML if nobody cares. >>> I certainly care, and will also respond to the RFC, this week or next. >>> >>> -Hal >>> >>>> >>>> >>>> On 02/24, Johannes Doerfert via llvm-dev wrote: >>>>> To (re)start a discussion here we published two patches to the >>>>> phabricator: >>>>> >>>>> 1) A documentation draft (including LangRef) for the PIR instructions >>>>> and the parallel regions [0]. >>>>> >>>>> 2) A draft implementation for a ParallelRegionInfo pass that identifies >>>>> and maintains parallel regions in the CFG [1]. It also serves as a >>>>> verifier pass for now. >>>>> >>>>> Please take a look and comment the patches either in the >>>>> phabricator or here on the list. >>>>> >>>>> Thanks! >>>>> >>>>> [0] https://reviews.llvm.org/D30353 [1] >>>>> https://reviews.llvm.org/D30354 >>>>> >>>>> >>>>> On 01/28, Johannes Doerfert via llvm-dev wrote: >>>>>> Dear all, >>>>>> >>>>>> This RFC proposes three new LLVM IR instructions to express >>>>>> high-level parallel constructs in a simple, low-level fashion. For >>>>>> this first stage we prepared two commits that add the proposed >>>>>> instructions and a pass to lower them to obtain sequential IR. >>>>>> Both patches have be uploaded for review [1, 2]. The latter patch >>>>>> is very simple and the former consists of almost only mechanical changes needed to add new instructions. >>>>>> >>>>>> The rest of this email contains (1) an introduction of the IR >>>>>> extension >>>>>> (2) the reasoning behind this approach, (3) a comparison to other >>>>>> ideas proposed so far, (4) a validation of the feasibility and >>>>>> potential impact, and (5) an outlook on the next steps. >>>>>> >>>>>> (1) IR extension: >>>>>> Parallel IR adds three new terminator instructions that define the >>>>>> beginning and the end of parallel regions in the CFG. A parallel >>>>>> region is a connected subgraph of the CFG that is potentially >>>>>> executed by two threads in parallel. It can only be entered with a >>>>>> fork instruction and spreads till a join instruction is reached. >>>>>> Therefor parallel regions are single-entry-multiple-exit regions. >>>>>> Parallel regions can be nested and if they are, they form a >>>>>> parallel region tree similar to the loop tree maintained by the >>>>>> natural loop info pass. Each parallel region defines two independent “sibling” >>>>>> tasks, namely the forked and continuation task. >>>>>> >>>>>> The new instructions are defined as follows: >>>>>> >>>>>> 1. fork: marks the beginning of parallel region. Every fork has two >>>>>> successor blocks which represent two parallel tasks. We call >>>>>> these two “sibling” tasks the forked and continuation tasks. >>>>>> Nested forking is supported, meaning that another fork can be >>>>>> reach prior to the join. >>>>>> >>>>>> 2. halt: marks the end of a forked task. The "sibling" continuation block >>>>>> (see fork above) is the operand of the halt terminator. This >>>>>> represents the idea of asymmetric parallelism as introduced by >>>>>> [1]. One advantage of asymmetric parallelism is that sequential >>>>>> semantics of the program are clear from its CFG (ref. [1]). >>>>>> Note that the edge from a forked block to a continuation block >>>>>> (the one introduced by the halt) represents the control flow >>>>>> when the two successors of a fork execute sequentially, not >>>>>> when they execute in parallel. In the latter case there is no >>>>>> “control transfer” happening via this edge but only >>>>>> synchronization between the tasks. >>>>>> >>>>>> 3. join: marks a synchronization point and the end of a parallel region. >>>>>> Once a join terminator is reached by a thread, execution stops >>>>>> in that thread until all tasks spawned by that thread finish >>>>>> their work, thus reach their respective halt instruction. A >>>>>> join shall only be reached by the continuation task of a fork, >>>>>> the forked task shall reach a halt with the continuation as a >>>>>> successor. >>>>>> >>>>>> >>>>>> Here is an example of a parallel OpenMP loop and its idiomatic >>>>>> lowering to Parallel IR. We set up a wiki [0] with additional examples. >>>>>> >>>>>> #pragma omp parallel >>>>>> for(int i = 0; i < n; ++i) { >>>>>> A[i] = C[i]; >>>>>> } >>>>>> >>>>>> >>>>>> preheader: >>>>>> br label %header >>>>>> >>>>>> header: >>>>>> %i = phi [ i32 0, %preheader ], [ %inc, %latch ] >>>>>> %done = icmp ge %i, %n >>>>>> br i1 %done, label %exit, label %body >>>>>> >>>>>> body: >>>>>> fork label %task, label %latch >>>>>> >>>>>> task: >>>>>> %aptr = getelementptr i32, i32* %A, i32 0, i32 %i >>>>>> %aval = load i32* %aptr >>>>>> %cptr = getelementptr i32, i32* %C, i32 0, i32 %i >>>>>> store i32 %aval, i32* %aptr >>>>>> halt label %latch >>>>>> >>>>>> latch: >>>>>> %inc = add i32, i32 %i, i32 1 >>>>>> br label %header >>>>>> >>>>>> exit: >>>>>> join label %afterloop >>>>>> >>>>>> afterloop: >>>>>> ... >>>>>> >>>>>> >>>>>> >>>>>> (2) Reasoning: >>>>>> The proposed approach is crafted such that the semantics of the >>>>>> parallel program is represented correctly in almost native, >>>>>> low-level IR right after front-end and preserved at any point till >>>>>> the final lowering to sequential IR or parallel runtime library >>>>>> calls. To this end, asymmetric parallelism is employed, a concept >>>>>> that uses control flow and the common concept of dominance to >>>>>> represent parts of the parallel semantics. In this model the >>>>>> parallel tasks do not dominate each other and only one parallel >>>>>> task dominates the code after the parallel region. As a >>>>>> consequence, various transformations that would break assumptions >>>>>> we make about parallel regions cannot happen (see [3,4]). While >>>>>> the explicitly modeled control flow together with dominance >>>>>> prevents various code motion problems, the use of terminators >>>>>> helps to minimize the changes needed to educate passes about parallel regions. Only a fraction of analysis and transformation passes deal with terminators explicitly. >>>>>> Most passes either test for known terminators (like branches), >>>>>> rely on dominance information, or work on a basic block level. To >>>>>> even further reduce changes to the existing passes, high-level >>>>>> concepts are broken down to already available low-level concepts >>>>>> instead of introducing new, semantically rich >>>>>> instructions/intrinsics (see the last paragraph of [5] and section >>>>>> 4 in the PIR white paper [6] for examples). Finally, this scheme >>>>>> allows a pass to simply reason about the sequential semantics of a >>>>>> parallel region, transform it back to one if needed or deemed >>>>>> beneficial and employ existing tooling solutions to debug and analyze the code [7]. >>>>>> >>>>>> (3) Comparison: >>>>>> The BoF discussion sheet [8] and the recent “[RFC] on IR-level >>>>>> region annotations” [9] both list pros and cons of different >>>>>> proposed schemes and implementations. We summarize and comment the >>>>>> discussion on the ones listen in the recent RFC here: >>>>>> (a) Metadata: It seems a consensus has been reached that metadata is >>>>>> not the solution but only a means to enhance a different solution. >>>>>> (c) One Intrinsic per directive/clause: This approach basically embeds >>>>>> a high-level (parallel) language in LLVM IR using intrinsics. It >>>>>> seems there is little to no support for this approach at the >>>>>> moment. >>>>>> (d) Parallel loop/region annotations: Here, intrinsics enclosing a >>>>>> parallel loop/region are used to represent parallelism. >>>>>> High-level knowledge is represented as attached metadata or in >>>>>> separate intrinsics. For more details please see the original >>>>>> RFC [9]. In the discussion several potential drawbacks have been >>>>>> mentioned: >>>>>> - The annotations might be too general [10]. >>>>>> - The IR is not semantically correct (or ready for optimization) >>>>>> after the front-end and needs an additional “prepare phase for >>>>>> pre-privatization" [11]. >>>>>> - The currently available “potential side effect for intrinsic >>>>>> calls” seem not to suffice for the proposed intrinsics as they >>>>>> do not have "call semantics" [12]. >>>>>> (b) Parallel instructions (this approach): The table in the region RFC >>>>>> [9] lists two drawbacks with this approach, both of which have >>>>>> already been called into question [5]. The first drawback is the >>>>>> effort needed to implement this scheme which is discussed in >>>>>> more detail in section (4) of this mail. The second drawback is >>>>>> the need for additional representation of high-level information >>>>>> that is not part of the semantics of the new fork-join >>>>>> instructions. As mentioned above, the choice to keep the new >>>>>> instructions as simple as possible is deliberate. This parallel >>>>>> IR is intended to be extensible, and in particular, compatible >>>>>> with representations of high-level parallel concepts that might >>>>>> be developed in the future. For the time being, the parallel IR >>>>>> is compatible with approach taken today of lowering high-level >>>>>> parallel linguistics, such as reductions and private memory, to >>>>>> existing IR constructs, such as parallel-runtime calls, >>>>>> atomicrmw instructions, and well placed alloca’s [5,6]. Although >>>>>> other extensions to the IR might allow LLVM to compile these >>>>>> higher-level constructs more effectively, we see no reason the >>>>>> parallel IR would conflict with any such extensions. (On the >>>>>> contrary, the parallel IR would seem to help compiler analyses >>>>>> of higher-level parallel constructs by exposing logical >>>>>> parallelism.) >>>>>> >>>>>> >>>>>> (4) Feasibility and Impact: >>>>>> The Tapir and PIR prototypes demonstrate the feasibility of this >>>>>> approach. The Tapir prototype [13] has recently proven its >>>>>> robustness as the standard compiler in the MIT class on parallel >>>>>> programming. It was implemented in ~ 5k LOC. However, >1k are >>>>>> explicit parallel optimization, 1k is used to add new instructions >>>>>> (thus mechanical) and 2k are used to lower the parallelism (basically needed for any scheme). >>>>>> Only the rest is required to make it work with existing analysis >>>>>> and transformation passes. While Tapir added explicit optimization >>>>>> passes for parallel regions/loops, the representation allows for a >>>>>> variety of classic optimizations (CSE, GVN, LICM, loop unrolling, >>>>>> TRE) to work with little to no modifications. Potential speedups >>>>>> compared to a classic “early-outlining” approaches can also be seen in the Tapir paper [13]. >>>>>> For the PIR prototype [14] we modified only three transformation >>>>>> passes >>>>>> (<20 LOC) [15] before we could run the O3 pipeline successfully on >>>>>> a parallel matrix multiplication. >>>>>> >>>>>> Together, these prototypes show how little passes actually inspect >>>>>> new (or “unknown”) terminators. The default assumption passes have >>>>>> to make, namely that control might be transferred to any >>>>>> successors at runtime, has, in terms of potential compiler >>>>>> transformations, a similar effect as the parallel semantics we >>>>>> want to model, namely that control is transferred to all successors. >>>>>> >>>>>> >>>>>> (5) Outlook: >>>>>> This first stage will only introduce and test the new instructions >>>>>> and the sequentialization pass. Afterwards we intend to start >>>>>> additions in different, partially overlapping but often orthogonal >>>>>> directions. We do welcome comments as well as developers for each of them: >>>>>> - Analysis and optimization: >>>>>> * A “parallel region info” pass to keep track of parallel regions >>>>>> and their nesting. The information can be made accessible in a >>>>>> “parallel region tree” similar to the loop tree maintained by the >>>>>> loop info pass. [stage 1, immediate next goal] >>>>>> * Extension of the verifier that allow to check parallel IR for >>>>>> “well-formedness”. [stage 1, immediate next goal] >>>>>> * Documentation of the PIR instructions in the language reference. >>>>>> [stage 1, immediate next goal] >>>>>> * A cost analysis for parallel tasks that can be queried by >>>>>> optimizations. The cost model needs to take the hardware, the >>>>>> runtime library and the parallel tasks into account. >>>>>> * Vectorizer enhancements to enable the vectorization of parallel >>>>>> * loops and tasks. >>>>>> * Parallelization centric optimizations: >>>>>> a) Parallel tasks can be balanced, merged or split as well as created >>>>>> from and lowered to sequential code. >>>>>> b) Barriers can be eliminated. >>>>>> c) Parallel loops can be statically scheduled or created from >>>>>> parallel recursive calls [13] >>>>>> * Analysis to extract high-level information (reductions, private >>>>>> memory, ...) from the low-level representation. >>>>>> - Front-end: >>>>>> * Lowering of simple OpenMP and Cilk++ annotations to PIR, including >>>>>> parallel sections and parallel loops with limited support for >>>>>> clauses (at first) (examples can be found here [1]). [milestone 1] >>>>>> * Generation of PIR code through automatic parallelization. A >>>>>> patched version of Polly exists that emits parallel loops using PIR instead of >>>>>> OpenMP runtime calls or llvm.parallel.loop metadata. [milestone 1] >>>>>> * Representation of more evolved high-level features like assignment >>>>>> of computation units. >>>>>> - Back-end: >>>>>> * Lowering of PIR regions to calls to the OpenMP (GOMP) and Cilk++ >>>>>> runtime library. [milestone 1] >>>>>> * A simple parallel library, e.g., based on pthreads, to be shipped >>>>>> with LLVM as a fallback implementation for parallel regions. >>>>>> >>>>>> >>>>>> Thank you all for your time and hopefully constructive input on this proposal! >>>>>> >>>>>> Cheers, >>>>>> Johannes, on behalf of the PIR team >>>>>> >>>>>> >>>>>> Disclaimer: >>>>>> This RFC, the patches, the wiki, etc. are a joint effort by Tao B. >>>>>> Schardl (MIT), Charles E. Leiserson (MIT), Kareem Ergawy (Saarland >>>>>> University), Simon Moll (Saarland University) and myself. However, >>>>>> ideas and feedback came from many people, including the members of >>>>>> the LLVM-HPC IR Extensions working group (Hal Finkel, Xinmin Tian, >>>>>> ...), the participants in the BoF at the US Developers’ meeting, >>>>>> everybody that commented on the BoF discussion sheet [16] and the >>>>>> recent RFC on IR-level region annotations [9] (Mehdi Amini, Sanjoy >>>>>> Das, Daniel Berlin, ...). >>>>>> >>>>>> >>>>>> >>>>>> [0] https://github.com/Parallel-IR/llvm-pir/wiki >>>>>> [1] https://reviews.llvm.org/D29250 [2] >>>>>> https://reviews.llvm.org/D29251 [3] >>>>>> http://lists.llvm.org/pipermail/llvm-dev/2017-January/109302.html >>>>>> [4] >>>>>> http://lists.llvm.org/pipermail/llvm-dev/2015-March/083348.html >>>>>> [5] >>>>>> http://lists.llvm.org/pipermail/llvm-dev/2017-January/109264.html >>>>>> [6] >>>>>> http://compilers.cs.uni-saarland.de/people/doerfert/parallelcfg.pd >>>>>> f [7] http://supertech.csail.mit.edu/papers/spbags.pdf & >>>>>> www.cse.wustl.edu/~angelee/papers/cilkprof.pdf >>>>>> [8] https://goo.gl/Blp2Xr >>>>>> [9] >>>>>> http://lists.llvm.org/pipermail/llvm-dev/2017-January/108906.html >>>>>> [10] >>>>>> http://lists.llvm.org/pipermail/llvm-dev/2017-January/108997.html >>>>>> [11] >>>>>> http://lists.llvm.org/pipermail/llvm-dev/2017-January/109377.html >>>>>> [12] >>>>>> http://lists.llvm.org/pipermail/llvm-dev/2017-January/109351.html >>>>>> [13] http://wsmoses.com/tapir.pdf >>>>>> [14] https://github.com/jdoerfert/llvm-pir/tree/feature/fork-join >>>>>> [15] >>>>>> https://github.com/jdoerfert/llvm-pir/commit/854259881d24d71f9f1f1 >>>>>> 7e >>>>>> 52547758c7be0618a >>>>>> [16] https://goo.gl/wKps3c >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Johannes Doerfert >>>>>> Researcher / PhD Student >>>>>> >>>>>> Compiler Design Lab (Prof. Hack) >>>>>> Saarland Informatics Campus, Germany Building E1.3, Room 4.31 >>>>>> >>>>>> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de Fax. +49 >>>>>> (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert >>>>> >>>>>> _______________________________________________ >>>>>> LLVM Developers mailing list >>>>>> llvm-dev at lists.llvm.org >>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>> -- >>>>> >>>>> Johannes Doerfert >>>>> Researcher / PhD Student >>>>> >>>>> Compiler Design Lab (Prof. Hack) >>>>> Saarland Informatics Campus, Germany Building E1.3, Room 4.31 >>>>> >>>>> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de Fax. +49 >>>>> (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert >>>> >>>>> _______________________________________________ >>>>> LLVM Developers mailing list >>>>> llvm-dev at lists.llvm.org >>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> -- >>> Hal Finkel >>> Lead, Compiler Technology and Programming Languages Leadership >>> Computing Facility Argonne National Laboratory >>> >> -- >> >> Johannes Doerfert >> Researcher / PhD Student >> >> Compiler Design Lab (Prof. Hack) >> Saarland Informatics Campus, Germany >> Building E1.3, Room 4.31 >> >> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de Fax. +49 (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory
On 03/08, Hal Finkel wrote:> > On 03/08/2017 12:44 PM, Johannes Doerfert wrote: > >I don't know who pointed it out first but Mehdi made me aware of it at > >CGO. I try to explain it shortly. > > > >Given the following situation (in pseudo code): > > > > alloc A[100]; > > parallel_for(i = 0; i < 100; i++) > > A[i] = f(i); > > > > acc = 1; > > for(i = 0; i < 100; i++) > > acc = acc * A[i]; > > > >Afaik, with your parallel regions there won't be a CFG loop for the > >parallel initialization, right? Instead some intrinsics that annotate > >the parallel region and then one initialization. I imagine it somewhat > >like this: > > > > token = llvm.parallel.for.start(0, 100, 1) > > i = llvm.parallel.for.iterator(token) > > A[i] = f(i) > > llvm.parallel.for.end(token) > > > >Which would (in serial semantics) allow an optimization to remove the > >second loop as it uses uninitialized (undef) values. In other words, it > >is a problem if we assume only one A[i] was initialized but the rest was > >not. > > > >@Mehdi Does that reflect what you explained to some degree? > > Yes, this is essentially the issue that Sanjoy pointed out on the mailing > list (as well). We definitely need semantics that do not have this problem.Xinmin said there is a loop so why does this problem exists? I did assume there is none (as shown above).> > > >On 03/08, Tian, Xinmin wrote: > >>".... the problem Mehdi pointed out regarding the missed initializations of array elements, did you comment on that one yet?" > >> > >>What is the initializations of array elements question? I don't remember this question. Please refresh my memory. Thanks. > >>I thought Mehdi's question is more about what are attributes needed for these IR-annotation for other LLVM pass to understand and make sense out of them. > >> > >>Xinmin > >> > >>-----Original Message----- > >>From: Johannes Doerfert [mailto:doerfert at cs.uni-saarland.de] > >>Sent: Wednesday, March 8, 2017 9:58 AM > >>To: Tian, Xinmin <xinmin.tian at intel.com> > >>Cc: Hal Finkel <hfinkel at anl.gov>; LLVM-Dev <llvm-dev at lists.llvm.org>; Sanjoy Das <sanjoy at playingwithpointers.com>; Mehdi Amini <mehdi.amini at apple.com>; Chandler Carruth <chandlerc at gmail.com>; Adve, Vikram Sadanand <vadve at illinois.edu>; TB Schardl <neboat at mit.edu>; acjacob at us.ibm.com > >>Subject: Re: > >> > >>On 03/08, Tian, Xinmin wrote: > >>>A quick update, we have been looking through all LLVM passes to > >>>identify the impact of "IR-region annotation", and interaction issues > >>>with the rest of LoopOpt and scalarOpt, e.g. interaction with > >>>vectorization when you have schedule(simd:guided: 64). What are the > >>>common properties for optimizer to know on IR-region annotations. We > >>>have our implementation working from O0, O1, O2 to O3. So far, the > >>>changes we made in the existing LLVM opt passes are < 200 LOC. ClangFE > >>>to executable end-to-end linked our omp library, and we also updated > >>>our implementation using token/tag on the intrinsics based on feedback > >>>from Google (David) and Xilinx (Hongbin). I am still owe Mehdi some > >>>answers. > >>Thanks for the update. I will look into the discussion thread again soon. I am especially interested in the problem Mehdi pointed out regarding the missed initializations of array elements, did you comment on that one yet? > >> > >>>One feedback for PIR RFC, putting "fork" into loop body does not work > >>>for loop vectorizer, it shuts down vectorization, this is the same > >>>issue as MIT's scheme. > >>Totally true. The vectorizer needs to be thought about the fork-join at some point. However, you can for now sequentialize and introduce the llvm.parallel.loop annotations it already knows about. We (Simon Moll and me) are currently discussing how to teach the RV vectorizer to handle fork-join parallel regions in loops but also in otherwise straight line code. Once we make progress we'll let you know! > >> > >>Thanks! > >> Johannes > >> > >>>-----Original Message----- > >>>From: Hal Finkel [mailto:hfinkel at anl.gov] > >>>Sent: Wednesday, March 8, 2017 5:51 AM > >>>To: Johannes Doerfert <doerfert at cs.uni-saarland.de>; LLVM-Dev > >>><llvm-dev at lists.llvm.org> > >>>Cc: Sanjoy Das <sanjoy at playingwithpointers.com>; Mehdi Amini > >>><mehdi.amini at apple.com>; Chandler Carruth <chandlerc at gmail.com>; Adve, > >>>Vikram Sadanand <vadve at illinois.edu>; Tian, Xinmin > >>><xinmin.tian at intel.com>; TB Schardl <neboat at mit.edu>; > >>>acjacob at us.ibm.com > >>>Subject: Re: > >>> > >>> > >>>On 03/08/2017 07:36 AM, Johannes Doerfert wrote: > >>>><mehdi.amini at apple.com>, > >>>>Bcc: > >>>>Subject: Re: [llvm-dev] [RFC][PIR] Parallel LLVM IR -- Stage 0 -- IR > >>>>extension > >>>>Reply-To: > >>>>In-Reply-To: <20170224221713.GA931 at arch-linux-jd.home> > >>>> > >>>>Ping. > >>>> > >>>> > >>>> > >>>> > >>>>PS. > >>>> > >>>> Are there actually people interested in this? > >>>> > >>>> We will continue working anyway but it might not make sense to put it > >>>> on reviews and announce it on the ML if nobody cares. > >>>I certainly care, and will also respond to the RFC, this week or next. > >>> > >>> -Hal > >>> > >>>> > >>>> > >>>>On 02/24, Johannes Doerfert via llvm-dev wrote: > >>>>>To (re)start a discussion here we published two patches to the > >>>>>phabricator: > >>>>> > >>>>>1) A documentation draft (including LangRef) for the PIR instructions > >>>>> and the parallel regions [0]. > >>>>> > >>>>>2) A draft implementation for a ParallelRegionInfo pass that identifies > >>>>> and maintains parallel regions in the CFG [1]. It also serves as a > >>>>> verifier pass for now. > >>>>> > >>>>>Please take a look and comment the patches either in the > >>>>>phabricator or here on the list. > >>>>> > >>>>>Thanks! > >>>>> > >>>>>[0] https://reviews.llvm.org/D30353 [1] > >>>>>https://reviews.llvm.org/D30354 > >>>>> > >>>>> > >>>>>On 01/28, Johannes Doerfert via llvm-dev wrote: > >>>>>>Dear all, > >>>>>> > >>>>>>This RFC proposes three new LLVM IR instructions to express > >>>>>>high-level parallel constructs in a simple, low-level fashion. For > >>>>>>this first stage we prepared two commits that add the proposed > >>>>>>instructions and a pass to lower them to obtain sequential IR. > >>>>>>Both patches have be uploaded for review [1, 2]. The latter patch > >>>>>>is very simple and the former consists of almost only mechanical changes needed to add new instructions. > >>>>>> > >>>>>>The rest of this email contains (1) an introduction of the IR > >>>>>>extension > >>>>>>(2) the reasoning behind this approach, (3) a comparison to other > >>>>>>ideas proposed so far, (4) a validation of the feasibility and > >>>>>>potential impact, and (5) an outlook on the next steps. > >>>>>> > >>>>>>(1) IR extension: > >>>>>>Parallel IR adds three new terminator instructions that define the > >>>>>>beginning and the end of parallel regions in the CFG. A parallel > >>>>>>region is a connected subgraph of the CFG that is potentially > >>>>>>executed by two threads in parallel. It can only be entered with a > >>>>>>fork instruction and spreads till a join instruction is reached. > >>>>>>Therefor parallel regions are single-entry-multiple-exit regions. > >>>>>>Parallel regions can be nested and if they are, they form a > >>>>>>parallel region tree similar to the loop tree maintained by the > >>>>>>natural loop info pass. Each parallel region defines two independent “sibling” > >>>>>>tasks, namely the forked and continuation task. > >>>>>> > >>>>>>The new instructions are defined as follows: > >>>>>> > >>>>>>1. fork: marks the beginning of parallel region. Every fork has two > >>>>>> successor blocks which represent two parallel tasks. We call > >>>>>> these two “sibling” tasks the forked and continuation tasks. > >>>>>> Nested forking is supported, meaning that another fork can be > >>>>>> reach prior to the join. > >>>>>> > >>>>>>2. halt: marks the end of a forked task. The "sibling" continuation block > >>>>>> (see fork above) is the operand of the halt terminator. This > >>>>>> represents the idea of asymmetric parallelism as introduced by > >>>>>> [1]. One advantage of asymmetric parallelism is that sequential > >>>>>> semantics of the program are clear from its CFG (ref. [1]). > >>>>>> Note that the edge from a forked block to a continuation block > >>>>>> (the one introduced by the halt) represents the control flow > >>>>>> when the two successors of a fork execute sequentially, not > >>>>>> when they execute in parallel. In the latter case there is no > >>>>>> “control transfer” happening via this edge but only > >>>>>> synchronization between the tasks. > >>>>>> > >>>>>>3. join: marks a synchronization point and the end of a parallel region. > >>>>>> Once a join terminator is reached by a thread, execution stops > >>>>>> in that thread until all tasks spawned by that thread finish > >>>>>> their work, thus reach their respective halt instruction. A > >>>>>> join shall only be reached by the continuation task of a fork, > >>>>>> the forked task shall reach a halt with the continuation as a > >>>>>> successor. > >>>>>> > >>>>>> > >>>>>>Here is an example of a parallel OpenMP loop and its idiomatic > >>>>>>lowering to Parallel IR. We set up a wiki [0] with additional examples. > >>>>>> > >>>>>>#pragma omp parallel > >>>>>>for(int i = 0; i < n; ++i) { > >>>>>> A[i] = C[i]; > >>>>>>} > >>>>>> > >>>>>> > >>>>>>preheader: > >>>>>> br label %header > >>>>>> > >>>>>>header: > >>>>>> %i = phi [ i32 0, %preheader ], [ %inc, %latch ] > >>>>>> %done = icmp ge %i, %n > >>>>>> br i1 %done, label %exit, label %body > >>>>>> > >>>>>>body: > >>>>>> fork label %task, label %latch > >>>>>> > >>>>>>task: > >>>>>> %aptr = getelementptr i32, i32* %A, i32 0, i32 %i > >>>>>> %aval = load i32* %aptr > >>>>>> %cptr = getelementptr i32, i32* %C, i32 0, i32 %i > >>>>>> store i32 %aval, i32* %aptr > >>>>>> halt label %latch > >>>>>> > >>>>>>latch: > >>>>>> %inc = add i32, i32 %i, i32 1 > >>>>>> br label %header > >>>>>> > >>>>>>exit: > >>>>>> join label %afterloop > >>>>>> > >>>>>>afterloop: > >>>>>>... > >>>>>> > >>>>>> > >>>>>> > >>>>>>(2) Reasoning: > >>>>>>The proposed approach is crafted such that the semantics of the > >>>>>>parallel program is represented correctly in almost native, > >>>>>>low-level IR right after front-end and preserved at any point till > >>>>>>the final lowering to sequential IR or parallel runtime library > >>>>>>calls. To this end, asymmetric parallelism is employed, a concept > >>>>>>that uses control flow and the common concept of dominance to > >>>>>>represent parts of the parallel semantics. In this model the > >>>>>>parallel tasks do not dominate each other and only one parallel > >>>>>>task dominates the code after the parallel region. As a > >>>>>>consequence, various transformations that would break assumptions > >>>>>>we make about parallel regions cannot happen (see [3,4]). While > >>>>>>the explicitly modeled control flow together with dominance > >>>>>>prevents various code motion problems, the use of terminators > >>>>>>helps to minimize the changes needed to educate passes about parallel regions. Only a fraction of analysis and transformation passes deal with terminators explicitly. > >>>>>>Most passes either test for known terminators (like branches), > >>>>>>rely on dominance information, or work on a basic block level. To > >>>>>>even further reduce changes to the existing passes, high-level > >>>>>>concepts are broken down to already available low-level concepts > >>>>>>instead of introducing new, semantically rich > >>>>>>instructions/intrinsics (see the last paragraph of [5] and section > >>>>>>4 in the PIR white paper [6] for examples). Finally, this scheme > >>>>>>allows a pass to simply reason about the sequential semantics of a > >>>>>>parallel region, transform it back to one if needed or deemed > >>>>>>beneficial and employ existing tooling solutions to debug and analyze the code [7]. > >>>>>> > >>>>>>(3) Comparison: > >>>>>>The BoF discussion sheet [8] and the recent “[RFC] on IR-level > >>>>>>region annotations” [9] both list pros and cons of different > >>>>>>proposed schemes and implementations. We summarize and comment the > >>>>>>discussion on the ones listen in the recent RFC here: > >>>>>> (a) Metadata: It seems a consensus has been reached that metadata is > >>>>>> not the solution but only a means to enhance a different solution. > >>>>>> (c) One Intrinsic per directive/clause: This approach basically embeds > >>>>>> a high-level (parallel) language in LLVM IR using intrinsics. It > >>>>>> seems there is little to no support for this approach at the > >>>>>> moment. > >>>>>> (d) Parallel loop/region annotations: Here, intrinsics enclosing a > >>>>>> parallel loop/region are used to represent parallelism. > >>>>>> High-level knowledge is represented as attached metadata or in > >>>>>> separate intrinsics. For more details please see the original > >>>>>> RFC [9]. In the discussion several potential drawbacks have been > >>>>>> mentioned: > >>>>>> - The annotations might be too general [10]. > >>>>>> - The IR is not semantically correct (or ready for optimization) > >>>>>> after the front-end and needs an additional “prepare phase for > >>>>>> pre-privatization" [11]. > >>>>>> - The currently available “potential side effect for intrinsic > >>>>>> calls” seem not to suffice for the proposed intrinsics as they > >>>>>> do not have "call semantics" [12]. > >>>>>> (b) Parallel instructions (this approach): The table in the region RFC > >>>>>> [9] lists two drawbacks with this approach, both of which have > >>>>>> already been called into question [5]. The first drawback is the > >>>>>> effort needed to implement this scheme which is discussed in > >>>>>> more detail in section (4) of this mail. The second drawback is > >>>>>> the need for additional representation of high-level information > >>>>>> that is not part of the semantics of the new fork-join > >>>>>> instructions. As mentioned above, the choice to keep the new > >>>>>> instructions as simple as possible is deliberate. This parallel > >>>>>> IR is intended to be extensible, and in particular, compatible > >>>>>> with representations of high-level parallel concepts that might > >>>>>> be developed in the future. For the time being, the parallel IR > >>>>>> is compatible with approach taken today of lowering high-level > >>>>>> parallel linguistics, such as reductions and private memory, to > >>>>>> existing IR constructs, such as parallel-runtime calls, > >>>>>> atomicrmw instructions, and well placed alloca’s [5,6]. Although > >>>>>> other extensions to the IR might allow LLVM to compile these > >>>>>> higher-level constructs more effectively, we see no reason the > >>>>>> parallel IR would conflict with any such extensions. (On the > >>>>>> contrary, the parallel IR would seem to help compiler analyses > >>>>>> of higher-level parallel constructs by exposing logical > >>>>>> parallelism.) > >>>>>> > >>>>>> > >>>>>>(4) Feasibility and Impact: > >>>>>>The Tapir and PIR prototypes demonstrate the feasibility of this > >>>>>>approach. The Tapir prototype [13] has recently proven its > >>>>>>robustness as the standard compiler in the MIT class on parallel > >>>>>>programming. It was implemented in ~ 5k LOC. However, >1k are > >>>>>>explicit parallel optimization, 1k is used to add new instructions > >>>>>>(thus mechanical) and 2k are used to lower the parallelism (basically needed for any scheme). > >>>>>>Only the rest is required to make it work with existing analysis > >>>>>>and transformation passes. While Tapir added explicit optimization > >>>>>>passes for parallel regions/loops, the representation allows for a > >>>>>>variety of classic optimizations (CSE, GVN, LICM, loop unrolling, > >>>>>>TRE) to work with little to no modifications. Potential speedups > >>>>>>compared to a classic “early-outlining” approaches can also be seen in the Tapir paper [13]. > >>>>>>For the PIR prototype [14] we modified only three transformation > >>>>>>passes > >>>>>>(<20 LOC) [15] before we could run the O3 pipeline successfully on > >>>>>>a parallel matrix multiplication. > >>>>>> > >>>>>>Together, these prototypes show how little passes actually inspect > >>>>>>new (or “unknown”) terminators. The default assumption passes have > >>>>>>to make, namely that control might be transferred to any > >>>>>>successors at runtime, has, in terms of potential compiler > >>>>>>transformations, a similar effect as the parallel semantics we > >>>>>>want to model, namely that control is transferred to all successors. > >>>>>> > >>>>>> > >>>>>>(5) Outlook: > >>>>>>This first stage will only introduce and test the new instructions > >>>>>>and the sequentialization pass. Afterwards we intend to start > >>>>>>additions in different, partially overlapping but often orthogonal > >>>>>>directions. We do welcome comments as well as developers for each of them: > >>>>>> - Analysis and optimization: > >>>>>> * A “parallel region info” pass to keep track of parallel regions > >>>>>> and their nesting. The information can be made accessible in a > >>>>>> “parallel region tree” similar to the loop tree maintained by the > >>>>>> loop info pass. [stage 1, immediate next goal] > >>>>>> * Extension of the verifier that allow to check parallel IR for > >>>>>> “well-formedness”. [stage 1, immediate next goal] > >>>>>> * Documentation of the PIR instructions in the language reference. > >>>>>> [stage 1, immediate next goal] > >>>>>> * A cost analysis for parallel tasks that can be queried by > >>>>>> optimizations. The cost model needs to take the hardware, the > >>>>>> runtime library and the parallel tasks into account. > >>>>>> * Vectorizer enhancements to enable the vectorization of parallel > >>>>>> * loops and tasks. > >>>>>> * Parallelization centric optimizations: > >>>>>> a) Parallel tasks can be balanced, merged or split as well as created > >>>>>> from and lowered to sequential code. > >>>>>> b) Barriers can be eliminated. > >>>>>> c) Parallel loops can be statically scheduled or created from > >>>>>> parallel recursive calls [13] > >>>>>> * Analysis to extract high-level information (reductions, private > >>>>>> memory, ...) from the low-level representation. > >>>>>> - Front-end: > >>>>>> * Lowering of simple OpenMP and Cilk++ annotations to PIR, including > >>>>>> parallel sections and parallel loops with limited support for > >>>>>> clauses (at first) (examples can be found here [1]). [milestone 1] > >>>>>> * Generation of PIR code through automatic parallelization. A > >>>>>> patched version of Polly exists that emits parallel loops using PIR instead of > >>>>>> OpenMP runtime calls or llvm.parallel.loop metadata. [milestone 1] > >>>>>> * Representation of more evolved high-level features like assignment > >>>>>> of computation units. > >>>>>> - Back-end: > >>>>>> * Lowering of PIR regions to calls to the OpenMP (GOMP) and Cilk++ > >>>>>> runtime library. [milestone 1] > >>>>>> * A simple parallel library, e.g., based on pthreads, to be shipped > >>>>>> with LLVM as a fallback implementation for parallel regions. > >>>>>> > >>>>>> > >>>>>>Thank you all for your time and hopefully constructive input on this proposal! > >>>>>> > >>>>>>Cheers, > >>>>>> Johannes, on behalf of the PIR team > >>>>>> > >>>>>> > >>>>>>Disclaimer: > >>>>>>This RFC, the patches, the wiki, etc. are a joint effort by Tao B. > >>>>>>Schardl (MIT), Charles E. Leiserson (MIT), Kareem Ergawy (Saarland > >>>>>>University), Simon Moll (Saarland University) and myself. However, > >>>>>>ideas and feedback came from many people, including the members of > >>>>>>the LLVM-HPC IR Extensions working group (Hal Finkel, Xinmin Tian, > >>>>>>...), the participants in the BoF at the US Developers’ meeting, > >>>>>>everybody that commented on the BoF discussion sheet [16] and the > >>>>>>recent RFC on IR-level region annotations [9] (Mehdi Amini, Sanjoy > >>>>>>Das, Daniel Berlin, ...). > >>>>>> > >>>>>> > >>>>>>[0] https://github.com/Parallel-IR/llvm-pir/wiki > >>>>>>[1] https://reviews.llvm.org/D29250 [2] > >>>>>>https://reviews.llvm.org/D29251 [3] > >>>>>>http://lists.llvm.org/pipermail/llvm-dev/2017-January/109302.html > >>>>>>[4] > >>>>>>http://lists.llvm.org/pipermail/llvm-dev/2015-March/083348.html > >>>>>>[5] > >>>>>>http://lists.llvm.org/pipermail/llvm-dev/2017-January/109264.html > >>>>>>[6] > >>>>>>http://compilers.cs.uni-saarland.de/people/doerfert/parallelcfg.pd > >>>>>>f [7] http://supertech.csail.mit.edu/papers/spbags.pdf & > >>>>>>www.cse.wustl.edu/~angelee/papers/cilkprof.pdf > >>>>>>[8] https://goo.gl/Blp2Xr > >>>>>>[9] > >>>>>>http://lists.llvm.org/pipermail/llvm-dev/2017-January/108906.html > >>>>>>[10] > >>>>>>http://lists.llvm.org/pipermail/llvm-dev/2017-January/108997.html > >>>>>>[11] > >>>>>>http://lists.llvm.org/pipermail/llvm-dev/2017-January/109377.html > >>>>>>[12] > >>>>>>http://lists.llvm.org/pipermail/llvm-dev/2017-January/109351.html > >>>>>>[13] http://wsmoses.com/tapir.pdf > >>>>>>[14] https://github.com/jdoerfert/llvm-pir/tree/feature/fork-join > >>>>>>[15] > >>>>>>https://github.com/jdoerfert/llvm-pir/commit/854259881d24d71f9f1f1 > >>>>>>7e > >>>>>>52547758c7be0618a > >>>>>>[16] https://goo.gl/wKps3c > >>>>>> > >>>>>> > >>>>>>-- > >>>>>> > >>>>>>Johannes Doerfert > >>>>>>Researcher / PhD Student > >>>>>> > >>>>>>Compiler Design Lab (Prof. Hack) > >>>>>>Saarland Informatics Campus, Germany Building E1.3, Room 4.31 > >>>>>> > >>>>>>Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de Fax. +49 > >>>>>>(0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert > >>>>> > >>>>>>_______________________________________________ > >>>>>>LLVM Developers mailing list > >>>>>>llvm-dev at lists.llvm.org > >>>>>>http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >>>>>-- > >>>>> > >>>>>Johannes Doerfert > >>>>>Researcher / PhD Student > >>>>> > >>>>>Compiler Design Lab (Prof. Hack) > >>>>>Saarland Informatics Campus, Germany Building E1.3, Room 4.31 > >>>>> > >>>>>Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de Fax. +49 > >>>>>(0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert > >>>> > >>>>>_______________________________________________ > >>>>>LLVM Developers mailing list > >>>>>llvm-dev at lists.llvm.org > >>>>>http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >>>-- > >>>Hal Finkel > >>>Lead, Compiler Technology and Programming Languages Leadership > >>>Computing Facility Argonne National Laboratory > >>> > >>-- > >> > >>Johannes Doerfert > >>Researcher / PhD Student > >> > >>Compiler Design Lab (Prof. Hack) > >>Saarland Informatics Campus, Germany > >>Building E1.3, Room 4.31 > >> > >>Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de Fax. +49 (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert > > -- > Hal Finkel > Lead, Compiler Technology and Programming Languages > Leadership Computing Facility > Argonne National Laboratory >-- Johannes Doerfert Researcher / PhD Student Compiler Design Lab (Prof. Hack) Saarland Informatics Campus, Germany Building E1.3, Room 4.31 Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de Fax. +49 (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 228 bytes Desc: Digital signature URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170308/cee7be80/attachment-0001.sig>
On 03/08/2017 04:50 PM, Johannes Doerfert wrote:> On 03/08, Hal Finkel wrote: >> On 03/08/2017 12:44 PM, Johannes Doerfert wrote: >>> I don't know who pointed it out first but Mehdi made me aware of it at >>> CGO. I try to explain it shortly. >>> >>> Given the following situation (in pseudo code): >>> >>> alloc A[100]; >>> parallel_for(i = 0; i < 100; i++) >>> A[i] = f(i); >>> >>> acc = 1; >>> for(i = 0; i < 100; i++) >>> acc = acc * A[i]; >>> >>> Afaik, with your parallel regions there won't be a CFG loop for the >>> parallel initialization, right? Instead some intrinsics that annotate >>> the parallel region and then one initialization. I imagine it somewhat >>> like this: >>> >>> token = llvm.parallel.for.start(0, 100, 1) >>> i = llvm.parallel.for.iterator(token) >>> A[i] = f(i) >>> llvm.parallel.for.end(token) >>> >>> Which would (in serial semantics) allow an optimization to remove the >>> second loop as it uses uninitialized (undef) values. In other words, it >>> is a problem if we assume only one A[i] was initialized but the rest was >>> not. >>> >>> @Mehdi Does that reflect what you explained to some degree? >> Yes, this is essentially the issue that Sanjoy pointed out on the mailing >> list (as well). We definitely need semantics that do not have this problem. > Xinmin said there is a loop so why does this problem exists? I did > assume there is none (as shown above).Regardless of whether there is a loop for a loop or not, there is no loop for a pure parallel region and the problem still occurs there. -Hal> >>> On 03/08, Tian, Xinmin wrote: >>>> ".... the problem Mehdi pointed out regarding the missed initializations of array elements, did you comment on that one yet?" >>>> >>>> What is the initializations of array elements question? I don't remember this question. Please refresh my memory. Thanks. >>>> I thought Mehdi's question is more about what are attributes needed for these IR-annotation for other LLVM pass to understand and make sense out of them. >>>> >>>> Xinmin >>>> >>>> -----Original Message----- >>>> From: Johannes Doerfert [mailto:doerfert at cs.uni-saarland.de] >>>> Sent: Wednesday, March 8, 2017 9:58 AM >>>> To: Tian, Xinmin <xinmin.tian at intel.com> >>>> Cc: Hal Finkel <hfinkel at anl.gov>; LLVM-Dev <llvm-dev at lists.llvm.org>; Sanjoy Das <sanjoy at playingwithpointers.com>; Mehdi Amini <mehdi.amini at apple.com>; Chandler Carruth <chandlerc at gmail.com>; Adve, Vikram Sadanand <vadve at illinois.edu>; TB Schardl <neboat at mit.edu>; acjacob at us.ibm.com >>>> Subject: Re: >>>> >>>> On 03/08, Tian, Xinmin wrote: >>>>> A quick update, we have been looking through all LLVM passes to >>>>> identify the impact of "IR-region annotation", and interaction issues >>>>> with the rest of LoopOpt and scalarOpt, e.g. interaction with >>>>> vectorization when you have schedule(simd:guided: 64). What are the >>>>> common properties for optimizer to know on IR-region annotations. We >>>>> have our implementation working from O0, O1, O2 to O3. So far, the >>>>> changes we made in the existing LLVM opt passes are < 200 LOC. ClangFE >>>>> to executable end-to-end linked our omp library, and we also updated >>>>> our implementation using token/tag on the intrinsics based on feedback >>>> >from Google (David) and Xilinx (Hongbin). I am still owe Mehdi some >>>>> answers. >>>> Thanks for the update. I will look into the discussion thread again soon. I am especially interested in the problem Mehdi pointed out regarding the missed initializations of array elements, did you comment on that one yet? >>>> >>>>> One feedback for PIR RFC, putting "fork" into loop body does not work >>>>> for loop vectorizer, it shuts down vectorization, this is the same >>>>> issue as MIT's scheme. >>>> Totally true. The vectorizer needs to be thought about the fork-join at some point. However, you can for now sequentialize and introduce the llvm.parallel.loop annotations it already knows about. We (Simon Moll and me) are currently discussing how to teach the RV vectorizer to handle fork-join parallel regions in loops but also in otherwise straight line code. Once we make progress we'll let you know! >>>> >>>> Thanks! >>>> Johannes >>>> >>>>> -----Original Message----- >>>>> From: Hal Finkel [mailto:hfinkel at anl.gov] >>>>> Sent: Wednesday, March 8, 2017 5:51 AM >>>>> To: Johannes Doerfert <doerfert at cs.uni-saarland.de>; LLVM-Dev >>>>> <llvm-dev at lists.llvm.org> >>>>> Cc: Sanjoy Das <sanjoy at playingwithpointers.com>; Mehdi Amini >>>>> <mehdi.amini at apple.com>; Chandler Carruth <chandlerc at gmail.com>; Adve, >>>>> Vikram Sadanand <vadve at illinois.edu>; Tian, Xinmin >>>>> <xinmin.tian at intel.com>; TB Schardl <neboat at mit.edu>; >>>>> acjacob at us.ibm.com >>>>> Subject: Re: >>>>> >>>>> >>>>> On 03/08/2017 07:36 AM, Johannes Doerfert wrote: >>>>>> <mehdi.amini at apple.com>, >>>>>> Bcc: >>>>>> Subject: Re: [llvm-dev] [RFC][PIR] Parallel LLVM IR -- Stage 0 -- IR >>>>>> extension >>>>>> Reply-To: >>>>>> In-Reply-To: <20170224221713.GA931 at arch-linux-jd.home> >>>>>> >>>>>> Ping. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> PS. >>>>>> >>>>>> Are there actually people interested in this? >>>>>> >>>>>> We will continue working anyway but it might not make sense to put it >>>>>> on reviews and announce it on the ML if nobody cares. >>>>> I certainly care, and will also respond to the RFC, this week or next. >>>>> >>>>> -Hal >>>>> >>>>>> >>>>>> On 02/24, Johannes Doerfert via llvm-dev wrote: >>>>>>> To (re)start a discussion here we published two patches to the >>>>>>> phabricator: >>>>>>> >>>>>>> 1) A documentation draft (including LangRef) for the PIR instructions >>>>>>> and the parallel regions [0]. >>>>>>> >>>>>>> 2) A draft implementation for a ParallelRegionInfo pass that identifies >>>>>>> and maintains parallel regions in the CFG [1]. It also serves as a >>>>>>> verifier pass for now. >>>>>>> >>>>>>> Please take a look and comment the patches either in the >>>>>>> phabricator or here on the list. >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> [0] https://reviews.llvm.org/D30353 [1] >>>>>>> https://reviews.llvm.org/D30354 >>>>>>> >>>>>>> >>>>>>> On 01/28, Johannes Doerfert via llvm-dev wrote: >>>>>>>> Dear all, >>>>>>>> >>>>>>>> This RFC proposes three new LLVM IR instructions to express >>>>>>>> high-level parallel constructs in a simple, low-level fashion. For >>>>>>>> this first stage we prepared two commits that add the proposed >>>>>>>> instructions and a pass to lower them to obtain sequential IR. >>>>>>>> Both patches have be uploaded for review [1, 2]. The latter patch >>>>>>>> is very simple and the former consists of almost only mechanical changes needed to add new instructions. >>>>>>>> >>>>>>>> The rest of this email contains (1) an introduction of the IR >>>>>>>> extension >>>>>>>> (2) the reasoning behind this approach, (3) a comparison to other >>>>>>>> ideas proposed so far, (4) a validation of the feasibility and >>>>>>>> potential impact, and (5) an outlook on the next steps. >>>>>>>> >>>>>>>> (1) IR extension: >>>>>>>> Parallel IR adds three new terminator instructions that define the >>>>>>>> beginning and the end of parallel regions in the CFG. A parallel >>>>>>>> region is a connected subgraph of the CFG that is potentially >>>>>>>> executed by two threads in parallel. It can only be entered with a >>>>>>>> fork instruction and spreads till a join instruction is reached. >>>>>>>> Therefor parallel regions are single-entry-multiple-exit regions. >>>>>>>> Parallel regions can be nested and if they are, they form a >>>>>>>> parallel region tree similar to the loop tree maintained by the >>>>>>>> natural loop info pass. Each parallel region defines two independent “sibling” >>>>>>>> tasks, namely the forked and continuation task. >>>>>>>> >>>>>>>> The new instructions are defined as follows: >>>>>>>> >>>>>>>> 1. fork: marks the beginning of parallel region. Every fork has two >>>>>>>> successor blocks which represent two parallel tasks. We call >>>>>>>> these two “sibling” tasks the forked and continuation tasks. >>>>>>>> Nested forking is supported, meaning that another fork can be >>>>>>>> reach prior to the join. >>>>>>>> >>>>>>>> 2. halt: marks the end of a forked task. The "sibling" continuation block >>>>>>>> (see fork above) is the operand of the halt terminator. This >>>>>>>> represents the idea of asymmetric parallelism as introduced by >>>>>>>> [1]. One advantage of asymmetric parallelism is that sequential >>>>>>>> semantics of the program are clear from its CFG (ref. [1]). >>>>>>>> Note that the edge from a forked block to a continuation block >>>>>>>> (the one introduced by the halt) represents the control flow >>>>>>>> when the two successors of a fork execute sequentially, not >>>>>>>> when they execute in parallel. In the latter case there is no >>>>>>>> “control transfer” happening via this edge but only >>>>>>>> synchronization between the tasks. >>>>>>>> >>>>>>>> 3. join: marks a synchronization point and the end of a parallel region. >>>>>>>> Once a join terminator is reached by a thread, execution stops >>>>>>>> in that thread until all tasks spawned by that thread finish >>>>>>>> their work, thus reach their respective halt instruction. A >>>>>>>> join shall only be reached by the continuation task of a fork, >>>>>>>> the forked task shall reach a halt with the continuation as a >>>>>>>> successor. >>>>>>>> >>>>>>>> >>>>>>>> Here is an example of a parallel OpenMP loop and its idiomatic >>>>>>>> lowering to Parallel IR. We set up a wiki [0] with additional examples. >>>>>>>> >>>>>>>> #pragma omp parallel >>>>>>>> for(int i = 0; i < n; ++i) { >>>>>>>> A[i] = C[i]; >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> preheader: >>>>>>>> br label %header >>>>>>>> >>>>>>>> header: >>>>>>>> %i = phi [ i32 0, %preheader ], [ %inc, %latch ] >>>>>>>> %done = icmp ge %i, %n >>>>>>>> br i1 %done, label %exit, label %body >>>>>>>> >>>>>>>> body: >>>>>>>> fork label %task, label %latch >>>>>>>> >>>>>>>> task: >>>>>>>> %aptr = getelementptr i32, i32* %A, i32 0, i32 %i >>>>>>>> %aval = load i32* %aptr >>>>>>>> %cptr = getelementptr i32, i32* %C, i32 0, i32 %i >>>>>>>> store i32 %aval, i32* %aptr >>>>>>>> halt label %latch >>>>>>>> >>>>>>>> latch: >>>>>>>> %inc = add i32, i32 %i, i32 1 >>>>>>>> br label %header >>>>>>>> >>>>>>>> exit: >>>>>>>> join label %afterloop >>>>>>>> >>>>>>>> afterloop: >>>>>>>> ... >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> (2) Reasoning: >>>>>>>> The proposed approach is crafted such that the semantics of the >>>>>>>> parallel program is represented correctly in almost native, >>>>>>>> low-level IR right after front-end and preserved at any point till >>>>>>>> the final lowering to sequential IR or parallel runtime library >>>>>>>> calls. To this end, asymmetric parallelism is employed, a concept >>>>>>>> that uses control flow and the common concept of dominance to >>>>>>>> represent parts of the parallel semantics. In this model the >>>>>>>> parallel tasks do not dominate each other and only one parallel >>>>>>>> task dominates the code after the parallel region. As a >>>>>>>> consequence, various transformations that would break assumptions >>>>>>>> we make about parallel regions cannot happen (see [3,4]). While >>>>>>>> the explicitly modeled control flow together with dominance >>>>>>>> prevents various code motion problems, the use of terminators >>>>>>>> helps to minimize the changes needed to educate passes about parallel regions. Only a fraction of analysis and transformation passes deal with terminators explicitly. >>>>>>>> Most passes either test for known terminators (like branches), >>>>>>>> rely on dominance information, or work on a basic block level. To >>>>>>>> even further reduce changes to the existing passes, high-level >>>>>>>> concepts are broken down to already available low-level concepts >>>>>>>> instead of introducing new, semantically rich >>>>>>>> instructions/intrinsics (see the last paragraph of [5] and section >>>>>>>> 4 in the PIR white paper [6] for examples). Finally, this scheme >>>>>>>> allows a pass to simply reason about the sequential semantics of a >>>>>>>> parallel region, transform it back to one if needed or deemed >>>>>>>> beneficial and employ existing tooling solutions to debug and analyze the code [7]. >>>>>>>> >>>>>>>> (3) Comparison: >>>>>>>> The BoF discussion sheet [8] and the recent “[RFC] on IR-level >>>>>>>> region annotations” [9] both list pros and cons of different >>>>>>>> proposed schemes and implementations. We summarize and comment the >>>>>>>> discussion on the ones listen in the recent RFC here: >>>>>>>> (a) Metadata: It seems a consensus has been reached that metadata is >>>>>>>> not the solution but only a means to enhance a different solution. >>>>>>>> (c) One Intrinsic per directive/clause: This approach basically embeds >>>>>>>> a high-level (parallel) language in LLVM IR using intrinsics. It >>>>>>>> seems there is little to no support for this approach at the >>>>>>>> moment. >>>>>>>> (d) Parallel loop/region annotations: Here, intrinsics enclosing a >>>>>>>> parallel loop/region are used to represent parallelism. >>>>>>>> High-level knowledge is represented as attached metadata or in >>>>>>>> separate intrinsics. For more details please see the original >>>>>>>> RFC [9]. In the discussion several potential drawbacks have been >>>>>>>> mentioned: >>>>>>>> - The annotations might be too general [10]. >>>>>>>> - The IR is not semantically correct (or ready for optimization) >>>>>>>> after the front-end and needs an additional “prepare phase for >>>>>>>> pre-privatization" [11]. >>>>>>>> - The currently available “potential side effect for intrinsic >>>>>>>> calls” seem not to suffice for the proposed intrinsics as they >>>>>>>> do not have "call semantics" [12]. >>>>>>>> (b) Parallel instructions (this approach): The table in the region RFC >>>>>>>> [9] lists two drawbacks with this approach, both of which have >>>>>>>> already been called into question [5]. The first drawback is the >>>>>>>> effort needed to implement this scheme which is discussed in >>>>>>>> more detail in section (4) of this mail. The second drawback is >>>>>>>> the need for additional representation of high-level information >>>>>>>> that is not part of the semantics of the new fork-join >>>>>>>> instructions. As mentioned above, the choice to keep the new >>>>>>>> instructions as simple as possible is deliberate. This parallel >>>>>>>> IR is intended to be extensible, and in particular, compatible >>>>>>>> with representations of high-level parallel concepts that might >>>>>>>> be developed in the future. For the time being, the parallel IR >>>>>>>> is compatible with approach taken today of lowering high-level >>>>>>>> parallel linguistics, such as reductions and private memory, to >>>>>>>> existing IR constructs, such as parallel-runtime calls, >>>>>>>> atomicrmw instructions, and well placed alloca’s [5,6]. Although >>>>>>>> other extensions to the IR might allow LLVM to compile these >>>>>>>> higher-level constructs more effectively, we see no reason the >>>>>>>> parallel IR would conflict with any such extensions. (On the >>>>>>>> contrary, the parallel IR would seem to help compiler analyses >>>>>>>> of higher-level parallel constructs by exposing logical >>>>>>>> parallelism.) >>>>>>>> >>>>>>>> >>>>>>>> (4) Feasibility and Impact: >>>>>>>> The Tapir and PIR prototypes demonstrate the feasibility of this >>>>>>>> approach. The Tapir prototype [13] has recently proven its >>>>>>>> robustness as the standard compiler in the MIT class on parallel >>>>>>>> programming. It was implemented in ~ 5k LOC. However, >1k are >>>>>>>> explicit parallel optimization, 1k is used to add new instructions >>>>>>>> (thus mechanical) and 2k are used to lower the parallelism (basically needed for any scheme). >>>>>>>> Only the rest is required to make it work with existing analysis >>>>>>>> and transformation passes. While Tapir added explicit optimization >>>>>>>> passes for parallel regions/loops, the representation allows for a >>>>>>>> variety of classic optimizations (CSE, GVN, LICM, loop unrolling, >>>>>>>> TRE) to work with little to no modifications. Potential speedups >>>>>>>> compared to a classic “early-outlining” approaches can also be seen in the Tapir paper [13]. >>>>>>>> For the PIR prototype [14] we modified only three transformation >>>>>>>> passes >>>>>>>> (<20 LOC) [15] before we could run the O3 pipeline successfully on >>>>>>>> a parallel matrix multiplication. >>>>>>>> >>>>>>>> Together, these prototypes show how little passes actually inspect >>>>>>>> new (or “unknown”) terminators. The default assumption passes have >>>>>>>> to make, namely that control might be transferred to any >>>>>>>> successors at runtime, has, in terms of potential compiler >>>>>>>> transformations, a similar effect as the parallel semantics we >>>>>>>> want to model, namely that control is transferred to all successors. >>>>>>>> >>>>>>>> >>>>>>>> (5) Outlook: >>>>>>>> This first stage will only introduce and test the new instructions >>>>>>>> and the sequentialization pass. Afterwards we intend to start >>>>>>>> additions in different, partially overlapping but often orthogonal >>>>>>>> directions. We do welcome comments as well as developers for each of them: >>>>>>>> - Analysis and optimization: >>>>>>>> * A “parallel region info” pass to keep track of parallel regions >>>>>>>> and their nesting. The information can be made accessible in a >>>>>>>> “parallel region tree” similar to the loop tree maintained by the >>>>>>>> loop info pass. [stage 1, immediate next goal] >>>>>>>> * Extension of the verifier that allow to check parallel IR for >>>>>>>> “well-formedness”. [stage 1, immediate next goal] >>>>>>>> * Documentation of the PIR instructions in the language reference. >>>>>>>> [stage 1, immediate next goal] >>>>>>>> * A cost analysis for parallel tasks that can be queried by >>>>>>>> optimizations. The cost model needs to take the hardware, the >>>>>>>> runtime library and the parallel tasks into account. >>>>>>>> * Vectorizer enhancements to enable the vectorization of parallel >>>>>>>> * loops and tasks. >>>>>>>> * Parallelization centric optimizations: >>>>>>>> a) Parallel tasks can be balanced, merged or split as well as created >>>>>>>> from and lowered to sequential code. >>>>>>>> b) Barriers can be eliminated. >>>>>>>> c) Parallel loops can be statically scheduled or created from >>>>>>>> parallel recursive calls [13] >>>>>>>> * Analysis to extract high-level information (reductions, private >>>>>>>> memory, ...) from the low-level representation. >>>>>>>> - Front-end: >>>>>>>> * Lowering of simple OpenMP and Cilk++ annotations to PIR, including >>>>>>>> parallel sections and parallel loops with limited support for >>>>>>>> clauses (at first) (examples can be found here [1]). [milestone 1] >>>>>>>> * Generation of PIR code through automatic parallelization. A >>>>>>>> patched version of Polly exists that emits parallel loops using PIR instead of >>>>>>>> OpenMP runtime calls or llvm.parallel.loop metadata. [milestone 1] >>>>>>>> * Representation of more evolved high-level features like assignment >>>>>>>> of computation units. >>>>>>>> - Back-end: >>>>>>>> * Lowering of PIR regions to calls to the OpenMP (GOMP) and Cilk++ >>>>>>>> runtime library. [milestone 1] >>>>>>>> * A simple parallel library, e.g., based on pthreads, to be shipped >>>>>>>> with LLVM as a fallback implementation for parallel regions. >>>>>>>> >>>>>>>> >>>>>>>> Thank you all for your time and hopefully constructive input on this proposal! >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Johannes, on behalf of the PIR team >>>>>>>> >>>>>>>> >>>>>>>> Disclaimer: >>>>>>>> This RFC, the patches, the wiki, etc. are a joint effort by Tao B. >>>>>>>> Schardl (MIT), Charles E. Leiserson (MIT), Kareem Ergawy (Saarland >>>>>>>> University), Simon Moll (Saarland University) and myself. However, >>>>>>>> ideas and feedback came from many people, including the members of >>>>>>>> the LLVM-HPC IR Extensions working group (Hal Finkel, Xinmin Tian, >>>>>>>> ...), the participants in the BoF at the US Developers’ meeting, >>>>>>>> everybody that commented on the BoF discussion sheet [16] and the >>>>>>>> recent RFC on IR-level region annotations [9] (Mehdi Amini, Sanjoy >>>>>>>> Das, Daniel Berlin, ...). >>>>>>>> >>>>>>>> >>>>>>>> [0] https://github.com/Parallel-IR/llvm-pir/wiki >>>>>>>> [1] https://reviews.llvm.org/D29250 [2] >>>>>>>> https://reviews.llvm.org/D29251 [3] >>>>>>>> http://lists.llvm.org/pipermail/llvm-dev/2017-January/109302.html >>>>>>>> [4] >>>>>>>> http://lists.llvm.org/pipermail/llvm-dev/2015-March/083348.html >>>>>>>> [5] >>>>>>>> http://lists.llvm.org/pipermail/llvm-dev/2017-January/109264.html >>>>>>>> [6] >>>>>>>> http://compilers.cs.uni-saarland.de/people/doerfert/parallelcfg.pd >>>>>>>> f [7] http://supertech.csail.mit.edu/papers/spbags.pdf & >>>>>>>> www.cse.wustl.edu/~angelee/papers/cilkprof.pdf >>>>>>>> [8] https://goo.gl/Blp2Xr >>>>>>>> [9] >>>>>>>> http://lists.llvm.org/pipermail/llvm-dev/2017-January/108906.html >>>>>>>> [10] >>>>>>>> http://lists.llvm.org/pipermail/llvm-dev/2017-January/108997.html >>>>>>>> [11] >>>>>>>> http://lists.llvm.org/pipermail/llvm-dev/2017-January/109377.html >>>>>>>> [12] >>>>>>>> http://lists.llvm.org/pipermail/llvm-dev/2017-January/109351.html >>>>>>>> [13] http://wsmoses.com/tapir.pdf >>>>>>>> [14] https://github.com/jdoerfert/llvm-pir/tree/feature/fork-join >>>>>>>> [15] >>>>>>>> https://github.com/jdoerfert/llvm-pir/commit/854259881d24d71f9f1f1 >>>>>>>> 7e >>>>>>>> 52547758c7be0618a >>>>>>>> [16] https://goo.gl/wKps3c >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> Johannes Doerfert >>>>>>>> Researcher / PhD Student >>>>>>>> >>>>>>>> Compiler Design Lab (Prof. Hack) >>>>>>>> Saarland Informatics Campus, Germany Building E1.3, Room 4.31 >>>>>>>> >>>>>>>> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de Fax. +49 >>>>>>>> (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert >>>>>>>> _______________________________________________ >>>>>>>> LLVM Developers mailing list >>>>>>>> llvm-dev at lists.llvm.org >>>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>>>> -- >>>>>>> >>>>>>> Johannes Doerfert >>>>>>> Researcher / PhD Student >>>>>>> >>>>>>> Compiler Design Lab (Prof. Hack) >>>>>>> Saarland Informatics Campus, Germany Building E1.3, Room 4.31 >>>>>>> >>>>>>> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de Fax. +49 >>>>>>> (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert >>>>>>> _______________________________________________ >>>>>>> LLVM Developers mailing list >>>>>>> llvm-dev at lists.llvm.org >>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>> -- >>>>> Hal Finkel >>>>> Lead, Compiler Technology and Programming Languages Leadership >>>>> Computing Facility Argonne National Laboratory >>>>> >>>> -- >>>> >>>> Johannes Doerfert >>>> Researcher / PhD Student >>>> >>>> Compiler Design Lab (Prof. Hack) >>>> Saarland Informatics Campus, Germany >>>> Building E1.3, Room 4.31 >>>> >>>> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de Fax. +49 (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert >> -- >> Hal Finkel >> Lead, Compiler Technology and Programming Languages >> Leadership Computing Facility >> Argonne National Laboratory >>-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory