thr3ads.net - llvm dev - [llvm-dev] [RFC][PIR] Parallel LLVM IR -- Stage 0 -- [Mar 2017]

If this information is useful, please help other people find it:
Share via:

Tian, Xinmin via llvm-dev

2017-Mar-08 19:03 UTC

[llvm-dev] (no subject)

The IR-region annotation we proposed is as below, there is no
@llvm.parallel.for.iterator().....  There is no change to loop CFG.

   alloc A[100];
   %t = call token @llvm.region.entry()["parallel.for"()]
    for(i = 0; i < 100; i++) {   
       a[i] = f(i);
   }
   @llvm.region.exit(%t)() ["end.parallel.for"()]

Xinmin

-----Original Message-----
From: Johannes Doerfert [mailto:doerfert at cs.uni-saarland.de] 
Sent: Wednesday, March 8, 2017 10:44 AM
To: Tian, Xinmin <xinmin.tian at intel.com>
Cc: Hal Finkel <hfinkel at anl.gov>; LLVM-Dev <llvm-dev at
lists.llvm.org>; Sanjoy Das <sanjoy at playingwithpointers.com>; Mehdi
Amini <mehdi.amini at apple.com>; Chandler Carruth <chandlerc at
gmail.com>; Adve, Vikram Sadanand <vadve at illinois.edu>; TB Schardl
<neboat at mit.edu>; acjacob at us.ibm.com
Subject: Re:

I don't know who pointed it out first but Mehdi made me aware of it at CGO.
I try to explain it shortly.

Given the following situation (in pseudo code):

  alloc A[100];
  parallel_for(i = 0; i < 100; i++)
    A[i] = f(i);

  acc = 1;
  for(i = 0; i < 100; i++)
    acc = acc * A[i];

Afaik, with your parallel regions there won't be a CFG loop for the parallel
initialization, right? Instead some intrinsics that annotate the parallel region
and then one initialization. I imagine it somewhat like this:

  token = llvm.parallel.for.start(0, 100, 1)
  i = llvm.parallel.for.iterator(token)
  A[i] = f(i)
  llvm.parallel.for.end(token)

Which would (in serial semantics) allow an optimization to remove the second
loop as it uses uninitialized (undef) values. In other words, it is a problem if
we assume only one A[i] was initialized but the rest was not.

@Mehdi Does that reflect what you explained to some degree?

On 03/08, Tian, Xinmin wrote:> ".... the problem Mehdi pointed out regarding the missed
initializations of array elements, did you comment on that one yet?"
> 
> What is the initializations of array elements question?  I don't
remember this question. Please refresh my memory. Thanks.
> I thought Mehdi's question is more about what are attributes needed for
these IR-annotation for other LLVM pass to understand and make sense out of
them.
> 
> Xinmin
> 
> -----Original Message-----
> From: Johannes Doerfert [mailto:doerfert at cs.uni-saarland.de]
> Sent: Wednesday, March 8, 2017 9:58 AM
> To: Tian, Xinmin <xinmin.tian at intel.com>
> Cc: Hal Finkel <hfinkel at anl.gov>; LLVM-Dev <llvm-dev at
lists.llvm.org>;
> Sanjoy Das <sanjoy at playingwithpointers.com>; Mehdi Amini 
> <mehdi.amini at apple.com>; Chandler Carruth <chandlerc at
gmail.com>; Adve,
> Vikram Sadanand <vadve at illinois.edu>; TB Schardl <neboat at
mit.edu>;
> acjacob at us.ibm.com
> Subject: Re:
> 
> On 03/08, Tian, Xinmin wrote:
> > A quick update, we have been looking through all LLVM passes to 
> > identify the impact of "IR-region annotation", and
interaction
> > issues with the rest of LoopOpt and scalarOpt, e.g.  interaction 
> > with vectorization when you have schedule(simd:guided: 64). What are 
> > the common properties for optimizer to know on IR-region 
> > annotations. We have our implementation working from O0, O1, O2 to 
> > O3. So far, the changes we made in the existing LLVM opt passes are 
> > < 200 LOC. ClangFE to executable end-to-end linked our omp library,
> > and we also updated our implementation using token/tag on the 
> > intrinsics based on feedback from Google (David) and Xilinx 
> > (Hongbin). I am still owe Mehdi some answers.
> Thanks for the update. I will look into the discussion thread again soon. I
am especially interested in the problem Mehdi pointed out regarding the missed
initializations of array elements, did you comment on that one yet?
> 
> > One feedback for PIR RFC, putting "fork" into loop body does
not
> > work for loop vectorizer, it shuts down vectorization, this is the 
> > same issue as MIT's scheme.
> Totally true. The vectorizer needs to be thought about the fork-join at
some point. However, you can for now sequentialize and introduce the
llvm.parallel.loop annotations it already knows about. We (Simon Moll and me)
are currently discussing how to teach the RV vectorizer to handle fork-join
parallel regions in loops but also in otherwise straight line code. Once we make
progress we'll let you know!
> 
> Thanks!
>   Johannes
> 
> > -----Original Message-----
> > From: Hal Finkel [mailto:hfinkel at anl.gov]
> > Sent: Wednesday, March 8, 2017 5:51 AM
> > To: Johannes Doerfert <doerfert at cs.uni-saarland.de>; LLVM-Dev
> > <llvm-dev at lists.llvm.org>
> > Cc: Sanjoy Das <sanjoy at playingwithpointers.com>; Mehdi Amini 
> > <mehdi.amini at apple.com>; Chandler Carruth <chandlerc at
gmail.com>;
> > Adve, Vikram Sadanand <vadve at illinois.edu>; Tian, Xinmin 
> > <xinmin.tian at intel.com>; TB Schardl <neboat at
mit.edu>;
> > acjacob at us.ibm.com
> > Subject: Re:
> > 
> > 
> > On 03/08/2017 07:36 AM, Johannes Doerfert wrote:
> > > <mehdi.amini at apple.com>,
> > > Bcc:
> > > Subject: Re: [llvm-dev] [RFC][PIR] Parallel LLVM IR -- Stage 0 --
> > > IR extension
> > > Reply-To:
> > > In-Reply-To: <20170224221713.GA931 at arch-linux-jd.home>
> > >
> > > Ping.
> > >
> > >
> > >
> > >
> > > PS.
> > >
> > >    Are there actually people interested in this?
> > >
> > >    We will continue working anyway but it might not make sense to
put it
> > >    on reviews and announce it on the ML if nobody cares.
> > 
> > I certainly care, and will also respond to the RFC, this week or next.
> > 
> >   -Hal
> > 
> > >
> > >
> > >
> > > On 02/24, Johannes Doerfert via llvm-dev wrote:
> > >> To (re)start a discussion here we published two patches to
the
> > >> phabricator:
> > >>
> > >> 1) A documentation draft (including LangRef) for the PIR
instructions
> > >>     and the parallel regions [0].
> > >>
> > >> 2) A draft implementation for a ParallelRegionInfo pass that
identifies
> > >>     and maintains parallel regions in the CFG [1]. It also
serves as a
> > >>     verifier pass for now.
> > >>
> > >> Please take a look and comment the patches either in the 
> > >> phabricator or here on the list.
> > >>
> > >> Thanks!
> > >>
> > >> [0] https://reviews.llvm.org/D30353 [1]
> > >> https://reviews.llvm.org/D30354
> > >>
> > >>
> > >> On 01/28, Johannes Doerfert via llvm-dev wrote:
> > >>> Dear all,
> > >>>
> > >>> This RFC proposes three new LLVM IR instructions to
express
> > >>> high-level parallel constructs in a simple, low-level
fashion.
> > >>> For this first stage we prepared two commits that add the
> > >>> proposed instructions and a pass to lower them to obtain
sequential IR.
> > >>> Both patches have be uploaded for review [1, 2]. The
latter
> > >>> patch is very simple and the former consists of almost
only mechanical changes needed to add new instructions.
> > >>>
> > >>> The rest of this email contains (1) an introduction of
the IR
> > >>> extension
> > >>> (2) the reasoning behind this approach, (3) a comparison
to
> > >>> other ideas proposed so far, (4) a validation of the
feasibility
> > >>> and potential impact, and (5) an outlook on the next
steps.
> > >>>
> > >>> (1) IR extension:
> > >>> Parallel IR adds three new terminator instructions that
define
> > >>> the beginning and the end of parallel regions in the CFG.
A
> > >>> parallel region is a connected subgraph of the CFG that
is
> > >>> potentially executed by two threads in parallel. It can
only be
> > >>> entered with a fork instruction and spreads till a join
instruction is reached.
> > >>> Therefor parallel regions are single-entry-multiple-exit
regions.
> > >>> Parallel regions can be nested and if they are, they form
a
> > >>> parallel region tree similar to the loop tree maintained
by the
> > >>> natural loop info pass. Each parallel region defines two
independent “sibling”
> > >>> tasks, namely the forked and continuation task.
> > >>>
> > >>> The new instructions are defined as follows:
> > >>>
> > >>> 1. fork: marks the beginning of parallel region. Every
fork has two
> > >>>           successor blocks which represent two parallel
tasks. We call
> > >>>           these two “sibling” tasks the forked and
continuation tasks.
> > >>>           Nested forking is supported, meaning that
another fork can be
> > >>>           reach prior to the join.
> > >>>
> > >>> 2. halt: marks the end of a forked task. The
"sibling" continuation block
> > >>>           (see fork above) is the operand of the halt
terminator. This
> > >>>           represents the idea of asymmetric parallelism
as introduced by
> > >>>           [1]. One advantage of asymmetric parallelism is
that sequential
> > >>>           semantics of the program are clear from its CFG
(ref. [1]).
> > >>>           Note that the edge from a forked block to a
continuation block
> > >>>           (the one introduced by the halt) represents the
control flow
> > >>>           when the two successors of a fork execute
sequentially, not
> > >>>           when they execute in parallel. In the latter
case there is no
> > >>>           “control transfer” happening via this edge but
only
> > >>>           synchronization between the tasks.
> > >>>
> > >>> 3. join: marks a synchronization point and the end of a
parallel region.
> > >>>           Once a join terminator is reached by a thread,
execution stops
> > >>>           in that thread until all tasks spawned by that
thread finish
> > >>>           their work, thus reach their respective halt
instruction.  A
> > >>>           join shall only be reached by the continuation
task of a fork,
> > >>>           the forked task shall reach a halt with the
continuation as a
> > >>>           successor.
> > >>>
> > >>>
> > >>> Here is an example of a parallel OpenMP loop and its
idiomatic
> > >>> lowering to Parallel IR. We set up a wiki [0] with
additional examples.
> > >>>
> > >>> #pragma omp parallel
> > >>> for(int i = 0; i < n; ++i) {
> > >>>   A[i] = C[i];
> > >>> }
> > >>>
> > >>>
> > >>> preheader:
> > >>>    br label %header
> > >>>
> > >>> header:
> > >>>    %i = phi [ i32 0, %preheader ], [ %inc, %latch ]
> > >>>    %done = icmp ge %i, %n
> > >>>    br i1 %done, label %exit, label %body
> > >>>
> > >>> body:
> > >>>    fork label %task, label %latch
> > >>>
> > >>> task:
> > >>>    %aptr = getelementptr i32, i32* %A, i32 0, i32 %i
> > >>>    %aval = load i32* %aptr
> > >>>    %cptr = getelementptr i32, i32* %C, i32 0, i32 %i
> > >>>    store i32 %aval, i32* %aptr
> > >>>    halt label %latch
> > >>>
> > >>> latch:
> > >>>    %inc = add i32, i32 %i, i32 1
> > >>>    br label %header
> > >>>
> > >>> exit:
> > >>>    join label %afterloop
> > >>>
> > >>> afterloop:
> > >>> ...
> > >>>
> > >>>
> > >>>
> > >>> (2) Reasoning:
> > >>> The proposed approach is crafted such that the semantics
of the
> > >>> parallel program is represented correctly in almost
native,
> > >>> low-level IR right after front-end and preserved at any
point
> > >>> till the final lowering to sequential IR or parallel
runtime
> > >>> library calls. To this end, asymmetric parallelism is
employed,
> > >>> a concept that uses control flow and the common concept
of
> > >>> dominance to represent parts of the parallel semantics.
In this
> > >>> model the parallel tasks do not dominate each other and
only one
> > >>> parallel task dominates the code after the parallel
region. As a
> > >>> consequence, various transformations that would break 
> > >>> assumptions we make about parallel regions cannot happen
(see
> > >>> [3,4]). While the explicitly modeled control flow
together with
> > >>> dominance prevents various code motion problems, the use
of
> > >>> terminators helps to minimize the changes needed to
educate passes about parallel regions. Only a fraction of analysis and
transformation passes deal with terminators explicitly.
> > >>> Most passes either test for known terminators (like
branches),
> > >>> rely on dominance information, or work on a basic block
level.
> > >>> To even further reduce changes to the existing passes, 
> > >>> high-level concepts are broken down to already available 
> > >>> low-level concepts instead of introducing new,
semantically rich
> > >>> instructions/intrinsics (see the last paragraph of [5]
and
> > >>> section
> > >>> 4 in the PIR white paper [6] for examples). Finally, this
scheme
> > >>> allows a pass to simply reason about the sequential
semantics of
> > >>> a parallel region, transform it back to one if needed or
deemed
> > >>> beneficial and employ existing tooling solutions to debug
and analyze the code [7].
> > >>>
> > >>> (3) Comparison:
> > >>> The BoF discussion sheet [8] and the recent “[RFC] on
IR-level
> > >>> region annotations” [9] both list pros and cons of
different
> > >>> proposed schemes and implementations. We summarize and
comment
> > >>> the discussion on the ones listen in the recent RFC here:
> > >>>    (a) Metadata: It seems a consensus has been reached
that metadata is
> > >>>          not the solution but only a means to enhance a
different solution.
> > >>>    (c) One Intrinsic per directive/clause: This approach
basically embeds
> > >>>          a high-level (parallel) language in LLVM IR
using intrinsics. It
> > >>>          seems there is little to no support for this
approach at the
> > >>>          moment.
> > >>>    (d) Parallel loop/region annotations: Here, intrinsics
enclosing a
> > >>>          parallel loop/region are used to represent
parallelism.
> > >>>          High-level knowledge is represented as attached
metadata or in
> > >>>          separate intrinsics. For more details please see
the original
> > >>>          RFC [9]. In the discussion several potential
drawbacks have been
> > >>>          mentioned:
> > >>>          - The annotations might be too general [10].
> > >>>          - The IR is not semantically correct (or ready
for optimization)
> > >>>            after the front-end and needs an additional
“prepare phase for
> > >>>            pre-privatization" [11].
> > >>>          - The currently available “potential side effect
for intrinsic
> > >>>            calls” seem not to suffice for the proposed
intrinsics as they
> > >>>            do not have "call semantics" [12].
> > >>>    (b) Parallel instructions (this approach): The table
in the region RFC
> > >>>          [9] lists two drawbacks with this approach, both
of which have
> > >>>          already been called into question [5]. The first
drawback is the
> > >>>          effort needed to implement this scheme which is
discussed in
> > >>>          more detail in section (4) of this mail. The
second drawback is
> > >>>          the need for additional representation of
high-level information
> > >>>          that is not part of the semantics of the new
fork-join
> > >>>          instructions. As mentioned above, the choice to
keep the new
> > >>>          instructions as simple as possible is
deliberate. This parallel
> > >>>          IR is intended to be extensible, and in
particular, compatible
> > >>>          with representations of high-level parallel
concepts that might
> > >>>          be developed in the future. For the time being,
the parallel IR
> > >>>          is compatible with approach taken today of
lowering high-level
> > >>>          parallel linguistics, such as reductions and
private memory, to
> > >>>          existing IR constructs, such as parallel-runtime
calls,
> > >>>          atomicrmw instructions, and well placed alloca’s
[5,6]. Although
> > >>>          other extensions to the IR might allow LLVM to
compile these
> > >>>          higher-level constructs more effectively, we see
no reason the
> > >>>          parallel IR would conflict with any such
extensions. (On the
> > >>>          contrary, the parallel IR would seem to help
compiler analyses
> > >>>          of higher-level parallel constructs by exposing
logical
> > >>>          parallelism.)
> > >>>
> > >>>
> > >>> (4) Feasibility and Impact:
> > >>> The Tapir and PIR prototypes demonstrate the feasibility
of this
> > >>> approach. The Tapir prototype [13] has recently proven
its
> > >>> robustness as the standard compiler in the MIT class on
parallel
> > >>> programming. It was implemented in ~ 5k LOC. However,
>1k are
> > >>> explicit parallel optimization, 1k is used to add new 
> > >>> instructions (thus mechanical) and 2k are used to lower
the parallelism (basically needed for any scheme).
> > >>> Only the rest is required to make it work with existing
analysis
> > >>> and transformation passes. While Tapir added explicit 
> > >>> optimization passes for parallel regions/loops, the 
> > >>> representation allows for a variety of classic
optimizations
> > >>> (CSE, GVN, LICM, loop unrolling,
> > >>> TRE) to work with little to no modifications. Potential
speedups
> > >>> compared to a classic “early-outlining” approaches can
also be seen in the Tapir paper [13].
> > >>> For the PIR prototype [14] we modified only three
transformation
> > >>> passes
> > >>> (<20 LOC) [15] before we could run the O3 pipeline
successfully
> > >>> on a parallel matrix multiplication.
> > >>>
> > >>> Together, these prototypes show how little passes
actually
> > >>> inspect new (or “unknown”) terminators. The default
assumption
> > >>> passes have to make, namely that control might be
transferred to
> > >>> any successors at runtime, has, in terms of potential
compiler
> > >>> transformations, a similar effect as the parallel
semantics we
> > >>> want to model, namely that control is transferred to all
successors.
> > >>>
> > >>>
> > >>> (5) Outlook:
> > >>> This first stage will only introduce and test the new 
> > >>> instructions and the sequentialization pass. Afterwards
we
> > >>> intend to start additions in different, partially
overlapping
> > >>> but often orthogonal directions. We do welcome comments
as well as developers for each of them:
> > >>>   - Analysis and optimization:
> > >>>      * A “parallel region info” pass to keep track of
parallel regions
> > >>>        and their nesting. The information can be made
accessible in a
> > >>>        “parallel region tree” similar to the loop tree
maintained by the
> > >>>        loop info pass. [stage 1, immediate next goal]
> > >>>      * Extension of the verifier that allow to check
parallel IR for
> > >>>        “well-formedness”. [stage 1, immediate next goal]
> > >>>      * Documentation of the PIR instructions in the
language reference.
> > >>>        [stage 1, immediate next goal]
> > >>>      * A cost analysis for parallel tasks that can be
queried by
> > >>>        optimizations. The cost model needs to take the
hardware, the
> > >>>        runtime library and the parallel tasks into
account.
> > >>>      * Vectorizer enhancements to enable the
vectorization of parallel
> > >>>      * loops and tasks.
> > >>>      * Parallelization centric optimizations:
> > >>>        a) Parallel tasks can be balanced, merged or split
as well as created
> > >>>           from and lowered to sequential code.
> > >>>        b) Barriers can be eliminated.
> > >>>        c) Parallel loops can be statically scheduled or
created from
> > >>>           parallel recursive calls [13]
> > >>>      * Analysis to extract high-level information
(reductions, private
> > >>>        memory, ...) from the low-level representation.
> > >>>   - Front-end:
> > >>>      * Lowering of simple OpenMP and Cilk++ annotations
to PIR, including
> > >>>        parallel sections and parallel loops with limited
support for
> > >>>        clauses (at first) (examples can be found here
[1]). [milestone 1]
> > >>>      * Generation of PIR code through automatic
parallelization. A
> > >>>        patched version of Polly exists that emits
parallel loops using PIR instead of
> > >>>        OpenMP runtime calls or llvm.parallel.loop
metadata. [milestone 1]
> > >>>      * Representation of more evolved high-level features
like assignment
> > >>>        of computation units.
> > >>>   - Back-end:
> > >>>      * Lowering of PIR regions to calls to the OpenMP
(GOMP) and Cilk++
> > >>>        runtime library. [milestone 1]
> > >>>      * A simple parallel library, e.g., based on
pthreads, to be shipped
> > >>>        with LLVM as a fallback implementation for
parallel regions.
> > >>>
> > >>>
> > >>> Thank you all for your time and hopefully constructive
input on this proposal!
> > >>>
> > >>> Cheers,
> > >>>    Johannes, on behalf of the PIR team
> > >>>
> > >>>
> > >>> Disclaimer:
> > >>> This RFC, the patches, the wiki, etc. are a joint effort
by Tao B.
> > >>> Schardl (MIT), Charles E. Leiserson (MIT), Kareem Ergawy 
> > >>> (Saarland University), Simon Moll (Saarland University)
and
> > >>> myself. However, ideas and feedback came from many
people,
> > >>> including the members of the LLVM-HPC IR Extensions
working
> > >>> group (Hal Finkel, Xinmin Tian, ...), the participants in
the
> > >>> BoF at the US Developers’ meeting, everybody that
commented on
> > >>> the BoF discussion sheet [16] and the recent RFC on
IR-level
> > >>> region annotations [9] (Mehdi Amini, Sanjoy Das, Daniel
Berlin, ...).
> > >>>   
> > >>>
> > >>>
> > >>> [0] https://github.com/Parallel-IR/llvm-pir/wiki
> > >>> [1] https://reviews.llvm.org/D29250 [2]
> > >>> https://reviews.llvm.org/D29251 [3] 
> > >>>
http://lists.llvm.org/pipermail/llvm-dev/2017-January/109302.htm
> > >>> l
> > >>> [4]
> > >>>
http://lists.llvm.org/pipermail/llvm-dev/2015-March/083348.html
> > >>> [5]
> > >>>
http://lists.llvm.org/pipermail/llvm-dev/2017-January/109264.htm
> > >>> l
> > >>> [6]
> > >>>
http://compilers.cs.uni-saarland.de/people/doerfert/parallelcfg.
> > >>> pd f [7] http://supertech.csail.mit.edu/papers/spbags.pdf
&
> > >>> www.cse.wustl.edu/~angelee/papers/cilkprof.pdf
> > >>> [8] https://goo.gl/Blp2Xr
> > >>> [9]
> > >>>
http://lists.llvm.org/pipermail/llvm-dev/2017-January/108906.htm
> > >>> l
> > >>> [10]
> > >>>
http://lists.llvm.org/pipermail/llvm-dev/2017-January/108997.htm
> > >>> l
> > >>> [11]
> > >>>
http://lists.llvm.org/pipermail/llvm-dev/2017-January/109377.htm
> > >>> l
> > >>> [12]
> > >>>
http://lists.llvm.org/pipermail/llvm-dev/2017-January/109351.htm
> > >>> l [13] http://wsmoses.com/tapir.pdf [14] 
> > >>>
https://github.com/jdoerfert/llvm-pir/tree/feature/fork-join
> > >>> [15]
> > >>>
https://github.com/jdoerfert/llvm-pir/commit/854259881d24d71f9f1
> > >>> f1
> > >>> 7e
> > >>> 52547758c7be0618a
> > >>> [16] https://goo.gl/wKps3c
> > >>>
> > >>>
> > >>> --
> > >>>
> > >>> Johannes Doerfert
> > >>> Researcher / PhD Student
> > >>>
> > >>> Compiler Design Lab (Prof. Hack) Saarland Informatics
Campus,
> > >>> Germany Building E1.3, Room 4.31
> > >>>
> > >>> Tel. +49 (0)681 302-57521 : doerfert at
cs.uni-saarland.de Fax. +49
> > >>> (0)681 302-3065  : 
> > >>> http://www.cdl.uni-saarland.de/people/doerfert
> > >>
> > >>
> > >>> _______________________________________________
> > >>> LLVM Developers mailing list
> > >>> llvm-dev at lists.llvm.org
> > >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > >>
> > >> --
> > >>
> > >> Johannes Doerfert
> > >> Researcher / PhD Student
> > >>
> > >> Compiler Design Lab (Prof. Hack)
> > >> Saarland Informatics Campus, Germany Building E1.3, Room 4.31
> > >>
> > >> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de
Fax. +49
> > >> (0)681 302-3065  :
http://www.cdl.uni-saarland.de/people/doerfert
> > >
> > >
> > >> _______________________________________________
> > >> LLVM Developers mailing list
> > >> llvm-dev at lists.llvm.org
> > >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > >
> > 
> > --
> > Hal Finkel
> > Lead, Compiler Technology and Programming Languages Leadership 
> > Computing Facility Argonne National Laboratory
> > 
> 
> --
> 
> Johannes Doerfert
> Researcher / PhD Student
> 
> Compiler Design Lab (Prof. Hack)
> Saarland Informatics Campus, Germany
> Building E1.3, Room 4.31
> 
> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de Fax. +49 
> (0)681 302-3065  : http://www.cdl.uni-saarland.de/people/doerfert
-- 

Johannes Doerfert
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland Informatics Campus, Germany
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de Fax. +49 (0)681
302-3065  : http://www.cdl.uni-saarland.de/people/doerfert

Hal Finkel via llvm-dev

2017-Mar-08 19:10 UTC

head link

[llvm-dev] [RFC][PIR] Parallel LLVM IR -- Stage 0 --

On 03/08/2017 01:03 PM, Tian, Xinmin wrote:> The IR-region annotation we proposed is as below, there is no
@llvm.parallel.for.iterator().....  There is no change to loop CFG.
>
>     alloc A[100];
>     %t = call token @llvm.region.entry()["parallel.for"()]
>      for(i = 0; i < 100; i++) {
>         a[i] = f(i);
>     }
>     @llvm.region.exit(%t)() ["end.parallel.for"()]
Yes, although as Sanjoy pointed out, the problem occurs when just using 
a parallel region and omp_get_thread_num().

  -Hal
>
> Xinmin
>
> -----Original Message-----
> From: Johannes Doerfert [mailto:doerfert at cs.uni-saarland.de]
> Sent: Wednesday, March 8, 2017 10:44 AM
> To: Tian, Xinmin <xinmin.tian at intel.com>
> Cc: Hal Finkel <hfinkel at anl.gov>; LLVM-Dev <llvm-dev at
lists.llvm.org>; Sanjoy Das <sanjoy at playingwithpointers.com>; Mehdi
Amini <mehdi.amini at apple.com>; Chandler Carruth <chandlerc at
gmail.com>; Adve, Vikram Sadanand <vadve at illinois.edu>; TB Schardl
<neboat at mit.edu>; acjacob at us.ibm.com
> Subject: Re:
>
> I don't know who pointed it out first but Mehdi made me aware of it at
CGO. I try to explain it shortly.
>
> Given the following situation (in pseudo code):
>
>    alloc A[100];
>    parallel_for(i = 0; i < 100; i++)
>      A[i] = f(i);
>
>    acc = 1;
>    for(i = 0; i < 100; i++)
>      acc = acc * A[i];
>
> Afaik, with your parallel regions there won't be a CFG loop for the
parallel initialization, right? Instead some intrinsics that annotate the
parallel region and then one initialization. I imagine it somewhat like this:
>
>    token = llvm.parallel.for.start(0, 100, 1)
>    i = llvm.parallel.for.iterator(token)
>    A[i] = f(i)
>    llvm.parallel.for.end(token)
>
> Which would (in serial semantics) allow an optimization to remove the
second loop as it uses uninitialized (undef) values. In other words, it is a
problem if we assume only one A[i] was initialized but the rest was not.
>
> @Mehdi Does that reflect what you explained to some degree?
>
> On 03/08, Tian, Xinmin wrote:
>> ".... the problem Mehdi pointed out regarding the missed
initializations of array elements, did you comment on that one yet?"
>>
>> What is the initializations of array elements question?  I don't
remember this question. Please refresh my memory. Thanks.
>> I thought Mehdi's question is more about what are attributes needed
for these IR-annotation for other LLVM pass to understand and make sense out of
them.
>>
>> Xinmin
>>
>> -----Original Message-----
>> From: Johannes Doerfert [mailto:doerfert at cs.uni-saarland.de]
>> Sent: Wednesday, March 8, 2017 9:58 AM
>> To: Tian, Xinmin <xinmin.tian at intel.com>
>> Cc: Hal Finkel <hfinkel at anl.gov>; LLVM-Dev <llvm-dev at
lists.llvm.org>;
>> Sanjoy Das <sanjoy at playingwithpointers.com>; Mehdi Amini
>> <mehdi.amini at apple.com>; Chandler Carruth <chandlerc at
gmail.com>; Adve,
>> Vikram Sadanand <vadve at illinois.edu>; TB Schardl <neboat at
mit.edu>;
>> acjacob at us.ibm.com
>> Subject: Re:
>>
>> On 03/08, Tian, Xinmin wrote:
>>> A quick update, we have been looking through all LLVM passes to
>>> identify the impact of "IR-region annotation", and
interaction
>>> issues with the rest of LoopOpt and scalarOpt, e.g.  interaction
>>> with vectorization when you have schedule(simd:guided: 64). What
are
>>> the common properties for optimizer to know on IR-region
>>> annotations. We have our implementation working from O0, O1, O2 to
>>> O3. So far, the changes we made in the existing LLVM opt passes are
>>> < 200 LOC. ClangFE to executable end-to-end linked our omp
library,
>>> and we also updated our implementation using token/tag on the
>>> intrinsics based on feedback from Google (David) and Xilinx
>>> (Hongbin). I am still owe Mehdi some answers.
>> Thanks for the update. I will look into the discussion thread again
soon. I am especially interested in the problem Mehdi pointed out regarding the
missed initializations of array elements, did you comment on that one yet?
>>
>>> One feedback for PIR RFC, putting "fork" into loop body
does not
>>> work for loop vectorizer, it shuts down vectorization, this is the
>>> same issue as MIT's scheme.
>> Totally true. The vectorizer needs to be thought about the fork-join at
some point. However, you can for now sequentialize and introduce the
llvm.parallel.loop annotations it already knows about. We (Simon Moll and me)
are currently discussing how to teach the RV vectorizer to handle fork-join
parallel regions in loops but also in otherwise straight line code. Once we make
progress we'll let you know!
>>
>> Thanks!
>>    Johannes
>>
>>> -----Original Message-----
>>> From: Hal Finkel [mailto:hfinkel at anl.gov]
>>> Sent: Wednesday, March 8, 2017 5:51 AM
>>> To: Johannes Doerfert <doerfert at cs.uni-saarland.de>;
LLVM-Dev
>>> <llvm-dev at lists.llvm.org>
>>> Cc: Sanjoy Das <sanjoy at playingwithpointers.com>; Mehdi
Amini
>>> <mehdi.amini at apple.com>; Chandler Carruth <chandlerc at
gmail.com>;
>>> Adve, Vikram Sadanand <vadve at illinois.edu>; Tian, Xinmin
>>> <xinmin.tian at intel.com>; TB Schardl <neboat at
mit.edu>;
>>> acjacob at us.ibm.com
>>> Subject: Re:
>>>
>>>
>>> On 03/08/2017 07:36 AM, Johannes Doerfert wrote:
>>>> <mehdi.amini at apple.com>,
>>>> Bcc:
>>>> Subject: Re: [llvm-dev] [RFC][PIR] Parallel LLVM IR -- Stage 0
--
>>>> IR extension
>>>> Reply-To:
>>>> In-Reply-To: <20170224221713.GA931 at arch-linux-jd.home>
>>>>
>>>> Ping.
>>>>
>>>>
>>>>
>>>>
>>>> PS.
>>>>
>>>>     Are there actually people interested in this?
>>>>
>>>>     We will continue working anyway but it might not make sense
to put it
>>>>     on reviews and announce it on the ML if nobody cares.
>>> I certainly care, and will also respond to the RFC, this week or
next.
>>>
>>>    -Hal
>>>
>>>>
>>>>
>>>> On 02/24, Johannes Doerfert via llvm-dev wrote:
>>>>> To (re)start a discussion here we published two patches to
the
>>>>> phabricator:
>>>>>
>>>>> 1) A documentation draft (including LangRef) for the PIR
instructions
>>>>>      and the parallel regions [0].
>>>>>
>>>>> 2) A draft implementation for a ParallelRegionInfo pass
that identifies
>>>>>      and maintains parallel regions in the CFG [1]. It also
serves as a
>>>>>      verifier pass for now.
>>>>>
>>>>> Please take a look and comment the patches either in the
>>>>> phabricator or here on the list.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> [0] https://reviews.llvm.org/D30353 [1]
>>>>> https://reviews.llvm.org/D30354
>>>>>
>>>>>
>>>>> On 01/28, Johannes Doerfert via llvm-dev wrote:
>>>>>> Dear all,
>>>>>>
>>>>>> This RFC proposes three new LLVM IR instructions to
express
>>>>>> high-level parallel constructs in a simple, low-level
fashion.
>>>>>> For this first stage we prepared two commits that add
the
>>>>>> proposed instructions and a pass to lower them to
obtain sequential IR.
>>>>>> Both patches have be uploaded for review [1, 2]. The
latter
>>>>>> patch is very simple and the former consists of almost
only mechanical changes needed to add new instructions.
>>>>>>
>>>>>> The rest of this email contains (1) an introduction of
the IR
>>>>>> extension
>>>>>> (2) the reasoning behind this approach, (3) a
comparison to
>>>>>> other ideas proposed so far, (4) a validation of the
feasibility
>>>>>> and potential impact, and (5) an outlook on the next
steps.
>>>>>>
>>>>>> (1) IR extension:
>>>>>> Parallel IR adds three new terminator instructions that
define
>>>>>> the beginning and the end of parallel regions in the
CFG. A
>>>>>> parallel region is a connected subgraph of the CFG that
is
>>>>>> potentially executed by two threads in parallel. It can
only be
>>>>>> entered with a fork instruction and spreads till a join
instruction is reached.
>>>>>> Therefor parallel regions are
single-entry-multiple-exit regions.
>>>>>> Parallel regions can be nested and if they are, they
form a
>>>>>> parallel region tree similar to the loop tree
maintained by the
>>>>>> natural loop info pass. Each parallel region defines
two independent “sibling”
>>>>>> tasks, namely the forked and continuation task.
>>>>>>
>>>>>> The new instructions are defined as follows:
>>>>>>
>>>>>> 1. fork: marks the beginning of parallel region. Every
fork has two
>>>>>>            successor blocks which represent two
parallel tasks. We call
>>>>>>            these two “sibling” tasks the forked and
continuation tasks.
>>>>>>            Nested forking is supported, meaning that
another fork can be
>>>>>>            reach prior to the join.
>>>>>>
>>>>>> 2. halt: marks the end of a forked task. The
"sibling" continuation block
>>>>>>            (see fork above) is the operand of the halt
terminator. This
>>>>>>            represents the idea of asymmetric
parallelism as introduced by
>>>>>>            [1]. One advantage of asymmetric parallelism
is that sequential
>>>>>>            semantics of the program are clear from its
CFG (ref. [1]).
>>>>>>            Note that the edge from a forked block to a
continuation block
>>>>>>            (the one introduced by the halt) represents
the control flow
>>>>>>            when the two successors of a fork execute
sequentially, not
>>>>>>            when they execute in parallel. In the latter
case there is no
>>>>>>            “control transfer” happening via this edge
but only
>>>>>>            synchronization between the tasks.
>>>>>>
>>>>>> 3. join: marks a synchronization point and the end of a
parallel region.
>>>>>>            Once a join terminator is reached by a
thread, execution stops
>>>>>>            in that thread until all tasks spawned by
that thread finish
>>>>>>            their work, thus reach their respective halt
instruction.  A
>>>>>>            join shall only be reached by the
continuation task of a fork,
>>>>>>            the forked task shall reach a halt with the
continuation as a
>>>>>>            successor.
>>>>>>
>>>>>>
>>>>>> Here is an example of a parallel OpenMP loop and its
idiomatic
>>>>>> lowering to Parallel IR. We set up a wiki [0] with
additional examples.
>>>>>>
>>>>>> #pragma omp parallel
>>>>>> for(int i = 0; i < n; ++i) {
>>>>>>    A[i] = C[i];
>>>>>> }
>>>>>>
>>>>>>
>>>>>> preheader:
>>>>>>     br label %header
>>>>>>
>>>>>> header:
>>>>>>     %i = phi [ i32 0, %preheader ], [ %inc, %latch ]
>>>>>>     %done = icmp ge %i, %n
>>>>>>     br i1 %done, label %exit, label %body
>>>>>>
>>>>>> body:
>>>>>>     fork label %task, label %latch
>>>>>>
>>>>>> task:
>>>>>>     %aptr = getelementptr i32, i32* %A, i32 0, i32 %i
>>>>>>     %aval = load i32* %aptr
>>>>>>     %cptr = getelementptr i32, i32* %C, i32 0, i32 %i
>>>>>>     store i32 %aval, i32* %aptr
>>>>>>     halt label %latch
>>>>>>
>>>>>> latch:
>>>>>>     %inc = add i32, i32 %i, i32 1
>>>>>>     br label %header
>>>>>>
>>>>>> exit:
>>>>>>     join label %afterloop
>>>>>>
>>>>>> afterloop:
>>>>>> ...
>>>>>>
>>>>>>
>>>>>>
>>>>>> (2) Reasoning:
>>>>>> The proposed approach is crafted such that the
semantics of the
>>>>>> parallel program is represented correctly in almost
native,
>>>>>> low-level IR right after front-end and preserved at any
point
>>>>>> till the final lowering to sequential IR or parallel
runtime
>>>>>> library calls. To this end, asymmetric parallelism is
employed,
>>>>>> a concept that uses control flow and the common concept
of
>>>>>> dominance to represent parts of the parallel semantics.
In this
>>>>>> model the parallel tasks do not dominate each other and
only one
>>>>>> parallel task dominates the code after the parallel
region. As a
>>>>>> consequence, various transformations that would break
>>>>>> assumptions we make about parallel regions cannot
happen (see
>>>>>> [3,4]). While the explicitly modeled control flow
together with
>>>>>> dominance prevents various code motion problems, the
use of
>>>>>> terminators helps to minimize the changes needed to
educate passes about parallel regions. Only a fraction of analysis and
transformation passes deal with terminators explicitly.
>>>>>> Most passes either test for known terminators (like
branches),
>>>>>> rely on dominance information, or work on a basic block
level.
>>>>>> To even further reduce changes to the existing passes,
>>>>>> high-level concepts are broken down to already
available
>>>>>> low-level concepts instead of introducing new,
semantically rich
>>>>>> instructions/intrinsics (see the last paragraph of [5]
and
>>>>>> section
>>>>>> 4 in the PIR white paper [6] for examples). Finally,
this scheme
>>>>>> allows a pass to simply reason about the sequential
semantics of
>>>>>> a parallel region, transform it back to one if needed
or deemed
>>>>>> beneficial and employ existing tooling solutions to
debug and analyze the code [7].
>>>>>>
>>>>>> (3) Comparison:
>>>>>> The BoF discussion sheet [8] and the recent “[RFC] on
IR-level
>>>>>> region annotations” [9] both list pros and cons of
different
>>>>>> proposed schemes and implementations. We summarize and
comment
>>>>>> the discussion on the ones listen in the recent RFC
here:
>>>>>>     (a) Metadata: It seems a consensus has been reached
that metadata is
>>>>>>           not the solution but only a means to enhance
a different solution.
>>>>>>     (c) One Intrinsic per directive/clause: This
approach basically embeds
>>>>>>           a high-level (parallel) language in LLVM IR
using intrinsics. It
>>>>>>           seems there is little to no support for this
approach at the
>>>>>>           moment.
>>>>>>     (d) Parallel loop/region annotations: Here,
intrinsics enclosing a
>>>>>>           parallel loop/region are used to represent
parallelism.
>>>>>>           High-level knowledge is represented as
attached metadata or in
>>>>>>           separate intrinsics. For more details please
see the original
>>>>>>           RFC [9]. In the discussion several potential
drawbacks have been
>>>>>>           mentioned:
>>>>>>           - The annotations might be too general [10].
>>>>>>           - The IR is not semantically correct (or
ready for optimization)
>>>>>>             after the front-end and needs an additional
“prepare phase for
>>>>>>             pre-privatization" [11].
>>>>>>           - The currently available “potential side
effect for intrinsic
>>>>>>             calls” seem not to suffice for the proposed
intrinsics as they
>>>>>>             do not have "call semantics"
[12].
>>>>>>     (b) Parallel instructions (this approach): The
table in the region RFC
>>>>>>           [9] lists two drawbacks with this approach,
both of which have
>>>>>>           already been called into question [5]. The
first drawback is the
>>>>>>           effort needed to implement this scheme which
is discussed in
>>>>>>           more detail in section (4) of this mail. The
second drawback is
>>>>>>           the need for additional representation of
high-level information
>>>>>>           that is not part of the semantics of the new
fork-join
>>>>>>           instructions. As mentioned above, the choice
to keep the new
>>>>>>           instructions as simple as possible is
deliberate. This parallel
>>>>>>           IR is intended to be extensible, and in
particular, compatible
>>>>>>           with representations of high-level parallel
concepts that might
>>>>>>           be developed in the future. For the time
being, the parallel IR
>>>>>>           is compatible with approach taken today of
lowering high-level
>>>>>>           parallel linguistics, such as reductions and
private memory, to
>>>>>>           existing IR constructs, such as
parallel-runtime calls,
>>>>>>           atomicrmw instructions, and well placed
alloca’s [5,6]. Although
>>>>>>           other extensions to the IR might allow LLVM
to compile these
>>>>>>           higher-level constructs more effectively, we
see no reason the
>>>>>>           parallel IR would conflict with any such
extensions. (On the
>>>>>>           contrary, the parallel IR would seem to help
compiler analyses
>>>>>>           of higher-level parallel constructs by
exposing logical
>>>>>>           parallelism.)
>>>>>>
>>>>>>
>>>>>> (4) Feasibility and Impact:
>>>>>> The Tapir and PIR prototypes demonstrate the
feasibility of this
>>>>>> approach. The Tapir prototype [13] has recently proven
its
>>>>>> robustness as the standard compiler in the MIT class on
parallel
>>>>>> programming. It was implemented in ~ 5k LOC. However,
>1k are
>>>>>> explicit parallel optimization, 1k is used to add new
>>>>>> instructions (thus mechanical) and 2k are used to lower
the parallelism (basically needed for any scheme).
>>>>>> Only the rest is required to make it work with existing
analysis
>>>>>> and transformation passes. While Tapir added explicit
>>>>>> optimization passes for parallel regions/loops, the
>>>>>> representation allows for a variety of classic
optimizations
>>>>>> (CSE, GVN, LICM, loop unrolling,
>>>>>> TRE) to work with little to no modifications. Potential
speedups
>>>>>> compared to a classic “early-outlining” approaches can
also be seen in the Tapir paper [13].
>>>>>> For the PIR prototype [14] we modified only three
transformation
>>>>>> passes
>>>>>> (<20 LOC) [15] before we could run the O3 pipeline
successfully
>>>>>> on a parallel matrix multiplication.
>>>>>>
>>>>>> Together, these prototypes show how little passes
actually
>>>>>> inspect new (or “unknown”) terminators. The default
assumption
>>>>>> passes have to make, namely that control might be
transferred to
>>>>>> any successors at runtime, has, in terms of potential
compiler
>>>>>> transformations, a similar effect as the parallel
semantics we
>>>>>> want to model, namely that control is transferred to
all successors.
>>>>>>
>>>>>>
>>>>>> (5) Outlook:
>>>>>> This first stage will only introduce and test the new
>>>>>> instructions and the sequentialization pass. Afterwards
we
>>>>>> intend to start additions in different, partially
overlapping
>>>>>> but often orthogonal directions. We do welcome comments
as well as developers for each of them:
>>>>>>    - Analysis and optimization:
>>>>>>       * A “parallel region info” pass to keep track of
parallel regions
>>>>>>         and their nesting. The information can be made
accessible in a
>>>>>>         “parallel region tree” similar to the loop tree
maintained by the
>>>>>>         loop info pass. [stage 1, immediate next goal]
>>>>>>       * Extension of the verifier that allow to check
parallel IR for
>>>>>>         “well-formedness”. [stage 1, immediate next
goal]
>>>>>>       * Documentation of the PIR instructions in the
language reference.
>>>>>>         [stage 1, immediate next goal]
>>>>>>       * A cost analysis for parallel tasks that can be
queried by
>>>>>>         optimizations. The cost model needs to take the
hardware, the
>>>>>>         runtime library and the parallel tasks into
account.
>>>>>>       * Vectorizer enhancements to enable the
vectorization of parallel
>>>>>>       * loops and tasks.
>>>>>>       * Parallelization centric optimizations:
>>>>>>         a) Parallel tasks can be balanced, merged or
split as well as created
>>>>>>            from and lowered to sequential code.
>>>>>>         b) Barriers can be eliminated.
>>>>>>         c) Parallel loops can be statically scheduled
or created from
>>>>>>            parallel recursive calls [13]
>>>>>>       * Analysis to extract high-level information
(reductions, private
>>>>>>         memory, ...) from the low-level representation.
>>>>>>    - Front-end:
>>>>>>       * Lowering of simple OpenMP and Cilk++
annotations to PIR, including
>>>>>>         parallel sections and parallel loops with
limited support for
>>>>>>         clauses (at first) (examples can be found here
[1]). [milestone 1]
>>>>>>       * Generation of PIR code through automatic
parallelization. A
>>>>>>         patched version of Polly exists that emits
parallel loops using PIR instead of
>>>>>>         OpenMP runtime calls or llvm.parallel.loop
metadata. [milestone 1]
>>>>>>       * Representation of more evolved high-level
features like assignment
>>>>>>         of computation units.
>>>>>>    - Back-end:
>>>>>>       * Lowering of PIR regions to calls to the OpenMP
(GOMP) and Cilk++
>>>>>>         runtime library. [milestone 1]
>>>>>>       * A simple parallel library, e.g., based on
pthreads, to be shipped
>>>>>>         with LLVM as a fallback implementation for
parallel regions.
>>>>>>
>>>>>>
>>>>>> Thank you all for your time and hopefully constructive
input on this proposal!
>>>>>>
>>>>>> Cheers,
>>>>>>     Johannes, on behalf of the PIR team
>>>>>>
>>>>>>
>>>>>> Disclaimer:
>>>>>> This RFC, the patches, the wiki, etc. are a joint
effort by Tao B.
>>>>>> Schardl (MIT), Charles E. Leiserson (MIT), Kareem
Ergawy
>>>>>> (Saarland University), Simon Moll (Saarland University)
and
>>>>>> myself. However, ideas and feedback came from many
people,
>>>>>> including the members of the LLVM-HPC IR Extensions
working
>>>>>> group (Hal Finkel, Xinmin Tian, ...), the participants
in the
>>>>>> BoF at the US Developers’ meeting, everybody that
commented on
>>>>>> the BoF discussion sheet [16] and the recent RFC on
IR-level
>>>>>> region annotations [9] (Mehdi Amini, Sanjoy Das, Daniel
Berlin, ...).
>>>>>>    
>>>>>>
>>>>>>
>>>>>> [0] https://github.com/Parallel-IR/llvm-pir/wiki
>>>>>> [1] https://reviews.llvm.org/D29250 [2]
>>>>>> https://reviews.llvm.org/D29251 [3]
>>>>>>
http://lists.llvm.org/pipermail/llvm-dev/2017-January/109302.htm
>>>>>> l
>>>>>> [4]
>>>>>>
http://lists.llvm.org/pipermail/llvm-dev/2015-March/083348.html
>>>>>> [5]
>>>>>>
http://lists.llvm.org/pipermail/llvm-dev/2017-January/109264.htm
>>>>>> l
>>>>>> [6]
>>>>>>
http://compilers.cs.uni-saarland.de/people/doerfert/parallelcfg.
>>>>>> pd f [7]
http://supertech.csail.mit.edu/papers/spbags.pdf  &
>>>>>> www.cse.wustl.edu/~angelee/papers/cilkprof.pdf
>>>>>> [8] https://goo.gl/Blp2Xr
>>>>>> [9]
>>>>>>
http://lists.llvm.org/pipermail/llvm-dev/2017-January/108906.htm
>>>>>> l
>>>>>> [10]
>>>>>>
http://lists.llvm.org/pipermail/llvm-dev/2017-January/108997.htm
>>>>>> l
>>>>>> [11]
>>>>>>
http://lists.llvm.org/pipermail/llvm-dev/2017-January/109377.htm
>>>>>> l
>>>>>> [12]
>>>>>>
http://lists.llvm.org/pipermail/llvm-dev/2017-January/109351.htm
>>>>>> l [13] http://wsmoses.com/tapir.pdf [14]
>>>>>>
https://github.com/jdoerfert/llvm-pir/tree/feature/fork-join
>>>>>> [15]
>>>>>>
https://github.com/jdoerfert/llvm-pir/commit/854259881d24d71f9f1
>>>>>> f1
>>>>>> 7e
>>>>>> 52547758c7be0618a
>>>>>> [16] https://goo.gl/wKps3c
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Johannes Doerfert
>>>>>> Researcher / PhD Student
>>>>>>
>>>>>> Compiler Design Lab (Prof. Hack) Saarland Informatics
Campus,
>>>>>> Germany Building E1.3, Room 4.31
>>>>>>
>>>>>> Tel. +49 (0)681 302-57521 : doerfert at
cs.uni-saarland.de Fax. +49
>>>>>> (0)681 302-3065  :
>>>>>> http://www.cdl.uni-saarland.de/people/doerfert
>>>>>
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> llvm-dev at lists.llvm.org
>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>> --
>>>>>
>>>>> Johannes Doerfert
>>>>> Researcher / PhD Student
>>>>>
>>>>> Compiler Design Lab (Prof. Hack)
>>>>> Saarland Informatics Campus, Germany Building E1.3, Room
4.31
>>>>>
>>>>> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de
Fax. +49
>>>>> (0)681 302-3065  :
http://www.cdl.uni-saarland.de/people/doerfert
>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>> --
>>> Hal Finkel
>>> Lead, Compiler Technology and Programming Languages Leadership
>>> Computing Facility Argonne National Laboratory
>>>
>> --
>>
>> Johannes Doerfert
>> Researcher / PhD Student
>>
>> Compiler Design Lab (Prof. Hack)
>> Saarland Informatics Campus, Germany
>> Building E1.3, Room 4.31
>>
>> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de Fax. +49
>> (0)681 302-3065  : http://www.cdl.uni-saarland.de/people/doerfert
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

Tian, Xinmin via llvm-dev

2017-Mar-08 19:24 UTC

head link

[llvm-dev] [RFC][PIR] Parallel LLVM IR -- Stage 0 --

I assume the referring case is something like below, right?

 #pragma omp parallel num_threads(n) 
 {
        #pragma omp critical
        { 
             x = x + 1;
        }
  }

If that is the case, the programmer is already writing the code that is not
"serial equivalent".
Our representation for parallelizer is 

%t = @llvm.region.entry()["omp.parallel"(),
"num_threads"(int32 %n)]
 {
        __kmpc_critical()
             x = x + 1;
        __kmpc_end_critical()
  }
@llvm.region.exit()["end.omp.parallel"()]

Xinmin  

-----Original Message-----
From: Hal Finkel [mailto:hfinkel at anl.gov] 
Sent: Wednesday, March 8, 2017 11:11 AM
To: Tian, Xinmin <xinmin.tian at intel.com>; Johannes Doerfert
<doerfert at cs.uni-saarland.de>
Cc: LLVM-Dev <llvm-dev at lists.llvm.org>; Sanjoy Das <sanjoy at
playingwithpointers.com>; Mehdi Amini <mehdi.amini at apple.com>;
Chandler Carruth <chandlerc at gmail.com>; Adve, Vikram Sadanand <vadve
at illinois.edu>; TB Schardl <neboat at mit.edu>; acjacob at us.ibm.com
Subject: Re: [llvm-dev] [RFC][PIR] Parallel LLVM IR -- Stage 0 --



On 03/08/2017 01:03 PM, Tian, Xinmin wrote:> The IR-region annotation we proposed is as below, there is no
@llvm.parallel.for.iterator().....  There is no change to loop CFG.
>
>     alloc A[100];
>     %t = call token @llvm.region.entry()["parallel.for"()]
>      for(i = 0; i < 100; i++) {
>         a[i] = f(i);
>     }
>     @llvm.region.exit(%t)() ["end.parallel.for"()]
Yes, although as Sanjoy pointed out, the problem occurs when just using a
parallel region and omp_get_thread_num().

  -Hal
>
> Xinmin
>
> -----Original Message-----
> From: Johannes Doerfert [mailto:doerfert at cs.uni-saarland.de]
> Sent: Wednesday, March 8, 2017 10:44 AM
> To: Tian, Xinmin <xinmin.tian at intel.com>
> Cc: Hal Finkel <hfinkel at anl.gov>; LLVM-Dev <llvm-dev at
lists.llvm.org>;
> Sanjoy Das <sanjoy at playingwithpointers.com>; Mehdi Amini 
> <mehdi.amini at apple.com>; Chandler Carruth <chandlerc at
gmail.com>; Adve,
> Vikram Sadanand <vadve at illinois.edu>; TB Schardl <neboat at
mit.edu>;
> acjacob at us.ibm.com
> Subject: Re:
>
> I don't know who pointed it out first but Mehdi made me aware of it at
CGO. I try to explain it shortly.
>
> Given the following situation (in pseudo code):
>
>    alloc A[100];
>    parallel_for(i = 0; i < 100; i++)
>      A[i] = f(i);
>
>    acc = 1;
>    for(i = 0; i < 100; i++)
>      acc = acc * A[i];
>
> Afaik, with your parallel regions there won't be a CFG loop for the
parallel initialization, right? Instead some intrinsics that annotate the
parallel region and then one initialization. I imagine it somewhat like this:
>
>    token = llvm.parallel.for.start(0, 100, 1)
>    i = llvm.parallel.for.iterator(token)
>    A[i] = f(i)
>    llvm.parallel.for.end(token)
>
> Which would (in serial semantics) allow an optimization to remove the
second loop as it uses uninitialized (undef) values. In other words, it is a
problem if we assume only one A[i] was initialized but the rest was not.
>
> @Mehdi Does that reflect what you explained to some degree?
>
> On 03/08, Tian, Xinmin wrote:
>> ".... the problem Mehdi pointed out regarding the missed
initializations of array elements, did you comment on that one yet?"
>>
>> What is the initializations of array elements question?  I don't
remember this question. Please refresh my memory. Thanks.
>> I thought Mehdi's question is more about what are attributes needed
for these IR-annotation for other LLVM pass to understand and make sense out of
them.
>>
>> Xinmin
>>
>> -----Original Message-----
>> From: Johannes Doerfert [mailto:doerfert at cs.uni-saarland.de]
>> Sent: Wednesday, March 8, 2017 9:58 AM
>> To: Tian, Xinmin <xinmin.tian at intel.com>
>> Cc: Hal Finkel <hfinkel at anl.gov>; LLVM-Dev <llvm-dev at
lists.llvm.org>;
>> Sanjoy Das <sanjoy at playingwithpointers.com>; Mehdi Amini 
>> <mehdi.amini at apple.com>; Chandler Carruth <chandlerc at
gmail.com>;
>> Adve, Vikram Sadanand <vadve at illinois.edu>; TB Schardl 
>> <neboat at mit.edu>; acjacob at us.ibm.com
>> Subject: Re:
>>
>> On 03/08, Tian, Xinmin wrote:
>>> A quick update, we have been looking through all LLVM passes to 
>>> identify the impact of "IR-region annotation", and
interaction
>>> issues with the rest of LoopOpt and scalarOpt, e.g.  interaction 
>>> with vectorization when you have schedule(simd:guided: 64). What
are
>>> the common properties for optimizer to know on IR-region 
>>> annotations. We have our implementation working from O0, O1, O2 to 
>>> O3. So far, the changes we made in the existing LLVM opt passes are
>>> < 200 LOC. ClangFE to executable end-to-end linked our omp
library,
>>> and we also updated our implementation using token/tag on the 
>>> intrinsics based on feedback from Google (David) and Xilinx 
>>> (Hongbin). I am still owe Mehdi some answers.
>> Thanks for the update. I will look into the discussion thread again
soon. I am especially interested in the problem Mehdi pointed out regarding the
missed initializations of array elements, did you comment on that one yet?
>>
>>> One feedback for PIR RFC, putting "fork" into loop body
does not
>>> work for loop vectorizer, it shuts down vectorization, this is the 
>>> same issue as MIT's scheme.
>> Totally true. The vectorizer needs to be thought about the fork-join at
some point. However, you can for now sequentialize and introduce the
llvm.parallel.loop annotations it already knows about. We (Simon Moll and me)
are currently discussing how to teach the RV vectorizer to handle fork-join
parallel regions in loops but also in otherwise straight line code. Once we make
progress we'll let you know!
>>
>> Thanks!
>>    Johannes
>>
>>> -----Original Message-----
>>> From: Hal Finkel [mailto:hfinkel at anl.gov]
>>> Sent: Wednesday, March 8, 2017 5:51 AM
>>> To: Johannes Doerfert <doerfert at cs.uni-saarland.de>;
LLVM-Dev
>>> <llvm-dev at lists.llvm.org>
>>> Cc: Sanjoy Das <sanjoy at playingwithpointers.com>; Mehdi
Amini
>>> <mehdi.amini at apple.com>; Chandler Carruth <chandlerc at
gmail.com>;
>>> Adve, Vikram Sadanand <vadve at illinois.edu>; Tian, Xinmin 
>>> <xinmin.tian at intel.com>; TB Schardl <neboat at
mit.edu>;
>>> acjacob at us.ibm.com
>>> Subject: Re:
>>>
>>>
>>> On 03/08/2017 07:36 AM, Johannes Doerfert wrote:
>>>> <mehdi.amini at apple.com>,
>>>> Bcc:
>>>> Subject: Re: [llvm-dev] [RFC][PIR] Parallel LLVM IR -- Stage 0
--
>>>> IR extension
>>>> Reply-To:
>>>> In-Reply-To: <20170224221713.GA931 at arch-linux-jd.home>
>>>>
>>>> Ping.
>>>>
>>>>
>>>>
>>>>
>>>> PS.
>>>>
>>>>     Are there actually people interested in this?
>>>>
>>>>     We will continue working anyway but it might not make sense
to put it
>>>>     on reviews and announce it on the ML if nobody cares.
>>> I certainly care, and will also respond to the RFC, this week or
next.
>>>
>>>    -Hal
>>>
>>>>
>>>>
>>>> On 02/24, Johannes Doerfert via llvm-dev wrote:
>>>>> To (re)start a discussion here we published two patches to
the
>>>>> phabricator:
>>>>>
>>>>> 1) A documentation draft (including LangRef) for the PIR
instructions
>>>>>      and the parallel regions [0].
>>>>>
>>>>> 2) A draft implementation for a ParallelRegionInfo pass
that identifies
>>>>>      and maintains parallel regions in the CFG [1]. It also
serves as a
>>>>>      verifier pass for now.
>>>>>
>>>>> Please take a look and comment the patches either in the 
>>>>> phabricator or here on the list.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> [0] https://reviews.llvm.org/D30353 [1]
>>>>> https://reviews.llvm.org/D30354
>>>>>
>>>>>
>>>>> On 01/28, Johannes Doerfert via llvm-dev wrote:
>>>>>> Dear all,
>>>>>>
>>>>>> This RFC proposes three new LLVM IR instructions to
express
>>>>>> high-level parallel constructs in a simple, low-level
fashion.
>>>>>> For this first stage we prepared two commits that add
the
>>>>>> proposed instructions and a pass to lower them to
obtain sequential IR.
>>>>>> Both patches have be uploaded for review [1, 2]. The
latter patch
>>>>>> is very simple and the former consists of almost only
mechanical changes needed to add new instructions.
>>>>>>
>>>>>> The rest of this email contains (1) an introduction of
the IR
>>>>>> extension
>>>>>> (2) the reasoning behind this approach, (3) a
comparison to other
>>>>>> ideas proposed so far, (4) a validation of the
feasibility and
>>>>>> potential impact, and (5) an outlook on the next steps.
>>>>>>
>>>>>> (1) IR extension:
>>>>>> Parallel IR adds three new terminator instructions that
define
>>>>>> the beginning and the end of parallel regions in the
CFG. A
>>>>>> parallel region is a connected subgraph of the CFG that
is
>>>>>> potentially executed by two threads in parallel. It can
only be
>>>>>> entered with a fork instruction and spreads till a join
instruction is reached.
>>>>>> Therefor parallel regions are
single-entry-multiple-exit regions.
>>>>>> Parallel regions can be nested and if they are, they
form a
>>>>>> parallel region tree similar to the loop tree
maintained by the
>>>>>> natural loop info pass. Each parallel region defines
two independent “sibling”
>>>>>> tasks, namely the forked and continuation task.
>>>>>>
>>>>>> The new instructions are defined as follows:
>>>>>>
>>>>>> 1. fork: marks the beginning of parallel region. Every
fork has two
>>>>>>            successor blocks which represent two
parallel tasks. We call
>>>>>>            these two “sibling” tasks the forked and
continuation tasks.
>>>>>>            Nested forking is supported, meaning that
another fork can be
>>>>>>            reach prior to the join.
>>>>>>
>>>>>> 2. halt: marks the end of a forked task. The
"sibling" continuation block
>>>>>>            (see fork above) is the operand of the halt
terminator. This
>>>>>>            represents the idea of asymmetric
parallelism as introduced by
>>>>>>            [1]. One advantage of asymmetric parallelism
is that sequential
>>>>>>            semantics of the program are clear from its
CFG (ref. [1]).
>>>>>>            Note that the edge from a forked block to a
continuation block
>>>>>>            (the one introduced by the halt) represents
the control flow
>>>>>>            when the two successors of a fork execute
sequentially, not
>>>>>>            when they execute in parallel. In the latter
case there is no
>>>>>>            “control transfer” happening via this edge
but only
>>>>>>            synchronization between the tasks.
>>>>>>
>>>>>> 3. join: marks a synchronization point and the end of a
parallel region.
>>>>>>            Once a join terminator is reached by a
thread, execution stops
>>>>>>            in that thread until all tasks spawned by
that thread finish
>>>>>>            their work, thus reach their respective halt
instruction.  A
>>>>>>            join shall only be reached by the
continuation task of a fork,
>>>>>>            the forked task shall reach a halt with the
continuation as a
>>>>>>            successor.
>>>>>>
>>>>>>
>>>>>> Here is an example of a parallel OpenMP loop and its
idiomatic
>>>>>> lowering to Parallel IR. We set up a wiki [0] with
additional examples.
>>>>>>
>>>>>> #pragma omp parallel
>>>>>> for(int i = 0; i < n; ++i) {
>>>>>>    A[i] = C[i];
>>>>>> }
>>>>>>
>>>>>>
>>>>>> preheader:
>>>>>>     br label %header
>>>>>>
>>>>>> header:
>>>>>>     %i = phi [ i32 0, %preheader ], [ %inc, %latch ]
>>>>>>     %done = icmp ge %i, %n
>>>>>>     br i1 %done, label %exit, label %body
>>>>>>
>>>>>> body:
>>>>>>     fork label %task, label %latch
>>>>>>
>>>>>> task:
>>>>>>     %aptr = getelementptr i32, i32* %A, i32 0, i32 %i
>>>>>>     %aval = load i32* %aptr
>>>>>>     %cptr = getelementptr i32, i32* %C, i32 0, i32 %i
>>>>>>     store i32 %aval, i32* %aptr
>>>>>>     halt label %latch
>>>>>>
>>>>>> latch:
>>>>>>     %inc = add i32, i32 %i, i32 1
>>>>>>     br label %header
>>>>>>
>>>>>> exit:
>>>>>>     join label %afterloop
>>>>>>
>>>>>> afterloop:
>>>>>> ...
>>>>>>
>>>>>>
>>>>>>
>>>>>> (2) Reasoning:
>>>>>> The proposed approach is crafted such that the
semantics of the
>>>>>> parallel program is represented correctly in almost
native,
>>>>>> low-level IR right after front-end and preserved at any
point
>>>>>> till the final lowering to sequential IR or parallel
runtime
>>>>>> library calls. To this end, asymmetric parallelism is
employed, a
>>>>>> concept that uses control flow and the common concept
of
>>>>>> dominance to represent parts of the parallel semantics.
In this
>>>>>> model the parallel tasks do not dominate each other and
only one
>>>>>> parallel task dominates the code after the parallel
region. As a
>>>>>> consequence, various transformations that would break
assumptions
>>>>>> we make about parallel regions cannot happen (see
[3,4]). While
>>>>>> the explicitly modeled control flow together with
dominance
>>>>>> prevents various code motion problems, the use of
terminators
>>>>>> helps to minimize the changes needed to educate passes
about parallel regions. Only a fraction of analysis and transformation passes
deal with terminators explicitly.
>>>>>> Most passes either test for known terminators (like
branches),
>>>>>> rely on dominance information, or work on a basic block
level.
>>>>>> To even further reduce changes to the existing passes,
high-level
>>>>>> concepts are broken down to already available low-level
concepts
>>>>>> instead of introducing new, semantically rich 
>>>>>> instructions/intrinsics (see the last paragraph of [5]
and
>>>>>> section
>>>>>> 4 in the PIR white paper [6] for examples). Finally,
this scheme
>>>>>> allows a pass to simply reason about the sequential
semantics of
>>>>>> a parallel region, transform it back to one if needed
or deemed
>>>>>> beneficial and employ existing tooling solutions to
debug and analyze the code [7].
>>>>>>
>>>>>> (3) Comparison:
>>>>>> The BoF discussion sheet [8] and the recent “[RFC] on
IR-level
>>>>>> region annotations” [9] both list pros and cons of
different
>>>>>> proposed schemes and implementations. We summarize and
comment
>>>>>> the discussion on the ones listen in the recent RFC
here:
>>>>>>     (a) Metadata: It seems a consensus has been reached
that metadata is
>>>>>>           not the solution but only a means to enhance
a different solution.
>>>>>>     (c) One Intrinsic per directive/clause: This
approach basically embeds
>>>>>>           a high-level (parallel) language in LLVM IR
using intrinsics. It
>>>>>>           seems there is little to no support for this
approach at the
>>>>>>           moment.
>>>>>>     (d) Parallel loop/region annotations: Here,
intrinsics enclosing a
>>>>>>           parallel loop/region are used to represent
parallelism.
>>>>>>           High-level knowledge is represented as
attached metadata or in
>>>>>>           separate intrinsics. For more details please
see the original
>>>>>>           RFC [9]. In the discussion several potential
drawbacks have been
>>>>>>           mentioned:
>>>>>>           - The annotations might be too general [10].
>>>>>>           - The IR is not semantically correct (or
ready for optimization)
>>>>>>             after the front-end and needs an additional
“prepare phase for
>>>>>>             pre-privatization" [11].
>>>>>>           - The currently available “potential side
effect for intrinsic
>>>>>>             calls” seem not to suffice for the proposed
intrinsics as they
>>>>>>             do not have "call semantics"
[12].
>>>>>>     (b) Parallel instructions (this approach): The
table in the region RFC
>>>>>>           [9] lists two drawbacks with this approach,
both of which have
>>>>>>           already been called into question [5]. The
first drawback is the
>>>>>>           effort needed to implement this scheme which
is discussed in
>>>>>>           more detail in section (4) of this mail. The
second drawback is
>>>>>>           the need for additional representation of
high-level information
>>>>>>           that is not part of the semantics of the new
fork-join
>>>>>>           instructions. As mentioned above, the choice
to keep the new
>>>>>>           instructions as simple as possible is
deliberate. This parallel
>>>>>>           IR is intended to be extensible, and in
particular, compatible
>>>>>>           with representations of high-level parallel
concepts that might
>>>>>>           be developed in the future. For the time
being, the parallel IR
>>>>>>           is compatible with approach taken today of
lowering high-level
>>>>>>           parallel linguistics, such as reductions and
private memory, to
>>>>>>           existing IR constructs, such as
parallel-runtime calls,
>>>>>>           atomicrmw instructions, and well placed
alloca’s [5,6]. Although
>>>>>>           other extensions to the IR might allow LLVM
to compile these
>>>>>>           higher-level constructs more effectively, we
see no reason the
>>>>>>           parallel IR would conflict with any such
extensions. (On the
>>>>>>           contrary, the parallel IR would seem to help
compiler analyses
>>>>>>           of higher-level parallel constructs by
exposing logical
>>>>>>           parallelism.)
>>>>>>
>>>>>>
>>>>>> (4) Feasibility and Impact:
>>>>>> The Tapir and PIR prototypes demonstrate the
feasibility of this
>>>>>> approach. The Tapir prototype [13] has recently proven
its
>>>>>> robustness as the standard compiler in the MIT class on
parallel
>>>>>> programming. It was implemented in ~ 5k LOC. However,
>1k are
>>>>>> explicit parallel optimization, 1k is used to add new 
>>>>>> instructions (thus mechanical) and 2k are used to lower
the parallelism (basically needed for any scheme).
>>>>>> Only the rest is required to make it work with existing
analysis
>>>>>> and transformation passes. While Tapir added explicit 
>>>>>> optimization passes for parallel regions/loops, the 
>>>>>> representation allows for a variety of classic
optimizations
>>>>>> (CSE, GVN, LICM, loop unrolling,
>>>>>> TRE) to work with little to no modifications. Potential
speedups
>>>>>> compared to a classic “early-outlining” approaches can
also be seen in the Tapir paper [13].
>>>>>> For the PIR prototype [14] we modified only three
transformation
>>>>>> passes
>>>>>> (<20 LOC) [15] before we could run the O3 pipeline
successfully
>>>>>> on a parallel matrix multiplication.
>>>>>>
>>>>>> Together, these prototypes show how little passes
actually
>>>>>> inspect new (or “unknown”) terminators. The default
assumption
>>>>>> passes have to make, namely that control might be
transferred to
>>>>>> any successors at runtime, has, in terms of potential
compiler
>>>>>> transformations, a similar effect as the parallel
semantics we
>>>>>> want to model, namely that control is transferred to
all successors.
>>>>>>
>>>>>>
>>>>>> (5) Outlook:
>>>>>> This first stage will only introduce and test the new 
>>>>>> instructions and the sequentialization pass. Afterwards
we intend
>>>>>> to start additions in different, partially overlapping
but often
>>>>>> orthogonal directions. We do welcome comments as well
as developers for each of them:
>>>>>>    - Analysis and optimization:
>>>>>>       * A “parallel region info” pass to keep track of
parallel regions
>>>>>>         and their nesting. The information can be made
accessible in a
>>>>>>         “parallel region tree” similar to the loop tree
maintained by the
>>>>>>         loop info pass. [stage 1, immediate next goal]
>>>>>>       * Extension of the verifier that allow to check
parallel IR for
>>>>>>         “well-formedness”. [stage 1, immediate next
goal]
>>>>>>       * Documentation of the PIR instructions in the
language reference.
>>>>>>         [stage 1, immediate next goal]
>>>>>>       * A cost analysis for parallel tasks that can be
queried by
>>>>>>         optimizations. The cost model needs to take the
hardware, the
>>>>>>         runtime library and the parallel tasks into
account.
>>>>>>       * Vectorizer enhancements to enable the
vectorization of parallel
>>>>>>       * loops and tasks.
>>>>>>       * Parallelization centric optimizations:
>>>>>>         a) Parallel tasks can be balanced, merged or
split as well as created
>>>>>>            from and lowered to sequential code.
>>>>>>         b) Barriers can be eliminated.
>>>>>>         c) Parallel loops can be statically scheduled
or created from
>>>>>>            parallel recursive calls [13]
>>>>>>       * Analysis to extract high-level information
(reductions, private
>>>>>>         memory, ...) from the low-level representation.
>>>>>>    - Front-end:
>>>>>>       * Lowering of simple OpenMP and Cilk++
annotations to PIR, including
>>>>>>         parallel sections and parallel loops with
limited support for
>>>>>>         clauses (at first) (examples can be found here
[1]). [milestone 1]
>>>>>>       * Generation of PIR code through automatic
parallelization. A
>>>>>>         patched version of Polly exists that emits
parallel loops using PIR instead of
>>>>>>         OpenMP runtime calls or llvm.parallel.loop
metadata. [milestone 1]
>>>>>>       * Representation of more evolved high-level
features like assignment
>>>>>>         of computation units.
>>>>>>    - Back-end:
>>>>>>       * Lowering of PIR regions to calls to the OpenMP
(GOMP) and Cilk++
>>>>>>         runtime library. [milestone 1]
>>>>>>       * A simple parallel library, e.g., based on
pthreads, to be shipped
>>>>>>         with LLVM as a fallback implementation for
parallel regions.
>>>>>>
>>>>>>
>>>>>> Thank you all for your time and hopefully constructive
input on this proposal!
>>>>>>
>>>>>> Cheers,
>>>>>>     Johannes, on behalf of the PIR team
>>>>>>
>>>>>>
>>>>>> Disclaimer:
>>>>>> This RFC, the patches, the wiki, etc. are a joint
effort by Tao B.
>>>>>> Schardl (MIT), Charles E. Leiserson (MIT), Kareem
Ergawy
>>>>>> (Saarland University), Simon Moll (Saarland University)
and
>>>>>> myself. However, ideas and feedback came from many
people,
>>>>>> including the members of the LLVM-HPC IR Extensions
working group
>>>>>> (Hal Finkel, Xinmin Tian, ...), the participants in the
BoF at
>>>>>> the US Developers’ meeting, everybody that commented on
the BoF
>>>>>> discussion sheet [16] and the recent RFC on IR-level
region
>>>>>> annotations [9] (Mehdi Amini, Sanjoy Das, Daniel
Berlin, ...).
>>>>>>    
>>>>>>
>>>>>>
>>>>>> [0] https://github.com/Parallel-IR/llvm-pir/wiki
>>>>>> [1] https://reviews.llvm.org/D29250 [2]
>>>>>> https://reviews.llvm.org/D29251 [3] 
>>>>>>
http://lists.llvm.org/pipermail/llvm-dev/2017-January/109302.htm
>>>>>> l
>>>>>> [4]
>>>>>>
http://lists.llvm.org/pipermail/llvm-dev/2015-March/083348.html
>>>>>> [5]
>>>>>>
http://lists.llvm.org/pipermail/llvm-dev/2017-January/109264.htm
>>>>>> l
>>>>>> [6]
>>>>>>
http://compilers.cs.uni-saarland.de/people/doerfert/parallelcfg.
>>>>>> pd f [7]
http://supertech.csail.mit.edu/papers/spbags.pdf  &
>>>>>> www.cse.wustl.edu/~angelee/papers/cilkprof.pdf
>>>>>> [8] https://goo.gl/Blp2Xr
>>>>>> [9]
>>>>>>
http://lists.llvm.org/pipermail/llvm-dev/2017-January/108906.htm
>>>>>> l
>>>>>> [10]
>>>>>>
http://lists.llvm.org/pipermail/llvm-dev/2017-January/108997.htm
>>>>>> l
>>>>>> [11]
>>>>>>
http://lists.llvm.org/pipermail/llvm-dev/2017-January/109377.htm
>>>>>> l
>>>>>> [12]
>>>>>>
http://lists.llvm.org/pipermail/llvm-dev/2017-January/109351.htm
>>>>>> l [13] http://wsmoses.com/tapir.pdf [14] 
>>>>>>
https://github.com/jdoerfert/llvm-pir/tree/feature/fork-join
>>>>>> [15]
>>>>>>
https://github.com/jdoerfert/llvm-pir/commit/854259881d24d71f9f1
>>>>>> f1
>>>>>> 7e
>>>>>> 52547758c7be0618a
>>>>>> [16] https://goo.gl/wKps3c
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Johannes Doerfert
>>>>>> Researcher / PhD Student
>>>>>>
>>>>>> Compiler Design Lab (Prof. Hack) Saarland Informatics
Campus,
>>>>>> Germany Building E1.3, Room 4.31
>>>>>>
>>>>>> Tel. +49 (0)681 302-57521 : doerfert at
cs.uni-saarland.de Fax. +49
>>>>>> (0)681 302-3065  :
>>>>>> http://www.cdl.uni-saarland.de/people/doerfert
>>>>>
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> llvm-dev at lists.llvm.org
>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>> --
>>>>>
>>>>> Johannes Doerfert
>>>>> Researcher / PhD Student
>>>>>
>>>>> Compiler Design Lab (Prof. Hack)
>>>>> Saarland Informatics Campus, Germany Building E1.3, Room
4.31
>>>>>
>>>>> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de
Fax. +49
>>>>> (0)681 302-3065  :
http://www.cdl.uni-saarland.de/people/doerfert
>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>> --
>>> Hal Finkel
>>> Lead, Compiler Technology and Programming Languages Leadership 
>>> Computing Facility Argonne National Laboratory
>>>
>> --
>>
>> Johannes Doerfert
>> Researcher / PhD Student
>>
>> Compiler Design Lab (Prof. Hack)
>> Saarland Informatics Campus, Germany
>> Building E1.3, Room 4.31
>>
>> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de Fax. +49
>> (0)681 302-3065  : http://www.cdl.uni-saarland.de/people/doerfert
--
Hal Finkel
Lead, Compiler Technology and Programming Languages Leadership Computing
Facility Argonne National Laboratory

Seemingly Similar Threads

Search for more reasonably related threads

llvm dev - Mar 2017 - [RFC][PIR] Parallel LLVM IR -- Stage 0 --

[llvm-dev] (no subject)

[llvm-dev] [RFC][PIR] Parallel LLVM IR -- Stage 0 --

[llvm-dev] [RFC][PIR] Parallel LLVM IR -- Stage 0 --

Seemingly Similar Threads