thr3ads.net - llvm dev - [llvm-dev] FW: [RFC] Abstract Parallel IR Optimizations [Jun 2018]

If this information is useful, please help other people find it:
Share via:

Adve, Vikram Sadanand via llvm-dev

2018-Jun-07 05:52 UTC

[llvm-dev] FW: [RFC] Abstract Parallel IR Optimizations

Johannes,

I definitely agree that LLVM needs better support for optimizing parallel
programs.  Comments inline:

From: Johannes Doerfert <jdoerfert at anl.gov>
Date: Wednesday, June 6, 2018 at 10:52 AM
To: LLVM-Dev <llvm-dev at lists.llvm.org>
Cc: Hal Finkel <hfinkel at anl.gov>, TB Schardl <neboat at mit.edu>,
"Adve, Vikram Sadanand" <vadve at illinois.edu>, "Tian,
Xinmin" <xinmin.tian at intel.com>, "llvmpar at
lists.cs.illinois.edu" <llvmpar at lists.cs.illinois.edu>
Subject: [RFC] Abstract Parallel IR Optimizations

This is an RFC to add analyses and transformation passes into LLVM to
optimize programs based on an abstract notion of a parallel region.

  == this is _not_ a proposal to add a new encoding of parallelism =
Actually, I suspect that your three “abstract” interfaces below
(ParallelRegionInfo, ParallelCommunicationInfo, ParallelIR/Builder) *are* a
parallel IR definition, and what you call the “implementations” essentially
encapsulate *target-specific* variations.  IOW, this looks like a
perhaps-very-preliminary parallel IR, not some generic abstract interface.  I
can only suspect this because I don’t have any information about what info or
operations these interfaces provide.

In fact, I think it *is* a good idea to have a proper parallel IR to support the
kinds of optimizations you’re describing.  I just don’t think you can define a
universal parallel IR for all the parallel languages and targets that LLVM may
want to support.  At a minimum, I think we need to experiment more before
committing to one (*if* one proved enough).

Instead of defining a specific parallel IR or interface, as you’re trying to do,
it would be much better to provide IR-independent hooks in LLVM so that
different parallel IRs can sit “above” the LLVM representation and support these
and other optimizations.  Your IR (or IR implementations, whatever you call
them) could be layered in this way.

The earlier RFC that Hal Finkel and Xinmin Tian circulated were for such a set
of hooks.  In fact, these hooks also work on regions, but are much more limited
and are *not* meant to support parallel passes themselves: those are left to any
IR built above them.

As you know, we (Hal, Xinmin, TB Schardl, George Stelle, Hashim Sharif, I, and a
few others) are trying to refine that proposal to provide examples of multiple
IRs the hooks can support, and to make a clearer argument for the soundness and
correctness impact of the hooks on LLVM passes.  You’ve been invited to our
conference calls and to edit the Google Doc.  It would be valuable if you could
add examples of how your IR information could be connected to LLVM via these
hooks.

More specific questions below.

We currently perform poorly when it comes to optimizations for parallel
codes. In fact, parallelizing your loops might actually prevent various
optimizations that would have been applied otherwise. One solution to
this problem is to teach the compiler about the semantics of the used
parallel representation. While this sounds tedious at first, it turns
out that we can perform key optimizations with reasonable implementation
effort (and thereby also reasonable maintenance costs). However, we have
various parallel representations that are already in use (KMPC,
GOMP, CILK runtime, ...) or proposed (Tapir, IntelPIR, ...).

Our proposal seeks to introduce parallelism specific optimizations for
multiple representations while minimizing the implementation overhead.
This is done through an abstract notion of a parallel region which hides
the actual representation from the analysis and optimization passes.

In general, for analysis and optimization passes to work with multiple
representations, you’d need to have a pretty rich abstract notion of a parallel
region.  I suspect that your design is pretty OpenMP-specific.  Even within that
context, it would have to capture quite a rich set of parallel constructs.  Can
you describe what exactly is your abstract notion of a parallel region, and how
is it different from a concrete representation?

In
the schemata below, our current five optimizations (described in detail
here [0]) are shown on the left, the abstract parallel IR interface is
is in the middle, and the representation specific implementations is on
the right.

         Optimization          (A)nalysis/(T)ransformation         Impl.
   ---------------------------------------------------------------------------
     CodePlacementOpt \  /---> ParallelRegionInfo (A) ---------|->
KMPCImpl (A)
       RegionExpander -\ |                                     |   GOMPImpl (A)
   AttributeAnnotator -|-|---> ParallelCommunicationInfo (A) --/   ...
   BarrierElimination -/ |
VariablePrivatization /  \---> ParallelIR/Builder (T) ----------->
KMPCImpl (T)

I wasn’t able to understand parts of this figure.  What info do
ParallelRegionInfo() and ParallelCommunicationInfo() provide?   What operations
does ParallelIR/Builder provide?

In our setting, a parallel region can be an outlined function called
through a runtime library but also a fork-join/attach-reattach region
embedded in an otherwise sequential code. The new optimizations will
provide parallelism specific optimizations to all of them (if
applicable). There are various reasons why we believe this is a
worthwhile effort that belongs into the LLVM codebase, including:

Before adding something this broad into the LLVM code base, I think we need to
understand a whole slew of things about it, starting with the questions above.

  1.  We improve the performance of parallel programs, today.

There are a number of important parallel languages and both language-specific
and target-specific parallel optimizations.  Just because these five
optimizations improve OpenMP program performance isn’t enough to justify adding
a new parallel IR (or interface) to LLVM, without knowing whether other
languages, targets and optimizations could be correctly and effectively
implemented using this approach.

  2) It serves as a meaningful baseline for future discussions on
     (optimized) parallel representations.

We can’t just add a new parallel IR design (abstract or concrete) to mainline
just to serve as a baseline for future discussions.

  3) It allows to determine the pros and cons of the different schemes
     when it comes to actual optimizations and inputs.

Same problem.

  4) It helps to identify problems that might arise once we start to
     transform parallel programs but _before_ we commit to a specific
     representation.

I suspect you *are* committing to a specific representation, although we can’t
be sure until you provide more details.

Our prototypes for the OpenMP KMPC library (used by clang) already shows
significant speedups for various benchmarks [0]. It also exposed a (to
me) prior unknown problem between restrict/noalias pointers and
(potential) barriers (see Section 3 in [0]).

We are currently in the process of cleaning the code, extending the
support for OpenMP constructs and adding a second implementation for a
embedded parallel regions. Though, a first horizontal prototype
implementation is already available for review [1].

Inputs of any kind are welcome and reviewers are needed!

Cheers,
  Johannes

[0] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf
[1] https://reviews.llvm.org/D47300

--

Johannes Doerfert
PhD Student / Researcher

Compiler Design Lab (Professor Hack) / Argonne National Laboratory
Saarland Informatics Campus, Germany / Lemont, IL 60439, USA
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de<mailto:doerfert at
cs.uni-saarland.de> / jdoerfert at anl.gov<mailto:jdoerfert at anl.gov>
Fax. +49 (0)681 302-3065  : http://www.cdl.uni-saarland.de/people/doerfert

  *

—Vikram Adve

// Interim Head, Department of Computer Science
// Donald B. Gillies Professor of Computer Science
// University of Illinois at Urbana-Champaign
// Admin Assistant: Amanda Foley - ajfoley2 at illinois.edu<mailto:ajfoley2
at illinois.edu>
// Google Hangouts: vikram.s.adve at gmail.com<mailto:vikram.s.adve at
gmail.com> || Skype: vikramsadve
// Research page:
http://vikram.cs.illinois.edu<http://vikram.cs.illinois.edu/>

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180607/d229b195/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: signature.asc
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180607/d229b195/attachment-0001.sig>

Johannes Doerfert via llvm-dev

2018-Jun-07 11:19 UTC

head link

[llvm-dev] FW: [RFC] Abstract Parallel IR Optimizations

Hi Vikram, 

Thanks for the quick reply. I put some comments below.

On 06/07, Adve, Vikram Sadanand wrote:> Johannes,
> 
> I definitely agree that LLVM needs better support for optimizing parallel
programs.  Comments inline:
> 
>
> > From: Johannes Doerfert <jdoerfert at anl.gov>
> > Date: Wednesday, June 6, 2018 at 10:52 AM
> > To: LLVM-Dev <llvm-dev at lists.llvm.org>
> > Cc: Hal Finkel <hfinkel at anl.gov>, TB Schardl <neboat at
mit.edu>, "Adve, Vikram Sadanand" <vadve at illinois.edu>,
"Tian, Xinmin" <xinmin.tian at intel.com>, "llvmpar at
lists.cs.illinois.edu" <llvmpar at lists.cs.illinois.edu>
> > Subject: [RFC] Abstract Parallel IR Optimizations
> > 
> > This is an RFC to add analyses and transformation passes into LLVM to
> > optimize programs based on an abstract notion of a parallel region.
> > 
> >   == this is _not_ a proposal to add a new encoding of parallelism
=>
> Actually, I suspect that your three “abstract” interfaces below
> (ParallelRegionInfo, ParallelCommunicationInfo, ParallelIR/Builder)
> *are* a parallel IR definition, and what you call the
> “implementations” essentially encapsulate *target-specific*
> variations. 
It is not an IR (encoding) in the sense that there are no new
instructions, intrinsics, or anything else that is materialized. The
interface provides a unified view of existing parallel representations
and it is as general as possible.

I am unsure what you mean by target-specific here but our
implementations are the actual parallel IR representations as we
discussed them several times on other occasions.
> IOW, this looks like a perhaps-very-preliminary parallel IR, not some
> generic abstract interface.  I can only suspect this because I don’t
> have any information about what info or operations these interfaces
> provide.
I disagree. Please check out the operations and info for the interface
here
 
https://reviews.llvm.org/differential/changeset/?ref=1086909&whitespace=ignore-most
and for the builder here
 
https://reviews.llvm.org/differential/changeset/?ref=1086915&whitespace=ignore-most
so we talk about the same thing.
> In fact, I think it *is* a good idea to have a proper parallel IR to
> support the kinds of optimizations you’re describing.  I just don’t
> think you can define a universal parallel IR for all the parallel
> languages and targets that LLVM may want to support.  At a minimum, I
> think we need to experiment more before committing to one (*if* one
> proved enough).
I think you misunderstood our intention here. This patch is in support
of such experiments as it allows to apply the optimizations we have and
might come up with to different parallel IR representations. Not all
optimizations will be applicable to all representations but I don't see
a reason why this is a problem. I also don't think this is a universal
parallel IR of sorts but it is a way to extract critical information
necessary for optimizations out of parallel representations.

> Instead of defining a specific parallel IR or interface, as you’re
> trying to do, it would be much better to provide IR-independent hooks
> in LLVM so that different parallel IRs can sit “above” the LLVM
> representation and support these and other optimizations.  Your IR (or
> IR implementations, whatever you call them) could be layered in this
> way.
I do not understand what you mean here. The "implementations" (as I
called them so recklessly) lift _some_ information from the actual IR
representation to the high level abstract parallel IR interface level
for consummation by the optimization passes. Different parallel IRs can
come with their own implementation and provide more or less information
to enable optimizations.
> The earlier RFC that Hal Finkel and Xinmin Tian circulated were for
> such a set of hooks.  In fact, these hooks also work on regions, but
> are much more limited and are *not* meant to support parallel passes
> themselves: those are left to any IR built above them.
>
> As you know, we (Hal, Xinmin, TB Schardl, George Stelle, Hashim
> Sharif, I, and a few others) are trying to refine that proposal to
> provide examples of multiple IRs the hooks can support, and to make a
> clearer argument for the soundness and correctness impact of the hooks
> on LLVM passes.  You’ve been invited to our conference calls and to
> edit the Google Doc.  It would be valuable if you could add examples
> of how your IR information could be connected to LLVM via these hooks.
I'm still confused about the hooks.
> More specific questions below.
> 
> 
> > We currently perform poorly when it comes to optimizations for
parallel
> > codes. In fact, parallelizing your loops might actually prevent
various
> > optimizations that would have been applied otherwise. One solution to
> > this problem is to teach the compiler about the semantics of the used
> > parallel representation. While this sounds tedious at first, it turns
> > out that we can perform key optimizations with reasonable
implementation
> > effort (and thereby also reasonable maintenance costs). However, we
have
> > various parallel representations that are already in use (KMPC,
> > GOMP, CILK runtime, ...) or proposed (Tapir, IntelPIR, ...).
> > 
> > Our proposal seeks to introduce parallelism specific optimizations for
> > multiple representations while minimizing the implementation overhead.
> > This is done through an abstract notion of a parallel region which
hides
> > the actual representation from the analysis and optimization passes.
> 
> 
> In general, for analysis and optimization passes to work with multiple
> representations, you’d need to have a pretty rich abstract notion of a
> parallel region.  I suspect that your design is pretty
> OpenMP-specific.  Even within that context, it would have to capture
> quite a rich set of parallel constructs.  Can you describe what
> exactly is your abstract notion of a parallel region, and how is it
> different from a concrete representation?
Good question. The interface implementations will abstract away of the
representation details that the optimization should not worry about such
as:

  - runtime library calls
  - variable passing via structures
  - intrinsic calls to encode information
  - ...

If you look at our variable privatization pass (which was not included
in this commit) it does not need to know if variable capturing is done
through a runtime library call (OpenMP), if captured variables are
handed to a special intrinsic to mark them (IntelPIR), or if capturing
is just a def-use chain crossing the parallel region boundaries (Tapir).
The pass requires to know the mapping (variable in sequential code ->
variable in parallel code) as well as the start/end instruction of the
parallel region. Then it can test for conditions that would allow to
privatize the variable, pass it by-value instead of by-reference, etc.

Similarly, other passes do not need to know the details of the encoding.
You can probably see that for barrier elimination but also parallel
region expansion is actually only looking at the surrounding code to
determine if expansion (and region merging) might make sense.
> > In the schemata below, our current five optimizations (described in
> > detail here [0]) are shown on the left, the abstract parallel IR
> > interface is is in the middle, and the representation specific
> > implementations is on the right.
> > 
> >          Optimization          (A)nalysis/(T)ransformation        
Impl.
> >   
---------------------------------------------------------------------------
> >      CodePlacementOpt \  /---> ParallelRegionInfo (A)
---------|-> KMPCImpl (A)
> >        RegionExpander -\ |                                     |  
GOMPImpl (A)
> >    AttributeAnnotator -|-|---> ParallelCommunicationInfo (A) --/  
...
> >    BarrierElimination -/ |
> > VariablePrivatization /  \---> ParallelIR/Builder (T)
-----------> KMPCImpl (T)
> 
> 
> I wasn’t able to understand parts of this figure.  What info do
> ParallelRegionInfo() and ParallelCommunicationInfo() provide?   What
> operations does ParallelIR/Builder provide?
So far not much. ParallelRegionInfo just collects and manages the
parallel regions. These know about their extend (start/end), provide a
way to visit the blocks and instructions and have some helper functions
to reason about (potential) barriers. Nothing is specific to OpenMP.
ParallelCommunicationInfo provides the mapping of variables outside the
parallel region to the ones inside and tells you about the way they are
used (e.g., the pointer is actually a read-only container). Finally, the
builder is the only way to modify the parallel regions (as we don't know
the actual representation in the optimization). Currently it can only
add/remove attributes but my local version can create parallel regions,
build barriers, etc.

For the current code check the two links from above.
 > > In our setting, a parallel region can be an outlined function called
> > through a runtime library but also a fork-join/attach-reattach region
> > embedded in an otherwise sequential code. The new optimizations will
> > provide parallelism specific optimizations to all of them (if
> > applicable). There are various reasons why we believe this is a
> > worthwhile effort that belongs into the LLVM codebase, including:
> 
> 
> Before adding something this broad into the LLVM code base, I think we
> need to understand a whole slew of things about it, starting with the
> questions above.
I'm not sure about broad. Do you mean because of the commit size? Most
of it is boilerplate and it is a fully working horizontal prototype.
Actually, there are only 1091 lines (all types included) added to
llvm/lib/.
> >   1.  We improve the performance of parallel programs, today.
> 
> There are a number of important parallel languages and both
> language-specific and target-specific parallel optimizations.  Just
> because these five optimizations improve OpenMP program performance
> isn’t enough to justify adding a new parallel IR (or interface) to
> LLVM, without knowing whether other languages, targets and
> optimizations could be correctly and effectively implemented using
> this approach.
This will not prevent us from testing other languages and optimizations,
actually it should enable us to do so. If it somehow turns out that this
abstraction makes it impossible to implement an optimization correctly
and effectively and that optimization is very important to us we can
still get rid of the abstraction and keep the rest of the code. The
other optimization we have will still work even without the abstraction.
The same holds true if we decide on a parallel IR representation and
do not need to support multiple ones anymore.
> >    2) It serves as a meaningful baseline for future discussions on
> >    (optimized) parallel representations.
> 
> We can’t just add a new parallel IR design (abstract or concrete) to
> mainline just to serve as a baseline for future discussions.
Not _just_! Adding something beneficial to mainline that makes sense (at
least for me but let's see what others think) does not strike me as a
problem. If that would happen, we would finally have a baseline for
parallel optimizations instead of comparing unoptimized programs to
optimized ones encoded in the new IR we came up with.
> >   3) It allows to determine the pros and cons of the different schemes
> >      when it comes to actual optimizations and inputs.
> 
> Same problem.
This is not the same problem. When you design your IR and you have to
make trade offs you would probably perform experiments and decide.
However, for that you need a test bed to perform the experiments in, an
implementation of both choices, etc. If we combine optimizations we
think are meaningful with representations we think are reasonable we can
actually learn if and how they impact each other for benchmarks we care
about.
> >   4) It helps to identify problems that might arise once we start to
> >      transform parallel programs but _before_ we commit to a specific
> >      representation.
> 
> I suspect you *are* committing to a specific representation, although
> we can’t be sure until you provide more details.
Please check out the links to the interface codes above.
> > Our prototypes for the OpenMP KMPC library (used by clang) already
shows
> > significant speedups for various benchmarks [0]. It also exposed a (to
> > me) prior unknown problem between restrict/noalias pointers and
> > (potential) barriers (see Section 3 in [0]).
> > 
> > We are currently in the process of cleaning the code, extending the
> > support for OpenMP constructs and adding a second implementation for a
> > embedded parallel regions. Though, a first horizontal prototype
> > implementation is already available for review [1].
> > 
> > Inputs of any kind are welcome and reviewers are needed!
> > 
> > Cheers,
> >   Johannes
> 
> 
> [0] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf
> [1] https://reviews.llvm.org/D47300
> 
> --
> 
> Johannes Doerfert
> PhD Student / Researcher
> 
> Compiler Design Lab (Professor Hack) / Argonne National Laboratory
> Saarland Informatics Campus, Germany / Lemont, IL 60439, USA
> Building E1.3, Room 4.31
> 
> Tel. +49 (0)681 302-57521 : doerfert at
cs.uni-saarland.de<mailto:doerfert at cs.uni-saarland.de> / jdoerfert at
anl.gov<mailto:jdoerfert at anl.gov>
> Fax. +49 (0)681 302-3065  : http://www.cdl.uni-saarland.de/people/doerfert
> 
> 
> 
>   *
> 
> —Vikram Adve
> 
> // Interim Head, Department of Computer Science
> // Donald B. Gillies Professor of Computer Science
> // University of Illinois at Urbana-Champaign
> // Admin Assistant: Amanda Foley - ajfoley2 at
illinois.edu<mailto:ajfoley2 at illinois.edu>
> // Google Hangouts: vikram.s.adve at gmail.com<mailto:vikram.s.adve at
gmail.com> || Skype: vikramsadve
> // Research page:
http://vikram.cs.illinois.edu<http://vikram.cs.illinois.edu/>
> 
> 


-- 

Johannes Doerfert
PhD Student / Researcher

Compiler Design Lab (Professor Hack) / Argonne National Laboratory
Saarland Informatics Campus, Germany / Lemont, IL 60439, USA
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de / jdoerfert at
anl.gov
Fax. +49 (0)681 302-3065  : http://www.cdl.uni-saarland.de/people/doerfert
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: Digital signature
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180607/31d9fc5a/attachment.sig>

llvm dev - Jun 2018 - FW: [RFC] Abstract Parallel IR Optimizations

[llvm-dev] FW: [RFC] Abstract Parallel IR Optimizations

[llvm-dev] FW: [RFC] Abstract Parallel IR Optimizations