thr3ads.net - llvm dev - [llvm-dev] [RFC] Writing loop transformations on the right representation is more productive [Jan 2020]

If this information is useful, please help other people find it:
Share via:

Michael Kruse via llvm-dev

2020-Jan-15 04:39 UTC

[llvm-dev] [RFC] Writing loop transformations on the right representation is more productive

Am So., 12. Jan. 2020 um 20:07 Uhr schrieb Chris Lattner <
clattner at nondot.org>:
> The central idea is to use a modifiable loop tree -- similar to
> LoopInfo -- as the primary representation. LLVM-IR is converted to a
> loop tree, then optimized and finally LLVM-IR is generated again for
> subtrees that are considered profitable. This is not a new concept, it
> has already been used in compilers such as IBM XL Fortran (called ASTI
> [4]) and Silicon Graphics/Open64 (called LNO [10]), and in research
> such as the Value State Dependence Graph [3].
>
>
> Ignoring the details of its representation, this is also conceptually how
> Polly works: code is lifted into its representation, transformed, then
> lowered back down.
>
Indeed I tried to improve on Polly's internal representation, and improve
on the issue that Polly can only represent a subset of LLVM-IR code.


> Overall, I think that this discussion would be easier to process if we
> broke it into a few pieces.  There seems to be consensus that LLVM IR (as
> is) is not the right representation for aggressive loop transformations.
> If we don’t have consensus on this, then I’d make sure to start there.
>
>
> Once that is established, there is a question of “what is the right
> representation to use”?  This question has two subcomponents: what data
> structure should we use, and what is the IR within it.
>
> If you propose introducing a brand new data structure, please expect me to
> push back on that pretty hard.
>
Which I think is a good thing since I also do not want too many data
structures being more-or-less well maintained. But I also think there is a
good argument to a loop-centric data structure.


> This is a perfect application of MLIR, and using it provides a lot of
> value (including amazing testing tools, round-tripping, location tracking,
> etc) which would otherwise would have to be reinvented, and doesn’t not
> dictate the IR structure otherwise.  MLIR absolutely supports nested loop
> structures etc, as is seen in the affine dialect.
>
> The MLIR community also is highly invested in HPC-style transformations on
> this, and a lot of thought has gone into it.  You can learn more about this
> in the slides and videos from the MLIR open design meetings
>
<https://docs.google.com/document/d/1y_9f1AbfgcoVdJh4_aM6-BaSHvrHl8zuA5G4jv_94K8/edit#heading=h.cite1kolful9>
> .
>
I have been following the development of MLIR.


One you achieve consensus on data structure, there is the question of
what> IR to use within it.  I would recommend starting with some combination of
> “existing LLVM IR operations + high level control flow representation”,
> e.g. parallel and affine loops.  The key here is that you need to always be
> able to lower in a simple and predictable way to LLVM IR (this is one of
> the things that classic polyhedral systems did sub optimally, making it
> difficult to reason about the cost model of various transformations), and
> this is a natural incremental starting point anyway.  Over time, more high
> level concepts can be gradually introduced.  FYI, MLIR already has a
> reasonable LLVM dialect <https://mlir.llvm.org/docs/Dialects/LLVM/>
and
> can generate LLVM IR from it, so we’d just need an “LLVM IR -> MLIR LLVM
> dialect” conversion, which should be straightforward to build.
>
Adding a LLVM-IR -> MLIR -> LLVM-IR round-trip would at the beginning just
introduce compile-time overhead and what Renato described as brittleness. I
fear this hurts adaption.


Once you have the data structure and the dialect within it decided,
you> have the set of transformations.  Again, you’ve given a lot of thought to
> this, and that all sounds great to me, but the priorities can be driven by
> whoever wants to contribute and what concrete problems they have to solve.
>
>
> Once the infra for “raising to this representation and lowering back down”
> is figured out, we can open the box of having clang and other front ends
> directly generate it.
>
This suggestions would also apply to VPlan. Ignoring that work on VPlan
started before MLIR, would you have suggested to implement VPlan on MLIR as
well? Would you maybe even advise to retarget VPlan on MLIR now?

Q: Relation to MLIR?>
> A: MLIR is more similar to LLVM-IR than a loop hierarchy.
>
>
> This is not true, MLIR is great for dialects that want to model loop
> hierarchies naturally, this is a major focus of the affine dialect
> <https://mlir.llvm.org/docs/Dialects/Affine/> (e.g. see affine.for on
> that page).  MLIR is not limited to affine loops, that is just a choice
> made by the affine dialect - the loop dialect
> <https://mlir.llvm.org/docs/Dialects/LoopOps/> has more general
> constructs that are less developed.
>

This is definitely subjective question. I think that MLIR is closer to
LLVM-IR for how it is processed. Both have a sequence of passes running
over a single source of truth. Both allow walking the entire structure from
every instruction/operation/block. Analyses are on function or module
level. Both have CFGs (I think for a certain kind of transformations it is
an advantage that control flow is handled implicitly).

For> instance, it also does not feature cheap copies.
>
>
> I’m not sure what this means.
>
>The possibility to make local changes speculatively without copying the
entire data structure. IMHO this is a central idea that allows applying a
transformations speculatively to pass it to a legality check and cost
heuristic without committing to apply it. As a consequence, passes do not
need to implement to implement these in a transformation-specific manner,
drastically reducing the burden of implementation.

For instance, more loop transformations are feasible if instructions are
moved into the innermost loops. With speculative transformations, we can
canonicalize the representation to sink computations into loops -- the
opposite of what LICM does -- and then see whether a transformation can
applied. If not, the speculative representation is discarded without having
an effect on the original representation (and not needing to hoist those
computations again).

Because the MLIR classes have many references to related objects (such as
pointer to parents and use-def chains), I don't think it is feasible to
implement on top of MLIR.

An advantage is that> loops and multi-dimensional arrays can be represented in the language
> without the need of being rediscovered, but have to be inserted by a
> front-end.
>
>
> This is correct, but I don’t see how this helps if your focus is raising
> code that has already been lowered to LLVM IR, e.g. by Clang or some other
> frontend that generates LLVM IR today.
>
Indeed, I would hope that LLVM-IR can preserve multi-dimensional array
accesses in some fashion as well (
https://lists.llvm.org/pipermail/llvm-dev/2019-July/134063.html). However,
currently MLIR has the advantage of being able represent it.


That is, if Clang was generating MLIR, loops and arrays> still have to be rediscovered.
>
>
> This isn’t true, it would be perfectly sensible to lower C control flow
> structures directly too LLVM.  The primary concern are things like
> unstructured switches (think duff’s device) and unstructured gotos, but
> these occur rarely: they need to be handled correctly, but can do so with a
> traditionally lowered CFG and other “best effort” attempts to raise them.
>
Moreover, syntactical loop structures are also not a reliable indicator
that there is a loop. Often enough, do,for and while are used for
syntactical reasons (`do { } while(0)`). Yes, you could eliminate them if a
non-loop is detected, but handling break, continue, etc correctly is a lot
of effort. Another case are corountines that are lowered with gotos into
loops, unless you think loop optimizers should handle coroutines directly.

On the other side, natural loop detection on CFGs is quite mature (with a
remaining issue of irreducible loops that might appear, but can also be
eliminated again). As a plus, optimization does depend less on how the
source code is written.

Thanks for the productive discussion,
Michael
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200114/2d6cceb8/attachment-0001.html>

Chris Lattner via llvm-dev

2020-Jan-16 06:27 UTC

head link

[llvm-dev] [RFC] Writing loop transformations on the right representation is more productive

On Jan 14, 2020, at 8:39 PM, Michael Kruse <llvmdev at meinersbur.de>
wrote:> Once that is established, there is a question of “what is the right
representation to use”?  This question has two subcomponents: what data
structure should we use, and what is the IR within it.
> 
> If you propose introducing a brand new data structure, please expect me to
push back on that pretty hard.
> 
> Which I think is a good thing since I also do not want too many data
structures being more-or-less well maintained. But I also think there is a good
argument to a loop-centric data structure.
Agreed, I think it is incredibly important for a first class loop optimizer to
have first class structured loops, parallel loops etc.
> 
> One you achieve consensus on data structure, there is the question of what
IR to use within it.  I would recommend starting with some combination of
“existing LLVM IR operations + high level control flow representation”, e.g.
parallel and affine loops.  The key here is that you need to always be able to
lower in a simple and predictable way to LLVM IR (this is one of the things that
classic polyhedral systems did sub optimally, making it difficult to reason
about the cost model of various transformations), and this is a natural
incremental starting point anyway.  Over time, more high level concepts can be
gradually introduced.  FYI, MLIR already has a reasonable LLVM dialect
<https://mlir.llvm.org/docs/Dialects/LLVM/> and can generate LLVM IR from
it, so we’d just need an “LLVM IR -> MLIR LLVM dialect” conversion, which
should be straightforward to build.
> 
> Adding a LLVM-IR -> MLIR -> LLVM-IR round-trip would at the beginning
just introduce compile-time overhead and what Renato described as brittleness. I
fear this hurts adaption.
Isn’t this true of *any* higher level IR?  Unless I’m missing something big,
this seems inherent to your proposal.
> Once you have the data structure and the dialect within it decided, you
have the set of transformations.  Again, you’ve given a lot of thought to this,
and that all sounds great to me, but the priorities can be driven by whoever
wants to contribute and what concrete problems they have to solve.
> 
> 
> Once the infra for “raising to this representation and lowering back down”
is figured out, we can open the box of having clang and other front ends
directly generate it.
> 
> This suggestions would also apply to VPlan. Ignoring that work on VPlan
started before MLIR, would you have suggested to implement VPlan on MLIR as
well? Would you maybe even advise to retarget VPlan on MLIR now?
I don’t know enough to say: the tradeoffs depend a lot of where VPlan is, the
challenges it faces etc.  I don’t know much about VPlan or the engineering
priorities behind it.


Here’s an observation though: if you ignore the engineering expense, it would
clearly make sense to reimplement the mid-level LLVM optimizers on top of MLIR
and replace include/llvm/IR with a dialect definition in MLIR instead.

MLIR as an IR is strictly equal to or better than the LLVM IR data structures in
all ways that I’m aware of.  In addition to representational flexibility, MLIR
allows (and provides) a multithreaded pass manager (function passes run in
parallel), has a better representation of PHI nodes, allows better terminators
(eliminating need for the really ugly/unfortunate landingpad
<http://llvm.org/docs/LangRef.html#landingpad-instruction>, catchpad
<http://llvm.org/docs/LangRef.html#catchpad-instruction> etc hacks), has a
better representation for “operands that must be constants” (immarg etc),
provides a better representation for location information (important for
debugging optimized code and diagnostic emission from the optimizer), and better
testing tools (by building on the better location info).

The additional representational flexibility would allow a much more flexible
compiler design - one where you could do progressive lowering of high level
loops, OpenMP, separate out ABI lowering from Clang IRGen, etc.

I’m very fond of LLVM IR obviously, but a lot has been learned in the nearly 20
years
<https://github.com/llvm/llvm-project/commit/2f7c963559dbc6dbda43df77a090a93f94c6625e>
since it was designed and implemented, and MLIR was implemented with a superset
of the experience that built LLVM :)
>> Q: Relation to MLIR?
>> A: MLIR is more similar to LLVM-IR than a loop hierarchy.
> 
> This is not true, MLIR is great for dialects that want to model loop
hierarchies naturally, this is a major focus of the affine dialect
<https://mlir.llvm.org/docs/Dialects/Affine/> (e.g. see affine.for on that
page).  MLIR is not limited to affine loops, that is just a choice made by the
affine dialect - the loop dialect
<https://mlir.llvm.org/docs/Dialects/LoopOps/> has more general constructs
that are less developed.
> 
> 
> This is definitely subjective question. I think that MLIR is closer to
LLVM-IR for how it is processed. Both have a sequence of passes running over a
single source of truth. Both allow walking the entire structure from every
instruction/operation/block. Analyses are on function or module level. Both have
CFGs (I think for a certain kind of transformations it is an advantage that
control flow is handled implicitly).
Right, but a frequent way that MLIR is used is without its CFG: most machine
learning kernels use nests of loops and ifs, not CFGs.  CFGs are exposed when
those are lowered out.  See some simple examples like:
https://github.com/llvm/llvm-project/blob/master/mlir/test/Transforms/affine-data-copy.mlir
<https://github.com/llvm/llvm-project/blob/master/mlir/test/Transforms/affine-data-copy.mlir>

> 
>> For
>> instance, it also does not feature cheap copies.
> 
> I’m not sure what this means.
> 
> 
> The possibility to make local changes speculatively without copying the
entire data structure. IMHO this is a central idea that allows applying a
transformations speculatively to pass it to a legality check and cost heuristic
without committing to apply it. As a consequence, passes do not need to
implement to implement these in a transformation-specific manner, drastically
reducing the burden of implementation.
> 
> For instance, more loop transformations are feasible if instructions are
moved into the innermost loops. With speculative transformations, we can
canonicalize the representation to sink computations into loops -- the opposite
of what LICM does -- and then see whether a transformation can applied. If not,
the speculative representation is discarded without having an effect on the
original representation (and not needing to hoist those computations again).
> 
> Because the MLIR classes have many references to related objects (such as
pointer to parents and use-def chains), I don't think it is feasible to
implement on top of MLIR.
Ah yes, I see what you mean.  One way to do that is to represent multiple
options as an op with region for each option.  This means you only fork the part
of the IR that you’re producing variants of.  I think this is the red/green tree
technique you mentioned, but I’m not sure.
> 
>> An advantage is that
>> loops and multi-dimensional arrays can be represented in the language
>> without the need of being rediscovered, but have to be inserted by a
>> front-end.
> 
> This is correct, but I don’t see how this helps if your focus is raising
code that has already been lowered to LLVM IR, e.g. by Clang or some other
frontend that generates LLVM IR today.
> 
> Indeed, I would hope that LLVM-IR can preserve multi-dimensional array
accesses in some fashion as well
(https://lists.llvm.org/pipermail/llvm-dev/2019-July/134063.html
<https://lists.llvm.org/pipermail/llvm-dev/2019-July/134063.html>).
However, currently MLIR has the advantage of being able represent it.
I don’t think LLVM IR will ever get there without a massive design change.  It
is possible that it will support static shaped accesses in limited ways though.
>  
>> That is, if Clang was generating MLIR, loops and arrays
>> still have to be rediscovered.
> 
> This isn’t true, it would be perfectly sensible to lower C control flow
structures directly too LLVM.  The primary concern are things like unstructured
switches (think duff’s device) and unstructured gotos, but these occur rarely:
they need to be handled correctly, but can do so with a traditionally lowered
CFG and other “best effort” attempts to raise them.
> 
> Moreover, syntactical loop structures are also not a reliable indicator
that there is a loop. Often enough, do,for and while are used for syntactical
reasons (`do { } while(0)`). Yes, you could eliminate them if a non-loop is
detected, but handling break, continue, etc correctly is a lot of effort.
Another case are corountines that are lowered with gotos into loops, unless you
think loop optimizers should handle coroutines directly.
Yes, you’d want to canonicalize the form of course.
> On the other side, natural loop detection on CFGs is quite mature (with a
remaining issue of irreducible loops that might appear, but can also be
eliminated again). As a plus, optimization does depend less on how the source
code is written.
Yep totally.  The question is whether you lose semantic information from
lowering to a CFG and reconstructing back up.  This can affect you when you have
higher level language semantics (e.g. Fortran parallel loops, openmp or other
concurrency constructs etc).  This is where MLIR excels of course.

-Chris

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200115/fb641a7b/attachment-0001.html>

Renato Golin via llvm-dev

2020-Jan-16 09:55 UTC

head link

[llvm-dev] [RFC] Writing loop transformations on the right representation is more productive

On Thu, 16 Jan 2020 at 06:27, Chris Lattner via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> Ah yes, I see what you mean. One way to do that is to represent multiple
options as an op with region for each option. This means you only fork the part
of the IR that you’re producing variants of. I think this is the red/green tree
technique you mentioned, but I’m not sure.
IIUC. the green tree is a lighter version of the tree (leaner memory
footprint) but still entire tree. It's ok to lose that info because
you don't usually need that for your transformations, and you can
always go back to the red tree (via pointer in the green tree) to ask
harder questions. Managing the semantics of the two becomes
non-trivial when you start adding and replacing nodes, there's a point
where you can't go back anymore to the red tree in the same way.

What I referred to as "journalling" is what you propose here. Add
metadata to the actual graph and during the heuristic search, only
clone those. If you make sure you can append those nodes to the graph,
and guarantee that the extra nodes are composable (ie semantically
valid in any order they may be applied), then having the original
graph + any intermediate state is valid. Therefore, keeping only the
extra nodes, and copying them along to try different routes, becomes
even cheaper than a green tree.

If those extra nodes are an MLIR dialect, with defined semantics and
structured composition, then using them in an heuristics search
produces semantically valid intermediate states and lightens the
burden of proof for every little step.

Michael Kruse via llvm-dev

2020-Jan-22 08:58 UTC

head link

[llvm-dev] [RFC] Writing loop transformations on the right representation is more productive

Am Mi., 15. Jan. 2020 um 20:27 Uhr schrieb Chris Lattner <
clattner at nondot.org>:
> One you achieve consensus on data structure, there is the question of what
>> IR to use within it.  I would recommend starting with some combination
of
>> “existing LLVM IR operations + high level control flow representation”,
>> e.g. parallel and affine loops.  The key here is that you need to
always be
>> able to lower in a simple and predictable way to LLVM IR (this is one
of
>> the things that classic polyhedral systems did sub optimally, making it
>> difficult to reason about the cost model of various transformations),
and
>> this is a natural incremental starting point anyway.  Over time, more
high
>> level concepts can be gradually introduced.  FYI, MLIR already has a
>> reasonable LLVM dialect
<https://mlir.llvm.org/docs/Dialects/LLVM/> and
>> can generate LLVM IR from it, so we’d just need an “LLVM IR -> MLIR
LLVM
>> dialect” conversion, which should be straightforward to build.
>>
>
> Adding a LLVM-IR -> MLIR -> LLVM-IR round-trip would at the beginning
just
> introduce compile-time overhead and what Renato described as brittleness. I
> fear this hurts adaption.
>
>
> Isn’t this true of *any* higher level IR?  Unless I’m missing something
> big, this seems inherent to your proposal.
>
No. A loop hierarchy may be created on-demand and can be skipped if, e.g.,
the function does not contain a loop. For IRs that are translation-unit
based, the entire module will have to do a round-trip whether changed or
not. To improve the situation, one could e.g. add a "has been changed"
flag
to each function. But it has to be added somewhere into the MLIR data
structure and kept up-to-date on modifications. In a
loop-hierarchical structure only the node(s) that has been changed needs to
be lowered (e.g. an innermost loop) and versioned with the original IR
depending on taken assumptions.


This is definitely subjective question. I think that MLIR is closer
to> LLVM-IR for how it is processed. Both have a sequence of passes running
> over a single source of truth. Both allow walking the entire structure from
> every instruction/operation/block. Analyses are on function or module
> level. Both have CFGs (I think for a certain kind of transformations it is
> an advantage that control flow is handled implicitly).
>
>
> Right, but a frequent way that MLIR is used is without its CFG: most
> machine learning kernels use nests of loops and ifs, not CFGs.  CFGs are
> exposed when those are lowered out.  See some simple examples like:
>
>
https://github.com/llvm/llvm-project/blob/master/mlir/test/Transforms/affine-data-copy.mlir
>
>I agree that a loop nest can be represented in MLIR. What is missing IMHO
is being able to have multiple versions of the same code. For instance,
raising emitted C++ to such representation to make it more optimizable may
only be possible under preconditions and by itself making the code slower.
If the raised representation cannot be optimized, we will want to use the
original one.

> The possibility to make local changes speculatively without copying the
> entire data structure. IMHO this is a central idea that allows applying a
> transformations speculatively to pass it to a legality check and cost
> heuristic without committing to apply it. As a consequence, passes do not
> need to implement to implement these in a transformation-specific manner,
> drastically reducing the burden of implementation.
>
> For instance, more loop transformations are feasible if instructions are
> moved into the innermost loops. With speculative transformations, we can
> canonicalize the representation to sink computations into loops -- the
> opposite of what LICM does -- and then see whether a transformation can
> applied. If not, the speculative representation is discarded without having
> an effect on the original representation (and not needing to hoist those
> computations again).
>
> Because the MLIR classes have many references to related objects (such as
> pointer to parents and use-def chains), I don't think it is feasible to
> implement on top of MLIR.
>
>
> Ah yes, I see what you mean.  One way to do that is to represent multiple
> options as an op with region for each option.  This means you only fork the
> part of the IR that you’re producing variants of.  I think this is the
> red/green tree technique you mentioned, but I’m not sure.
>
>The red-green tree technique even allows re-inserting entire unchanged
subtrees (e.g. loop bodies after an interchange). If op takes multiple
regions, each region still must be deep copies.

> An advantage is that
>> loops and multi-dimensional arrays can be represented in the language
>> without the need of being rediscovered, but have to be inserted by a
>> front-end.
>>
>>
>> This is correct, but I don’t see how this helps if your focus is
raising
>> code that has already been lowered to LLVM IR, e.g. by Clang or some
other
>> frontend that generates LLVM IR today.
>>
>
> Indeed, I would hope that LLVM-IR can preserve multi-dimensional array
> accesses in some fashion as well (
> https://lists.llvm.org/pipermail/llvm-dev/2019-July/134063.html).
> However, currently MLIR has the advantage of being able represent it.
>
>
> I don’t think LLVM IR will ever get there without a massive design
> change.  It is possible that it will support static shaped accesses in
> limited ways though.
>
>Static sized rectangular multi-dimensional arrays are already possible
using a standard GetElementPtr and its inrange qualifier. For dynamic sized
multi-dimensional sized arrays what is needed is to convey the dimensions
of the array in form of an llvm::Value. In the RFC we discussed an
intrinsic and operand bundles, neither looks like massive design changes to
me.

On the other side, natural loop detection on CFGs is quite mature (with
a> remaining issue of irreducible loops that might appear, but can also be
> eliminated again). As a plus, optimization does depend less on how the
> source code is written.
>
>
> Yep totally.  The question is whether you lose semantic information from
> lowering to a CFG and reconstructing back up.  This can affect you when you
> have higher level language semantics (e.g. Fortran parallel loops, openmp
> or other concurrency constructs etc).  This is where MLIR excels of course.
>
>Indeed it is easier to not lower these constructs, but not impossible (as
shown in https://reviews.llvm.org/D69930). I think the relevant difference
is that these constructs come with additional guarantees (e.g.
Single-Entry-Single-Exit regions) and optimization hurdles (e.g. thread
synchronization; where programmers do not expect the compiler to do a lot
of things) compared to C++ loop constructs.


Michael
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200121/93b37eab/attachment.html>

llvm dev - Jan 2020 - [RFC] Writing loop transformations on the right representation is more productive

[llvm-dev] [RFC] Writing loop transformations on the right representation is more productive

[llvm-dev] [RFC] Writing loop transformations on the right representation is more productive

[llvm-dev] [RFC] Writing loop transformations on the right representation is more productive

[llvm-dev] [RFC] Writing loop transformations on the right representation is more productive