thr3ads.net - llvm dev - [llvm-dev] [RFC] Writing loop transformations on the right representation is more productive [Jan 2020]

If this information is useful, please help other people find it:
Share via:

Michael Kruse via llvm-dev

2020-Jan-22 08:58 UTC

[llvm-dev] [RFC] Writing loop transformations on the right representation is more productive

Am Mi., 15. Jan. 2020 um 20:27 Uhr schrieb Chris Lattner <
clattner at nondot.org>:
> One you achieve consensus on data structure, there is the question of what
>> IR to use within it.  I would recommend starting with some combination
of
>> “existing LLVM IR operations + high level control flow representation”,
>> e.g. parallel and affine loops.  The key here is that you need to
always be
>> able to lower in a simple and predictable way to LLVM IR (this is one
of
>> the things that classic polyhedral systems did sub optimally, making it
>> difficult to reason about the cost model of various transformations),
and
>> this is a natural incremental starting point anyway.  Over time, more
high
>> level concepts can be gradually introduced.  FYI, MLIR already has a
>> reasonable LLVM dialect
<https://mlir.llvm.org/docs/Dialects/LLVM/> and
>> can generate LLVM IR from it, so we’d just need an “LLVM IR -> MLIR
LLVM
>> dialect” conversion, which should be straightforward to build.
>>
>
> Adding a LLVM-IR -> MLIR -> LLVM-IR round-trip would at the beginning
just
> introduce compile-time overhead and what Renato described as brittleness. I
> fear this hurts adaption.
>
>
> Isn’t this true of *any* higher level IR?  Unless I’m missing something
> big, this seems inherent to your proposal.
>
No. A loop hierarchy may be created on-demand and can be skipped if, e.g.,
the function does not contain a loop. For IRs that are translation-unit
based, the entire module will have to do a round-trip whether changed or
not. To improve the situation, one could e.g. add a "has been changed"
flag
to each function. But it has to be added somewhere into the MLIR data
structure and kept up-to-date on modifications. In a
loop-hierarchical structure only the node(s) that has been changed needs to
be lowered (e.g. an innermost loop) and versioned with the original IR
depending on taken assumptions.


This is definitely subjective question. I think that MLIR is closer
to> LLVM-IR for how it is processed. Both have a sequence of passes running
> over a single source of truth. Both allow walking the entire structure from
> every instruction/operation/block. Analyses are on function or module
> level. Both have CFGs (I think for a certain kind of transformations it is
> an advantage that control flow is handled implicitly).
>
>
> Right, but a frequent way that MLIR is used is without its CFG: most
> machine learning kernels use nests of loops and ifs, not CFGs.  CFGs are
> exposed when those are lowered out.  See some simple examples like:
>
>
https://github.com/llvm/llvm-project/blob/master/mlir/test/Transforms/affine-data-copy.mlir
>
>I agree that a loop nest can be represented in MLIR. What is missing IMHO
is being able to have multiple versions of the same code. For instance,
raising emitted C++ to such representation to make it more optimizable may
only be possible under preconditions and by itself making the code slower.
If the raised representation cannot be optimized, we will want to use the
original one.

> The possibility to make local changes speculatively without copying the
> entire data structure. IMHO this is a central idea that allows applying a
> transformations speculatively to pass it to a legality check and cost
> heuristic without committing to apply it. As a consequence, passes do not
> need to implement to implement these in a transformation-specific manner,
> drastically reducing the burden of implementation.
>
> For instance, more loop transformations are feasible if instructions are
> moved into the innermost loops. With speculative transformations, we can
> canonicalize the representation to sink computations into loops -- the
> opposite of what LICM does -- and then see whether a transformation can
> applied. If not, the speculative representation is discarded without having
> an effect on the original representation (and not needing to hoist those
> computations again).
>
> Because the MLIR classes have many references to related objects (such as
> pointer to parents and use-def chains), I don't think it is feasible to
> implement on top of MLIR.
>
>
> Ah yes, I see what you mean.  One way to do that is to represent multiple
> options as an op with region for each option.  This means you only fork the
> part of the IR that you’re producing variants of.  I think this is the
> red/green tree technique you mentioned, but I’m not sure.
>
>The red-green tree technique even allows re-inserting entire unchanged
subtrees (e.g. loop bodies after an interchange). If op takes multiple
regions, each region still must be deep copies.

> An advantage is that
>> loops and multi-dimensional arrays can be represented in the language
>> without the need of being rediscovered, but have to be inserted by a
>> front-end.
>>
>>
>> This is correct, but I don’t see how this helps if your focus is
raising
>> code that has already been lowered to LLVM IR, e.g. by Clang or some
other
>> frontend that generates LLVM IR today.
>>
>
> Indeed, I would hope that LLVM-IR can preserve multi-dimensional array
> accesses in some fashion as well (
> https://lists.llvm.org/pipermail/llvm-dev/2019-July/134063.html).
> However, currently MLIR has the advantage of being able represent it.
>
>
> I don’t think LLVM IR will ever get there without a massive design
> change.  It is possible that it will support static shaped accesses in
> limited ways though.
>
>Static sized rectangular multi-dimensional arrays are already possible
using a standard GetElementPtr and its inrange qualifier. For dynamic sized
multi-dimensional sized arrays what is needed is to convey the dimensions
of the array in form of an llvm::Value. In the RFC we discussed an
intrinsic and operand bundles, neither looks like massive design changes to
me.

On the other side, natural loop detection on CFGs is quite mature (with
a> remaining issue of irreducible loops that might appear, but can also be
> eliminated again). As a plus, optimization does depend less on how the
> source code is written.
>
>
> Yep totally.  The question is whether you lose semantic information from
> lowering to a CFG and reconstructing back up.  This can affect you when you
> have higher level language semantics (e.g. Fortran parallel loops, openmp
> or other concurrency constructs etc).  This is where MLIR excels of course.
>
>Indeed it is easier to not lower these constructs, but not impossible (as
shown in https://reviews.llvm.org/D69930). I think the relevant difference
is that these constructs come with additional guarantees (e.g.
Single-Entry-Single-Exit regions) and optimization hurdles (e.g. thread
synchronization; where programmers do not expect the compiler to do a lot
of things) compared to C++ loop constructs.


Michael
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200121/93b37eab/attachment.html>

Chris Lattner via llvm-dev

2020-Jan-27 03:04 UTC

head link

[llvm-dev] [RFC] Writing loop transformations on the right representation is more productive

> On Jan 22, 2020, at 12:58 AM, Michael Kruse <llvmdev at
meinersbur.de> wrote:
> Am Mi., 15. Jan. 2020 um 20:27 Uhr schrieb Chris Lattner <clattner at
nondot.org <mailto:clattner at nondot.org>>:
>> One you achieve consensus on data structure, there is the question of
what IR to use within it.  I would recommend starting with some combination of
“existing LLVM IR operations + high level control flow representation”, e.g.
parallel and affine loops.  The key here is that you need to always be able to
lower in a simple and predictable way to LLVM IR (this is one of the things that
classic polyhedral systems did sub optimally, making it difficult to reason
about the cost model of various transformations), and this is a natural
incremental starting point anyway.  Over time, more high level concepts can be
gradually introduced.  FYI, MLIR already has a reasonable LLVM dialect
<https://mlir.llvm.org/docs/Dialects/LLVM/> and can generate LLVM IR from
it, so we’d just need an “LLVM IR -> MLIR LLVM dialect” conversion, which
should be straightforward to build.
>> 
>> Adding a LLVM-IR -> MLIR -> LLVM-IR round-trip would at the
beginning just introduce compile-time overhead and what Renato described as
brittleness. I fear this hurts adaption.
> 
> Isn’t this true of *any* higher level IR?  Unless I’m missing something
big, this seems inherent to your proposal.
> 
> No. A loop hierarchy may be created on-demand and can be skipped if, e.g.,
the function does not contain a loop.
I don’t see how this is specific to a loop IR.  Any “LLVMIR -> X ->
LLVMIR” system has the behavior you describe, whether X is polly, mlir, or some
other loop IR; any decision about skipping the round trip could be applied to
any of them.

The advantage of MLIR in this discussion is that it has the opportunity to
subsume LLVMIR at some point in the future, eliminating the round trip at that
point.
> For IRs that are translation-unit based, the entire module will have to do
a round-trip whether changed or not.
I think that this must be the misunderstanding.  There is no requirement to do a
“Full LLVM IR module to MLIR module” conversion, you can convert one function,
one loop nest, one basic block or whatever you else you’d want to do.
>> This is definitely subjective question. I think that MLIR is closer to
LLVM-IR for how it is processed. Both have a sequence of passes running over a
single source of truth. Both allow walking the entire structure from every
instruction/operation/block. Analyses are on function or module level. Both have
CFGs (I think for a certain kind of transformations it is an advantage that
control flow is handled implicitly).
> 
> Right, but a frequent way that MLIR is used is without its CFG: most
machine learning kernels use nests of loops and ifs, not CFGs.  CFGs are exposed
when those are lowered out.  See some simple examples like:
>
https://github.com/llvm/llvm-project/blob/master/mlir/test/Transforms/affine-data-copy.mlir
<https://github.com/llvm/llvm-project/blob/master/mlir/test/Transforms/affine-data-copy.mlir>
> 
> 
> I agree that a loop nest can be represented in MLIR. What is missing IMHO
is being able to have multiple versions of the same code. For instance, raising
emitted C++ to such representation to make it more optimizable may only be
possible under preconditions and by itself making the code slower. If the raised
representation cannot be optimized, we will want to use the original one.
This is pretty straight-forward (in principle, I haven’t actually built a system
to prove this), because you can just have an operation with regions for each
version, e.g.:

for {
  op1
  versioned {
    stuff
  } {
    other stuff
  }
  op2
}

Then transform the code and select which version you want later.

MLIR doesn’t magically make the algorithms happen for you, but it does handle
the representational issues, making it super flexible and easy to support things
like this.
> 
>> On the other side, natural loop detection on CFGs is quite mature (with
a remaining issue of irreducible loops that might appear, but can also be
eliminated again). As a plus, optimization does depend less on how the source
code is written.
> 
> Yep totally.  The question is whether you lose semantic information from
lowering to a CFG and reconstructing back up.  This can affect you when you have
higher level language semantics (e.g. Fortran parallel loops, openmp or other
concurrency constructs etc).  This is where MLIR excels of course.
> 
> 
> Indeed it is easier to not lower these constructs, but not impossible (as
shown in https://reviews.llvm.org/D69930
<https://reviews.llvm.org/D69930>). I think the relevant difference is
that these constructs come with additional guarantees (e.g.
Single-Entry-Single-Exit regions) and optimization hurdles (e.g. thread
synchronization; where programmers do not expect the compiler to do a lot of
things) compared to C++ loop constructs.
There is a dangerous implication in the way you phrased this, and while it may
not have been intentional,  is something I think is important to point out.  
The whole point of good IR design is to make certain things *easy*.  IR design
rarely makes things “possible”, and so saying something is “not impossible” is a
dangerous ground to stand on.

To reduce the point to absurdity, you could implement loop transformations on
machine code: you can reconstruct a CFG, discover natural loops, raise the
machine code to something like LLVM IR, etc.  The problem with this is that it
is extremely cumbersome, fragile, and would lose the ability to use high level
information that cannot be persisted in machine code.  The same thing is true
about LLVM IR, just less absurdly so.

-Chris


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200126/96ecb1b0/attachment.html>

Michael Kruse via llvm-dev

2020-Jan-30 09:04 UTC

head link

[llvm-dev] [RFC] Writing loop transformations on the right representation is more productive

Am So., 26. Jan. 2020 um 21:04 Uhr schrieb Chris Lattner <
clattner at nondot.org>:
>
> On Jan 22, 2020, at 12:58 AM, Michael Kruse <llvmdev at
meinersbur.de> wrote:
> Am Mi., 15. Jan. 2020 um 20:27 Uhr schrieb Chris Lattner <
> clattner at nondot.org>:
>
>> One you achieve consensus on data structure, there is the question of
>>> what IR to use within it.  I would recommend starting with some
combination
>>> of “existing LLVM IR operations + high level control flow
representation”,
>>> e.g. parallel and affine loops.  The key here is that you need to
always be
>>> able to lower in a simple and predictable way to LLVM IR (this is
one of
>>> the things that classic polyhedral systems did sub optimally,
making it
>>> difficult to reason about the cost model of various
transformations), and
>>> this is a natural incremental starting point anyway.  Over time,
more high
>>> level concepts can be gradually introduced.  FYI, MLIR already has
a
>>> reasonable LLVM dialect
<https://mlir.llvm.org/docs/Dialects/LLVM/> and
>>> can generate LLVM IR from it, so we’d just need an “LLVM IR ->
MLIR LLVM
>>> dialect” conversion, which should be straightforward to build.
>>>
>>
>> Adding a LLVM-IR -> MLIR -> LLVM-IR round-trip would at the
beginning
>> just introduce compile-time overhead and what Renato described as
>> brittleness. I fear this hurts adaption.
>>
>>
>> Isn’t this true of *any* higher level IR?  Unless I’m missing something
>> big, this seems inherent to your proposal.
>>
>
> No. A loop hierarchy may be created on-demand and can be skipped if, e.g.,
> the function does not contain a loop.
>
>
> I don’t see how this is specific to a loop IR.  Any “LLVMIR -> X ->
> LLVMIR” system has the behavior you describe, whether X is polly, mlir, or
> some other loop IR; any decision about skipping the round trip could be
> applied to any of them.
>
> The advantage of MLIR in this discussion is that it has the opportunity to
> subsume LLVMIR at some point in the future, eliminating the round trip at
> that point.
>
>As long as there is no plan for this to happen, this is speculative.

> For IRs that are translation-unit based, the entire module will have to do
> a round-trip whether changed or not.
>
>
> I think that this must be the misunderstanding.  There is no requirement
> to do a “Full LLVM IR module to MLIR module” conversion, you can convert
> one function, one loop nest, one basic block or whatever you else you’d
> want to do.
>
Either the goal is for clang to generate MLIR translation units, or to have
per-function round-trips. However, I do see that per-function round-trips
could be an intermediate state.

However, for sub-function units, one somehow has to keep the reference to
the llvm::Value objects declared outside the extracted union so they can be
referenced again when generating LLVM-IR again. Possible with a
DenseMap<mlir::Operation,llvm::Instruction>, but sounds fragile. Also,
these llvm::Instructions must be represented as an mlir::Operation somehow.

>
> This is pretty straight-forward (in principle, I haven’t actually built a
> system to prove this), because you can just have an operation with regions
> for each version, e.g.:
>
> for {
>   op1
>   versioned {
>     stuff
>   } {
>     other stuff
>   }
>   op2
> }
>
> Then transform the code and select which version you want later.
>
> MLIR doesn’t magically make the algorithms happen for you, but it does
> handle the representational issues, making it super flexible and easy to
> support things like this.
>
>What I proposed is much more than the possibility to represent multiple
versions in the same IR (assuming that we want to keep just one of the
versions, these additional ones are visible in use-def chains and possibly
confusing some analyses. It might also be a problem for convergent
instructions).

I would like to stress that multi-versioning is just one of the
possibilities enabled by cheap copies (i.e. Red/Green trees) and I remain
convinced that cheap copies are not possible with a
densely-coupled representation such as MLIR, LLVM-IR and VPlan (unless we
separate the in-memory representation from the logical data structure).

Other thinks enabled by cheap copies:
 * Apply transformation first, then check legality and profitability
 * Speculative transformations / Profitability can be checked of a chain of
operations
 * Canonicalization that remove dependencies (e.g. moving computations into
the loops)
 * Lazy expansion of nodes (e.g. instructions outside of loops do not need
to be represented)
 * Cheap representation of unrolling
 * Code versioning

Your suggestion above still requires copying the entire "stuff" and
"other
stuff" subtrees. Not rarely do e.g. stencil-loops have thousands of
instructions and all would need to be copied for each version. This is a
practical barrier for the applications above because of the compile time
increase.

There is an analogy in file-systems: Btrfs and ZFS support copy-on-write of
files and directory, enabling new applications such as snapshots. Without
the copy-on-write feature, a snapshot would have to copy the entire
filesystem. Possible, but not really feasible.

> Indeed it is easier to not lower these constructs, but not impossible (as
> shown in https://reviews.llvm.org/D69930). I think the relevant
> difference is that these constructs come with additional guarantees (e.g.
> Single-Entry-Single-Exit regions) and optimization hurdles (e.g. thread
> synchronization; where programmers do not expect the compiler to do a lot
> of things) compared to C++ loop constructs.
>
>
> There is a dangerous implication in the way you phrased this, and while it
> may not have been intentional,  is something I think is important to point
> out.   The whole point of good IR design is to make certain things *easy*.
> IR design rarely makes things “possible”, and so saying something is “not
> impossible” is a dangerous ground to stand on.
>
I cannot follow the deduction from "less easy" to
"dangerous".

Anyway, this is also the argument I want to make: A loop-only
representation can simplify implementing a loop transformations by a lot.
Instead of each transformation having to implement its own legality check
and profitability heuristics (often requiring the most lines of code), they
all can share the same implementation.

Michael
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200130/170947ea/attachment.html>

llvm dev - Jan 2020 - [RFC] Writing loop transformations on the right representation is more productive

[llvm-dev] [RFC] Writing loop transformations on the right representation is more productive

[llvm-dev] [RFC] Writing loop transformations on the right representation is more productive

[llvm-dev] [RFC] Writing loop transformations on the right representation is more productive