thr3ads.net - llvm dev - [llvm-dev] [RFC] Writing loop transformations on the right representation is more productive [Jan 2020]

If this information is useful, please help other people find it:
Share via:

Michael Kruse via llvm-dev

2020-Jan-03 11:26 UTC

[llvm-dev] Writing loop transformations on the right representation is more productive

In the 2018 LLVM DevMtg [1], I presented some shortcomings of how LLVM
optimizes loops. In summary, the biggest issues are (a) the complexity
of writing a new loop optimization pass (including needing to deal
with a variety of low-level issues, a significant amount of required
boilerplate, the difficulty of analysis preservation, etc.), (b)
independent optimization heuristics and a fixed pass ordering and (c)
code explosion due to versioning. Currently, different people are
working on improving different loop optimization passes such as
LoopInterchange, LoopFuse and LoopVectorize. Even if LLVM had
'perfect' individual loop passes, they would still have the
aforementioned problems due to still being independent and the system
as a whole will be suboptimal.

Instead of each loop pass being a major component, the heavy lifting
could be done by a framework of which each transformation itself is a
small part. In this RFC, I would like to work towards a consensus on
how such a framework could look. I already outlined a possible
solution in the same presentation [1] and a publication [7], which is
partially based on a previous RFC [8]. All feedback is welcome,
including a simple '+1'.

The central idea is to use a modifiable loop tree -- similar to
LoopInfo -- as the primary representation. LLVM-IR is converted to a
loop tree, then optimized and finally LLVM-IR is generated again for
subtrees that are considered profitable. This is not a new concept, it
has already been used in compilers such as IBM XL Fortran (called ASTI
[4]) and Silicon Graphics/Open64 (called LNO [10]), and in research
such as the Value State Dependence Graph [3].

Other features include the following:

1. Candidate selection through cost functions
2. Cheap copies using Red/Green Trees
3. Application of transformations from high-level to low-level
4. Represents loops and predicates instead of CFGs
5. Data and control dependencies in one graph
6. Late fallback versioning at IR regeneration
7. Standardized API for analyses with multiple implementations
8. Abstract representation of statements
9. Expansion of use-def chains to arrays when spanning loops

To elaborate on each of these:

1. Candidate selection through cost function
--------------------------------------------
Instead of needing to know which transformation is profitable before
applying it, create a copy of the data structure, modify it, and
compare it to the original. Moreover, holding multiple, differently
optimized, copies allows evaluating each variant using a cost function
and select the 'best' when re-generating LLVM-IR again (or re-use the
original LLVM-IR).

Otherwise, each transformation will have to implement its own
profitability heuristic but cannot know what other passes will do. For
instance, the motivation for LLVM's LoopDistribution is to enable
vectorization of loops that can be separated from other loops.
However, it does not know whether LoopVectorize will actually
vectorize the loop, or whether two vectorizable loops are better
vectorized together.

Instantiating every possible sequence of transformations of course is
not feasible, so the search space needs to be pruned. This could be
made dependent on the optimization level.


2. Cheap copies using Red/Green Trees
-------------------------------------
A red/green trees [4] is a data structure for cheap copies of DAGs.
Basically, a green tree node references only the nodes children but
not the parent. A node from the red tree references its corresponding
green node and its parent (red) node. When modifying the represented
node through copy-on-write, this allows reusing all green nodes except
the ones from the modified node to the tree's root. The same node can
even be used multiple times in the same copy, which can be useful for
e.g. loop unrolling and code versioning. As a result, storing many
optimization variants of the same code only requires copies of the
changed parts.

By contrast, the red node is created on-demand. Traversing the entire
tree should rarely be necessary (instead, use summary information in
the green node) and when limited to subtrees that are considered
interesting for optimization.


3. Apply transformations from high-level to low-level
-----------------------------------------------------
Optimization should be applied from very specialized to very general
(potentially after some canonicalization). For instance, the first
step could be detecting common idioms such as gemm and replace them
with either a BLAS function call or apply well-studied optimizations
like BLIS to them. After such an idiom has been detected, no other
transformation should be applied to them.

Mid-level transformations may try to map entire loop nests to cache-
and compute hierarchies (SIMT threads, multiprocessors, offloading,
etc) by applying transformations such as tiling, loop interchange and
array packing.

Low-level transformations commonly look at only one loop to improve
processor pipelining and predication: Unrolling, software pipelining,
etc.


4. Represents loops and predicates instead of CFGs
-------------------------------------------------
Control flow graphs make loop optimizations difficult, but is also not
something a loop optimizer typically cares about. For instance, in the
example below, loop distribution would need to understand the
condition `c` and reconstruct the branch to fission

    for (int i = 0; i < n; i += 1)
      if (c) {
        foo(i);
        bar(i);
      }

into

    for (int i = 0; i < n; i += 1)
      if (c)
        foo(i);
    for (int i = 0; i < n; i += 1)
      if (c)
        bar(i);

In the loop tree, control flow is represented through predicates of
the statements.  In a first step, it is represented as

    for (int i = 0; i < n; i += 1) {
      bool __BB1_executed = 0;
      { if (c) foo(i); __BB1_executed = 1; }
      { if (__BB1_executed) bar(i); }
    }

The predicates (`c` and `__BB1_executed`) can be separated into a
"necessary condition" (`c` must be true before `foo` can be executed)
and a "sufficient condition" (if `c` is true, `foo` must be executed).
These can be different to allow speculative execution of the
statement. For instance, a necessary condition of `true` means that
the statement can always be executed speculatively.

The advantage is that control flow becomes a data-dependency problem.
It can trivially be converted back to its CFG representation because
`__BB1_executed` directly corresponds to the CFG edge to which it was
connected. Data flow optimizations such as value propagation can be
applied as well (here: `__BB1_executed == c`), such that we get

    for (int i = 0; i < n; i += 1) {
      { if (c) foo(i); }
      { if (c) bar(i); }
    }

Which can be distributed trivially (assuming foo() and bar() do not
cause dependencies). Whether the copy propagation is beneficial can be
decided by a heuristic (considering whether the original flow can
still be reconstructed) or just thrown away if the cost function does
not select a loop tree based on the copy propagation.

As a result of the predicated representation the loop tree has a
regular pattern: sequence -> loop -> sequence -> loop -> ... . A
loop
here is abstract in that it is something that is executed multiple
times. This can be a sequential loop, but also a collection of SPMD
processes, threads, vector lanes, etc.

At IR regeneration, we have multiple choices of how to lower
conditions. If we lower to LLVM-IR CFG, it may lower to
 * Branch instructions, reusing branch conditions from previous statements
 * Select instructions

If lowering to SIMD instructions, we can generate
 * Branch instructions for uniform predicates
 * Execution masks for varying predicates
 * A mix of both, since predicates can be uniform relative to previous
taken conditions

If lowering to NVPTX-style SIMT
 * @p per-instruction execution masks
 * Divergent branches; Every statement can be made a re-convergence
point (by re-evaluating the condition), but will happen, at the
latest, at the end of the loop
 * Uniform predicates can be given the ".uni" suffix

When lowering to AMD-style SIMT
 * S_CBRANCH (explicitly managed execution mask)
 * S_CBRANCH_I/G_FORK (stack of execution masks)


5. Data and control dependencies in one graph
---------------------------------------------
Also as a result of the predicated from, during dependency analysis,
control dependencies and data dependencies have the same
representation, like in a program dependency graph. In LLVM-IR, passes
such as AggressiveDeadCodeElimination use DominanceFrontier for this
purpose. This would not be necessary.

When re-generating IR, control dependencies can either be lowered to
branching or as a boolean variables. For the purpose of loop
optimizations, the choice of representation does not (and should not)
make a difference.


6. Late fallback versioning at IR regeneration
------------------------------------------
When a transformation is applied, it can attach conditions (no
aliasing, no integer wrap, value restrictions, etc.) under which the
transformation is valid. During LLVM-IR generation, these conditions
are collected and emitted as run-time conditions. If the condition
fails, the original code is executed.

This avoids having every transformation emit its own fallback code.
The fallback should never execute, or only in cases that do not do
significant work (e.g. loop trip count is 0), such that optimizing the
fallback itself is not necessary. In other cases, the transformation
should implement proper loop versioning by adding the original
statement with a `!rtc` condition.


7. Standardized API for analyses with multiple implementations
--------------------------------------------------------------
Common analyses use an interface that can be fulfilled by different
implementations. Among these are:

 * Closed form expressions
   - None (expressions are always represented through dependencies
     between statements)
   - LLVM ScalarEvolution
   - PolyhedralValueAnalysis [6]

Closed form expressions can be used to remap values (typically array
subscripts, loop lower/upper bounds), simplify expression computations
(including predicates => is a predicate always true?), determine loop
trip counts, etc. Most importantly, it identifies a statement
execution (instance) relative to the loop counters.


 * Array accesses / access range / provenance
   - LLVM AliasAnalysis
   - One-dimensional (always linearized)
   - Multi-dimensional, delinearized if necessary
   - From metadata (that a front-end emits for arrays of pointers to
     arrays of data)

Array access analysis determines whether two accesses at a specific
index can alias each other, access ranges and connectivity.


 * Dependencies
   - none (cannot reorder statements)
   - flat (depends on last execution of another statement)
   - LLVM DependendencyAnalysis
   - Polyhedral analysis

There is intentionally no "does X depend on Y?" API as this would make
any analysis based on it quadratic in runtime. Instead, it has to list
all statements another statement depends on such that a graph can be
built and e.g. sorted topologically.

Dependencies include: use-def chains via registers, memory
flow/anti/output, general side-effects and control dependencies.


 * Control flow
   - Extracted from original CFG
   - Evaluate statement conditions

Control flow analysis answers questions about (post-)dominance and
mutual exclusivity within the acyclic sequence/body of a loop.


 * Cost heuristic
   - TargetTransformationInfo

Estimate the execution time of a (sub-)tree. Average trip counts from
profiling [11] could be useful.


8. Abstract representation of statements
----------------------------------------
There is no direct relationship between LLVM instructions and loop DAG
statements. If possible, loop tree optimizations should only use the
properties of the statement node, not its contents. Typically, a
statement consists of multiple instructions that do not make sense to
split. For instance, assuming that %add is not used a second time, in
the example below

    %add = add i64 %arg, 2
    %mul = shl i64 %add, 1

the two instructions should always be computed together in the same
loop iteration. Combining instructions to statements reduces the
computational complexity of loop optimizations and might be controlled
by the optimization level (at lower optimization level more
instructions would be combined). A loop node is itself a statement in
it's parent loop node's acyclic body.

Moreover, there is no reason to use only LLVM-IR to carry the
semantics. We could also use e.g. Machine-IR or MLIR operations.


9. Expansion of use-def chains to arrays when spanning loops
------------------------------------------------------------
Consider we would like to fission the following loop:

    for (int i = 0; i < n; i += 1) {
      { if(true) double a = f(i); }
      { if(true) g(a); }
    }

The problem is that the split cuts the use-def dependency over `a`.
However, the fission can still be done and results in:

    double A[n];
    for (int i = 0; i < n; i += 1)
      { if(true) A[i] = f(i); }
    for (int i = 0; i < n; i += 1) {
      { if(true) g(A[i]); }

The transforming pass has to consider this during its profitability
model. The big advantage is that in terms of correctness, use-def
chains do not manifest false dependencies.


Possible Transformations
------------------------
In principle, any transformation could be applied on this
representation, but instruction-centric transformations such as
InstCombine are less suitable. Obviously, loop-centric transformations
are a better fit. Ideas for loop transformations include:

 * Unroll(-and-jam)
 * Peeling
 * Iteration range split, concatenation
 * Interchange
 * Loop fusion and distribution(/fission)
 * Tiling/stripmining/stripemining/blocking
 * Code versioning (including loop unswitching)
 * Maybe vectorization
 * Accelerator offloading
 * Parallelization
 * Space-filling curve iteration [12]
 * Array packing (e.g. for scratch pads), transposition
 * Statement reordering
 * SIMT-ization
 * Detect and optimize reductions
 * Loop skewing/wavefronting
 * Loop reversal
 * Solver-driven (polyhedral) optimization
 * Loop idiom recognition


Optimization Pipeline
---------------------
The optimization pipeline for the hierarchy DAG is straightforward:

    void optimizeLoopNest() {
      RedOriginal = generateLoopHierarchy();
      List = createCandidates(RedOriginal);
      sort(List, [](Cand1,Cand2) { return estimateExecutionTime(Cand2)
                                                   -
estimateExecutionTime(Cand1); } );
      if (List[0] != RedOriginal)
        emit(List[0]);
    }

Subtree exploration is done recursively, optimizing inner nodes which
can be (re-)used when optimizing outer nodes.

The current node to optimize is compared a selection of known
heuristics and its transformation is added to the list of candidates.
That is, an optimization routine resemble the following, with more
specific heuristics checked before more general ones.

void exploreOptimizations(Root, Worklist) {
   Matches = 0;

   // High level: user annotations
   if (Root.hasHint()) {
     // Hint is mandatory: do not explore any other possibility
     Worklist.replace(Root, applyHint(Root));
     return;
   }

    // High level: Idiom recognition
    if (isBLAS(Root)) {
      Optimized = LinkAgainstBLASLibary ? convertToBlasLibCall(Root) :

applyBLIS(Root):
      Worklist.replace(Root, Optimized);
      return;
    }

    // Mid level: recognize stencils
    if (isStencil(Root))
      ...

    // Mid level: Map to compute hierarchy (here: simd -> SMT -> core
-> socket for CPUs)
    if (isParallel(Root) fitsWorkingSet(Root.getBody(), LLCSize) &&
&&
      isParallel(Root.getPerfectlyNestedLoop(1)) &&
       fitsWorkingSet(Root.getPerfectlyNestedLoop(2).getBody(), L2Size)
&&
      isParallel(Root.getPerfectlyNestedLoop(2)) &&
       fitsWorkingSet(Root.getPerfectlyNestedLoop(3).getBody(), L1Size)
&&
      isVectorizable(Root.getPerfectlyNestedLoop(3))) {
        Optimized = applyVectorization(Root.getPerfectlyNestedLoop(3));
        Optimized = applyParallel(Optimized.getParent(), /*places=*/SMT);
        Optimized = applyParallel(Optimized.getParent(), /*places=*/Cores);
        Optimized = applyParallel(Optimized.getParent(), /*places=*/Sockets);

        // Still explore CPU pipeline optimizations of innermost
        Worklist.insert(Root,Optimized);
        Matches += 1;
    }

    if (Matches >= Threshold)
      return;

    // Low level: optimize working set for L1 cache size using tiling
    Band = maximumTilableNest(Root);
    WorkingSetSizePerIteration = estimatWorkingSet(Band.back().getBody());
    TileSize = floor(nthroot(L1CacheSize / WorkingSetSizePerIteration,
Band.size()));
    if (TileSize > 1) {
       Worklist.insert(applyTiling(Root, TilesSize.splat(Band.size()));
       Matches += 1;
    }

    if (Matches >= Threshold)
      return;

    // Low level: vectorization for each SIMD level (potentially
    // carried-out by VPlan)
    foreach (Width : promisingSimdWidths(Root)) {
      Vectorized = applyVectorization(Root, Width);
      if (Vectorized.isSatisfiable()) {
        Worklist.insert(Vectorized);
        Matches += 1;
      }
    }

    if (Matches >= Threshold)
      return;

    // Low level: rely on heuristic to find best unrolling factor
    if (Factor = unrollFactorHeuristic(Root))  {
      Worklist.insert(applyUnroll(Root, Factor).getRoot());
      Matches += 1;
    }

    if (Matches >= Threshold)
      return;

    ...
}


The "levels" encode precedence optimizations. Depending on the
optimization level, we might only apply the most promising
optimization.

Instead of this brute force implementation, for predefined
optimization levels, the search space exploration should be guided:
Encode the order of the most useful transformations in the search
space exploration. For instance, find all idioms first, then apply
memory hierarchy optimizations, then compute hierarchy, then CPU
pipeline optimizations. Since this is very transformation specific, I
do not expect to write a 'pass manager' where transformations can
register itself to (except for exhaustive searches when explicitly
enabled). Instead, I'd write this in code that classifies loop nests
(e.g. compute bound, bandwidth bound, known idioms, tensor
contractions, stencils, etc.) and choses appropriate optimization
paths. We can, of course, make this more generic over time.



Writing an optimization pass
-----------------------------

Optimizations themselves are written in the following steps:
1. Check preconditions
2. Quick profitability analysis to skip obviously bad transformations
3. Create new green nodes to represent the changed loop nest; reuse
unchanged green nodes as much as possible
4. Create a red node to insert into the parent
5. Preserve analyses (most analyses should be able to use analysis
stored in reused green nodes and/or lazyly reevaluate new nodes)
6. Check whether the transformation was valid

Especially the possibility to analyze the correctness after the
transformation was applied could significantly reduce the amount of
code needed to verify the correctness.

A transformation like loop fusion could look like:

  void applyFuse(Red1, Red2) {
   // Assume Red1 and Red2 are consecutive sibling loops

  // Bail out on untypical things (atomic accesses, exceptions,
convergent instructions, ...)
  if (!Red1->isSimple() || !Red2->isSimple())
    return;

  // Consistency checks (can be less less strict by moving conditions
into the fused body).
  if (Red1->getIterationCount() != Red2->getIterationCount())
    return;
  if (Red1->getPredicate() != Red2->getPredicate())
    return;

  // Create new loop hierarchy with fused loops.
  GreenFused = new GreenLoop({ Red1->getGreen(), Red2->getGreen() },
                                                    
Red1->getIterationCount());
  RedFused = new RedNode(GreenFused, Red1->getParent());
  NewRoot = RedFused->getRoot();

  // Preserve analysis.
  // In this case not really necessary since the green nodes of the
  // loop bodies are reused.
  DependenceAnalysis->transferAnalysis(Red1->getBody(),
                                                   RedFused->getBody()[0],
ClosedExprAnalysis->makeIdentityMapping(Red1->getIterator(),
                                                  RedFused->getIterator()));
  DependenceAnalysis->transferAnalysis(Red2->getBody(),
                                                  RedFused->getBody()[1],
   ClosedExprAnalysis->makeIdentityMapping(Red2->getIterator(),
                                                  RedFused->getIterator()));

  ValidityCondition = DependenceAnalysis->isValid(NewRoot);
  if (ValidityCondition.isSatisfiable()) {
    RedFused.addAssumption(ValidityCondition);
    Candidates.push_back(NewRoot);
  }
}



Questions & Answers
-------------------

Q: Is the loop hierarchy a tree or DAG?
A: Both. A green node can be reused multiple times in the same
structure (e.g. due to unrolling, versioning, peelings, etc), but a
red node is specific to the position in its parent; there can be
multiple red nodes for the same green node.


Q: No SSA?
A: Since the location of a definition depends on the parent in a green
DAG, possibly from different loop hierarchies after transformations,
it is not possible to directly reference the definition with its use.
However, it is possible in the red tree, but runs against the goal of
avoiding heavy red nodes. Instead, to find the previous definition,
one has to traverse the green no up to find at which sibling level the
value is defined. However, this should not be a common query; loop
optimizations should use dependency information instead.


Q: Loops with multiple exits?
A: If every exiting edge a unique number to its __???_executed
variable, it can be handled like a switch.


Q: Unstructured control flow?
A: Unstructured control flow only introduces disjunctions to statement
predicated. That is, the predicate

    a or b

corresponds to block BB2 the CFG

    BB0:
      br i1 %a, label %BB2, label %BB2

    BB1:
      br i1 %b, label %BB2, label %BB3

    BB2:

In structured control flow, BB1 and/or BB2 would be post-dominated by
BB2 and the disjunction disappear.


Q: Irreducible loops?
A: Can be folded into the first iteration of the loop by including a
condition for statements not executed during the first iteration.


Q: The unreachable terminator?
A: If due to a non-returning function call, this introduces a loop
exit (which will also introduce control a dependency to everything
that follows) to the surrounding loops. If due to an assumption, we
may add the branching condition to the sufficient condition, but not
to the necessary condition.


Q: Infinite loops?
A: The closed-form expression identifying a statement instances and
loop trip count have no upper bound. Either use arbitrary range
arithmetic, use relative dependence vectors and/or fall back on
pessimistic assumptions. In addition, like with the unreachable
terminator, this may be handled as another loop exit.


Q: IndirectBr?
Since the list of possible targets is part of the instructions, can be
handled like a switch over the target address


Q: Loop guards?
A: Loop guards manifests as the execution condition of the loop
statement's predicate. A loop whose predicate evaluates to false can
be considered a loop with 0 iterations. The IR generator can also emit
the predicate as a loop guard. It might be useful to ensure that if
the predicate evaluates to true, the loop has at least one iteration
and the loop-condition is not evaluated for the first iteration (like
LoopRotation).


Q: Exceptions?
A: Could also be handled with control dependencies, but I would
instead mark the surrounding loops with "contains unpredictable
control flow" to be ignored by most transformations. Similar to
`!isSimple()` of LoadInst and StoreInst.


Q: Relation to the new/legacy pass manager?
A: LLVM's pass managers are unfortunately not designed to apply to
subtrees nor persistent data structures besides the LLVM-IR itself.
Instead, the loop tree optimizer would be its own monolithic pass on
the pass manager level (like MachinePassManager and VPlan). My idea is
to add it somewhere before LoopVectorize, but after the inliner,
potentially replace most other loop transformations. There should be
just one instance of the pass in the pipeline since creating an
alternative representation of LLVM-IR is not that cheap. It does not
need IR normalization (LoopSimplify, LICM, LoopRotate, LCSSA); these
can be performed on the loop tree and emitted into the LLVM-IR if a
non-modified tree is chosen. Possibly some of these passes can be
removed.
In order to preserve user-directed transformations (such as "#pragma
clang loop distribute"), it may also be interesting to put it earlier
into the pipeline to avoid that other passes such as LoopFullUnroll
change the loop before the transformation can be applied.
Alternatively, such passes can be taught the check the presence of a
user-directed transformation before applying its transformation.


Q: Relation to LoopVectorize/VPlan?
A: VPlan has similar design goals [9] but is intended for
vectorization only. It also has a hierarchical data structure, but
uses a CFG for acyclic control flow. It is also intended to have
multiple variants of the same source IR from which the best is
selected via a cost function. However, it lacks cheap copies. Instead
of instructions, it uses recipes/"meta-instructions" that handle what
happens to instructions after vectorization, e.g. do that operation on
each vector lane ("WIDEN"). VPlan is more oriented towards modifying
instructions instead of statements as collection of instructions. With
the loop hierarchy, I do not plan to redo the work of VPlan, but also
do not see a reason why it should not also be able to do
vectorization.


Q: Relation to Polly?
A: Polly uses ISL's schedule tree as the representation of the loop
hierarchy. Its creation may fail of something not representable in a
schedule tree is found. Polly tries to find the maximum representable
region, but once it is chosen, it cannot be changed (e.g. because
optimizing it takes too long). The schedule tree is stored internally
by ISL as a kind of green tree, but is walked over by an iterator that
remembers the path to the root. While this could be used to select a
best schedule tree using a cost function, Polly relies mostly on ISL's
reschedule algorithm to find a profitable transformation.


Q: Relation to MLIR?
A: MLIR is more similar to LLVM-IR than a loop hierarchy. For
instance, it also does not feature cheap copies. An advantage is that
loops and multi-dimensional arrays can be represented in the language
without the need of being rediscovered, but have to be inserted by a
front-end. That is, if Clang was generating MLIR, loops and arrays
still have to be rediscovered. However, a loop hierarchy optimizer
could be applied to MLIR just as well as to LLVM-IR.


References
----------
[1] http://llvm.org/devmtg/2018-10/talk-abstracts.html#talk11
[2] https://ieeexplore.ieee.org/abstract/document/5389392/
[3] http://sro.sussex.ac.uk/id/eprint/7576/1/Stanier,_James.pdf
[4]
https://blogs.msdn.microsoft.com/ericlippert/2012/06/08/persistence-facades-and-roslyns-red-green-trees/
[5] https://github.com/Meinersbur/llvm-project/tree/LOF/llvm/lib/LOF
[6] https://llvm.org/devmtg/2017-10/#src2
[7] https://arxiv.org/abs/1811.00632
[8] https://lists.llvm.org/pipermail/llvm-dev/2017-October/118125.html
[9] https://llvm.org/docs/Proposals/VectorizationPlan.html
[10] https://github.com/open64-compiler/open64/tree/master/osprey/be/lno
[11] https://reviews.llvm.org/D70688
[12] https://arxiv.org/abs/1801.07399

Michael Kruse via llvm-dev

2020-Jan-03 15:57 UTC

head link

[llvm-dev] RFC: Writing loop transformations on the right representation is more productive

I seem to have dropped the "RFC" in the title accidentally. That is,
this is meant to be a Request For Comments.

Michael


Am Fr., 3. Jan. 2020 um 12:26 Uhr schrieb Michael Kruse <llvmdev at
meinersbur.de>:>
> In the 2018 LLVM DevMtg [1], I presented some shortcomings of how LLVM
> optimizes loops. In summary, the biggest issues are (a) the complexity
> of writing a new loop optimization pass (including needing to deal
> with a variety of low-level issues, a significant amount of required
> boilerplate, the difficulty of analysis preservation, etc.), (b)
> independent optimization heuristics and a fixed pass ordering and (c)
> code explosion due to versioning. Currently, different people are
> working on improving different loop optimization passes such as
> LoopInterchange, LoopFuse and LoopVectorize. Even if LLVM had
> 'perfect' individual loop passes, they would still have the
> aforementioned problems due to still being independent and the system
> as a whole will be suboptimal.
>
> Instead of each loop pass being a major component, the heavy lifting
> could be done by a framework of which each transformation itself is a
> small part. In this RFC, I would like to work towards a consensus on
> how such a framework could look. I already outlined a possible
> solution in the same presentation [1] and a publication [7], which is
> partially based on a previous RFC [8]. All feedback is welcome,
> including a simple '+1'.
>
> The central idea is to use a modifiable loop tree -- similar to
> LoopInfo -- as the primary representation. LLVM-IR is converted to a
> loop tree, then optimized and finally LLVM-IR is generated again for
> subtrees that are considered profitable. This is not a new concept, it
> has already been used in compilers such as IBM XL Fortran (called ASTI
> [4]) and Silicon Graphics/Open64 (called LNO [10]), and in research
> such as the Value State Dependence Graph [3].
>
> Other features include the following:
>
> 1. Candidate selection through cost functions
> 2. Cheap copies using Red/Green Trees
> 3. Application of transformations from high-level to low-level
> 4. Represents loops and predicates instead of CFGs
> 5. Data and control dependencies in one graph
> 6. Late fallback versioning at IR regeneration
> 7. Standardized API for analyses with multiple implementations
> 8. Abstract representation of statements
> 9. Expansion of use-def chains to arrays when spanning loops
>
> To elaborate on each of these:
>
> 1. Candidate selection through cost function
> --------------------------------------------
> Instead of needing to know which transformation is profitable before
> applying it, create a copy of the data structure, modify it, and
> compare it to the original. Moreover, holding multiple, differently
> optimized, copies allows evaluating each variant using a cost function
> and select the 'best' when re-generating LLVM-IR again (or re-use
the
> original LLVM-IR).
>
> Otherwise, each transformation will have to implement its own
> profitability heuristic but cannot know what other passes will do. For
> instance, the motivation for LLVM's LoopDistribution is to enable
> vectorization of loops that can be separated from other loops.
> However, it does not know whether LoopVectorize will actually
> vectorize the loop, or whether two vectorizable loops are better
> vectorized together.
>
> Instantiating every possible sequence of transformations of course is
> not feasible, so the search space needs to be pruned. This could be
> made dependent on the optimization level.
>
>
> 2. Cheap copies using Red/Green Trees
> -------------------------------------
> A red/green trees [4] is a data structure for cheap copies of DAGs.
> Basically, a green tree node references only the nodes children but
> not the parent. A node from the red tree references its corresponding
> green node and its parent (red) node. When modifying the represented
> node through copy-on-write, this allows reusing all green nodes except
> the ones from the modified node to the tree's root. The same node can
> even be used multiple times in the same copy, which can be useful for
> e.g. loop unrolling and code versioning. As a result, storing many
> optimization variants of the same code only requires copies of the
> changed parts.
>
> By contrast, the red node is created on-demand. Traversing the entire
> tree should rarely be necessary (instead, use summary information in
> the green node) and when limited to subtrees that are considered
> interesting for optimization.
>
>
> 3. Apply transformations from high-level to low-level
> -----------------------------------------------------
> Optimization should be applied from very specialized to very general
> (potentially after some canonicalization). For instance, the first
> step could be detecting common idioms such as gemm and replace them
> with either a BLAS function call or apply well-studied optimizations
> like BLIS to them. After such an idiom has been detected, no other
> transformation should be applied to them.
>
> Mid-level transformations may try to map entire loop nests to cache-
> and compute hierarchies (SIMT threads, multiprocessors, offloading,
> etc) by applying transformations such as tiling, loop interchange and
> array packing.
>
> Low-level transformations commonly look at only one loop to improve
> processor pipelining and predication: Unrolling, software pipelining,
> etc.
>
>
> 4. Represents loops and predicates instead of CFGs
> -------------------------------------------------
> Control flow graphs make loop optimizations difficult, but is also not
> something a loop optimizer typically cares about. For instance, in the
> example below, loop distribution would need to understand the
> condition `c` and reconstruct the branch to fission
>
>     for (int i = 0; i < n; i += 1)
>       if (c) {
>         foo(i);
>         bar(i);
>       }
>
> into
>
>     for (int i = 0; i < n; i += 1)
>       if (c)
>         foo(i);
>     for (int i = 0; i < n; i += 1)
>       if (c)
>         bar(i);
>
> In the loop tree, control flow is represented through predicates of
> the statements.  In a first step, it is represented as
>
>     for (int i = 0; i < n; i += 1) {
>       bool __BB1_executed = 0;
>       { if (c) foo(i); __BB1_executed = 1; }
>       { if (__BB1_executed) bar(i); }
>     }
>
> The predicates (`c` and `__BB1_executed`) can be separated into a
> "necessary condition" (`c` must be true before `foo` can be
executed)
> and a "sufficient condition" (if `c` is true, `foo` must be
executed).
> These can be different to allow speculative execution of the
> statement. For instance, a necessary condition of `true` means that
> the statement can always be executed speculatively.
>
> The advantage is that control flow becomes a data-dependency problem.
> It can trivially be converted back to its CFG representation because
> `__BB1_executed` directly corresponds to the CFG edge to which it was
> connected. Data flow optimizations such as value propagation can be
> applied as well (here: `__BB1_executed == c`), such that we get
>
>     for (int i = 0; i < n; i += 1) {
>       { if (c) foo(i); }
>       { if (c) bar(i); }
>     }
>
> Which can be distributed trivially (assuming foo() and bar() do not
> cause dependencies). Whether the copy propagation is beneficial can be
> decided by a heuristic (considering whether the original flow can
> still be reconstructed) or just thrown away if the cost function does
> not select a loop tree based on the copy propagation.
>
> As a result of the predicated representation the loop tree has a
> regular pattern: sequence -> loop -> sequence -> loop -> ... .
A loop
> here is abstract in that it is something that is executed multiple
> times. This can be a sequential loop, but also a collection of SPMD
> processes, threads, vector lanes, etc.
>
> At IR regeneration, we have multiple choices of how to lower
> conditions. If we lower to LLVM-IR CFG, it may lower to
>  * Branch instructions, reusing branch conditions from previous statements
>  * Select instructions
>
> If lowering to SIMD instructions, we can generate
>  * Branch instructions for uniform predicates
>  * Execution masks for varying predicates
>  * A mix of both, since predicates can be uniform relative to previous
> taken conditions
>
> If lowering to NVPTX-style SIMT
>  * @p per-instruction execution masks
>  * Divergent branches; Every statement can be made a re-convergence
> point (by re-evaluating the condition), but will happen, at the
> latest, at the end of the loop
>  * Uniform predicates can be given the ".uni" suffix
>
> When lowering to AMD-style SIMT
>  * S_CBRANCH (explicitly managed execution mask)
>  * S_CBRANCH_I/G_FORK (stack of execution masks)
>
>
> 5. Data and control dependencies in one graph
> ---------------------------------------------
> Also as a result of the predicated from, during dependency analysis,
> control dependencies and data dependencies have the same
> representation, like in a program dependency graph. In LLVM-IR, passes
> such as AggressiveDeadCodeElimination use DominanceFrontier for this
> purpose. This would not be necessary.
>
> When re-generating IR, control dependencies can either be lowered to
> branching or as a boolean variables. For the purpose of loop
> optimizations, the choice of representation does not (and should not)
> make a difference.
>
>
> 6. Late fallback versioning at IR regeneration
> ------------------------------------------
> When a transformation is applied, it can attach conditions (no
> aliasing, no integer wrap, value restrictions, etc.) under which the
> transformation is valid. During LLVM-IR generation, these conditions
> are collected and emitted as run-time conditions. If the condition
> fails, the original code is executed.
>
> This avoids having every transformation emit its own fallback code.
> The fallback should never execute, or only in cases that do not do
> significant work (e.g. loop trip count is 0), such that optimizing the
> fallback itself is not necessary. In other cases, the transformation
> should implement proper loop versioning by adding the original
> statement with a `!rtc` condition.
>
>
> 7. Standardized API for analyses with multiple implementations
> --------------------------------------------------------------
> Common analyses use an interface that can be fulfilled by different
> implementations. Among these are:
>
>  * Closed form expressions
>    - None (expressions are always represented through dependencies
>      between statements)
>    - LLVM ScalarEvolution
>    - PolyhedralValueAnalysis [6]
>
> Closed form expressions can be used to remap values (typically array
> subscripts, loop lower/upper bounds), simplify expression computations
> (including predicates => is a predicate always true?), determine loop
> trip counts, etc. Most importantly, it identifies a statement
> execution (instance) relative to the loop counters.
>
>
>  * Array accesses / access range / provenance
>    - LLVM AliasAnalysis
>    - One-dimensional (always linearized)
>    - Multi-dimensional, delinearized if necessary
>    - From metadata (that a front-end emits for arrays of pointers to
>      arrays of data)
>
> Array access analysis determines whether two accesses at a specific
> index can alias each other, access ranges and connectivity.
>
>
>  * Dependencies
>    - none (cannot reorder statements)
>    - flat (depends on last execution of another statement)
>    - LLVM DependendencyAnalysis
>    - Polyhedral analysis
>
> There is intentionally no "does X depend on Y?" API as this would
make
> any analysis based on it quadratic in runtime. Instead, it has to list
> all statements another statement depends on such that a graph can be
> built and e.g. sorted topologically.
>
> Dependencies include: use-def chains via registers, memory
> flow/anti/output, general side-effects and control dependencies.
>
>
>  * Control flow
>    - Extracted from original CFG
>    - Evaluate statement conditions
>
> Control flow analysis answers questions about (post-)dominance and
> mutual exclusivity within the acyclic sequence/body of a loop.
>
>
>  * Cost heuristic
>    - TargetTransformationInfo
>
> Estimate the execution time of a (sub-)tree. Average trip counts from
> profiling [11] could be useful.
>
>
> 8. Abstract representation of statements
> ----------------------------------------
> There is no direct relationship between LLVM instructions and loop DAG
> statements. If possible, loop tree optimizations should only use the
> properties of the statement node, not its contents. Typically, a
> statement consists of multiple instructions that do not make sense to
> split. For instance, assuming that %add is not used a second time, in
> the example below
>
>     %add = add i64 %arg, 2
>     %mul = shl i64 %add, 1
>
> the two instructions should always be computed together in the same
> loop iteration. Combining instructions to statements reduces the
> computational complexity of loop optimizations and might be controlled
> by the optimization level (at lower optimization level more
> instructions would be combined). A loop node is itself a statement in
> it's parent loop node's acyclic body.
>
> Moreover, there is no reason to use only LLVM-IR to carry the
> semantics. We could also use e.g. Machine-IR or MLIR operations.
>
>
> 9. Expansion of use-def chains to arrays when spanning loops
> ------------------------------------------------------------
> Consider we would like to fission the following loop:
>
>     for (int i = 0; i < n; i += 1) {
>       { if(true) double a = f(i); }
>       { if(true) g(a); }
>     }
>
> The problem is that the split cuts the use-def dependency over `a`.
> However, the fission can still be done and results in:
>
>     double A[n];
>     for (int i = 0; i < n; i += 1)
>       { if(true) A[i] = f(i); }
>     for (int i = 0; i < n; i += 1) {
>       { if(true) g(A[i]); }
>
> The transforming pass has to consider this during its profitability
> model. The big advantage is that in terms of correctness, use-def
> chains do not manifest false dependencies.
>
>
> Possible Transformations
> ------------------------
> In principle, any transformation could be applied on this
> representation, but instruction-centric transformations such as
> InstCombine are less suitable. Obviously, loop-centric transformations
> are a better fit. Ideas for loop transformations include:
>
>  * Unroll(-and-jam)
>  * Peeling
>  * Iteration range split, concatenation
>  * Interchange
>  * Loop fusion and distribution(/fission)
>  * Tiling/stripmining/stripemining/blocking
>  * Code versioning (including loop unswitching)
>  * Maybe vectorization
>  * Accelerator offloading
>  * Parallelization
>  * Space-filling curve iteration [12]
>  * Array packing (e.g. for scratch pads), transposition
>  * Statement reordering
>  * SIMT-ization
>  * Detect and optimize reductions
>  * Loop skewing/wavefronting
>  * Loop reversal
>  * Solver-driven (polyhedral) optimization
>  * Loop idiom recognition
>
>
> Optimization Pipeline
> ---------------------
> The optimization pipeline for the hierarchy DAG is straightforward:
>
>     void optimizeLoopNest() {
>       RedOriginal = generateLoopHierarchy();
>       List = createCandidates(RedOriginal);
>       sort(List, [](Cand1,Cand2) { return estimateExecutionTime(Cand2)
>                                                    -
> estimateExecutionTime(Cand1); } );
>       if (List[0] != RedOriginal)
>         emit(List[0]);
>     }
>
> Subtree exploration is done recursively, optimizing inner nodes which
> can be (re-)used when optimizing outer nodes.
>
> The current node to optimize is compared a selection of known
> heuristics and its transformation is added to the list of candidates.
> That is, an optimization routine resemble the following, with more
> specific heuristics checked before more general ones.
>
> void exploreOptimizations(Root, Worklist) {
>    Matches = 0;
>
>    // High level: user annotations
>    if (Root.hasHint()) {
>      // Hint is mandatory: do not explore any other possibility
>      Worklist.replace(Root, applyHint(Root));
>      return;
>    }
>
>     // High level: Idiom recognition
>     if (isBLAS(Root)) {
>       Optimized = LinkAgainstBLASLibary ? convertToBlasLibCall(Root) :
>
> applyBLIS(Root):
>       Worklist.replace(Root, Optimized);
>       return;
>     }
>
>     // Mid level: recognize stencils
>     if (isStencil(Root))
>       ...
>
>     // Mid level: Map to compute hierarchy (here: simd -> SMT -> core
> -> socket for CPUs)
>     if (isParallel(Root) fitsWorkingSet(Root.getBody(), LLCSize) &&
&&
>       isParallel(Root.getPerfectlyNestedLoop(1)) &&
>        fitsWorkingSet(Root.getPerfectlyNestedLoop(2).getBody(), L2Size)
&&
>       isParallel(Root.getPerfectlyNestedLoop(2)) &&
>        fitsWorkingSet(Root.getPerfectlyNestedLoop(3).getBody(), L1Size)
&&
>       isVectorizable(Root.getPerfectlyNestedLoop(3))) {
>         Optimized = applyVectorization(Root.getPerfectlyNestedLoop(3));
>         Optimized = applyParallel(Optimized.getParent(), /*places=*/SMT);
>         Optimized = applyParallel(Optimized.getParent(), /*places=*/Cores);
>         Optimized = applyParallel(Optimized.getParent(),
/*places=*/Sockets);
>
>         // Still explore CPU pipeline optimizations of innermost
>         Worklist.insert(Root,Optimized);
>         Matches += 1;
>     }
>
>     if (Matches >= Threshold)
>       return;
>
>     // Low level: optimize working set for L1 cache size using tiling
>     Band = maximumTilableNest(Root);
>     WorkingSetSizePerIteration = estimatWorkingSet(Band.back().getBody());
>     TileSize = floor(nthroot(L1CacheSize / WorkingSetSizePerIteration,
> Band.size()));
>     if (TileSize > 1) {
>        Worklist.insert(applyTiling(Root, TilesSize.splat(Band.size()));
>        Matches += 1;
>     }
>
>     if (Matches >= Threshold)
>       return;
>
>     // Low level: vectorization for each SIMD level (potentially
>     // carried-out by VPlan)
>     foreach (Width : promisingSimdWidths(Root)) {
>       Vectorized = applyVectorization(Root, Width);
>       if (Vectorized.isSatisfiable()) {
>         Worklist.insert(Vectorized);
>         Matches += 1;
>       }
>     }
>
>     if (Matches >= Threshold)
>       return;
>
>     // Low level: rely on heuristic to find best unrolling factor
>     if (Factor = unrollFactorHeuristic(Root))  {
>       Worklist.insert(applyUnroll(Root, Factor).getRoot());
>       Matches += 1;
>     }
>
>     if (Matches >= Threshold)
>       return;
>
>     ...
> }
>
>
> The "levels" encode precedence optimizations. Depending on the
> optimization level, we might only apply the most promising
> optimization.
>
> Instead of this brute force implementation, for predefined
> optimization levels, the search space exploration should be guided:
> Encode the order of the most useful transformations in the search
> space exploration. For instance, find all idioms first, then apply
> memory hierarchy optimizations, then compute hierarchy, then CPU
> pipeline optimizations. Since this is very transformation specific, I
> do not expect to write a 'pass manager' where transformations can
> register itself to (except for exhaustive searches when explicitly
> enabled). Instead, I'd write this in code that classifies loop nests
> (e.g. compute bound, bandwidth bound, known idioms, tensor
> contractions, stencils, etc.) and choses appropriate optimization
> paths. We can, of course, make this more generic over time.
>
>
>
> Writing an optimization pass
> -----------------------------
>
> Optimizations themselves are written in the following steps:
> 1. Check preconditions
> 2. Quick profitability analysis to skip obviously bad transformations
> 3. Create new green nodes to represent the changed loop nest; reuse
> unchanged green nodes as much as possible
> 4. Create a red node to insert into the parent
> 5. Preserve analyses (most analyses should be able to use analysis
> stored in reused green nodes and/or lazyly reevaluate new nodes)
> 6. Check whether the transformation was valid
>
> Especially the possibility to analyze the correctness after the
> transformation was applied could significantly reduce the amount of
> code needed to verify the correctness.
>
> A transformation like loop fusion could look like:
>
>   void applyFuse(Red1, Red2) {
>    // Assume Red1 and Red2 are consecutive sibling loops
>
>   // Bail out on untypical things (atomic accesses, exceptions,
> convergent instructions, ...)
>   if (!Red1->isSimple() || !Red2->isSimple())
>     return;
>
>   // Consistency checks (can be less less strict by moving conditions
> into the fused body).
>   if (Red1->getIterationCount() != Red2->getIterationCount())
>     return;
>   if (Red1->getPredicate() != Red2->getPredicate())
>     return;
>
>   // Create new loop hierarchy with fused loops.
>   GreenFused = new GreenLoop({ Red1->getGreen(), Red2->getGreen() },
>                                                     
Red1->getIterationCount());
>   RedFused = new RedNode(GreenFused, Red1->getParent());
>   NewRoot = RedFused->getRoot();
>
>   // Preserve analysis.
>   // In this case not really necessary since the green nodes of the
>   // loop bodies are reused.
>   DependenceAnalysis->transferAnalysis(Red1->getBody(),
>                                                   
RedFused->getBody()[0],
> ClosedExprAnalysis->makeIdentityMapping(Red1->getIterator(),
>                                                  
RedFused->getIterator()));
>   DependenceAnalysis->transferAnalysis(Red2->getBody(),
>                                                  
RedFused->getBody()[1],
>    ClosedExprAnalysis->makeIdentityMapping(Red2->getIterator(),
>                                                  
RedFused->getIterator()));
>
>   ValidityCondition = DependenceAnalysis->isValid(NewRoot);
>   if (ValidityCondition.isSatisfiable()) {
>     RedFused.addAssumption(ValidityCondition);
>     Candidates.push_back(NewRoot);
>   }
> }
>
>
>
> Questions & Answers
> -------------------
>
> Q: Is the loop hierarchy a tree or DAG?
> A: Both. A green node can be reused multiple times in the same
> structure (e.g. due to unrolling, versioning, peelings, etc), but a
> red node is specific to the position in its parent; there can be
> multiple red nodes for the same green node.
>
>
> Q: No SSA?
> A: Since the location of a definition depends on the parent in a green
> DAG, possibly from different loop hierarchies after transformations,
> it is not possible to directly reference the definition with its use.
> However, it is possible in the red tree, but runs against the goal of
> avoiding heavy red nodes. Instead, to find the previous definition,
> one has to traverse the green no up to find at which sibling level the
> value is defined. However, this should not be a common query; loop
> optimizations should use dependency information instead.
>
>
> Q: Loops with multiple exits?
> A: If every exiting edge a unique number to its __???_executed
> variable, it can be handled like a switch.
>
>
> Q: Unstructured control flow?
> A: Unstructured control flow only introduces disjunctions to statement
> predicated. That is, the predicate
>
>     a or b
>
> corresponds to block BB2 the CFG
>
>     BB0:
>       br i1 %a, label %BB2, label %BB2
>
>     BB1:
>       br i1 %b, label %BB2, label %BB3
>
>     BB2:
>
> In structured control flow, BB1 and/or BB2 would be post-dominated by
> BB2 and the disjunction disappear.
>
>
> Q: Irreducible loops?
> A: Can be folded into the first iteration of the loop by including a
> condition for statements not executed during the first iteration.
>
>
> Q: The unreachable terminator?
> A: If due to a non-returning function call, this introduces a loop
> exit (which will also introduce control a dependency to everything
> that follows) to the surrounding loops. If due to an assumption, we
> may add the branching condition to the sufficient condition, but not
> to the necessary condition.
>
>
> Q: Infinite loops?
> A: The closed-form expression identifying a statement instances and
> loop trip count have no upper bound. Either use arbitrary range
> arithmetic, use relative dependence vectors and/or fall back on
> pessimistic assumptions. In addition, like with the unreachable
> terminator, this may be handled as another loop exit.
>
>
> Q: IndirectBr?
> Since the list of possible targets is part of the instructions, can be
> handled like a switch over the target address
>
>
> Q: Loop guards?
> A: Loop guards manifests as the execution condition of the loop
> statement's predicate. A loop whose predicate evaluates to false can
> be considered a loop with 0 iterations. The IR generator can also emit
> the predicate as a loop guard. It might be useful to ensure that if
> the predicate evaluates to true, the loop has at least one iteration
> and the loop-condition is not evaluated for the first iteration (like
> LoopRotation).
>
>
> Q: Exceptions?
> A: Could also be handled with control dependencies, but I would
> instead mark the surrounding loops with "contains unpredictable
> control flow" to be ignored by most transformations. Similar to
> `!isSimple()` of LoadInst and StoreInst.
>
>
> Q: Relation to the new/legacy pass manager?
> A: LLVM's pass managers are unfortunately not designed to apply to
> subtrees nor persistent data structures besides the LLVM-IR itself.
> Instead, the loop tree optimizer would be its own monolithic pass on
> the pass manager level (like MachinePassManager and VPlan). My idea is
> to add it somewhere before LoopVectorize, but after the inliner,
> potentially replace most other loop transformations. There should be
> just one instance of the pass in the pipeline since creating an
> alternative representation of LLVM-IR is not that cheap. It does not
> need IR normalization (LoopSimplify, LICM, LoopRotate, LCSSA); these
> can be performed on the loop tree and emitted into the LLVM-IR if a
> non-modified tree is chosen. Possibly some of these passes can be
> removed.
> In order to preserve user-directed transformations (such as "#pragma
> clang loop distribute"), it may also be interesting to put it earlier
> into the pipeline to avoid that other passes such as LoopFullUnroll
> change the loop before the transformation can be applied.
> Alternatively, such passes can be taught the check the presence of a
> user-directed transformation before applying its transformation.
>
>
> Q: Relation to LoopVectorize/VPlan?
> A: VPlan has similar design goals [9] but is intended for
> vectorization only. It also has a hierarchical data structure, but
> uses a CFG for acyclic control flow. It is also intended to have
> multiple variants of the same source IR from which the best is
> selected via a cost function. However, it lacks cheap copies. Instead
> of instructions, it uses recipes/"meta-instructions" that handle
what
> happens to instructions after vectorization, e.g. do that operation on
> each vector lane ("WIDEN"). VPlan is more oriented towards
modifying
> instructions instead of statements as collection of instructions. With
> the loop hierarchy, I do not plan to redo the work of VPlan, but also
> do not see a reason why it should not also be able to do
> vectorization.
>
>
> Q: Relation to Polly?
> A: Polly uses ISL's schedule tree as the representation of the loop
> hierarchy. Its creation may fail of something not representable in a
> schedule tree is found. Polly tries to find the maximum representable
> region, but once it is chosen, it cannot be changed (e.g. because
> optimizing it takes too long). The schedule tree is stored internally
> by ISL as a kind of green tree, but is walked over by an iterator that
> remembers the path to the root. While this could be used to select a
> best schedule tree using a cost function, Polly relies mostly on ISL's
> reschedule algorithm to find a profitable transformation.
>
>
> Q: Relation to MLIR?
> A: MLIR is more similar to LLVM-IR than a loop hierarchy. For
> instance, it also does not feature cheap copies. An advantage is that
> loops and multi-dimensional arrays can be represented in the language
> without the need of being rediscovered, but have to be inserted by a
> front-end. That is, if Clang was generating MLIR, loops and arrays
> still have to be rediscovered. However, a loop hierarchy optimizer
> could be applied to MLIR just as well as to LLVM-IR.
>
>
> References
> ----------
> [1] http://llvm.org/devmtg/2018-10/talk-abstracts.html#talk11
> [2] https://ieeexplore.ieee.org/abstract/document/5389392/
> [3] http://sro.sussex.ac.uk/id/eprint/7576/1/Stanier,_James.pdf
> [4]
https://blogs.msdn.microsoft.com/ericlippert/2012/06/08/persistence-facades-and-roslyns-red-green-trees/
> [5] https://github.com/Meinersbur/llvm-project/tree/LOF/llvm/lib/LOF
> [6] https://llvm.org/devmtg/2017-10/#src2
> [7] https://arxiv.org/abs/1811.00632
> [8] https://lists.llvm.org/pipermail/llvm-dev/2017-October/118125.html
> [9] https://llvm.org/docs/Proposals/VectorizationPlan.html
> [10] https://github.com/open64-compiler/open64/tree/master/osprey/be/lno
> [11] https://reviews.llvm.org/D70688
> [12] https://arxiv.org/abs/1801.07399

Renato Golin via llvm-dev

2020-Jan-10 22:09 UTC

head link

[llvm-dev] Writing loop transformations on the right representation is more productive

On Fri, 3 Jan 2020 at 11:27, Michael Kruse via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> 1. Candidate selection through cost function
> --------------------------------------------
> Instead of needing to know which transformation is profitable before
> applying it, create a copy of the data structure, modify it, and
> compare it to the original. Moreover, holding multiple, differently
> optimized, copies allows evaluating each variant using a cost function
> and select the 'best' when re-generating LLVM-IR again (or re-use
the
> original LLVM-IR).
This sounds a lot like VPlan.
> Instantiating every possible sequence of transformations of course is
> not feasible, so the search space needs to be pruned. This could be
> made dependent on the optimization level.
Are you planning on using heuristic searches? This could make the
vectoriser unstable upon small input changes and therefore hard to get
consistent results and testing.

I'm not against such idea, but I think we need to be conservative in
such a core component of the compiler.

It would be nice to have -Ogocrazy to mean "keep going until you find
something", but usually, -O3 should terminate. :)
> 2. Cheap copies using Red/Green Trees
This seems like an elegant approach to a complex problem.

> 3. Apply transformations from high-level to low-level
> -----------------------------------------------------
> Optimization should be applied from very specialized to very general
> (potentially after some canonicalization). For instance, the first
> step could be detecting common idioms such as gemm and replace them
> with either a BLAS function call or apply well-studied optimizations
> like BLIS to them. After such an idiom has been detected, no other
> transformation should be applied to them.
I'm sceptical to such a machinery. People usually write bad code (me
included) and trying to mach multiple patterns to the same semantics
will be hard, considering how lenient C++ is to pointer handling and
type conversions.

If you do find a match and convert to a library function call, then
well, you can't do anything with it even if you wanted to. :)
> Mid-level transformations may try to map entire loop nests to cache-
> and compute hierarchies (SIMT threads, multiprocessors, offloading,
> etc) by applying transformations such as tiling, loop interchange and
> array packing.
This is hard but very profitable. However, feels to me again that this
is just VPlan packed differently.

While VPlan still has no way yet to handle even simple outer-loops
(has that landed yet?), once we do, then the natural progression will
be to start understanding their semantics and possibly make high level
assumptions like that.

> 6. Late fallback versioning at IR regeneration
> ------------------------------------------
> When a transformation is applied, it can attach conditions (no
> aliasing, no integer wrap, value restrictions, etc.) under which the
> transformation is valid. During LLVM-IR generation, these conditions
> are collected and emitted as run-time conditions. If the condition
> fails, the original code is executed.
This sounds like it will bloat code for a lot of cold cases. Or worse,
get it wrong, and put hot code in the cold path.

> 7. Standardized API for analyses with multiple implementations
These are good to have regardless of which vectorisation strategy we use.

> 8. Abstract representation of statements
> ----------------------------------------
> For instance, assuming that %add is not used a second time, in
> the example below
>
>     %add = add i64 %arg, 2
>     %mul = shl i64 %add, 1
>
> the two instructions should always be computed together in the same
> loop iteration.
This may inhibit further combines, or even detection of target
specific patterns for SIMD code that aren't common.

I agree that not forcefully binding statements with instructions is a
good idea, but this may need a target-specific pattern matcher to be
more sensitive to target idiosyncrasies.

> 9. Expansion of use-def chains to arrays when spanning loops
> ------------------------------------------------------------
> The transforming pass has to consider this during its profitability
> model. The big advantage is that in terms of correctness, use-def
> chains do not manifest false dependencies.
Sounds good, but also creates the problem of how to handle the array.
If 'n' is unknown, or dependent on SIMD widths or number of threads,
it's too low level to add anything that is guaranteed to not change
the performance profile of the original loop.

> Q: Relation to the new/legacy pass manager?
> A: LLVM's pass managers are unfortunately not designed to apply to
> subtrees nor persistent data structures besides the LLVM-IR itself.
By design. The more alternative persistent data structures you have
being modified by a series of passes, the harder it is to know what
did what and where you are.
> Instead, the loop tree optimizer would be its own monolithic pass on
> the pass manager level (like MachinePassManager and VPlan). My idea is
> to add it somewhere before LoopVectorize, but after the inliner,
> potentially replace most other loop transformations.
To me this almost sounds like Polly. Take LLVM IR into a completely
different representation, do a bunch of transformations with it,
re-generate LLVM IR and spits it back into the pipeline.

By that time, all analyses have to be invalidated. All
canonicalisations that had been done will probably be destroyed and
many current pattern matches will stop working. This infrastructure is
only meaningful at the function level or higher, so the potential for
wide range destruction is not trivial.

Don't get me wrong, I like the idea, it's a cool experiment using some
cool data structures and algorithms. But previous experiences with the
pass manager have, well, not gone smooth in any shape or form.

> Q: Relation to LoopVectorize/VPlan?
> A: VPlan has similar design goals [9] but is intended for
> vectorization only.
Again, by a conservative design. I think adding yet-another won't help.

My point is: if this is the way to go, then we should start to think
how we make everything that makes sense become part of this scheme.
Splitting the pass manager into SSA and Tree, run some passes in one
others in the other, and so on.

But creating multiple, incompatible and potentially destructive whole
new pass managers will make a hard job impossible.
> However, it lacks cheap copies. Instead
> of instructions, it uses recipes/"meta-instructions" that handle
what
> happens to instructions after vectorization, e.g. do that operation on
> each vector lane ("WIDEN").
Nothing stops us from implementing a leaner approach to VPlan. It
wouldn't be a trivial implementation, but the volume of work that
would be required in this proposal is staggering, too.

> VPlan is more oriented towards modifying
> instructions instead of statements as collection of instructions.
Fair enough, the design was to enhance SIMD code generation, not any
kind of parallel semantics. I guess it would be possible to add the
concept of higher level blocks to VPlan.

All in all, VPlan is young and in constant refactoring, and perhaps it
would be more productive to move it towards a more inclusive approach
than throwing it away before it fully matures to start a whole new
project.

https://xkcd.com/927/

> Q: Relation to MLIR?
> A: MLIR is more similar to LLVM-IR than a loop hierarchy. For
> instance, it also does not feature cheap copies.
If you treat MLIR as your red tree, you could create a green tree
(perhaps as a dialect) and have cheap copies (passing the dialects and
deltas without passing the base).

> An advantage is that
> loops and multi-dimensional arrays can be represented in the language
> without the need of being rediscovered, but have to be inserted by a
> front-end.
Not necessarily. We have discussed introducing dialect annotation to
MLIR during compile time from analysis passes that would basically do
what the front-end should have done.

Conclusions?

This was a long email, with too many proposals, so I don't have any
strong opinion or conclusions, not even from my own comments.

Overall, I like a lot of the ideas (red/green, tree optimisation,
different search strategy), but I dislike the encompassing proposal to
*replace* a lot of the existing infrastructure.

For better or worse, LLVM is a product of its age. Some things could
have been done better, but we have always adopted the "general
consensus and slow movement" way to change things. Sometimes too slow,
but...

Now, MLIR can be a way to speed that up.

It is a much more malleable format than LLVM IR, it was designed for
high-level representation, has a lot of parallelism concepts in it and
it's supposed to interact with LLVM IR seamlessly.

It may be much easier to use MLIR to interoperate the two "pass
managers" _together_ than converting from one to the other and vice
versa.

This is a bold claim and I have no evidence it could ever work. But I
think it would still be less work than creating yet another pass
manager from scratch.

cheers,
--renato

Michael Kruse via llvm-dev

2020-Jan-11 00:33 UTC

head link

[llvm-dev] Writing loop transformations on the right representation is more productive

Am Fr., 10. Jan. 2020 um 16:10 Uhr schrieb Renato Golin <rengolin at
gmail.com>:> On Fri, 3 Jan 2020 at 11:27, Michael Kruse via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
> > 1. Candidate selection through cost function
> > --------------------------------------------
> > Instead of needing to know which transformation is profitable before
> > applying it, create a copy of the data structure, modify it, and
> > compare it to the original. Moreover, holding multiple, differently
> > optimized, copies allows evaluating each variant using a cost function
> > and select the 'best' when re-generating LLVM-IR again (or
re-use the
> > original LLVM-IR).
>
> This sounds a lot like VPlan.
Yes, as mentioned in the Q&A. Unfortunately VPlan is able to represent
arbitrary code not has cheap copies.

> > Instantiating every possible sequence of transformations of course is
> > not feasible, so the search space needs to be pruned. This could be
> > made dependent on the optimization level.
>
> Are you planning on using heuristic searches? This could make the
> vectoriser unstable upon small input changes and therefore hard to get
> consistent results and testing.
>
> I'm not against such idea, but I think we need to be conservative in
> such a core component of the compiler.
>
> It would be nice to have -Ogocrazy to mean "keep going until you find
> something", but usually, -O3 should terminate. :)
I agree, as outlined in the RFC under "predefined optimization
levels".

> > 3. Apply transformations from high-level to low-level
> > -----------------------------------------------------
> > Optimization should be applied from very specialized to very general
> > (potentially after some canonicalization). For instance, the first
> > step could be detecting common idioms such as gemm and replace them
> > with either a BLAS function call or apply well-studied optimizations
> > like BLIS to them. After such an idiom has been detected, no other
> > transformation should be applied to them.
>
> I'm sceptical to such a machinery. People usually write bad code (me
> included) and trying to mach multiple patterns to the same semantics
> will be hard, considering how lenient C++ is to pointer handling and
> type conversions.
This conversion is a possibility and certainly not the main motivation
for a loop hierarchy. Smaller idioms exists as well, such as detecting
popcount. Even with gemm I think it would be nice if it could be
written in a naive version in the source code that compiles with any
compiler, but also benefit from the target platform's hand-optimized
performance primitives by adding a compiler switch (which could be
-O3).


> > Mid-level transformations may try to map entire loop nests to cache-
> > and compute hierarchies (SIMT threads, multiprocessors, offloading,
> > etc) by applying transformations such as tiling, loop interchange and
> > array packing.
>
> This is hard but very profitable. However, feels to me again that this
> is just VPlan packed differently.
>
> While VPlan still has no way yet to handle even simple outer-loops
> (has that landed yet?), once we do, then the natural progression will
> be to start understanding their semantics and possibly make high level
> assumptions like that.
I wouldn't have thought that parallelization and offloading was ever
considered on top of VPlan.


> > 6. Late fallback versioning at IR regeneration
> > ------------------------------------------
> > When a transformation is applied, it can attach conditions (no
> > aliasing, no integer wrap, value restrictions, etc.) under which the
> > transformation is valid. During LLVM-IR generation, these conditions
> > are collected and emitted as run-time conditions. If the condition
> > fails, the original code is executed.
>
> This sounds like it will bloat code for a lot of cold cases. Or worse,
> get it wrong, and put hot code in the cold path.
Are you arguing against code versioning? It is already done today by
multiple passes such as LoopVersioningLICM, LoopDistribute,
LoopUnrollAndJam and LoopVectorize. The proposal explicitly tries to
avoid code bloat by having just one fallback copy. Runtime conditions
can be chosen more or less optimistically, but I don't see how this
should be an argument for all kinds of versioning.

If you are concerned about bloat in cold paths, we could use profile
information to optimize cold functions with '-Os', something that GCC
does, but not Clang.

> > 7. Standardized API for analyses with multiple implementations
>
> These are good to have regardless of which vectorisation strategy we use.
In LLVM, AliasAnalysis does this, but hat not yet found another application.


> > 8. Abstract representation of statements
> > ----------------------------------------
> > For instance, assuming that %add is not used a second time, in
> > the example below
> >
> >     %add = add i64 %arg, 2
> >     %mul = shl i64 %add, 1
> >
> > the two instructions should always be computed together in the same
> > loop iteration.
>
> This may inhibit further combines, or even detection of target
> specific patterns for SIMD code that aren't common.
>
> I agree that not forcefully binding statements with instructions is a
> good idea, but this may need a target-specific pattern matcher to be
> more sensitive to target idiosyncrasies.
My idea here is that loop-level optimizations rarely need to know
which target-specific instructions are executed, as long as it knows
its performance-relevant properties. This might be a difference to
vectorization which may be more ISA-specific.


> > 9. Expansion of use-def chains to arrays when spanning loops
> > ------------------------------------------------------------
> > The transforming pass has to consider this during its profitability
> > model. The big advantage is that in terms of correctness, use-def
> > chains do not manifest false dependencies.
>
> Sounds good, but also creates the problem of how to handle the array.
> If 'n' is unknown, or dependent on SIMD widths or number of
threads,
> it's too low level to add anything that is guaranteed to not change
> the performance profile of the original loop.
As mentioned, the profitability model has to take this into account.
Conservatively, we may only do this if the resulting array is a small
constant size such that we can assume that even multiple of those fir
on the stack.

> > Q: Relation to the new/legacy pass manager?
> > A: LLVM's pass managers are unfortunately not designed to apply to
> > subtrees nor persistent data structures besides the LLVM-IR itself.
>
> By design. The more alternative persistent data structures you have
> being modified by a series of passes, the harder it is to know what
> did what and where you are.
The proposal avoids persistent data structures between separate passes.

Note that MachineFunctionPass maintains the MachineFunction data
structure in parallel to the LLVM-IR.

> > Instead, the loop tree optimizer would be its own monolithic pass on
> > the pass manager level (like MachinePassManager and VPlan). My idea is
> > to add it somewhere before LoopVectorize, but after the inliner,
> > potentially replace most other loop transformations.
>
> To me this almost sounds like Polly. Take LLVM IR into a completely
> different representation, do a bunch of transformations with it,
> re-generate LLVM IR and spits it back into the pipeline.
There is indeed an inspiration from Polly.

> By that time, all analyses have to be invalidated. All
> canonicalisations that had been done will probably be destroyed and
> many current pattern matches will stop working. This infrastructure is
> only meaningful at the function level or higher, so the potential for
> wide range destruction is not trivial.
>
> Don't get me wrong, I like the idea, it's a cool experiment using
some
> cool data structures and algorithms. But previous experiences with the
> pass manager have, well, not gone smooth in any shape or form.
What experiments? I don't see a problem if the pass manger has to
invalidate analysis are re-run canonicalization passes. This happens
many times in the default pass pipelines. In addition, this
invalidation is only necessary if the loop optimization pass optimizes
something, in which case the additional cost should be justified.

> > Q: Relation to LoopVectorize/VPlan?
> > A: VPlan has similar design goals [9] but is intended for
> > vectorization only.
>
> Again, by a conservative design. I think adding yet-another won't help.
>
> My point is: if this is the way to go, then we should start to think
> how we make everything that makes sense become part of this scheme.
> Splitting the pass manager into SSA and Tree, run some passes in one
> others in the other, and so on.
>
> But creating multiple, incompatible and potentially destructive whole
> new pass managers will make a hard job impossible.
I don't think the proposal qualifies as including a full-flexible new
pass manger, at least no more than the current mechanism LoopVectorize
uses to run passes on VPlan (LoopVectorizationPlanner::plan).

> > However, it lacks cheap copies. Instead
> > of instructions, it uses recipes/"meta-instructions" that
handle what
> > happens to instructions after vectorization, e.g. do that operation on
> > each vector lane ("WIDEN").
>
> Nothing stops us from implementing a leaner approach to VPlan. It
> wouldn't be a trivial implementation, but the volume of work that
> would be required in this proposal is staggering, too.
>
> > VPlan is more oriented towards modifying
> > instructions instead of statements as collection of instructions.
>
> Fair enough, the design was to enhance SIMD code generation, not any
> kind of parallel semantics. I guess it would be possible to add the
> concept of higher level blocks to VPlan.
>
> All in all, VPlan is young and in constant refactoring, and perhaps it
> would be more productive to move it towards a more inclusive approach
> than throwing it away before it fully matures to start a whole new
> project.
While I still think the goals of VPlan and a loop hierarchy are
different, I expect VPlan to be production-ready earlier than this
proposal. I fear that combining them would delay the both.

> https://xkcd.com/927/
While I can never find this xkcd not funny, a the loop hierarchy is
not intended to be universal.

> > Q: Relation to MLIR?
> > A: MLIR is more similar to LLVM-IR than a loop hierarchy. For
> > instance, it also does not feature cheap copies.
>
> If you treat MLIR as your red tree, you could create a green tree
> (perhaps as a dialect) and have cheap copies (passing the dialects and
> deltas without passing the base).
I don't see how this could work.

> > An advantage is that
> > loops and multi-dimensional arrays can be represented in the language
> > without the need of being rediscovered, but have to be inserted by a
> > front-end.
>
> Not necessarily. We have discussed introducing dialect annotation to
> MLIR during compile time from analysis passes that would basically do
> what the front-end should have done.
The argument is that MLIR has first-class expressions for
multi-dimensional array accesses ("MemRef") while LLVM-IR does not.

https://lists.llvm.org/pipermail/llvm-dev/2019-July/134063.html

Both of them can have analyses to raise the abstraction to a
multi-dimensional access ("delinearization").

> Conclusions?
>
> This was a long email, with too many proposals, so I don't have any
> strong opinion or conclusions, not even from my own comments.
Thank you for going through it!

> Overall, I like a lot of the ideas (red/green, tree optimization,
> different search strategy), but I dislike the encompassing proposal to
> *replace* a lot of the existing infrastructure.
Not a replacement, but an addition that does not always need to be
enabled (e.g. -O0).

In a previous RFC [8] I tried to NOT introduce a data structure but to
re-use LLVM-IR. The only discussion there was about the RFC, was about
not to 'abuse' the LLVM-IR.

https://lists.llvm.org/pipermail/llvm-dev/2017-October/118169.html
https://lists.llvm.org/pipermail/llvm-dev/2017-October/118258.html

I definitely see the merits of using fewer data structures, but it is
also hard to re-use something existing for a different purpose (in
this case: VPlan) without making both more complex.

> For better or worse, LLVM is a product of its age. Some things could
> have been done better, but we have always adopted the "general
> consensus and slow movement" way to change things. Sometimes too slow,
> but...
>
> Now, MLIR can be a way to speed that up.
>
> It is a much more malleable format than LLVM IR, it was designed for
> high-level representation, has a lot of parallelism concepts in it and
> it's supposed to interact with LLVM IR seamlessly.
>
> It may be much easier to use MLIR to interoperate the two "pass
> managers" _together_ than converting from one to the other and vice
> versa.
>
> This is a bold claim and I have no evidence it could ever work. But I
> think it would still be less work than creating yet another pass
> manager from scratch.
This is why I don't want the framework to be too tangled with LLVM-IR.
For the foreseeable future, Clang will generate LLVM-IR, but our
motivation is to (also) optimize C/C++ code. That is, I do not see a
way to not (also) handle LLVM-IR until Clang is changed to generate
MLIR (which then again will be another data struture in the system).


Michael

Chris Lattner via llvm-dev

2020-Jan-13 06:07 UTC

head link

[llvm-dev] [RFC] Writing loop transformations on the right representation is more productive

On Jan 3, 2020, at 3:26 AM, Michael Kruse via llvm-dev <llvm-dev at
lists.llvm.org> wrote:> In the 2018 LLVM DevMtg [1], I presented some shortcomings of how LLVM
> optimizes loops. In summary, the biggest issues are (a) the complexity
> of writing a new loop optimization pass (including needing to deal
> with a variety of low-level issues, a significant amount of required
> boilerplate, the difficulty of analysis preservation, etc.), (b)
> independent optimization heuristics and a fixed pass ordering and (c)
> code explosion due to versioning. Currently, different people are
> working on improving different loop optimization passes such as
> LoopInterchange, LoopFuse and LoopVectorize. Even if LLVM had
> 'perfect' individual loop passes, they would still have the
> aforementioned problems due to still being independent and the system
> as a whole will be suboptimal.
Hi Michael,

Thank you for bringing this up.  This is an area of interest, and I certainly
share you view of what a pain this all is right now.  I can tell you’ve put a
lot of thought into this and time into your RFC!

> The central idea is to use a modifiable loop tree -- similar to
> LoopInfo -- as the primary representation. LLVM-IR is converted to a
> loop tree, then optimized and finally LLVM-IR is generated again for
> subtrees that are considered profitable. This is not a new concept, it
> has already been used in compilers such as IBM XL Fortran (called ASTI
> [4]) and Silicon Graphics/Open64 (called LNO [10]), and in research
> such as the Value State Dependence Graph [3].
Ignoring the details of its representation, this is also conceptually how Polly
works: code is lifted into its representation, transformed, then lowered back
down.
> 4. Represents loops and predicates instead of CFGs
Yes, totally!

Overall, I think that this discussion would be easier to process if we broke it
into a few pieces.  There seems to be consensus that LLVM IR (as is) is not the
right representation for aggressive loop transformations.  If we don’t have
consensus on this, then I’d make sure to start there.

Once that is established, there is a question of “what is the right
representation to use”?  This question has two subcomponents: what data
structure should we use, and what is the IR within it.

If you propose introducing a brand new data structure, please expect me to push
back on that pretty hard.  This is a perfect application of MLIR, and using it
provides a lot of value (including amazing testing tools, round-tripping,
location tracking, etc) which would otherwise would have to be reinvented, and
doesn’t not dictate the IR structure otherwise.  MLIR absolutely supports nested
loop structures etc, as is seen in the affine dialect.

The MLIR community also is highly invested in HPC-style transformations on this,
and a lot of thought has gone into it.  You can learn more about this in the
slides and videos from the MLIR open design meetings
<https://docs.google.com/document/d/1y_9f1AbfgcoVdJh4_aM6-BaSHvrHl8zuA5G4jv_94K8/edit#heading=h.cite1kolful9>.

One you achieve consensus on data structure, there is the question of what IR to
use within it.  I would recommend starting with some combination of “existing
LLVM IR operations + high level control flow representation”, e.g. parallel and
affine loops.  The key here is that you need to always be able to lower in a
simple and predictable way to LLVM IR (this is one of the things that classic
polyhedral systems did sub optimally, making it difficult to reason about the
cost model of various transformations), and this is a natural incremental
starting point anyway.  Over time, more high level concepts can be gradually
introduced.  FYI, MLIR already has a reasonable LLVM dialect
<https://mlir.llvm.org/docs/Dialects/LLVM/> and can generate LLVM IR from
it, so we’d just need an “LLVM IR -> MLIR LLVM dialect” conversion, which
should be straightforward to build.

Once you have the data structure and the dialect within it decided, you have the
set of transformations.  Again, you’ve given a lot of thought to this, and that
all sounds great to me, but the priorities can be driven by whoever wants to
contribute and what concrete problems they have to solve.

Once the infra for “raising to this representation and lowering back down” is
figured out, we can open the box of having clang and other front ends directly
generate it.
> Q: Relation to MLIR?
> A: MLIR is more similar to LLVM-IR than a loop hierarchy.
This is not true, MLIR is great for dialects that want to model loop hierarchies
naturally, this is a major focus of the affine dialect
<https://mlir.llvm.org/docs/Dialects/Affine/> (e.g. see affine.for on that
page).  MLIR is not limited to affine loops, that is just a choice made by the
affine dialect - the loop dialect
<https://mlir.llvm.org/docs/Dialects/LoopOps/> has more general constructs
that are less developed.
> For
> instance, it also does not feature cheap copies.
I’m not sure what this means.
> An advantage is that
> loops and multi-dimensional arrays can be represented in the language
> without the need of being rediscovered, but have to be inserted by a
> front-end.
This is correct, but I don’t see how this helps if your focus is raising code
that has already been lowered to LLVM IR, e.g. by Clang or some other frontend
that generates LLVM IR today.
> That is, if Clang was generating MLIR, loops and arrays
> still have to be rediscovered.
This isn’t true, it would be perfectly sensible to lower C control flow
structures directly too LLVM.  The primary concern are things like unstructured
switches (think duff’s device) and unstructured gotos, but these occur rarely:
they need to be handled correctly, but can do so with a traditionally lowered
CFG and other “best effort” attempts to raise them.

Other frontends like Swift and Flang could also generate this directly if they
chose to, getting the benefits of progressive lowering.
> However, a loop hierarchy optimizer
> could be applied to MLIR just as well as to LLVM-IR.
Right!  In addition to structured control flow, MLIR has great support for CFG
representations like LLVM of course. :-)

-Chris

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200112/c16521ae/attachment.html>

Renato Golin via llvm-dev

2020-Jan-13 09:51 UTC

head link

[llvm-dev] [RFC] Writing loop transformations on the right representation is more productive

On Mon, 13 Jan 2020 at 06:07, Chris Lattner via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> Overall, I think that this discussion would be easier to process if we
broke it into a few pieces. There seems to be consensus that LLVM IR (as is) is
not the right representation for aggressive loop transformations. If we don’t
have consensus on this, then I’d make sure to start there.
>From all meetings we had about parallelism representations in IR overthe past few dev meetings, this was pretty much the only agreement. :)

We couldn't find a clear way to represent most parallel concepts in IR
(without breaking the others), but there wasn't a single format on top
of LLVM IR that we could all agree on.

> If you propose introducing a brand new data structure, please expect me to
push back on that pretty hard. This is a perfect application of MLIR, and using
it provides a lot of value (including amazing testing tools, round-tripping,
location tracking, etc) which would otherwise would have to be reinvented, and
doesn’t not dictate the IR structure otherwise. MLIR absolutely supports nested
loop structures etc, as is seen in the affine dialect.
Today, I believe MLIR can serve that purpose, with (possibly)
overlapping dialects, and have cheap copies.

Cheap copies are required for heuristic searches, which have at least
polynomial compute/memory cost, but sometimes exponential.

Copying the whole module N^2 times (N = some pre-defined large budget)
won't work, that's why overlays (like the red/green tree idea) are
needed.

Journaling and snap-shooting are also possible ways to make memory
almost constant (and not exploding compute), and that should be
possible with MLIR dialects, I think.

> Once you have the data structure and the dialect within it decided, you
have the set of transformations. Again, you’ve given a lot of thought to this,
and that all sounds great to me, but the priorities can be driven by whoever
wants to contribute and what concrete problems they have to solve.
If we don't have a generic enough representation, early designs will
change how later ones will be implemented.

That's why I think MLIR would be the right choice, as it's much more
flexible and composable.

cheers,
--renato

Michael Kruse via llvm-dev

2020-Jan-15 04:39 UTC

head link

[llvm-dev] [RFC] Writing loop transformations on the right representation is more productive

Am So., 12. Jan. 2020 um 20:07 Uhr schrieb Chris Lattner <
clattner at nondot.org>:
> The central idea is to use a modifiable loop tree -- similar to
> LoopInfo -- as the primary representation. LLVM-IR is converted to a
> loop tree, then optimized and finally LLVM-IR is generated again for
> subtrees that are considered profitable. This is not a new concept, it
> has already been used in compilers such as IBM XL Fortran (called ASTI
> [4]) and Silicon Graphics/Open64 (called LNO [10]), and in research
> such as the Value State Dependence Graph [3].
>
>
> Ignoring the details of its representation, this is also conceptually how
> Polly works: code is lifted into its representation, transformed, then
> lowered back down.
>
Indeed I tried to improve on Polly's internal representation, and improve
on the issue that Polly can only represent a subset of LLVM-IR code.


> Overall, I think that this discussion would be easier to process if we
> broke it into a few pieces.  There seems to be consensus that LLVM IR (as
> is) is not the right representation for aggressive loop transformations.
> If we don’t have consensus on this, then I’d make sure to start there.
>
>
> Once that is established, there is a question of “what is the right
> representation to use”?  This question has two subcomponents: what data
> structure should we use, and what is the IR within it.
>
> If you propose introducing a brand new data structure, please expect me to
> push back on that pretty hard.
>
Which I think is a good thing since I also do not want too many data
structures being more-or-less well maintained. But I also think there is a
good argument to a loop-centric data structure.


> This is a perfect application of MLIR, and using it provides a lot of
> value (including amazing testing tools, round-tripping, location tracking,
> etc) which would otherwise would have to be reinvented, and doesn’t not
> dictate the IR structure otherwise.  MLIR absolutely supports nested loop
> structures etc, as is seen in the affine dialect.
>
> The MLIR community also is highly invested in HPC-style transformations on
> this, and a lot of thought has gone into it.  You can learn more about this
> in the slides and videos from the MLIR open design meetings
>
<https://docs.google.com/document/d/1y_9f1AbfgcoVdJh4_aM6-BaSHvrHl8zuA5G4jv_94K8/edit#heading=h.cite1kolful9>
> .
>
I have been following the development of MLIR.


One you achieve consensus on data structure, there is the question of
what> IR to use within it.  I would recommend starting with some combination of
> “existing LLVM IR operations + high level control flow representation”,
> e.g. parallel and affine loops.  The key here is that you need to always be
> able to lower in a simple and predictable way to LLVM IR (this is one of
> the things that classic polyhedral systems did sub optimally, making it
> difficult to reason about the cost model of various transformations), and
> this is a natural incremental starting point anyway.  Over time, more high
> level concepts can be gradually introduced.  FYI, MLIR already has a
> reasonable LLVM dialect <https://mlir.llvm.org/docs/Dialects/LLVM/>
and
> can generate LLVM IR from it, so we’d just need an “LLVM IR -> MLIR LLVM
> dialect” conversion, which should be straightforward to build.
>
Adding a LLVM-IR -> MLIR -> LLVM-IR round-trip would at the beginning just
introduce compile-time overhead and what Renato described as brittleness. I
fear this hurts adaption.


Once you have the data structure and the dialect within it decided,
you> have the set of transformations.  Again, you’ve given a lot of thought to
> this, and that all sounds great to me, but the priorities can be driven by
> whoever wants to contribute and what concrete problems they have to solve.
>
>
> Once the infra for “raising to this representation and lowering back down”
> is figured out, we can open the box of having clang and other front ends
> directly generate it.
>
This suggestions would also apply to VPlan. Ignoring that work on VPlan
started before MLIR, would you have suggested to implement VPlan on MLIR as
well? Would you maybe even advise to retarget VPlan on MLIR now?

Q: Relation to MLIR?>
> A: MLIR is more similar to LLVM-IR than a loop hierarchy.
>
>
> This is not true, MLIR is great for dialects that want to model loop
> hierarchies naturally, this is a major focus of the affine dialect
> <https://mlir.llvm.org/docs/Dialects/Affine/> (e.g. see affine.for on
> that page).  MLIR is not limited to affine loops, that is just a choice
> made by the affine dialect - the loop dialect
> <https://mlir.llvm.org/docs/Dialects/LoopOps/> has more general
> constructs that are less developed.
>

This is definitely subjective question. I think that MLIR is closer to
LLVM-IR for how it is processed. Both have a sequence of passes running
over a single source of truth. Both allow walking the entire structure from
every instruction/operation/block. Analyses are on function or module
level. Both have CFGs (I think for a certain kind of transformations it is
an advantage that control flow is handled implicitly).

For> instance, it also does not feature cheap copies.
>
>
> I’m not sure what this means.
>
>The possibility to make local changes speculatively without copying the
entire data structure. IMHO this is a central idea that allows applying a
transformations speculatively to pass it to a legality check and cost
heuristic without committing to apply it. As a consequence, passes do not
need to implement to implement these in a transformation-specific manner,
drastically reducing the burden of implementation.

For instance, more loop transformations are feasible if instructions are
moved into the innermost loops. With speculative transformations, we can
canonicalize the representation to sink computations into loops -- the
opposite of what LICM does -- and then see whether a transformation can
applied. If not, the speculative representation is discarded without having
an effect on the original representation (and not needing to hoist those
computations again).

Because the MLIR classes have many references to related objects (such as
pointer to parents and use-def chains), I don't think it is feasible to
implement on top of MLIR.

An advantage is that> loops and multi-dimensional arrays can be represented in the language
> without the need of being rediscovered, but have to be inserted by a
> front-end.
>
>
> This is correct, but I don’t see how this helps if your focus is raising
> code that has already been lowered to LLVM IR, e.g. by Clang or some other
> frontend that generates LLVM IR today.
>
Indeed, I would hope that LLVM-IR can preserve multi-dimensional array
accesses in some fashion as well (
https://lists.llvm.org/pipermail/llvm-dev/2019-July/134063.html). However,
currently MLIR has the advantage of being able represent it.


That is, if Clang was generating MLIR, loops and arrays> still have to be rediscovered.
>
>
> This isn’t true, it would be perfectly sensible to lower C control flow
> structures directly too LLVM.  The primary concern are things like
> unstructured switches (think duff’s device) and unstructured gotos, but
> these occur rarely: they need to be handled correctly, but can do so with a
> traditionally lowered CFG and other “best effort” attempts to raise them.
>
Moreover, syntactical loop structures are also not a reliable indicator
that there is a loop. Often enough, do,for and while are used for
syntactical reasons (`do { } while(0)`). Yes, you could eliminate them if a
non-loop is detected, but handling break, continue, etc correctly is a lot
of effort. Another case are corountines that are lowered with gotos into
loops, unless you think loop optimizers should handle coroutines directly.

On the other side, natural loop detection on CFGs is quite mature (with a
remaining issue of irreducible loops that might appear, but can also be
eliminated again). As a plus, optimization does depend less on how the
source code is written.

Thanks for the productive discussion,
Michael
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200114/2d6cceb8/attachment-0001.html>

Prashanth N R via llvm-dev

2020-Jan-22 04:52 UTC

head link

[llvm-dev] Writing loop transformations on the right representation is more productive

Hi Michael-

Liked your proposal and hope that it gets implemented in MLIR. Linearized
IR of LLVM is not suitable for LNO.

We have written multiple Loop Nest Optimizers (in LLVM) in past five
years.  We sent a talk proposal to LLVM developer meeting in 2017. It was
rejected. From the review comment it looked like polly was the preferred
path for Loop Nest Optimization. Hope it is not the case any more.

thanks,
-Prashanth


On Fri, Jan 3, 2020 at 4:57 PM Michael Kruse via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> In the 2018 LLVM DevMtg [1], I presented some shortcomings of how LLVM
> optimizes loops. In summary, the biggest issues are (a) the complexity
> of writing a new loop optimization pass (including needing to deal
> with a variety of low-level issues, a significant amount of required
> boilerplate, the difficulty of analysis preservation, etc.), (b)
> independent optimization heuristics and a fixed pass ordering and (c)
> code explosion due to versioning. Currently, different people are
> working on improving different loop optimization passes such as
> LoopInterchange, LoopFuse and LoopVectorize. Even if LLVM had
> 'perfect' individual loop passes, they would still have the
> aforementioned problems due to still being independent and the system
> as a whole will be suboptimal.
>
> Instead of each loop pass being a major component, the heavy lifting
> could be done by a framework of which each transformation itself is a
> small part. In this RFC, I would like to work towards a consensus on
> how such a framework could look. I already outlined a possible
> solution in the same presentation [1] and a publication [7], which is
> partially based on a previous RFC [8]. All feedback is welcome,
> including a simple '+1'.
>
> The central idea is to use a modifiable loop tree -- similar to
> LoopInfo -- as the primary representation. LLVM-IR is converted to a
> loop tree, then optimized and finally LLVM-IR is generated again for
> subtrees that are considered profitable. This is not a new concept, it
> has already been used in compilers such as IBM XL Fortran (called ASTI
> [4]) and Silicon Graphics/Open64 (called LNO [10]), and in research
> such as the Value State Dependence Graph [3].
>
> Other features include the following:
>
> 1. Candidate selection through cost functions
> 2. Cheap copies using Red/Green Trees
> 3. Application of transformations from high-level to low-level
> 4. Represents loops and predicates instead of CFGs
> 5. Data and control dependencies in one graph
> 6. Late fallback versioning at IR regeneration
> 7. Standardized API for analyses with multiple implementations
> 8. Abstract representation of statements
> 9. Expansion of use-def chains to arrays when spanning loops
>
> To elaborate on each of these:
>
> 1. Candidate selection through cost function
> --------------------------------------------
> Instead of needing to know which transformation is profitable before
> applying it, create a copy of the data structure, modify it, and
> compare it to the original. Moreover, holding multiple, differently
> optimized, copies allows evaluating each variant using a cost function
> and select the 'best' when re-generating LLVM-IR again (or re-use
the
> original LLVM-IR).
>
> Otherwise, each transformation will have to implement its own
> profitability heuristic but cannot know what other passes will do. For
> instance, the motivation for LLVM's LoopDistribution is to enable
> vectorization of loops that can be separated from other loops.
> However, it does not know whether LoopVectorize will actually
> vectorize the loop, or whether two vectorizable loops are better
> vectorized together.
>
> Instantiating every possible sequence of transformations of course is
> not feasible, so the search space needs to be pruned. This could be
> made dependent on the optimization level.
>
>
> 2. Cheap copies using Red/Green Trees
> -------------------------------------
> A red/green trees [4] is a data structure for cheap copies of DAGs.
> Basically, a green tree node references only the nodes children but
> not the parent. A node from the red tree references its corresponding
> green node and its parent (red) node. When modifying the represented
> node through copy-on-write, this allows reusing all green nodes except
> the ones from the modified node to the tree's root. The same node can
> even be used multiple times in the same copy, which can be useful for
> e.g. loop unrolling and code versioning. As a result, storing many
> optimization variants of the same code only requires copies of the
> changed parts.
>
> By contrast, the red node is created on-demand. Traversing the entire
> tree should rarely be necessary (instead, use summary information in
> the green node) and when limited to subtrees that are considered
> interesting for optimization.
>
>
> 3. Apply transformations from high-level to low-level
> -----------------------------------------------------
> Optimization should be applied from very specialized to very general
> (potentially after some canonicalization). For instance, the first
> step could be detecting common idioms such as gemm and replace them
> with either a BLAS function call or apply well-studied optimizations
> like BLIS to them. After such an idiom has been detected, no other
> transformation should be applied to them.
>
> Mid-level transformations may try to map entire loop nests to cache-
> and compute hierarchies (SIMT threads, multiprocessors, offloading,
> etc) by applying transformations such as tiling, loop interchange and
> array packing.
>
> Low-level transformations commonly look at only one loop to improve
> processor pipelining and predication: Unrolling, software pipelining,
> etc.
>
>
> 4. Represents loops and predicates instead of CFGs
> -------------------------------------------------
> Control flow graphs make loop optimizations difficult, but is also not
> something a loop optimizer typically cares about. For instance, in the
> example below, loop distribution would need to understand the
> condition `c` and reconstruct the branch to fission
>
>     for (int i = 0; i < n; i += 1)
>       if (c) {
>         foo(i);
>         bar(i);
>       }
>
> into
>
>     for (int i = 0; i < n; i += 1)
>       if (c)
>         foo(i);
>     for (int i = 0; i < n; i += 1)
>       if (c)
>         bar(i);
>
> In the loop tree, control flow is represented through predicates of
> the statements.  In a first step, it is represented as
>
>     for (int i = 0; i < n; i += 1) {
>       bool __BB1_executed = 0;
>       { if (c) foo(i); __BB1_executed = 1; }
>       { if (__BB1_executed) bar(i); }
>     }
>
> The predicates (`c` and `__BB1_executed`) can be separated into a
> "necessary condition" (`c` must be true before `foo` can be
executed)
> and a "sufficient condition" (if `c` is true, `foo` must be
executed).
> These can be different to allow speculative execution of the
> statement. For instance, a necessary condition of `true` means that
> the statement can always be executed speculatively.
>
> The advantage is that control flow becomes a data-dependency problem.
> It can trivially be converted back to its CFG representation because
> `__BB1_executed` directly corresponds to the CFG edge to which it was
> connected. Data flow optimizations such as value propagation can be
> applied as well (here: `__BB1_executed == c`), such that we get
>
>     for (int i = 0; i < n; i += 1) {
>       { if (c) foo(i); }
>       { if (c) bar(i); }
>     }
>
> Which can be distributed trivially (assuming foo() and bar() do not
> cause dependencies). Whether the copy propagation is beneficial can be
> decided by a heuristic (considering whether the original flow can
> still be reconstructed) or just thrown away if the cost function does
> not select a loop tree based on the copy propagation.
>
> As a result of the predicated representation the loop tree has a
> regular pattern: sequence -> loop -> sequence -> loop -> ... .
A loop
> here is abstract in that it is something that is executed multiple
> times. This can be a sequential loop, but also a collection of SPMD
> processes, threads, vector lanes, etc.
>
> At IR regeneration, we have multiple choices of how to lower
> conditions. If we lower to LLVM-IR CFG, it may lower to
>  * Branch instructions, reusing branch conditions from previous statements
>  * Select instructions
>
> If lowering to SIMD instructions, we can generate
>  * Branch instructions for uniform predicates
>  * Execution masks for varying predicates
>  * A mix of both, since predicates can be uniform relative to previous
> taken conditions
>
> If lowering to NVPTX-style SIMT
>  * @p per-instruction execution masks
>  * Divergent branches; Every statement can be made a re-convergence
> point (by re-evaluating the condition), but will happen, at the
> latest, at the end of the loop
>  * Uniform predicates can be given the ".uni" suffix
>
> When lowering to AMD-style SIMT
>  * S_CBRANCH (explicitly managed execution mask)
>  * S_CBRANCH_I/G_FORK (stack of execution masks)
>
>
> 5. Data and control dependencies in one graph
> ---------------------------------------------
> Also as a result of the predicated from, during dependency analysis,
> control dependencies and data dependencies have the same
> representation, like in a program dependency graph. In LLVM-IR, passes
> such as AggressiveDeadCodeElimination use DominanceFrontier for this
> purpose. This would not be necessary.
>
> When re-generating IR, control dependencies can either be lowered to
> branching or as a boolean variables. For the purpose of loop
> optimizations, the choice of representation does not (and should not)
> make a difference.
>
>
> 6. Late fallback versioning at IR regeneration
> ------------------------------------------
> When a transformation is applied, it can attach conditions (no
> aliasing, no integer wrap, value restrictions, etc.) under which the
> transformation is valid. During LLVM-IR generation, these conditions
> are collected and emitted as run-time conditions. If the condition
> fails, the original code is executed.
>
> This avoids having every transformation emit its own fallback code.
> The fallback should never execute, or only in cases that do not do
> significant work (e.g. loop trip count is 0), such that optimizing the
> fallback itself is not necessary. In other cases, the transformation
> should implement proper loop versioning by adding the original
> statement with a `!rtc` condition.
>
>
> 7. Standardized API for analyses with multiple implementations
> --------------------------------------------------------------
> Common analyses use an interface that can be fulfilled by different
> implementations. Among these are:
>
>  * Closed form expressions
>    - None (expressions are always represented through dependencies
>      between statements)
>    - LLVM ScalarEvolution
>    - PolyhedralValueAnalysis [6]
>
> Closed form expressions can be used to remap values (typically array
> subscripts, loop lower/upper bounds), simplify expression computations
> (including predicates => is a predicate always true?), determine loop
> trip counts, etc. Most importantly, it identifies a statement
> execution (instance) relative to the loop counters.
>
>
>  * Array accesses / access range / provenance
>    - LLVM AliasAnalysis
>    - One-dimensional (always linearized)
>    - Multi-dimensional, delinearized if necessary
>    - From metadata (that a front-end emits for arrays of pointers to
>      arrays of data)
>
> Array access analysis determines whether two accesses at a specific
> index can alias each other, access ranges and connectivity.
>
>
>  * Dependencies
>    - none (cannot reorder statements)
>    - flat (depends on last execution of another statement)
>    - LLVM DependendencyAnalysis
>    - Polyhedral analysis
>
> There is intentionally no "does X depend on Y?" API as this would
make
> any analysis based on it quadratic in runtime. Instead, it has to list
> all statements another statement depends on such that a graph can be
> built and e.g. sorted topologically.
>
> Dependencies include: use-def chains via registers, memory
> flow/anti/output, general side-effects and control dependencies.
>
>
>  * Control flow
>    - Extracted from original CFG
>    - Evaluate statement conditions
>
> Control flow analysis answers questions about (post-)dominance and
> mutual exclusivity within the acyclic sequence/body of a loop.
>
>
>  * Cost heuristic
>    - TargetTransformationInfo
>
> Estimate the execution time of a (sub-)tree. Average trip counts from
> profiling [11] could be useful.
>
>
> 8. Abstract representation of statements
> ----------------------------------------
> There is no direct relationship between LLVM instructions and loop DAG
> statements. If possible, loop tree optimizations should only use the
> properties of the statement node, not its contents. Typically, a
> statement consists of multiple instructions that do not make sense to
> split. For instance, assuming that %add is not used a second time, in
> the example below
>
>     %add = add i64 %arg, 2
>     %mul = shl i64 %add, 1
>
> the two instructions should always be computed together in the same
> loop iteration. Combining instructions to statements reduces the
> computational complexity of loop optimizations and might be controlled
> by the optimization level (at lower optimization level more
> instructions would be combined). A loop node is itself a statement in
> it's parent loop node's acyclic body.
>
> Moreover, there is no reason to use only LLVM-IR to carry the
> semantics. We could also use e.g. Machine-IR or MLIR operations.
>
>
> 9. Expansion of use-def chains to arrays when spanning loops
> ------------------------------------------------------------
> Consider we would like to fission the following loop:
>
>     for (int i = 0; i < n; i += 1) {
>       { if(true) double a = f(i); }
>       { if(true) g(a); }
>     }
>
> The problem is that the split cuts the use-def dependency over `a`.
> However, the fission can still be done and results in:
>
>     double A[n];
>     for (int i = 0; i < n; i += 1)
>       { if(true) A[i] = f(i); }
>     for (int i = 0; i < n; i += 1) {
>       { if(true) g(A[i]); }
>
> The transforming pass has to consider this during its profitability
> model. The big advantage is that in terms of correctness, use-def
> chains do not manifest false dependencies.
>
>
> Possible Transformations
> ------------------------
> In principle, any transformation could be applied on this
> representation, but instruction-centric transformations such as
> InstCombine are less suitable. Obviously, loop-centric transformations
> are a better fit. Ideas for loop transformations include:
>
>  * Unroll(-and-jam)
>  * Peeling
>  * Iteration range split, concatenation
>  * Interchange
>  * Loop fusion and distribution(/fission)
>  * Tiling/stripmining/stripemining/blocking
>  * Code versioning (including loop unswitching)
>  * Maybe vectorization
>  * Accelerator offloading
>  * Parallelization
>  * Space-filling curve iteration [12]
>  * Array packing (e.g. for scratch pads), transposition
>  * Statement reordering
>  * SIMT-ization
>  * Detect and optimize reductions
>  * Loop skewing/wavefronting
>  * Loop reversal
>  * Solver-driven (polyhedral) optimization
>  * Loop idiom recognition
>
>
> Optimization Pipeline
> ---------------------
> The optimization pipeline for the hierarchy DAG is straightforward:
>
>     void optimizeLoopNest() {
>       RedOriginal = generateLoopHierarchy();
>       List = createCandidates(RedOriginal);
>       sort(List, [](Cand1,Cand2) { return estimateExecutionTime(Cand2)
>                                                    -
> estimateExecutionTime(Cand1); } );
>       if (List[0] != RedOriginal)
>         emit(List[0]);
>     }
>
> Subtree exploration is done recursively, optimizing inner nodes which
> can be (re-)used when optimizing outer nodes.
>
> The current node to optimize is compared a selection of known
> heuristics and its transformation is added to the list of candidates.
> That is, an optimization routine resemble the following, with more
> specific heuristics checked before more general ones.
>
> void exploreOptimizations(Root, Worklist) {
>    Matches = 0;
>
>    // High level: user annotations
>    if (Root.hasHint()) {
>      // Hint is mandatory: do not explore any other possibility
>      Worklist.replace(Root, applyHint(Root));
>      return;
>    }
>
>     // High level: Idiom recognition
>     if (isBLAS(Root)) {
>       Optimized = LinkAgainstBLASLibary ? convertToBlasLibCall(Root) :
>
> applyBLIS(Root):
>       Worklist.replace(Root, Optimized);
>       return;
>     }
>
>     // Mid level: recognize stencils
>     if (isStencil(Root))
>       ...
>
>     // Mid level: Map to compute hierarchy (here: simd -> SMT -> core
> -> socket for CPUs)
>     if (isParallel(Root) fitsWorkingSet(Root.getBody(), LLCSize) &&
&&
>       isParallel(Root.getPerfectlyNestedLoop(1)) &&
>        fitsWorkingSet(Root.getPerfectlyNestedLoop(2).getBody(), L2Size)
&&
>       isParallel(Root.getPerfectlyNestedLoop(2)) &&
>        fitsWorkingSet(Root.getPerfectlyNestedLoop(3).getBody(), L1Size)
&&
>       isVectorizable(Root.getPerfectlyNestedLoop(3))) {
>         Optimized = applyVectorization(Root.getPerfectlyNestedLoop(3));
>         Optimized = applyParallel(Optimized.getParent(), /*places=*/SMT);
>         Optimized = applyParallel(Optimized.getParent(), /*places=*/Cores);
>         Optimized = applyParallel(Optimized.getParent(),
> /*places=*/Sockets);
>
>         // Still explore CPU pipeline optimizations of innermost
>         Worklist.insert(Root,Optimized);
>         Matches += 1;
>     }
>
>     if (Matches >= Threshold)
>       return;
>
>     // Low level: optimize working set for L1 cache size using tiling
>     Band = maximumTilableNest(Root);
>     WorkingSetSizePerIteration = estimatWorkingSet(Band.back().getBody());
>     TileSize = floor(nthroot(L1CacheSize / WorkingSetSizePerIteration,
> Band.size()));
>     if (TileSize > 1) {
>        Worklist.insert(applyTiling(Root, TilesSize.splat(Band.size()));
>        Matches += 1;
>     }
>
>     if (Matches >= Threshold)
>       return;
>
>     // Low level: vectorization for each SIMD level (potentially
>     // carried-out by VPlan)
>     foreach (Width : promisingSimdWidths(Root)) {
>       Vectorized = applyVectorization(Root, Width);
>       if (Vectorized.isSatisfiable()) {
>         Worklist.insert(Vectorized);
>         Matches += 1;
>       }
>     }
>
>     if (Matches >= Threshold)
>       return;
>
>     // Low level: rely on heuristic to find best unrolling factor
>     if (Factor = unrollFactorHeuristic(Root))  {
>       Worklist.insert(applyUnroll(Root, Factor).getRoot());
>       Matches += 1;
>     }
>
>     if (Matches >= Threshold)
>       return;
>
>     ...
> }
>
>
> The "levels" encode precedence optimizations. Depending on the
> optimization level, we might only apply the most promising
> optimization.
>
> Instead of this brute force implementation, for predefined
> optimization levels, the search space exploration should be guided:
> Encode the order of the most useful transformations in the search
> space exploration. For instance, find all idioms first, then apply
> memory hierarchy optimizations, then compute hierarchy, then CPU
> pipeline optimizations. Since this is very transformation specific, I
> do not expect to write a 'pass manager' where transformations can
> register itself to (except for exhaustive searches when explicitly
> enabled). Instead, I'd write this in code that classifies loop nests
> (e.g. compute bound, bandwidth bound, known idioms, tensor
> contractions, stencils, etc.) and choses appropriate optimization
> paths. We can, of course, make this more generic over time.
>
>
>
> Writing an optimization pass
> -----------------------------
>
> Optimizations themselves are written in the following steps:
> 1. Check preconditions
> 2. Quick profitability analysis to skip obviously bad transformations
> 3. Create new green nodes to represent the changed loop nest; reuse
> unchanged green nodes as much as possible
> 4. Create a red node to insert into the parent
> 5. Preserve analyses (most analyses should be able to use analysis
> stored in reused green nodes and/or lazyly reevaluate new nodes)
> 6. Check whether the transformation was valid
>
> Especially the possibility to analyze the correctness after the
> transformation was applied could significantly reduce the amount of
> code needed to verify the correctness.
>
> A transformation like loop fusion could look like:
>
>   void applyFuse(Red1, Red2) {
>    // Assume Red1 and Red2 are consecutive sibling loops
>
>   // Bail out on untypical things (atomic accesses, exceptions,
> convergent instructions, ...)
>   if (!Red1->isSimple() || !Red2->isSimple())
>     return;
>
>   // Consistency checks (can be less less strict by moving conditions
> into the fused body).
>   if (Red1->getIterationCount() != Red2->getIterationCount())
>     return;
>   if (Red1->getPredicate() != Red2->getPredicate())
>     return;
>
>   // Create new loop hierarchy with fused loops.
>   GreenFused = new GreenLoop({ Red1->getGreen(), Red2->getGreen() },
>
>  Red1->getIterationCount());
>   RedFused = new RedNode(GreenFused, Red1->getParent());
>   NewRoot = RedFused->getRoot();
>
>   // Preserve analysis.
>   // In this case not really necessary since the green nodes of the
>   // loop bodies are reused.
>   DependenceAnalysis->transferAnalysis(Red1->getBody(),
>                                                   
RedFused->getBody()[0],
> ClosedExprAnalysis->makeIdentityMapping(Red1->getIterator(),
>
> RedFused->getIterator()));
>   DependenceAnalysis->transferAnalysis(Red2->getBody(),
>                                                  
RedFused->getBody()[1],
>    ClosedExprAnalysis->makeIdentityMapping(Red2->getIterator(),
>
> RedFused->getIterator()));
>
>   ValidityCondition = DependenceAnalysis->isValid(NewRoot);
>   if (ValidityCondition.isSatisfiable()) {
>     RedFused.addAssumption(ValidityCondition);
>     Candidates.push_back(NewRoot);
>   }
> }
>
>
>
> Questions & Answers
> -------------------
>
> Q: Is the loop hierarchy a tree or DAG?
> A: Both. A green node can be reused multiple times in the same
> structure (e.g. due to unrolling, versioning, peelings, etc), but a
> red node is specific to the position in its parent; there can be
> multiple red nodes for the same green node.
>
>
> Q: No SSA?
> A: Since the location of a definition depends on the parent in a green
> DAG, possibly from different loop hierarchies after transformations,
> it is not possible to directly reference the definition with its use.
> However, it is possible in the red tree, but runs against the goal of
> avoiding heavy red nodes. Instead, to find the previous definition,
> one has to traverse the green no up to find at which sibling level the
> value is defined. However, this should not be a common query; loop
> optimizations should use dependency information instead.
>
>
> Q: Loops with multiple exits?
> A: If every exiting edge a unique number to its __???_executed
> variable, it can be handled like a switch.
>
>
> Q: Unstructured control flow?
> A: Unstructured control flow only introduces disjunctions to statement
> predicated. That is, the predicate
>
>     a or b
>
> corresponds to block BB2 the CFG
>
>     BB0:
>       br i1 %a, label %BB2, label %BB2
>
>     BB1:
>       br i1 %b, label %BB2, label %BB3
>
>     BB2:
>
> In structured control flow, BB1 and/or BB2 would be post-dominated by
> BB2 and the disjunction disappear.
>
>
> Q: Irreducible loops?
> A: Can be folded into the first iteration of the loop by including a
> condition for statements not executed during the first iteration.
>
>
> Q: The unreachable terminator?
> A: If due to a non-returning function call, this introduces a loop
> exit (which will also introduce control a dependency to everything
> that follows) to the surrounding loops. If due to an assumption, we
> may add the branching condition to the sufficient condition, but not
> to the necessary condition.
>
>
> Q: Infinite loops?
> A: The closed-form expression identifying a statement instances and
> loop trip count have no upper bound. Either use arbitrary range
> arithmetic, use relative dependence vectors and/or fall back on
> pessimistic assumptions. In addition, like with the unreachable
> terminator, this may be handled as another loop exit.
>
>
> Q: IndirectBr?
> Since the list of possible targets is part of the instructions, can be
> handled like a switch over the target address
>
>
> Q: Loop guards?
> A: Loop guards manifests as the execution condition of the loop
> statement's predicate. A loop whose predicate evaluates to false can
> be considered a loop with 0 iterations. The IR generator can also emit
> the predicate as a loop guard. It might be useful to ensure that if
> the predicate evaluates to true, the loop has at least one iteration
> and the loop-condition is not evaluated for the first iteration (like
> LoopRotation).
>
>
> Q: Exceptions?
> A: Could also be handled with control dependencies, but I would
> instead mark the surrounding loops with "contains unpredictable
> control flow" to be ignored by most transformations. Similar to
> `!isSimple()` of LoadInst and StoreInst.
>
>
> Q: Relation to the new/legacy pass manager?
> A: LLVM's pass managers are unfortunately not designed to apply to
> subtrees nor persistent data structures besides the LLVM-IR itself.
> Instead, the loop tree optimizer would be its own monolithic pass on
> the pass manager level (like MachinePassManager and VPlan). My idea is
> to add it somewhere before LoopVectorize, but after the inliner,
> potentially replace most other loop transformations. There should be
> just one instance of the pass in the pipeline since creating an
> alternative representation of LLVM-IR is not that cheap. It does not
> need IR normalization (LoopSimplify, LICM, LoopRotate, LCSSA); these
> can be performed on the loop tree and emitted into the LLVM-IR if a
> non-modified tree is chosen. Possibly some of these passes can be
> removed.
> In order to preserve user-directed transformations (such as "#pragma
> clang loop distribute"), it may also be interesting to put it earlier
> into the pipeline to avoid that other passes such as LoopFullUnroll
> change the loop before the transformation can be applied.
> Alternatively, such passes can be taught the check the presence of a
> user-directed transformation before applying its transformation.
>
>
> Q: Relation to LoopVectorize/VPlan?
> A: VPlan has similar design goals [9] but is intended for
> vectorization only. It also has a hierarchical data structure, but
> uses a CFG for acyclic control flow. It is also intended to have
> multiple variants of the same source IR from which the best is
> selected via a cost function. However, it lacks cheap copies. Instead
> of instructions, it uses recipes/"meta-instructions" that handle
what
> happens to instructions after vectorization, e.g. do that operation on
> each vector lane ("WIDEN"). VPlan is more oriented towards
modifying
> instructions instead of statements as collection of instructions. With
> the loop hierarchy, I do not plan to redo the work of VPlan, but also
> do not see a reason why it should not also be able to do
> vectorization.
>
>
> Q: Relation to Polly?
> A: Polly uses ISL's schedule tree as the representation of the loop
> hierarchy. Its creation may fail of something not representable in a
> schedule tree is found. Polly tries to find the maximum representable
> region, but once it is chosen, it cannot be changed (e.g. because
> optimizing it takes too long). The schedule tree is stored internally
> by ISL as a kind of green tree, but is walked over by an iterator that
> remembers the path to the root. While this could be used to select a
> best schedule tree using a cost function, Polly relies mostly on ISL's
> reschedule algorithm to find a profitable transformation.
>
>
> Q: Relation to MLIR?
> A: MLIR is more similar to LLVM-IR than a loop hierarchy. For
> instance, it also does not feature cheap copies. An advantage is that
> loops and multi-dimensional arrays can be represented in the language
> without the need of being rediscovered, but have to be inserted by a
> front-end. That is, if Clang was generating MLIR, loops and arrays
> still have to be rediscovered. However, a loop hierarchy optimizer
> could be applied to MLIR just as well as to LLVM-IR.
>
>
> References
> ----------
> [1] http://llvm.org/devmtg/2018-10/talk-abstracts.html#talk11
> [2] https://ieeexplore.ieee.org/abstract/document/5389392/
> [3] http://sro.sussex.ac.uk/id/eprint/7576/1/Stanier,_James.pdf
> [4]
>
https://blogs.msdn.microsoft.com/ericlippert/2012/06/08/persistence-facades-and-roslyns-red-green-trees/
> [5] https://github.com/Meinersbur/llvm-project/tree/LOF/llvm/lib/LOF
> [6] https://llvm.org/devmtg/2017-10/#src2
> [7] https://arxiv.org/abs/1811.00632
> [8] https://lists.llvm.org/pipermail/llvm-dev/2017-October/118125.html
> [9] https://llvm.org/docs/Proposals/VectorizationPlan.html
> [10] https://github.com/open64-compiler/open64/tree/master/osprey/be/lno
> [11] https://reviews.llvm.org/D70688
> [12] https://arxiv.org/abs/1801.07399
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200122/57a23b57/attachment-0001.html>

Michael Kruse via llvm-dev

2020-Jan-26 01:35 UTC

head link

[llvm-dev] Writing loop transformations on the right representation is more productive

Am Di., 21. Jan. 2020 um 18:52 Uhr schrieb Prashanth N R
<prashanth.nr at gmail.com>:> We have written multiple Loop Nest Optimizers (in LLVM) in past five years.
We sent a talk proposal to LLVM developer meeting in 2017. It was rejected. From
the review comment it looked like polly was the preferred path for Loop Nest
Optimization. Hope it is not the case any more.
We made some effort into mainstreaming Polly by integrating it into
the main repository. There were some hurdles in there, one of the
largest being that it relies on an external library written in C.
Others are that it requires well-formed IR to do anything and may
significantly increase compile time.  My proposal is intended to be a
solution for these problems.

Michael

Uday Kumar Reddy Bondhugula via llvm-dev

2020-Jan-28 04:05 UTC

head link

[llvm-dev] Writing loop transformations on the right representation is more productive

Hi Michael,

Although the approach to use a higher order in-memory abstraction like the
loop tree will make it easier than what you have today, if you used MLIR
for this representation, you already get a round trippable textual format
that is *very close* to your form. The affine.for/if, std.for/if in MLIR
are nearly isomorphic to the tree representation you want, and as such,
this drastically reduces the delta between the in-memory data structures
your passes want to operate on and what you see when you print the IR.
Normally, there'd be resistance to building a textual form / introducing a
first class concept in the IR for what you are proposing, but since this
was already done for MLIR, it looks like it would be a big win from a
compiler developers' productivity standpoint if you just used MLIR for this
loop tree representation. With regard to the FAQ, I can't tell whether you
meant something else or missed the representation used in MLIR for the
affine dialect or in general for "ops with regions".
> Q: Relation to MLIR?
> A: MLIR is more similar to LLVM-IR than a loop hierarchy. For
It's completely the opposite unless you are looking only at MLIR's std
dialect! The affine dialect as well as the std.for/if (currently misnamed
as loop.for/loop.if) are actually a loop tree. The affine ops are just an
affine loop AST isomorphic to the materialization of polyhedral
domains/schedules via code generation. Every IST AST or the output of
polyhedral code generation can be represented in the affine dialect and
vice versa. MLIR's loop/if ops are a hierarchy rather than a flat form /
list of blocks CFG.
> still have to be rediscovered. However, a loop hierarchy optimizer
> could be applied to MLIR just as well as to LLVM-IR.
This is true, but it's easier to apply it to MLIR because the actual IR is
close by miles to the in-memory thing your loop hierarchy optimizer would
be using. For eg., here's the input IR and the output IR of a simple outer
loop vectorization performed in MLIR:
https://github.com/bondhugula/llvm-project/blob/hop/mlir/test/Transforms/vectorize.mlir#L23

~ Uday

On Fri, 3 Jan 2020 at 16:57, Michael Kruse via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> In the 2018 LLVM DevMtg [1], I presented some shortcomings of how LLVM
> optimizes loops. In summary, the biggest issues are (a) the complexity
> of writing a new loop optimization pass (including needing to deal
> with a variety of low-level issues, a significant amount of required
> boilerplate, the difficulty of analysis preservation, etc.), (b)
> independent optimization heuristics and a fixed pass ordering and (c)
> code explosion due to versioning. Currently, different people are
> working on improving different loop optimization passes such as
> LoopInterchange, LoopFuse and LoopVectorize. Even if LLVM had
> 'perfect' individual loop passes, they would still have the
> aforementioned problems due to still being independent and the system
> as a whole will be suboptimal.
>
> Instead of each loop pass being a major component, the heavy lifting
> could be done by a framework of which each transformation itself is a
> small part. In this RFC, I would like to work towards a consensus on
> how such a framework could look. I already outlined a possible
> solution in the same presentation [1] and a publication [7], which is
> partially based on a previous RFC [8]. All feedback is welcome,
> including a simple '+1'.
>
> The central idea is to use a modifiable loop tree -- similar to
> LoopInfo -- as the primary representation. LLVM-IR is converted to a
> loop tree, then optimized and finally LLVM-IR is generated again for
> subtrees that are considered profitable. This is not a new concept, it
> has already been used in compilers such as IBM XL Fortran (called ASTI
> [4]) and Silicon Graphics/Open64 (called LNO [10]), and in research
> such as the Value State Dependence Graph [3].
>
> Other features include the following:
>
> 1. Candidate selection through cost functions
> 2. Cheap copies using Red/Green Trees
> 3. Application of transformations from high-level to low-level
> 4. Represents loops and predicates instead of CFGs
> 5. Data and control dependencies in one graph
> 6. Late fallback versioning at IR regeneration
> 7. Standardized API for analyses with multiple implementations
> 8. Abstract representation of statements
> 9. Expansion of use-def chains to arrays when spanning loops
>
> To elaborate on each of these:
>
> 1. Candidate selection through cost function
> --------------------------------------------
> Instead of needing to know which transformation is profitable before
> applying it, create a copy of the data structure, modify it, and
> compare it to the original. Moreover, holding multiple, differently
> optimized, copies allows evaluating each variant using a cost function
> and select the 'best' when re-generating LLVM-IR again (or re-use
the
> original LLVM-IR).
>
> Otherwise, each transformation will have to implement its own
> profitability heuristic but cannot know what other passes will do. For
> instance, the motivation for LLVM's LoopDistribution is to enable
> vectorization of loops that can be separated from other loops.
> However, it does not know whether LoopVectorize will actually
> vectorize the loop, or whether two vectorizable loops are better
> vectorized together.
>
> Instantiating every possible sequence of transformations of course is
> not feasible, so the search space needs to be pruned. This could be
> made dependent on the optimization level.
>
>
> 2. Cheap copies using Red/Green Trees
> -------------------------------------
> A red/green trees [4] is a data structure for cheap copies of DAGs.
> Basically, a green tree node references only the nodes children but
> not the parent. A node from the red tree references its corresponding
> green node and its parent (red) node. When modifying the represented
> node through copy-on-write, this allows reusing all green nodes except
> the ones from the modified node to the tree's root. The same node can
> even be used multiple times in the same copy, which can be useful for
> e.g. loop unrolling and code versioning. As a result, storing many
> optimization variants of the same code only requires copies of the
> changed parts.
>
> By contrast, the red node is created on-demand. Traversing the entire
> tree should rarely be necessary (instead, use summary information in
> the green node) and when limited to subtrees that are considered
> interesting for optimization.
>
>
> 3. Apply transformations from high-level to low-level
> -----------------------------------------------------
> Optimization should be applied from very specialized to very general
> (potentially after some canonicalization). For instance, the first
> step could be detecting common idioms such as gemm and replace them
> with either a BLAS function call or apply well-studied optimizations
> like BLIS to them. After such an idiom has been detected, no other
> transformation should be applied to them.
>
> Mid-level transformations may try to map entire loop nests to cache-
> and compute hierarchies (SIMT threads, multiprocessors, offloading,
> etc) by applying transformations such as tiling, loop interchange and
> array packing.
>
> Low-level transformations commonly look at only one loop to improve
> processor pipelining and predication: Unrolling, software pipelining,
> etc.
>
>
> 4. Represents loops and predicates instead of CFGs
> -------------------------------------------------
> Control flow graphs make loop optimizations difficult, but is also not
> something a loop optimizer typically cares about. For instance, in the
> example below, loop distribution would need to understand the
> condition `c` and reconstruct the branch to fission
>
>     for (int i = 0; i < n; i += 1)
>       if (c) {
>         foo(i);
>         bar(i);
>       }
>
> into
>
>     for (int i = 0; i < n; i += 1)
>       if (c)
>         foo(i);
>     for (int i = 0; i < n; i += 1)
>       if (c)
>         bar(i);
>
> In the loop tree, control flow is represented through predicates of
> the statements.  In a first step, it is represented as
>
>     for (int i = 0; i < n; i += 1) {
>       bool __BB1_executed = 0;
>       { if (c) foo(i); __BB1_executed = 1; }
>       { if (__BB1_executed) bar(i); }
>     }
>
> The predicates (`c` and `__BB1_executed`) can be separated into a
> "necessary condition" (`c` must be true before `foo` can be
executed)
> and a "sufficient condition" (if `c` is true, `foo` must be
executed).
> These can be different to allow speculative execution of the
> statement. For instance, a necessary condition of `true` means that
> the statement can always be executed speculatively.
>
> The advantage is that control flow becomes a data-dependency problem.
> It can trivially be converted back to its CFG representation because
> `__BB1_executed` directly corresponds to the CFG edge to which it was
> connected. Data flow optimizations such as value propagation can be
> applied as well (here: `__BB1_executed == c`), such that we get
>
>     for (int i = 0; i < n; i += 1) {
>       { if (c) foo(i); }
>       { if (c) bar(i); }
>     }
>
> Which can be distributed trivially (assuming foo() and bar() do not
> cause dependencies). Whether the copy propagation is beneficial can be
> decided by a heuristic (considering whether the original flow can
> still be reconstructed) or just thrown away if the cost function does
> not select a loop tree based on the copy propagation.
>
> As a result of the predicated representation the loop tree has a
> regular pattern: sequence -> loop -> sequence -> loop -> ... .
A loop
> here is abstract in that it is something that is executed multiple
> times. This can be a sequential loop, but also a collection of SPMD
> processes, threads, vector lanes, etc.
>
> At IR regeneration, we have multiple choices of how to lower
> conditions. If we lower to LLVM-IR CFG, it may lower to
>  * Branch instructions, reusing branch conditions from previous statements
>  * Select instructions
>
> If lowering to SIMD instructions, we can generate
>  * Branch instructions for uniform predicates
>  * Execution masks for varying predicates
>  * A mix of both, since predicates can be uniform relative to previous
> taken conditions
>
> If lowering to NVPTX-style SIMT
>  * @p per-instruction execution masks
>  * Divergent branches; Every statement can be made a re-convergence
> point (by re-evaluating the condition), but will happen, at the
> latest, at the end of the loop
>  * Uniform predicates can be given the ".uni" suffix
>
> When lowering to AMD-style SIMT
>  * S_CBRANCH (explicitly managed execution mask)
>  * S_CBRANCH_I/G_FORK (stack of execution masks)
>
>
> 5. Data and control dependencies in one graph
> ---------------------------------------------
> Also as a result of the predicated from, during dependency analysis,
> control dependencies and data dependencies have the same
> representation, like in a program dependency graph. In LLVM-IR, passes
> such as AggressiveDeadCodeElimination use DominanceFrontier for this
> purpose. This would not be necessary.
>
> When re-generating IR, control dependencies can either be lowered to
> branching or as a boolean variables. For the purpose of loop
> optimizations, the choice of representation does not (and should not)
> make a difference.
>
>
> 6. Late fallback versioning at IR regeneration
> ------------------------------------------
> When a transformation is applied, it can attach conditions (no
> aliasing, no integer wrap, value restrictions, etc.) under which the
> transformation is valid. During LLVM-IR generation, these conditions
> are collected and emitted as run-time conditions. If the condition
> fails, the original code is executed.
>
> This avoids having every transformation emit its own fallback code.
> The fallback should never execute, or only in cases that do not do
> significant work (e.g. loop trip count is 0), such that optimizing the
> fallback itself is not necessary. In other cases, the transformation
> should implement proper loop versioning by adding the original
> statement with a `!rtc` condition.
>
>
> 7. Standardized API for analyses with multiple implementations
> --------------------------------------------------------------
> Common analyses use an interface that can be fulfilled by different
> implementations. Among these are:
>
>  * Closed form expressions
>    - None (expressions are always represented through dependencies
>      between statements)
>    - LLVM ScalarEvolution
>    - PolyhedralValueAnalysis [6]
>
> Closed form expressions can be used to remap values (typically array
> subscripts, loop lower/upper bounds), simplify expression computations
> (including predicates => is a predicate always true?), determine loop
> trip counts, etc. Most importantly, it identifies a statement
> execution (instance) relative to the loop counters.
>
>
>  * Array accesses / access range / provenance
>    - LLVM AliasAnalysis
>    - One-dimensional (always linearized)
>    - Multi-dimensional, delinearized if necessary
>    - From metadata (that a front-end emits for arrays of pointers to
>      arrays of data)
>
> Array access analysis determines whether two accesses at a specific
> index can alias each other, access ranges and connectivity.
>
>
>  * Dependencies
>    - none (cannot reorder statements)
>    - flat (depends on last execution of another statement)
>    - LLVM DependendencyAnalysis
>    - Polyhedral analysis
>
> There is intentionally no "does X depend on Y?" API as this would
make
> any analysis based on it quadratic in runtime. Instead, it has to list
> all statements another statement depends on such that a graph can be
> built and e.g. sorted topologically.
>
> Dependencies include: use-def chains via registers, memory
> flow/anti/output, general side-effects and control dependencies.
>
>
>  * Control flow
>    - Extracted from original CFG
>    - Evaluate statement conditions
>
> Control flow analysis answers questions about (post-)dominance and
> mutual exclusivity within the acyclic sequence/body of a loop.
>
>
>  * Cost heuristic
>    - TargetTransformationInfo
>
> Estimate the execution time of a (sub-)tree. Average trip counts from
> profiling [11] could be useful.
>
>
> 8. Abstract representation of statements
> ----------------------------------------
> There is no direct relationship between LLVM instructions and loop DAG
> statements. If possible, loop tree optimizations should only use the
> properties of the statement node, not its contents. Typically, a
> statement consists of multiple instructions that do not make sense to
> split. For instance, assuming that %add is not used a second time, in
> the example below
>
>     %add = add i64 %arg, 2
>     %mul = shl i64 %add, 1
>
> the two instructions should always be computed together in the same
> loop iteration. Combining instructions to statements reduces the
> computational complexity of loop optimizations and might be controlled
> by the optimization level (at lower optimization level more
> instructions would be combined). A loop node is itself a statement in
> it's parent loop node's acyclic body.
>
> Moreover, there is no reason to use only LLVM-IR to carry the
> semantics. We could also use e.g. Machine-IR or MLIR operations.
>
>
> 9. Expansion of use-def chains to arrays when spanning loops
> ------------------------------------------------------------
> Consider we would like to fission the following loop:
>
>     for (int i = 0; i < n; i += 1) {
>       { if(true) double a = f(i); }
>       { if(true) g(a); }
>     }
>
> The problem is that the split cuts the use-def dependency over `a`.
> However, the fission can still be done and results in:
>
>     double A[n];
>     for (int i = 0; i < n; i += 1)
>       { if(true) A[i] = f(i); }
>     for (int i = 0; i < n; i += 1) {
>       { if(true) g(A[i]); }
>
> The transforming pass has to consider this during its profitability
> model. The big advantage is that in terms of correctness, use-def
> chains do not manifest false dependencies.
>
>
> Possible Transformations
> ------------------------
> In principle, any transformation could be applied on this
> representation, but instruction-centric transformations such as
> InstCombine are less suitable. Obviously, loop-centric transformations
> are a better fit. Ideas for loop transformations include:
>
>  * Unroll(-and-jam)
>  * Peeling
>  * Iteration range split, concatenation
>  * Interchange
>  * Loop fusion and distribution(/fission)
>  * Tiling/stripmining/stripemining/blocking
>  * Code versioning (including loop unswitching)
>  * Maybe vectorization
>  * Accelerator offloading
>  * Parallelization
>  * Space-filling curve iteration [12]
>  * Array packing (e.g. for scratch pads), transposition
>  * Statement reordering
>  * SIMT-ization
>  * Detect and optimize reductions
>  * Loop skewing/wavefronting
>  * Loop reversal
>  * Solver-driven (polyhedral) optimization
>  * Loop idiom recognition
>
>
> Optimization Pipeline
> ---------------------
> The optimization pipeline for the hierarchy DAG is straightforward:
>
>     void optimizeLoopNest() {
>       RedOriginal = generateLoopHierarchy();
>       List = createCandidates(RedOriginal);
>       sort(List, [](Cand1,Cand2) { return estimateExecutionTime(Cand2)
>                                                    -
> estimateExecutionTime(Cand1); } );
>       if (List[0] != RedOriginal)
>         emit(List[0]);
>     }
>
> Subtree exploration is done recursively, optimizing inner nodes which
> can be (re-)used when optimizing outer nodes.
>
> The current node to optimize is compared a selection of known
> heuristics and its transformation is added to the list of candidates.
> That is, an optimization routine resemble the following, with more
> specific heuristics checked before more general ones.
>
> void exploreOptimizations(Root, Worklist) {
>    Matches = 0;
>
>    // High level: user annotations
>    if (Root.hasHint()) {
>      // Hint is mandatory: do not explore any other possibility
>      Worklist.replace(Root, applyHint(Root));
>      return;
>    }
>
>     // High level: Idiom recognition
>     if (isBLAS(Root)) {
>       Optimized = LinkAgainstBLASLibary ? convertToBlasLibCall(Root) :
>
> applyBLIS(Root):
>       Worklist.replace(Root, Optimized);
>       return;
>     }
>
>     // Mid level: recognize stencils
>     if (isStencil(Root))
>       ...
>
>     // Mid level: Map to compute hierarchy (here: simd -> SMT -> core
> -> socket for CPUs)
>     if (isParallel(Root) fitsWorkingSet(Root.getBody(), LLCSize) &&
&&
>       isParallel(Root.getPerfectlyNestedLoop(1)) &&
>        fitsWorkingSet(Root.getPerfectlyNestedLoop(2).getBody(), L2Size)
&&
>       isParallel(Root.getPerfectlyNestedLoop(2)) &&
>        fitsWorkingSet(Root.getPerfectlyNestedLoop(3).getBody(), L1Size)
&&
>       isVectorizable(Root.getPerfectlyNestedLoop(3))) {
>         Optimized = applyVectorization(Root.getPerfectlyNestedLoop(3));
>         Optimized = applyParallel(Optimized.getParent(), /*places=*/SMT);
>         Optimized = applyParallel(Optimized.getParent(), /*places=*/Cores);
>         Optimized = applyParallel(Optimized.getParent(),
> /*places=*/Sockets);
>
>         // Still explore CPU pipeline optimizations of innermost
>         Worklist.insert(Root,Optimized);
>         Matches += 1;
>     }
>
>     if (Matches >= Threshold)
>       return;
>
>     // Low level: optimize working set for L1 cache size using tiling
>     Band = maximumTilableNest(Root);
>     WorkingSetSizePerIteration = estimatWorkingSet(Band.back().getBody());
>     TileSize = floor(nthroot(L1CacheSize / WorkingSetSizePerIteration,
> Band.size()));
>     if (TileSize > 1) {
>        Worklist.insert(applyTiling(Root, TilesSize.splat(Band.size()));
>        Matches += 1;
>     }
>
>     if (Matches >= Threshold)
>       return;
>
>     // Low level: vectorization for each SIMD level (potentially
>     // carried-out by VPlan)
>     foreach (Width : promisingSimdWidths(Root)) {
>       Vectorized = applyVectorization(Root, Width);
>       if (Vectorized.isSatisfiable()) {
>         Worklist.insert(Vectorized);
>         Matches += 1;
>       }
>     }
>
>     if (Matches >= Threshold)
>       return;
>
>     // Low level: rely on heuristic to find best unrolling factor
>     if (Factor = unrollFactorHeuristic(Root))  {
>       Worklist.insert(applyUnroll(Root, Factor).getRoot());
>       Matches += 1;
>     }
>
>     if (Matches >= Threshold)
>       return;
>
>     ...
> }
>
>
> The "levels" encode precedence optimizations. Depending on the
> optimization level, we might only apply the most promising
> optimization.
>
> Instead of this brute force implementation, for predefined
> optimization levels, the search space exploration should be guided:
> Encode the order of the most useful transformations in the search
> space exploration. For instance, find all idioms first, then apply
> memory hierarchy optimizations, then compute hierarchy, then CPU
> pipeline optimizations. Since this is very transformation specific, I
> do not expect to write a 'pass manager' where transformations can
> register itself to (except for exhaustive searches when explicitly
> enabled). Instead, I'd write this in code that classifies loop nests
> (e.g. compute bound, bandwidth bound, known idioms, tensor
> contractions, stencils, etc.) and choses appropriate optimization
> paths. We can, of course, make this more generic over time.
>
>
>
> Writing an optimization pass
> -----------------------------
>
> Optimizations themselves are written in the following steps:
> 1. Check preconditions
> 2. Quick profitability analysis to skip obviously bad transformations
> 3. Create new green nodes to represent the changed loop nest; reuse
> unchanged green nodes as much as possible
> 4. Create a red node to insert into the parent
> 5. Preserve analyses (most analyses should be able to use analysis
> stored in reused green nodes and/or lazyly reevaluate new nodes)
> 6. Check whether the transformation was valid
>
> Especially the possibility to analyze the correctness after the
> transformation was applied could significantly reduce the amount of
> code needed to verify the correctness.
>
> A transformation like loop fusion could look like:
>
>   void applyFuse(Red1, Red2) {
>    // Assume Red1 and Red2 are consecutive sibling loops
>
>   // Bail out on untypical things (atomic accesses, exceptions,
> convergent instructions, ...)
>   if (!Red1->isSimple() || !Red2->isSimple())
>     return;
>
>   // Consistency checks (can be less less strict by moving conditions
> into the fused body).
>   if (Red1->getIterationCount() != Red2->getIterationCount())
>     return;
>   if (Red1->getPredicate() != Red2->getPredicate())
>     return;
>
>   // Create new loop hierarchy with fused loops.
>   GreenFused = new GreenLoop({ Red1->getGreen(), Red2->getGreen() },
>
>  Red1->getIterationCount());
>   RedFused = new RedNode(GreenFused, Red1->getParent());
>   NewRoot = RedFused->getRoot();
>
>   // Preserve analysis.
>   // In this case not really necessary since the green nodes of the
>   // loop bodies are reused.
>   DependenceAnalysis->transferAnalysis(Red1->getBody(),
>                                                   
RedFused->getBody()[0],
> ClosedExprAnalysis->makeIdentityMapping(Red1->getIterator(),
>
> RedFused->getIterator()));
>   DependenceAnalysis->transferAnalysis(Red2->getBody(),
>                                                  
RedFused->getBody()[1],
>    ClosedExprAnalysis->makeIdentityMapping(Red2->getIterator(),
>
> RedFused->getIterator()));
>
>   ValidityCondition = DependenceAnalysis->isValid(NewRoot);
>   if (ValidityCondition.isSatisfiable()) {
>     RedFused.addAssumption(ValidityCondition);
>     Candidates.push_back(NewRoot);
>   }
> }
>
>
>
> Questions & Answers
> -------------------
>
> Q: Is the loop hierarchy a tree or DAG?
> A: Both. A green node can be reused multiple times in the same
> structure (e.g. due to unrolling, versioning, peelings, etc), but a
> red node is specific to the position in its parent; there can be
> multiple red nodes for the same green node.
>
>
> Q: No SSA?
> A: Since the location of a definition depends on the parent in a green
> DAG, possibly from different loop hierarchies after transformations,
> it is not possible to directly reference the definition with its use.
> However, it is possible in the red tree, but runs against the goal of
> avoiding heavy red nodes. Instead, to find the previous definition,
> one has to traverse the green no up to find at which sibling level the
> value is defined. However, this should not be a common query; loop
> optimizations should use dependency information instead.
>
>
> Q: Loops with multiple exits?
> A: If every exiting edge a unique number to its __???_executed
> variable, it can be handled like a switch.
>
>
> Q: Unstructured control flow?
> A: Unstructured control flow only introduces disjunctions to statement
> predicated. That is, the predicate
>
>     a or b
>
> corresponds to block BB2 the CFG
>
>     BB0:
>       br i1 %a, label %BB2, label %BB2
>
>     BB1:
>       br i1 %b, label %BB2, label %BB3
>
>     BB2:
>
> In structured control flow, BB1 and/or BB2 would be post-dominated by
> BB2 and the disjunction disappear.
>
>
> Q: Irreducible loops?
> A: Can be folded into the first iteration of the loop by including a
> condition for statements not executed during the first iteration.
>
>
> Q: The unreachable terminator?
> A: If due to a non-returning function call, this introduces a loop
> exit (which will also introduce control a dependency to everything
> that follows) to the surrounding loops. If due to an assumption, we
> may add the branching condition to the sufficient condition, but not
> to the necessary condition.
>
>
> Q: Infinite loops?
> A: The closed-form expression identifying a statement instances and
> loop trip count have no upper bound. Either use arbitrary range
> arithmetic, use relative dependence vectors and/or fall back on
> pessimistic assumptions. In addition, like with the unreachable
> terminator, this may be handled as another loop exit.
>
>
> Q: IndirectBr?
> Since the list of possible targets is part of the instructions, can be
> handled like a switch over the target address
>
>
> Q: Loop guards?
> A: Loop guards manifests as the execution condition of the loop
> statement's predicate. A loop whose predicate evaluates to false can
> be considered a loop with 0 iterations. The IR generator can also emit
> the predicate as a loop guard. It might be useful to ensure that if
> the predicate evaluates to true, the loop has at least one iteration
> and the loop-condition is not evaluated for the first iteration (like
> LoopRotation).
>
>
> Q: Exceptions?
> A: Could also be handled with control dependencies, but I would
> instead mark the surrounding loops with "contains unpredictable
> control flow" to be ignored by most transformations. Similar to
> `!isSimple()` of LoadInst and StoreInst.
>
>
> Q: Relation to the new/legacy pass manager?
> A: LLVM's pass managers are unfortunately not designed to apply to
> subtrees nor persistent data structures besides the LLVM-IR itself.
> Instead, the loop tree optimizer would be its own monolithic pass on
> the pass manager level (like MachinePassManager and VPlan). My idea is
> to add it somewhere before LoopVectorize, but after the inliner,
> potentially replace most other loop transformations. There should be
> just one instance of the pass in the pipeline since creating an
> alternative representation of LLVM-IR is not that cheap. It does not
> need IR normalization (LoopSimplify, LICM, LoopRotate, LCSSA); these
> can be performed on the loop tree and emitted into the LLVM-IR if a
> non-modified tree is chosen. Possibly some of these passes can be
> removed.
> In order to preserve user-directed transformations (such as "#pragma
> clang loop distribute"), it may also be interesting to put it earlier
> into the pipeline to avoid that other passes such as LoopFullUnroll
> change the loop before the transformation can be applied.
> Alternatively, such passes can be taught the check the presence of a
> user-directed transformation before applying its transformation.
>
>
> Q: Relation to LoopVectorize/VPlan?
> A: VPlan has similar design goals [9] but is intended for
> vectorization only. It also has a hierarchical data structure, but
> uses a CFG for acyclic control flow. It is also intended to have
> multiple variants of the same source IR from which the best is
> selected via a cost function. However, it lacks cheap copies. Instead
> of instructions, it uses recipes/"meta-instructions" that handle
what
> happens to instructions after vectorization, e.g. do that operation on
> each vector lane ("WIDEN"). VPlan is more oriented towards
modifying
> instructions instead of statements as collection of instructions. With
> the loop hierarchy, I do not plan to redo the work of VPlan, but also
> do not see a reason why it should not also be able to do
> vectorization.
>
>
> Q: Relation to Polly?
> A: Polly uses ISL's schedule tree as the representation of the loop
> hierarchy. Its creation may fail of something not representable in a
> schedule tree is found. Polly tries to find the maximum representable
> region, but once it is chosen, it cannot be changed (e.g. because
> optimizing it takes too long). The schedule tree is stored internally
> by ISL as a kind of green tree, but is walked over by an iterator that
> remembers the path to the root. While this could be used to select a
> best schedule tree using a cost function, Polly relies mostly on ISL's
> reschedule algorithm to find a profitable transformation.
>
>
> Q: Relation to MLIR?
> A: MLIR is more similar to LLVM-IR than a loop hierarchy. For
> instance, it also does not feature cheap copies. An advantage is that
> loops and multi-dimensional arrays can be represented in the language
> without the need of being rediscovered, but have to be inserted by a
> front-end. That is, if Clang was generating MLIR, loops and arrays
> still have to be rediscovered. However, a loop hierarchy optimizer
> could be applied to MLIR just as well as to LLVM-IR.
>
>
> References
> ----------
> [1] http://llvm.org/devmtg/2018-10/talk-abstracts.html#talk11
> [2] https://ieeexplore.ieee.org/abstract/document/5389392/
> [3] http://sro.sussex.ac.uk/id/eprint/7576/1/Stanier,_James.pdf
> [4]
>
https://blogs.msdn.microsoft.com/ericlippert/2012/06/08/persistence-facades-and-roslyns-red-green-trees/
> [5] https://github.com/Meinersbur/llvm-project/tree/LOF/llvm/lib/LOF
> [6] https://llvm.org/devmtg/2017-10/#src2
> [7] https://arxiv.org/abs/1811.00632
> [8] https://lists.llvm.org/pipermail/llvm-dev/2017-October/118125.html
> [9] https://llvm.org/docs/Proposals/VectorizationPlan.html
> [10] https://github.com/open64-compiler/open64/tree/master/osprey/be/lno
> [11] https://reviews.llvm.org/D70688
> [12] https://arxiv.org/abs/1801.07399
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

-- 
Founder and Director, PolyMage Labs
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200128/cfaeddc1/attachment-0001.html>

Michael Kruse via llvm-dev

2020-Jan-30 09:05 UTC

head link

[llvm-dev] Writing loop transformations on the right representation is more productive

Am Mo., 27. Jan. 2020 um 22:06 Uhr schrieb Uday Kumar Reddy Bondhugula <
uday at polymagelabs.com>:
> Hi Michael,
>
> Although the approach to use a higher order in-memory abstraction like the
> loop tree will make it easier than what you have today, if you used MLIR
> for this representation, you already get a round trippable textual format
> that is *very close* to your form. The affine.for/if, std.for/if in MLIR
> are nearly isomorphic to the tree representation you want, and as such,
> this drastically reduces the delta between the in-memory data structures
> your passes want to operate on and what you see when you print the IR.
> Normally, there'd be resistance to building a textual form /
introducing a
> first class concept in the IR for what you are proposing, but since this
> was already done for MLIR, it looks like it would be a big win from a
> compiler developers' productivity standpoint if you just used MLIR for
this
> loop tree representation. With regard to the FAQ, I can't tell whether
you
> meant something else or missed the representation used in MLIR for the
> affine dialect or in general for "ops with regions".
>
The point of the proposal is not having a first-class construct for loops,
but to allow speculative transformations. Please my response to Chris
Lattner:
https://lists.llvm.org/pipermail/llvm-dev/2020-January/138778.html

> > Q: Relation to MLIR?
> > A: MLIR is more similar to LLVM-IR than a loop hierarchy. For
>
> It's completely the opposite unless you are looking only at MLIR's
std
> dialect! The affine dialect as well as the std.for/if (currently misnamed
> as loop.for/loop.if) are actually a loop tree. The affine ops are just an
> affine loop AST isomorphic to the materialization of polyhedral
> domains/schedules via code generation. Every IST AST or the output of
> polyhedral code generation can be represented in the affine dialect and
> vice versa. MLIR's loop/if ops are a hierarchy rather than a flat form
/
> list of blocks CFG.
>
As per my discussion with Chris Lattner, this is a very subjective
question. It might be controversial, but I don't see MLIR regions as much
more than syntactic sugar for inlined function calls that allow referencing
the outer regions' definitions. This does not mean that I think they are
they are useless, on the contrary.

Regarding the affine dialect, I see the same problem that Polly has when
creating a schedule tree representation: A lot of work has to be done to
make IR originating from Clang compatible. Everything becomes easy if the
front-end can generate an affine dialect out-of-the box.


> > still have to be rediscovered. However, a loop hierarchy optimizer
> > could be applied to MLIR just as well as to LLVM-IR.
>
> This is true, but it's easier to apply it to MLIR because the actual IR
is
> close by miles to the in-memory thing your loop hierarchy optimizer would
> be using. For eg., here's the input IR and the output IR of a simple
outer
> loop vectorization performed in MLIR:
>
>
https://github.com/bondhugula/llvm-project/blob/hop/mlir/test/Transforms/vectorize.mlir#L23
>
>Again, the proposal is about the in-memory representation using red/green
trees (which I firmly disagree with being close to MLIR's in-memory
representation), not the textual format.


Michael
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200130/f95bb574/attachment.html>

llvm dev - Jan 2020 - [RFC] Writing loop transformations on the right representation is more productive

[llvm-dev] Writing loop transformations on the right representation is more productive

[llvm-dev] RFC: Writing loop transformations on the right representation is more productive

[llvm-dev] Writing loop transformations on the right representation is more productive

[llvm-dev] Writing loop transformations on the right representation is more productive

[llvm-dev] [RFC] Writing loop transformations on the right representation is more productive

[llvm-dev] [RFC] Writing loop transformations on the right representation is more productive

[llvm-dev] [RFC] Writing loop transformations on the right representation is more productive

[llvm-dev] Writing loop transformations on the right representation is more productive

[llvm-dev] Writing loop transformations on the right representation is more productive

[llvm-dev] Writing loop transformations on the right representation is more productive

[llvm-dev] Writing loop transformations on the right representation is more productive