thr3ads.net - llvm dev - [llvm-dev] Writing loop transformations on the right representation is more productive [Jan 2020]

If this information is useful, please help other people find it:
Share via:

Michael Kruse via llvm-dev

2020-Jan-11 00:33 UTC

[llvm-dev] Writing loop transformations on the right representation is more productive

Am Fr., 10. Jan. 2020 um 16:10 Uhr schrieb Renato Golin <rengolin at
gmail.com>:> On Fri, 3 Jan 2020 at 11:27, Michael Kruse via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
> > 1. Candidate selection through cost function
> > --------------------------------------------
> > Instead of needing to know which transformation is profitable before
> > applying it, create a copy of the data structure, modify it, and
> > compare it to the original. Moreover, holding multiple, differently
> > optimized, copies allows evaluating each variant using a cost function
> > and select the 'best' when re-generating LLVM-IR again (or
re-use the
> > original LLVM-IR).
>
> This sounds a lot like VPlan.
Yes, as mentioned in the Q&A. Unfortunately VPlan is able to represent
arbitrary code not has cheap copies.

> > Instantiating every possible sequence of transformations of course is
> > not feasible, so the search space needs to be pruned. This could be
> > made dependent on the optimization level.
>
> Are you planning on using heuristic searches? This could make the
> vectoriser unstable upon small input changes and therefore hard to get
> consistent results and testing.
>
> I'm not against such idea, but I think we need to be conservative in
> such a core component of the compiler.
>
> It would be nice to have -Ogocrazy to mean "keep going until you find
> something", but usually, -O3 should terminate. :)
I agree, as outlined in the RFC under "predefined optimization
levels".

> > 3. Apply transformations from high-level to low-level
> > -----------------------------------------------------
> > Optimization should be applied from very specialized to very general
> > (potentially after some canonicalization). For instance, the first
> > step could be detecting common idioms such as gemm and replace them
> > with either a BLAS function call or apply well-studied optimizations
> > like BLIS to them. After such an idiom has been detected, no other
> > transformation should be applied to them.
>
> I'm sceptical to such a machinery. People usually write bad code (me
> included) and trying to mach multiple patterns to the same semantics
> will be hard, considering how lenient C++ is to pointer handling and
> type conversions.
This conversion is a possibility and certainly not the main motivation
for a loop hierarchy. Smaller idioms exists as well, such as detecting
popcount. Even with gemm I think it would be nice if it could be
written in a naive version in the source code that compiles with any
compiler, but also benefit from the target platform's hand-optimized
performance primitives by adding a compiler switch (which could be
-O3).


> > Mid-level transformations may try to map entire loop nests to cache-
> > and compute hierarchies (SIMT threads, multiprocessors, offloading,
> > etc) by applying transformations such as tiling, loop interchange and
> > array packing.
>
> This is hard but very profitable. However, feels to me again that this
> is just VPlan packed differently.
>
> While VPlan still has no way yet to handle even simple outer-loops
> (has that landed yet?), once we do, then the natural progression will
> be to start understanding their semantics and possibly make high level
> assumptions like that.
I wouldn't have thought that parallelization and offloading was ever
considered on top of VPlan.


> > 6. Late fallback versioning at IR regeneration
> > ------------------------------------------
> > When a transformation is applied, it can attach conditions (no
> > aliasing, no integer wrap, value restrictions, etc.) under which the
> > transformation is valid. During LLVM-IR generation, these conditions
> > are collected and emitted as run-time conditions. If the condition
> > fails, the original code is executed.
>
> This sounds like it will bloat code for a lot of cold cases. Or worse,
> get it wrong, and put hot code in the cold path.
Are you arguing against code versioning? It is already done today by
multiple passes such as LoopVersioningLICM, LoopDistribute,
LoopUnrollAndJam and LoopVectorize. The proposal explicitly tries to
avoid code bloat by having just one fallback copy. Runtime conditions
can be chosen more or less optimistically, but I don't see how this
should be an argument for all kinds of versioning.

If you are concerned about bloat in cold paths, we could use profile
information to optimize cold functions with '-Os', something that GCC
does, but not Clang.

> > 7. Standardized API for analyses with multiple implementations
>
> These are good to have regardless of which vectorisation strategy we use.
In LLVM, AliasAnalysis does this, but hat not yet found another application.


> > 8. Abstract representation of statements
> > ----------------------------------------
> > For instance, assuming that %add is not used a second time, in
> > the example below
> >
> >     %add = add i64 %arg, 2
> >     %mul = shl i64 %add, 1
> >
> > the two instructions should always be computed together in the same
> > loop iteration.
>
> This may inhibit further combines, or even detection of target
> specific patterns for SIMD code that aren't common.
>
> I agree that not forcefully binding statements with instructions is a
> good idea, but this may need a target-specific pattern matcher to be
> more sensitive to target idiosyncrasies.
My idea here is that loop-level optimizations rarely need to know
which target-specific instructions are executed, as long as it knows
its performance-relevant properties. This might be a difference to
vectorization which may be more ISA-specific.


> > 9. Expansion of use-def chains to arrays when spanning loops
> > ------------------------------------------------------------
> > The transforming pass has to consider this during its profitability
> > model. The big advantage is that in terms of correctness, use-def
> > chains do not manifest false dependencies.
>
> Sounds good, but also creates the problem of how to handle the array.
> If 'n' is unknown, or dependent on SIMD widths or number of
threads,
> it's too low level to add anything that is guaranteed to not change
> the performance profile of the original loop.
As mentioned, the profitability model has to take this into account.
Conservatively, we may only do this if the resulting array is a small
constant size such that we can assume that even multiple of those fir
on the stack.

> > Q: Relation to the new/legacy pass manager?
> > A: LLVM's pass managers are unfortunately not designed to apply to
> > subtrees nor persistent data structures besides the LLVM-IR itself.
>
> By design. The more alternative persistent data structures you have
> being modified by a series of passes, the harder it is to know what
> did what and where you are.
The proposal avoids persistent data structures between separate passes.

Note that MachineFunctionPass maintains the MachineFunction data
structure in parallel to the LLVM-IR.

> > Instead, the loop tree optimizer would be its own monolithic pass on
> > the pass manager level (like MachinePassManager and VPlan). My idea is
> > to add it somewhere before LoopVectorize, but after the inliner,
> > potentially replace most other loop transformations.
>
> To me this almost sounds like Polly. Take LLVM IR into a completely
> different representation, do a bunch of transformations with it,
> re-generate LLVM IR and spits it back into the pipeline.
There is indeed an inspiration from Polly.

> By that time, all analyses have to be invalidated. All
> canonicalisations that had been done will probably be destroyed and
> many current pattern matches will stop working. This infrastructure is
> only meaningful at the function level or higher, so the potential for
> wide range destruction is not trivial.
>
> Don't get me wrong, I like the idea, it's a cool experiment using
some
> cool data structures and algorithms. But previous experiences with the
> pass manager have, well, not gone smooth in any shape or form.
What experiments? I don't see a problem if the pass manger has to
invalidate analysis are re-run canonicalization passes. This happens
many times in the default pass pipelines. In addition, this
invalidation is only necessary if the loop optimization pass optimizes
something, in which case the additional cost should be justified.

> > Q: Relation to LoopVectorize/VPlan?
> > A: VPlan has similar design goals [9] but is intended for
> > vectorization only.
>
> Again, by a conservative design. I think adding yet-another won't help.
>
> My point is: if this is the way to go, then we should start to think
> how we make everything that makes sense become part of this scheme.
> Splitting the pass manager into SSA and Tree, run some passes in one
> others in the other, and so on.
>
> But creating multiple, incompatible and potentially destructive whole
> new pass managers will make a hard job impossible.
I don't think the proposal qualifies as including a full-flexible new
pass manger, at least no more than the current mechanism LoopVectorize
uses to run passes on VPlan (LoopVectorizationPlanner::plan).

> > However, it lacks cheap copies. Instead
> > of instructions, it uses recipes/"meta-instructions" that
handle what
> > happens to instructions after vectorization, e.g. do that operation on
> > each vector lane ("WIDEN").
>
> Nothing stops us from implementing a leaner approach to VPlan. It
> wouldn't be a trivial implementation, but the volume of work that
> would be required in this proposal is staggering, too.
>
> > VPlan is more oriented towards modifying
> > instructions instead of statements as collection of instructions.
>
> Fair enough, the design was to enhance SIMD code generation, not any
> kind of parallel semantics. I guess it would be possible to add the
> concept of higher level blocks to VPlan.
>
> All in all, VPlan is young and in constant refactoring, and perhaps it
> would be more productive to move it towards a more inclusive approach
> than throwing it away before it fully matures to start a whole new
> project.
While I still think the goals of VPlan and a loop hierarchy are
different, I expect VPlan to be production-ready earlier than this
proposal. I fear that combining them would delay the both.

> https://xkcd.com/927/
While I can never find this xkcd not funny, a the loop hierarchy is
not intended to be universal.

> > Q: Relation to MLIR?
> > A: MLIR is more similar to LLVM-IR than a loop hierarchy. For
> > instance, it also does not feature cheap copies.
>
> If you treat MLIR as your red tree, you could create a green tree
> (perhaps as a dialect) and have cheap copies (passing the dialects and
> deltas without passing the base).
I don't see how this could work.

> > An advantage is that
> > loops and multi-dimensional arrays can be represented in the language
> > without the need of being rediscovered, but have to be inserted by a
> > front-end.
>
> Not necessarily. We have discussed introducing dialect annotation to
> MLIR during compile time from analysis passes that would basically do
> what the front-end should have done.
The argument is that MLIR has first-class expressions for
multi-dimensional array accesses ("MemRef") while LLVM-IR does not.

https://lists.llvm.org/pipermail/llvm-dev/2019-July/134063.html

Both of them can have analyses to raise the abstraction to a
multi-dimensional access ("delinearization").

> Conclusions?
>
> This was a long email, with too many proposals, so I don't have any
> strong opinion or conclusions, not even from my own comments.
Thank you for going through it!

> Overall, I like a lot of the ideas (red/green, tree optimization,
> different search strategy), but I dislike the encompassing proposal to
> *replace* a lot of the existing infrastructure.
Not a replacement, but an addition that does not always need to be
enabled (e.g. -O0).

In a previous RFC [8] I tried to NOT introduce a data structure but to
re-use LLVM-IR. The only discussion there was about the RFC, was about
not to 'abuse' the LLVM-IR.

https://lists.llvm.org/pipermail/llvm-dev/2017-October/118169.html
https://lists.llvm.org/pipermail/llvm-dev/2017-October/118258.html

I definitely see the merits of using fewer data structures, but it is
also hard to re-use something existing for a different purpose (in
this case: VPlan) without making both more complex.

> For better or worse, LLVM is a product of its age. Some things could
> have been done better, but we have always adopted the "general
> consensus and slow movement" way to change things. Sometimes too slow,
> but...
>
> Now, MLIR can be a way to speed that up.
>
> It is a much more malleable format than LLVM IR, it was designed for
> high-level representation, has a lot of parallelism concepts in it and
> it's supposed to interact with LLVM IR seamlessly.
>
> It may be much easier to use MLIR to interoperate the two "pass
> managers" _together_ than converting from one to the other and vice
> versa.
>
> This is a bold claim and I have no evidence it could ever work. But I
> think it would still be less work than creating yet another pass
> manager from scratch.
This is why I don't want the framework to be too tangled with LLVM-IR.
For the foreseeable future, Clang will generate LLVM-IR, but our
motivation is to (also) optimize C/C++ code. That is, I do not see a
way to not (also) handle LLVM-IR until Clang is changed to generate
MLIR (which then again will be another data struture in the system).


Michael

Renato Golin via llvm-dev

2020-Jan-11 17:43 UTC

head link

[llvm-dev] Writing loop transformations on the right representation is more productive

On Sat, 11 Jan 2020 at 00:34, Michael Kruse <llvmdev at meinersbur.de>
wrote:> Yes, as mentioned in the Q&A. Unfortunately VPlan is able to represent
> arbitrary code not has cheap copies.
Orthogonal, but we should also be looking into implementing the cheap
copies in VPlan if we want to search for composable plans.

> This conversion is a possibility and certainly not the main motivation
> for a loop hierarchy.
I know. There are many things that can be done with what you propose,
but we should focus on what's the main motivation.
>From what I can tell, the tree representation is a concrete proposalfor the many year discussion about parallel IR.

The short paper doesn't mention that, nor it discusses other
opportunities to fix pipeline complexity (that is inherent of any
compiler).

I still believe that many of the techniques you propose are meaningful
ways to solve them, but creating another IR will invariably create
some adoption barriers.

Especially when we already have VPlan and MLIR converging now, which
will need to find their own spaces, too.

> I wouldn't have thought that parallelization and offloading was ever
> considered on top of VPlan.
I don't see why not. VPlan is a structure for picking a path through
composable transformations.

While so far it's being mainly focused at replacing the monolithic
vectorisation, there are concrete plans to look at composition and
more complex idioms.

> Are you arguing against code versioning? It is already done today by
> multiple passes such as LoopVersioningLICM, LoopDistribute,
> LoopUnrollAndJam and LoopVectorize. The proposal explicitly tries to
> avoid code bloat by having just one fallback copy. Runtime conditions
> can be chosen more or less optimistically, but I don't see how this
> should be an argument for all kinds of versioning.
No. I'm cautious to the combination of heuristics search and
versioning, especially when the conditions are runtime based. It may
be hard to CSE them later.

The paths found may not be the most optimal in terms of intermediate states.

> > Don't get me wrong, I like the idea, it's a cool experiment
using some
> > cool data structures and algorithms. But previous experiences with the
> > pass manager have, well, not gone smooth in any shape or form.
>
> What experiments? I don't see a problem if the pass manger has to
> invalidate analysis are re-run canonicalization passes. This happens
> many times in the default pass pipelines. In addition, this
> invalidation is only necessary if the loop optimization pass optimizes
> something, in which case the additional cost should be justified.
My point goes back to doing that in VPlan, then tree. The more
back-and-forth IR transformations we add to the pipeline, the more
brittle it will be.

The original email also proposes, for the future, to do all sorts of
analyses and transformations in the tree representation, and that will
likely be incompatible with (or at least not propagated through) the
conversions.

> I don't think the proposal qualifies as including a full-flexible new
> pass manger, at least no more than the current mechanism LoopVectorize
> uses to run passes on VPlan (LoopVectorizationPlanner::plan).
Sorry, that came out stronger than it should have been. I agree it's
not a "whole new pass manager".

> While I still think the goals of VPlan and a loop hierarchy are
> different, I expect VPlan to be production-ready earlier than this
> proposal. I fear that combining them would delay the both.
I get it, but I fear taking a completely different approach may make
it harder to get your proposal to show benefits any time soon.

> > https://xkcd.com/927/
>
> While I can never find this xkcd not funny, a the loop hierarchy is
> not intended to be universal.
Sorry, poetic license. :)

I tried to reflect the perils of creating too many, sometimes competing, IRs.

> In a previous RFC [8] I tried to NOT introduce a data structure but to
> re-use LLVM-IR. The only discussion there was about the RFC, was about
> not to 'abuse' the LLVM-IR.
>
> https://lists.llvm.org/pipermail/llvm-dev/2017-October/118169.html
> https://lists.llvm.org/pipermail/llvm-dev/2017-October/118258.html
>
> I definitely see the merits of using fewer data structures, but it is
> also hard to re-use something existing for a different purpose (in
> this case: VPlan) without making both more complex.
My point about avoiding more structures and IRs was related to VPlan
and MLIR, not LLVM-IR.

I agree there should be an abstraction layer to do parallelisation
analysis, but we already have two, and I'd rather add many of your
good proposals on those than create a third.

Perhaps it's not clear how we could do that now, but we should at
least try to weigh the options.

I'd seriously look at adding a tree-like annotation as an MLIR
dialect, and use it for lean copies.

> For the foreseeable future, Clang will generate LLVM-IR, but our
> motivation is to (also) optimize C/C++ code. That is, I do not see a
> way to not (also) handle LLVM-IR until Clang is changed to generate
> MLIR (which then again will be another data struture in the system).
Even if/when Clang generates MLIR, there's no guarantee the high-level
dialects will be preserved until the vectorisation pass. And other
front-ends may not generate the same quality of annotations.

We may have to re-generate what we need anyway, so no point in waiting
all the front-ends to do what we need as well as all the previous
passes to guarantee to keep it.

cheers,
--renato

Michael Kruse via llvm-dev

2020-Jan-15 04:39 UTC

head link

[llvm-dev] Writing loop transformations on the right representation is more productive

Am Sa., 11. Jan. 2020 um 07:43 Uhr schrieb Renato Golin <rengolin at
gmail.com>:
> On Sat, 11 Jan 2020 at 00:34, Michael Kruse <llvmdev at
meinersbur.de> wrote:
> > Yes, as mentioned in the Q&A. Unfortunately VPlan is able to
represent
> > arbitrary code not has cheap copies.
>
> Orthogonal, but we should also be looking into implementing the cheap
> copies in VPlan if we want to search for composable plans.
VPlan structures have many references to neighboring structures such as
parents and use-def chains. This makes adding cheap copies as an
afterthought really hard.

> > This conversion is a possibility and certainly not the main motivation
> > for a loop hierarchy.
>
> I know. There are many things that can be done with what you propose,
> but we should focus on what's the main motivation.
>
> From what I can tell, the tree representation is a concrete proposal
> for the many year discussion about parallel IR.
As I recall, the Parallel IR approaches were trying to add parallel
constructs to the existing LLVM-IR. This added the issue that the current
infrastructure suddenly need to handle those as well, becoming a major
problem for adoption.

> The short paper doesn't mention that, nor it discusses other
> opportunities to fix pipeline complexity (that is inherent of any
> compiler).
>
> I still believe that many of the techniques you propose are meaningful
> ways to solve them, but creating another IR will invariably create
> some adoption barriers.
I see it as an advantage in respect of adoption: It can be switched on and
off without affecting other parts.

> > Are you arguing against code versioning? It is already done today by
> > multiple passes such as LoopVersioningLICM, LoopDistribute,
> > LoopUnrollAndJam and LoopVectorize. The proposal explicitly tries to
> > avoid code bloat by having just one fallback copy. Runtime conditions
> > can be chosen more or less optimistically, but I don't see how
this
> > should be an argument for all kinds of versioning.
>
> No. I'm cautious to the combination of heuristics search and
> versioning, especially when the conditions are runtime based. It may
> be hard to CSE them later.
>
> The paths found may not be the most optimal in terms of intermediatestates.

Versioning is always a trade-off between how likely the preconditions apply
and code size (and maybe how expensive the runtime checks are). IMHO this
concern is separate from how code versioning is implemented.

> > > Don't get me wrong, I like the idea, it's a cool
experiment using some
> > > cool data structures and algorithms. But previous experiences
with the
> > > pass manager have, well, not gone smooth in any shape or form.
> >
> > What experiments? I don't see a problem if the pass manger has to
> > invalidate analysis are re-run canonicalization passes. This happens
> > many times in the default pass pipelines. In addition, this
> > invalidation is only necessary if the loop optimization pass optimizes
> > something, in which case the additional cost should be justified.
>
> My point goes back to doing that in VPlan, then tree. The more
> back-and-forth IR transformations we add to the pipeline, the more
> brittle it will be.
Agreed, but IMHO this is the price to pay for better loop optimizations.

> The original email also proposes, for the future, to do all sorts of
> analyses and transformations in the tree representation, and that will
> likely be incompatible with (or at least not propagated through) the
> conversions.
Correct, but I'd argue these are different kinds of analyses not
necessarily even useful for different representations. MLIR also has its
set of analyses separate to those on MLIR.

> > In a previous RFC [8] I tried to NOT introduce a data structure but to
> > re-use LLVM-IR. The only discussion there was about the RFC, was about
> > not to 'abuse' the LLVM-IR.
> >
> > https://lists.llvm.org/pipermail/llvm-dev/2017-October/118169.html
> > https://lists.llvm.org/pipermail/llvm-dev/2017-October/118258.html
> >
> > I definitely see the merits of using fewer data structures, but it is
> > also hard to re-use something existing for a different purpose (in
> > this case: VPlan) without making both more complex.
>
> My point about avoiding more structures and IRs was related to VPlan
> and MLIR, not LLVM-IR.
>
> I agree there should be an abstraction layer to do parallelisation
> analysis, but we already have two, and I'd rather add many of your
> good proposals on those than create a third.
>
> Perhaps it's not clear how we could do that now, but we should at
> least try to weigh the options.
>
> I'd seriously look at adding a tree-like annotation as an MLIR
> dialect, and use it for lean copies.
Like VPlan, MLIR is a representation with many references between objects
from different levels. I do not see how to add cheap copies as an
afterthought.


> > For the foreseeable future, Clang will generate LLVM-IR, but our
> > motivation is to (also) optimize C/C++ code. That is, I do not see a
> > way to not (also) handle LLVM-IR until Clang is changed to generate
> > MLIR (which then again will be another data struture in the system).
>
> Even if/when Clang generates MLIR, there's no guarantee the high-level
> dialects will be preserved until the vectorisation pass.
I'd put loop optimizations earlier into the pipeline than vectorization.
Where exactly is a phase ordering problem. I'd want to at least preserve
multi-dimensional subscripts. Fortunately MemRef is a core MLIR construct
and unlikely to be lowered before lowering to another representation
(likely LLVM-IR).

> And other
> front-ends may not generate the same quality of annotations.
> We may have to re-generate what we need anyway, so no point in waiting
> all the front-ends to do what we need as well as all the previous
> passes to guarantee to keep it.
I don't see how this is relevant for a Clang-based pipeline. Other
languages likely need a different pipeline than one intended for C/C++ code.

There are not a lot of high-level semantics required to be preserved to
build a loop hierarchy.

Thanks for the productive discussion,
Michael
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200114/a658d69e/attachment.html>

llvm dev - Jan 2020 - Writing loop transformations on the right representation is more productive

[llvm-dev] Writing loop transformations on the right representation is more productive

[llvm-dev] Writing loop transformations on the right representation is more productive

[llvm-dev] Writing loop transformations on the right representation is more productive