Later today, I'm going to be reverting D87551. I first raised serious
concern on said review back in Oct, but this is a bit of an unusual case
because the change landed roughly a year before that.
This patch introduced a profile driven heuristic to selectively disable
hoisting of instructions out of loops. By doing so, it changes a long
standing design element without broad consensus following discussion on
llvm-dev. However, this email isn't really about the revert per se.
In the course of the discussion leading to this point, I realized we
didn't really have a cite-able resource describing the historical
design. This email is an attempt to provide that, and to highlight some
of the issues which need addressed if we do decide we want to change it.
*LICM as canonical form*
We have for many years treated hoisting instructions out of loops as a
canonical form. That is, hoisting is not done because it is profitable
(though it often is), but is instead done so that other parts of the
optimizer can rely on it.
We assume that an unprofitable hoist will be undone. Historically, we
have generally assumed this to be done in the backend, but more
recently, LoopSink has also been added towards the end of the IR
pipeline with the same goal.
*Why does this matter?*
Other transforms depend on us having hoisted instructions out of loops
for effectiveness. The largest source of such assumptions is that SCEV
is unable to compute trip counts for any exit condition involving a loop
varying load. Almost all of our loop transformations depend on SCEVs
trip count logic, so failing to hoist an otherwise hoistable load is a
severe pessimization.
For illustration purposes, consider this toy example:
for (int i = 0; ; i++) {
sum += a[i] + *b;
length = a.length;
if (i >= length) break;
}
This example involves a typical for-loop for which the exit test depends
on a loop varying load.
Here's a couple examples:
1. In unrolling, the form above is not unrollable. The trip count is
unknowable. We might be able to use profile information to do a
bounded full unroll if this loop is short running, but all other
forms of unrolling (exact full, partial, and runtime) will be
impossible.
2. In the vectorizer, we will be unable to establish a trip count, and
thus will not vectorize. Additionally, even if we can compute a
trip count, the cost model handling for uniformity depends on hoisting.
Other impacts worth noting
1. In loop idiom recognize, we will fail to recognize most counted
idioms (e.g. popcount, cttz). Additionally, things like memset
recognition will not happen if the value being stored was hoistable,
but not hoisted.
2. Our ability to analyze dominating conditions (e.g. cvp,
valuetracking, SCEV's isKnownPredicateAt) will all be crippled by
the inability to recognize values are loop invariant. When the RHS
of a comparison is a potentially different value every time it runs,
it really limits our ability to derive useful knowledge from that
comparison or cross correlate comparisons.
*But what about an unprofitable hoist?*
There are examples where hoisting is not profitable. Here's one such
example:
for (int i = 0; i < N; i++) {
if (dynamically_always_taken) continue;
sum += a[i];
length = a.length;
if (i >= length) break;
}
Our general posture has been that we will perform hoisting in the middle
end, and then undo that hoisting if needed later in the pass pipeline.
The basic reason for that is that it is nearly impossible to distinguish
profitable from unprofitable cases because the profitability of the
transform depends too heavily on which following transforms might run.
Here's a small example which might at first seem unprofitable - inspired
by the patch being reverted - but where hoisting is in fact the far more
profitable outcome.
for (int i = 0; i < N; i++) {
i8* addr = a;
if (invariant_cond_usually_false) {
// very, very rare block e.g. 1 in 100 million
addr = a + 1;
}
*addr = 0
}
Subtly, this example should be profitable to hoist even if the rare
condition is not invariant. We still know this loop writes to at most
two memory locations. While we might not exploit that fact today, an
extended form of load-store promotion could do so. If we don't hoist the
addressing expression under the rare branch, we can not (in general)
determine that at most two locations are written.
I will note that LoopSink appears to be a bit restricted in practice.
Someone with unprofitable examples could reasonably push this much further.
*How would we change this?*
I want to be very explicit about saying this design is only one
reasonable design. It would also be entirely reasonable to build a
design around a profit driven LICM. That's simply not what we have
today. The remainder of this section is about expanding on the work
which would need to be undertaken to make such a change.
First, we would need a clear set of examples where LICM was truly
unprofitable. These examples would need to be publicly accessible.
They would also need to be fairly minimal. In particular, there must
not be other obvious optimizations which if implemented makes the
hoisting profitable after all.
Second, we would need a proposal to llvm-dev which directly engages with
the fact that SCEV (and thus most of our loop passes) depends on having
loads hoisted for analysis quality. We could build a mechanism in SCEV
to model possible load hoisting. There are some tricky bits in doing
so, but it should be possible in theory.
The main problem with modeling possible hoists in SCEV is the need to
query both memory analysis and fault legality efficiently. Figuring out
how to make that available for all users of SCEV without introducing
nasty invalidation bugs or degenerate compile times is a hard design
problem, but might be feasible. (I think that MemorySSA gives an
interesting building block here, but have not deeply considered this.)
There's also an API design problem in making sure that analysis results
can't be consumed without committing to the hoisting in the IR. A
transform which e.g. assumes a trip count without hoisting the relevant
load would be subtly incorrect.
Third, a consensus must be built that the resulting additional
complexity for the mechanism build to address the previous point is
worthwhile for the project as a whole. This will be a judgement call
and would depend heavily on the solution chosen for the previous point.
Finally, to be explicit, I am using the SCEV use case by way of an
example only. There may be other ways that we depend on hoisting as a
canonical form that I did not happen to think of when writing this
email. The burden is on the person or person proposing a change to
identify any other dependencies, and to convince the community they have
done so. Discussion of a testing strategy to find those dependencies
should be a first class concern in any proposal.
Philip
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20211203/4a41d395/attachment.html>
I am in support of your proposal. Checking whether an llvm::Value is not defined by an instruction in the loop is the most straightforward way to determine whether a value is loop-invariant. Cf D87551 the profile information should probably be used in LoopSink or similar pass determining whether the only use of an instruction is so rare that it is worth moving into the conditional execution in the loop, with the added benefit that it would also sink instructions that were not in the loop in the the first place. Michael Am Fr., 3. Dez. 2021 um 17:28 Uhr schrieb Philip Reames via llvm-dev <llvm-dev at lists.llvm.org>:> > Later today, I'm going to be reverting D87551. I first raised serious concern on said review back in Oct, but this is a bit of an unusual case because the change landed roughly a year before that. > > This patch introduced a profile driven heuristic to selectively disable hoisting of instructions out of loops. By doing so, it changes a long standing design element without broad consensus following discussion on llvm-dev. However, this email isn't really about the revert per se. > > In the course of the discussion leading to this point, I realized we didn't really have a cite-able resource describing the historical design. This email is an attempt to provide that, and to highlight some of the issues which need addressed if we do decide we want to change it. > > LICM as canonical form > > We have for many years treated hoisting instructions out of loops as a canonical form. That is, hoisting is not done because it is profitable (though it often is), but is instead done so that other parts of the optimizer can rely on it. > > We assume that an unprofitable hoist will be undone. Historically, we have generally assumed this to be done in the backend, but more recently, LoopSink has also been added towards the end of the IR pipeline with the same goal. > > Why does this matter? > > Other transforms depend on us having hoisted instructions out of loops for effectiveness. The largest source of such assumptions is that SCEV is unable to compute trip counts for any exit condition involving a loop varying load. Almost all of our loop transformations depend on SCEVs trip count logic, so failing to hoist an otherwise hoistable load is a severe pessimization. > > For illustration purposes, consider this toy example: > > for (int i = 0; ; i++) { > sum += a[i] + *b; > length = a.length; > if (i >= length) break; > } > > This example involves a typical for-loop for which the exit test depends on a loop varying load. > > Here's a couple examples: > > In unrolling, the form above is not unrollable. The trip count is unknowable. We might be able to use profile information to do a bounded full unroll if this loop is short running, but all other forms of unrolling (exact full, partial, and runtime) will be impossible. > In the vectorizer, we will be unable to establish a trip count, and thus will not vectorize. Additionally, even if we can compute a trip count, the cost model handling for uniformity depends on hoisting. > > Other impacts worth noting > > In loop idiom recognize, we will fail to recognize most counted idioms (e.g. popcount, cttz). Additionally, things like memset recognition will not happen if the value being stored was hoistable, but not hoisted. > Our ability to analyze dominating conditions (e.g. cvp, valuetracking, SCEV's isKnownPredicateAt) will all be crippled by the inability to recognize values are loop invariant. When the RHS of a comparison is a potentially different value every time it runs, it really limits our ability to derive useful knowledge from that comparison or cross correlate comparisons. > > But what about an unprofitable hoist? > > There are examples where hoisting is not profitable. Here's one such example: > > for (int i = 0; i < N; i++) { > if (dynamically_always_taken) continue; > sum += a[i]; > length = a.length; > if (i >= length) break; > } > > Our general posture has been that we will perform hoisting in the middle end, and then undo that hoisting if needed later in the pass pipeline. The basic reason for that is that it is nearly impossible to distinguish profitable from unprofitable cases because the profitability of the transform depends too heavily on which following transforms might run. > > Here's a small example which might at first seem unprofitable - inspired by the patch being reverted - but where hoisting is in fact the far more profitable outcome. > > for (int i = 0; i < N; i++) { > i8* addr = a; > if (invariant_cond_usually_false) { > // very, very rare block e.g. 1 in 100 million > addr = a + 1; > } > *addr = 0 > } > > Subtly, this example should be profitable to hoist even if the rare condition is not invariant. We still know this loop writes to at most two memory locations. While we might not exploit that fact today, an extended form of load-store promotion could do so. If we don't hoist the addressing expression under the rare branch, we can not (in general) determine that at most two locations are written. > > I will note that LoopSink appears to be a bit restricted in practice. Someone with unprofitable examples could reasonably push this much further. > > How would we change this? > > I want to be very explicit about saying this design is only one reasonable design. It would also be entirely reasonable to build a design around a profit driven LICM. That's simply not what we have today. The remainder of this section is about expanding on the work which would need to be undertaken to make such a change. > > First, we would need a clear set of examples where LICM was truly unprofitable. These examples would need to be publicly accessible. They would also need to be fairly minimal. In particular, there must not be other obvious optimizations which if implemented makes the hoisting profitable after all. > > Second, we would need a proposal to llvm-dev which directly engages with the fact that SCEV (and thus most of our loop passes) depends on having loads hoisted for analysis quality. We could build a mechanism in SCEV to model possible load hoisting. There are some tricky bits in doing so, but it should be possible in theory. > > The main problem with modeling possible hoists in SCEV is the need to query both memory analysis and fault legality efficiently. Figuring out how to make that available for all users of SCEV without introducing nasty invalidation bugs or degenerate compile times is a hard design problem, but might be feasible. (I think that MemorySSA gives an interesting building block here, but have not deeply considered this.) > > There's also an API design problem in making sure that analysis results can't be consumed without committing to the hoisting in the IR. A transform which e.g. assumes a trip count without hoisting the relevant load would be subtly incorrect. > > Third, a consensus must be built that the resulting additional complexity for the mechanism build to address the previous point is worthwhile for the project as a whole. This will be a judgement call and would depend heavily on the solution chosen for the previous point. > > Finally, to be explicit, I am using the SCEV use case by way of an example only. There may be other ways that we depend on hoisting as a canonical form that I did not happen to think of when writing this email. The burden is on the person or person proposing a change to identify any other dependencies, and to convince the community they have done so. Discussion of a testing strategy to find those dependencies should be a first class concern in any proposal. > > Philip > > > > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Hi Philip, Like others, I like where this is going. Like all, I'm worried how much weird stuff will come out of the box. :) On Fri, 3 Dec 2021 at 23:27, Philip Reames via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Finally, to be explicit, I am using the SCEV use case by way of an example > only. There may be other ways that we depend on hoisting as a canonical > form that I did not happen to think of when writing this email. The burden > is on the person or person proposing a change to identify any other > dependencies, and to convince the community they have done so. Discussion > of a testing strategy to find those dependencies should be a first class > concern in any proposal. >There are a number of checks in loop optimisations that look beyond SCEV, and all of them will have to learn the new "canonical" forms. What I mean is that wherever the ranges, loop dependencies, target legalities will look at could have more than one shape (basically which BB to look for it, how complex the Phi graph is, etc). It should not have to look at *all* BBs in a given function nor build a complete graph of all changes to all variables across all BBs accessible from the loop context. So, perhaps we could propose at least two "canonical" forms (in quotes, because it's more than one), but you get the idea. Another issue is a potential cascading inability to perform transformations. If there are passes between LICM and other loop optimisations that happen to (perhaps implicitly) expect hoisted loads and thus fail to transform in case the profiler says LICM shouldn't happen, subsequent passes will have more and more "canonical" patterns to look for, or more likely, will fail to see anything. The potential change to everything that happens after LICM is worrying. We already have that problem today, as probably all other compilers do, but we have reached this local minimum by trial and error, and finding a better minimum elsewhere will be an iterative process and likely pass through maximums and create noise, especially in edge cases. So the real question to me is: how are we going to guide this through the solution space with minimal disruption while still making progress? cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211206/f0586ad0/attachment.html>
Thanks writing this up and the details on why the design is set up this way,
I've got a better picture of this now than from the comments on the diffs.
I'll look into our exact case more to see if this could be un-done in
LoopSink using profile information. If not, I'll extract out the example so
we can openly discuss what a solution would look like moving forward.
Modi
On 12/3/21, 3:27 PM, "llvm-dev on behalf of Philip Reames via
llvm-dev" <llvm-dev-bounces at lists.llvm.org on behalf of llvm-dev at
lists.llvm.org> wrote:
Later today, I'm going to be reverting D87551. I first raised serious
concern on said review back in Oct, but this is a bit of an unusual case because
the change landed roughly a year before that.
This patch introduced a profile driven heuristic to selectively disable hoisting
of instructions out of loops. By doing so, it changes a long standing design
element without broad consensus following discussion on llvm-dev. However, this
email isn't really about the revert per se.
In the course of the discussion leading to this point, I realized we didn't
really have a cite-able resource describing the historical design. This email
is an attempt to provide that, and to highlight some of the issues which need
addressed if we do decide we want to change it.
LICM as canonical form
We have for many years treated hoisting instructions out of loops as a canonical
form. That is, hoisting is not done because it is profitable (though it often
is), but is instead done so that other parts of the optimizer can rely on it.
We assume that an unprofitable hoist will be undone. Historically, we have
generally assumed this to be done in the backend, but more recently, LoopSink
has also been added towards the end of the IR pipeline with the same goal.
Why does this matter?
Other transforms depend on us having hoisted instructions out of loops for
effectiveness. The largest source of such assumptions is that SCEV is unable to
compute trip counts for any exit condition involving a loop varying load.
Almost all of our loop transformations depend on SCEVs trip count logic, so
failing to hoist an otherwise hoistable load is a severe pessimization.
For illustration purposes, consider this toy example:
for (int i = 0; ; i++) {
sum += a[i] + *b;
length = a.length;
if (i >= length) break;
}
This example involves a typical for-loop for which the exit test depends on a
loop varying load.
Here's a couple examples:
1. In unrolling, the form above is not unrollable. The trip count is
unknowable. We might be able to use profile information to do a bounded full
unroll if this loop is short running, but all other forms of unrolling (exact
full, partial, and runtime) will be impossible.
2. In the vectorizer, we will be unable to establish a trip count, and thus will
not vectorize. Additionally, even if we can compute a trip count, the cost
model handling for uniformity depends on hoisting.
Other impacts worth noting
1. In loop idiom recognize, we will fail to recognize most counted idioms (e.g.
popcount, cttz). Additionally, things like memset recognition will not happen
if the value being stored was hoistable, but not hoisted.
2. Our ability to analyze dominating conditions (e.g. cvp, valuetracking,
SCEV's isKnownPredicateAt) will all be crippled by the inability to
recognize values are loop invariant. When the RHS of a comparison is a
potentially different value every time it runs, it really limits our ability to
derive useful knowledge from that comparison or cross correlate comparisons.
But what about an unprofitable hoist?
There are examples where hoisting is not profitable. Here's one such
example:
for (int i = 0; i < N; i++) {
if (dynamically_always_taken) continue;
sum += a[i];
length = a.length;
if (i >= length) break;
}
Our general posture has been that we will perform hoisting in the middle end,
and then undo that hoisting if needed later in the pass pipeline. The basic
reason for that is that it is nearly impossible to distinguish profitable from
unprofitable cases because the profitability of the transform depends too
heavily on which following transforms might run.
Here's a small example which might at first seem unprofitable - inspired by
the patch being reverted - but where hoisting is in fact the far more profitable
outcome.
for (int i = 0; i < N; i++) {
i8* addr = a;
if (invariant_cond_usually_false) {
// very, very rare block e.g. 1 in 100 million
addr = a + 1;
}
*addr = 0
}
Subtly, this example should be profitable to hoist even if the rare condition is
not invariant. We still know this loop writes to at most two memory locations.
While we might not exploit that fact today, an extended form of load-store
promotion could do so. If we don't hoist the addressing expression under
the rare branch, we can not (in general) determine that at most two locations
are written.
I will note that LoopSink appears to be a bit restricted in practice. Someone
with unprofitable examples could reasonably push this much further.
How would we change this?
I want to be very explicit about saying this design is only one reasonable
design. It would also be entirely reasonable to build a design around a profit
driven LICM. That's simply not what we have today. The remainder of this
section is about expanding on the work which would need to be undertaken to make
such a change.
First, we would need a clear set of examples where LICM was truly unprofitable.
These examples would need to be publicly accessible. They would also need to be
fairly minimal. In particular, there must not be other obvious optimizations
which if implemented makes the hoisting profitable after all.
Second, we would need a proposal to llvm-dev which directly engages with the
fact that SCEV (and thus most of our loop passes) depends on having loads
hoisted for analysis quality. We could build a mechanism in SCEV to model
possible load hoisting. There are some tricky bits in doing so, but it should
be possible in theory.
The main problem with modeling possible hoists in SCEV is the need to query both
memory analysis and fault legality efficiently. Figuring out how to make that
available for all users of SCEV without introducing nasty invalidation bugs or
degenerate compile times is a hard design problem, but might be feasible. (I
think that MemorySSA gives an interesting building block here, but have not
deeply considered this.)
There's also an API design problem in making sure that analysis results
can't be consumed without committing to the hoisting in the IR. A transform
which e.g. assumes a trip count without hoisting the relevant load would be
subtly incorrect.
Third, a consensus must be built that the resulting additional complexity for
the mechanism build to address the previous point is worthwhile for the project
as a whole. This will be a judgement call and would depend heavily on the
solution chosen for the previous point.
Finally, to be explicit, I am using the SCEV use case by way of an example
only. There may be other ways that we depend on hoisting as a canonical form
that I did not happen to think of when writing this email. The burden is on the
person or person proposing a change to identify any other dependencies, and to
convince the community they have done so. Discussion of a testing strategy to
find those dependencies should be a first class concern in any proposal.
Philip