thr3ads.net - llvm dev - [llvm-dev] RFC Storing BB order in llvm::Instruction for faster local dominance [Sep 2018]

If this information is useful, please help other people find it:
Share via:

Reid Kleckner via llvm-dev

2018-Sep-26 18:55 UTC

[llvm-dev] RFC Storing BB order in llvm::Instruction for faster local dominance

On Tue, Sep 25, 2018 at 10:45 PM Chris Lattner <clattner at nondot.org>
wrote:
> So this is one of the reasons I find your patch to be problematic: it
> isn’t fixing N^2 behavior, it is merely changing one N^2 situation for
> another.  AFAICT there are one of two possible cases in these sorts of
> transformations:
>
> 1) These transformations are iteratively removing or inserting
> instructions, which invalidate the ordering with your approach, causing
> subsequent queries to rebuild the equivalent of OrderedBasicBlock.
>
I would say that this code fixes the quadratic behavior in practice, and
lays the foundation to fix it permanently if it becomes a problem.

First, removing instructions doesn't invalidate the ordering, and these
passes mostly delete instructions, so in practice, the latent quadratic
behavior is very hard to exercise.

In the future, if profiling shows that "insert;query;insert;query" is
a
bottleneck, we can fix it by leaving gaps in the numbering space and
assigning new numbers halfway between adjacent instructions, renumbering as
required to maintain amortized O(1) insertion complexity. I think it's
unlikely that we will ever need to implement this fancy renumbering
algorithm, and people on the bug agree (https://llvm.org/pr38829#c2).
Existing code that uses OrderedBasicBlock already doesn't implement this
algorithm. According to existing code, invalidating on modification is good
enough.

2) These transformations are not doing anything, in which case this is
all> wasted work.
>
> As a next step, can you please instrument the calls to calls from DSE
> and/or memcpyopt and see how many of the ones for the huge basic blocks
> actually turn into a transformation in the code?  If there are zero, then
> we should just disable these optimizations for large blocks.  If there are
> a few improvements, then we should see how to characterize them and what
> the right solution is based on the sorts of information they need.
>
> LLVM is occasionally maligned for the magic numbers like “6” in various
> optimizations that bound the work in various analyses (e.g. when computing
> masked bits) but they exist for very real reasons: the cases in which a
> large search will actually lead to benefits are marginal to non-existent,
> and the possibility for unbounded queries for core APIs cause all kinds of
> problems.
>
> I see the dependence analysis queries as the same sort of situation: until
> something like MemorySSA is able to define away these dense queries, we
> should be using bounded searches (in this case larger than 6 but less than
> 11,000 :-).
>
> It would be straight-forward to change llvm::BasicBlock to keep track of
> the number of instruction’s in it (doing so would not change any ilist
> algorithmic behavior that I’m aware of given the need to keep instruction
> parent pointers updated), and having that info would make it easy to cap
> linear passes like this or switch into local search modes.
>
> The cost of such a thing is the risk of performance regressions.  We
> control that risk by doing some analysis of the payoff of these
> transformations on large blocks - if the payoff doesn’t exist then it is a
> simple answer.  The real question here is not whether DSE is able to
> eliminate a single store, it is whether eliminating that store actually
> leads to a measurable performance improvement in the generated code.  Given
> huge blocks, I suspect the answer is “definitely not”.
>
> As suggested in the bug, if we were to rewrite these passes to use
> MemorySSA, this bottleneck would go away. I rebased a patch to do that for
> DSE, but finishing it off and enabling it by default is probably out of
> scope for me.
>
>
> Yes, understood, that would be a major change.  That said, changing the
> entire architecture of the compiler to work around this isn’t really
> appealing to me, I’d rather limit the problematic optimizations on the
> crazy large cases.
>
You're probably right, these passes probably aren't actually firing.
But,
it sounds like we have two options:
1. Add cutoffs to poorly understood slow passes that are misbehaving
2. Make a core data structure change to make dominance calculations faster
and simpler for all transforms

I can do this analysis, and add these cutoffs, but I wouldn't feel very
good about it. It adds code complexity, we'll have to test it, and tomorrow
someone will add new quadratic dominance queries. I don't see how
heuristically limiting misbehaving optimizations builds a better foundation
for LLVM tomorrow.

Dominance has been and will always be an important analysis in LLVM. This
patch is basically pushing an analysis cache down into IR. Would it be
accurate to say that your objection to this approach has more to do with
analysis data living in IR data structures?

If I added more infrastructure to invert the dependency, store these
instruction numbers in DominatorTree, and make BasicBlock notify
DominatorTree of instruction insertion and deletion, would that be
preferable? That's very similar to the code we're writing today, where
the
transform takes responsibility for marrying its modifications to its
invalidations. The question is, do we want to keep this stuff up to date
automatically, like we do for use lists, or is this something we want to
maintain on the side? I think dominance is something we might want to start
pushing down into IR.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180926/77266c52/attachment-0001.html>

Chris Lattner via llvm-dev

2018-Sep-27 05:24 UTC

head link

[llvm-dev] RFC Storing BB order in llvm::Instruction for faster local dominance

> On Sep 26, 2018, at 11:55 AM, Reid Kleckner <rnk at google.com>
wrote:
> 
>> 
>> As suggested in the bug, if we were to rewrite these passes to use
MemorySSA, this bottleneck would go away. I rebased a patch to do that for DSE,
but finishing it off and enabling it by default is probably out of scope for me.
> 
> Yes, understood, that would be a major change.  That said, changing the
entire architecture of the compiler to work around this isn’t really appealing
to me, I’d rather limit the problematic optimizations on the crazy large cases.
> 
> You're probably right, these passes probably aren't actually
firing. But, it sounds like we have two options:
> 1. Add cutoffs to poorly understood slow passes that are misbehaving
> 2. Make a core data structure change to make dominance calculations faster
and simpler for all transforms
> 
> I can do this analysis, and add these cutoffs, but I wouldn't feel very
good about it. It adds code complexity, we'll have to test it, and tomorrow
someone will add new quadratic dominance queries. I don't see how
heuristically limiting misbehaving optimizations builds a better foundation for
LLVM tomorrow.
My answer to this is that the long term path is to move these passes to
MemorySSA, which doesn’t have these problems.  At which point, the core IR
change you are proposing becomes really the wrong thing.
> Dominance has been and will always be an important analysis in LLVM. This
patch is basically pushing an analysis cache down into IR. Would it be accurate
to say that your objection to this approach has more to do with analysis data
living in IR data structures?
> 
> If I added more infrastructure to invert the dependency, store these
instruction numbers in DominatorTree, and make BasicBlock notify DominatorTree
of instruction insertion and deletion, would that be preferable? That's very
similar to the code we're writing today, where the transform takes
responsibility for marrying its modifications to its invalidations. The question
is, do we want to keep this stuff up to date automatically, like we do for use
lists, or is this something we want to maintain on the side? I think dominance
is something we might want to start pushing down into IR.
I have several objections to caches like this, and we have been through many
failed attempts at making analyses “autoadapt” to loosely coupled
transformations (e.g. the CallbackVH stuff, sigh).  My objections are things
like:

1) Caching and invalidation are very difficult to get right.
2) Invalidation logic slows down everyone even if they aren’t using the cache
(your “check to see if I need to invalidate when permuting the ilist).
3) We try very hard not to put analysis/xform specific gunk into core IR types,
because it punishes everyone but benefits only a few passes.
4) LLVM is used for a LOT of widely varying use cases - clang is just one
client, and many of them don’t run these passes.  This change pessimizes all
those clients, particularly the most sensitive 32-bit systems.
5) If done wrong, these caches can lead break invariants that LLVM optimization
passes are supposed to maintain.  For example, if you do “sparse numbering” then
the actual numbering will depend on the series of transformations that happen,
and if someone accidentally uses the numbers in the wrong way, you could make
“opt -pass1 -pass2” behave differently than “opt -pass1 | opt -pass2”.

Putting this stuff into DominatorTree could make sense, but then clients of
dominator tree would have to invalidate this when doing transformations.  If you
put the invalidation logic into ilist, then many of the problems above reoccur. 
I don’t think (but don’t know) that it is unreasonable to make all clients of DT
invalidate this manually.

-Chris

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180926/8d0b8392/attachment.html>

Finkel, Hal J. via llvm-dev

2018-Sep-27 17:36 UTC

head link

[llvm-dev] RFC Storing BB order in llvm::Instruction for faster local dominance

On 09/27/2018 12:24 AM, Chris Lattner via llvm-dev wrote:

On Sep 26, 2018, at 11:55 AM, Reid Kleckner <rnk at google.com<mailto:rnk
at google.com>> wrote:

As suggested in the bug, if we were to rewrite these passes to use MemorySSA,
this bottleneck would go away. I rebased a patch to do that for DSE, but
finishing it off and enabling it by default is probably out of scope for me.

Yes, understood, that would be a major change. That said, changing the entire
architecture of the compiler to work around this isn’t really appealing to me,
I’d rather limit the problematic optimizations on the crazy large cases.

You're probably right, these passes probably aren't actually firing.
But, it sounds like we have two options:
1. Add cutoffs to poorly understood slow passes that are misbehaving
2. Make a core data structure change to make dominance calculations faster and
simpler for all transforms

I can do this analysis, and add these cutoffs, but I wouldn't feel very good
about it. It adds code complexity, we'll have to test it, and tomorrow
someone will add new quadratic dominance queries. I don't see how
heuristically limiting misbehaving optimizations builds a better foundation for
LLVM tomorrow.

My answer to this is that the long term path is to move these passes to
MemorySSA, which doesn’t have these problems. At which point, the core IR
change you are proposing becomes really the wrong thing.

Maybe I'm missing something, but why do you say this? MemorySSA certainly
collects the memory references in a basic block into a set of use/def chains,
but determining dominance still requires walking those chains. This might help
reduce the constant factor on the O(N), because it skips the
non-memory-access-instructions, but the underlying complexity problem remains.
Maybe MemorySSA should cache a local numbering, but...

MemorySSA only helps if you're fine with skipping non-aliasing accesses.
It's not clear to me that this is always the case. For example, imagine that
we're trying to do something like SLP vectorization, and so we follow the
use-def chain of an address calculation and find two adjacent, but non-aliasing,
loads in the same basic block. We want to convert them into a vector load, so we
need to place the new vector load at the position of the first (dominating)
load. MemorySSA will help determine legality of the transformation, but
MemorySSA won't help determine the domainance because it will have the
MemorySSA nodes for both loads tied, potentially, to the same MemorySSA
definition node (i.e., you can't walk from one to the other, by design,
which is why it helps with the legality determination).

Dominance has been and will always be an important analysis in LLVM. This patch
is basically pushing an analysis cache down into IR. Would it be accurate to say
that your objection to this approach has more to do with analysis data living in
IR data structures?

If I added more infrastructure to invert the dependency, store these instruction
numbers in DominatorTree, and make BasicBlock notify DominatorTree of
instruction insertion and deletion, would that be preferable? That's very
similar to the code we're writing today, where the transform takes
responsibility for marrying its modifications to its invalidations. The question
is, do we want to keep this stuff up to date automatically, like we do for use
lists, or is this something we want to maintain on the side? I think dominance
is something we might want to start pushing down into IR.

I have several objections to caches like this, and we have been through many
failed attempts at making analyses “autoadapt” to loosely coupled
transformations (e.g. the CallbackVH stuff, sigh). My objections are things
like:

1) Caching and invalidation are very difficult to get right.
2) Invalidation logic slows down everyone even if they aren’t using the cache
(your “check to see if I need to invalidate when permuting the ilist).
3) We try very hard not to put analysis/xform specific gunk into core IR types,
because it punishes everyone but benefits only a few passes.

Dominance is a fundamental construct to an SSA-form IR. I certainly agree with
you in general, but I'm not convinced by this reasoning in this case.

4) LLVM is used for a LOT of widely varying use cases - clang is just one
client, and many of them don’t run these passes. This change pessimizes all
those clients, particularly the most sensitive 32-bit systems.

I don't see why this has anything to do with Clang. As I understood it, the
self-host compile time of a large Clang source file was being used only as an
example. This change may very well help many clients.

Thanks again,
Hal

5) If done wrong, these caches can lead break invariants that LLVM optimization
passes are supposed to maintain. For example, if you do “sparse numbering” then
the actual numbering will depend on the series of transformations that happen,
and if someone accidentally uses the numbers in the wrong way, you could make
“opt -pass1 -pass2” behave differently than “opt -pass1 | opt -pass2”.

Putting this stuff into DominatorTree could make sense, but then clients of
dominator tree would have to invalidate this when doing transformations. If you
put the invalidation logic into ilist, then many of the problems above reoccur.
I don’t think (but don’t know) that it is unreasonable to make all clients of DT
invalidate this manually.

-Chris

_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180927/f4f891c8/attachment-0001.html>

Maybe Matching Threads

Search for more reasonably related threads

llvm dev - Sep 2018 - RFC Storing BB order in llvm::Instruction for faster local dominance

[llvm-dev] RFC Storing BB order in llvm::Instruction for faster local dominance

[llvm-dev] RFC Storing BB order in llvm::Instruction for faster local dominance

[llvm-dev] RFC Storing BB order in llvm::Instruction for faster local dominance

Maybe Matching Threads