thr3ads.net - llvm dev - [LLVMdev] Question regarding basic-block placement optimization [Oct 2011]

If this information is useful, please help other people find it:
Share via:

Chandler Carruth

2011-Oct-21 06:53 UTC

[LLVMdev] Question regarding basic-block placement optimization

On Thu, Oct 20, 2011 at 10:53 AM, Jakob Stoklund Olesen <stoklund at
2pi.dk>wrote:
>
> On Oct 20, 2011, at 9:56 AM, Chandler Carruth wrote:
>
> > A new patch is attached that is *much* less of a rough draft. Sorry
for
> any confusion due to the early state of the patch.
>
> Thanks, Chandler. This is great stuff.
>
> > Still, I never intended this to be on-by-default at first. This is a
> starting point, that I hope can be improved into something that is
> on-by-default eventually, but I'll be the first to say it's almost
certainly
> not ready for that just yet.
>
> I would like to see this go in ASAP under a flag. I prefer to see
> development as commits rather than a series of updated patches.
>
Awesome, same way I feel. Checked in as r142641. I've got some generic
cleanup I'll go ahead and commit to it over the next few days.

Also, let me know whether you'd prefer post-commit or pre-commit review for
various feature additions we've talked about here. I'm fine either way,
so
its really what makes things easier on your end. I'll definitely want review
on all of it as some of this (especially the loop structure issues) is still
a bit new to me. =D

At least some of it (the loop alignment from CodePlacementOpt) is obvious
how to port across, but I'm sure there will be more tricky elements around
the loop structures.

Could you rename it to MachineBlockPlacement.cpp first, though? That
makes> it clear it's a CodeGen pass, and the BlockPlacement2 name is icky.
>
Yea, the 2 thing was only intended to work around a brief duplication with
the existing pass. MachineBlockPlacement solves both problems nicely.

> > I'm more hopeful that we can delete the existing block placement
pass,
> and direct anyone interested in profile file guided stuff to write a simple
> pass to load profiles into metadata. I suspect this pass is already
superior
> to that one.
>
> I also see it as a replacement for CodePlacementOpt.cpp, so I think your
> flag should run your pass instead of CodePlacementOpt rather than before or
> after it.
>
> You should simply clone the loop alignment stuff, it's pretty trivial.

Done, and in progress. SHould have the loop alignment stuff cloned right
away, and then i'll start looking at the loop structure issues. I'll
probably have some questions for you and/or andy about exactly how that
should work, whether CodePlacementOpt is doing the right thing, etc.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111020/8708a06e/attachment.html>

Andrew Trick

2011-Oct-22 01:17 UTC

head link

[LLVMdev] Question regarding basic-block placement optimization

On Oct 20, 2011, at 11:53 PM, Chandler Carruth wrote:> On Thu, Oct 20, 2011 at 10:53 AM, Jakob Stoklund Olesen <stoklund at
2pi.dk> wrote:
> On Oct 20, 2011, at 9:56 AM, Chandler Carruth wrote:
>
> > A new patch is attached that is *much* less of a rough draft. Sorry
for any confusion due to the early state of the patch.
>
> Thanks, Chandler. This is great stuff.
>
> > Still, I never intended this to be on-by-default at first. This is a
starting point, that I hope can be improved into something that is on-by-default
eventually, but I'll be the first to say it's almost certainly not ready
for that just yet.
>
> I would like to see this go in ASAP under a flag. I prefer to see
development as commits rather than a series of updated patches.
...> Awesome, same way I feel. Checked in as r142641. I've got some generic
cleanup I'll go ahead and commit to it over the next few days.
> Done, and in progress. SHould have the loop alignment stuff cloned right
away, and then i'll start looking at the loop structure issues. I'll
probably have some questions for you and/or andy about exactly how that should
work, whether CodePlacementOpt is doing the right thing, etc.
Your work looks good and there's talk of making this the default
layout pass. So let's figure out the goals and requirements. We agree
that preserving "structure" of the CFG is important in the absence of
profile information to the contrary. This is important for performance
stability in the presence of imperfect profile data and unbiased
branches. But it's also important for the sanity of fellow compiler
engineers and anyone debugging the disassembly. We can define
structure very simply by saying that we want contiguous loops and
topologically ordered regions within a loop. Those are not the
strongest constraints. We could still end up with strangely shuffled
code, but they are easy constraints to define.

Once you decide to break these constraints, you have effectively
designated the offending paths as "cold". It's up to you how to
decide which paths are cold. You'll need some threshold based on
confidence in the profile, and we can measure the effectiveness of
your heuristics and add more over time. The key point is that the
layout algorithm needs to make an explicit binary decision at some
point to violate the natural structure of the CFG.

I don't care much about how you layout the cold chains as long as they
are not interleaved with the rest of the code, thus breaking the
topological ordering and forcing extra branches. In practice, this
means they get their own little Siberia after the function return,
effectively splitting the function into two sections. However, the
notion of "coldness" is really relative to the current loop head. So
you may layout a nice contiguous, topologically ordered inner loop,
then determine the whole thing is cold relative to its outer loop, so
send the whole thing packing to Siberia. That's exactly what we
want. The loop itself is sane, but doesn't get in the way of hotter
enclosing code.

Even within the constraints imposed by CFG structure, you can reduce
the likelihood of taken branches. Here you can employ any
profile-based or static heuristics with lower confidence, because the
resulting layout will be reasonably good either way. This will still
do a great job in the absence of accurate profile information, builtin
expect, or very strong static heuristics (e.g. landing pads are cold).

This is where the algorithm gets interesting and you can be creative,
but I'll give you a rough idea so we're on the same page:

1) Visit each loop independently inside-out.

2) Construct chains of blocks within the loop. Here the "entry" chain
is the loop header, or function entry. You can probably represent
each already laid out inner loop as a single chain. The graph of
chains within a loop is *acyclic* except for irreducible control
flow, which doesn't have any structure to preserve anyway. There's
no need to compute SCCs.

3) Stop constructing the chain at a CFG merge. Chaining past a merge
would immediately violate topological ordering, which is only ok if
other predecessors of the merge have been designated "cold".

4) Splice the chains together, preserving topological order. Assigning
each block to a DFS reverse-post-order number provides a nice
ordering that is easy to compare or sort.

Let me give you a little example to think about:

A:
br B(80%), C(20%)
B:
br D
C:
br D(90%), E(10%)
D:
br F
E:
br F
F:
ret

This is a fine topological sort but bad layout based on expected
branch probability (even with low confidence in the profile).

The only good layout that preserves topological ordering is:
A, B, C, E, D, F

.8 * 1 branch + .18 * 2 branches + .02 * 2 branches

This approach likely covers the existing functionality of
CodePlacement, which attempts to keep loops contiguous.
You will probably need to do the loop tail adjustment that
CodePlacement currently does as a special case.
(Shift the tail of the loop to precede its header, resulting in
a fall-through loop test.)

We can discuss the design more, but you may want to think about it and
come with your own design first.

Finally, we need some way to validate the design and verify the
implementation. Weighted statistics will be very useful here, similar
to those that Jakob used in the register allocator. For example:

For each nonadjacent edge:
TakenBranchFrequency += frequency(Edge)

This metric will prefer the most "extreme" layout because it assumes
perfect accuracy of block frequencies. It would be nice to look at the
same metric computed from skewed branch probabilities to see how
sensitive it is to small shifts in branch behavior.

Taken branch frequency is the most obvious, but doesn't completely
capture locality. You could also measure distance jumped fairly
easily, also weighted by edge frequency.

It might be useful to gather these stats as late as possible to take
into consideration the final impact of layout, including the effect on
any later passes.

-Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111021/302813af/attachment.html>

Chandler Carruth

2011-Oct-22 11:25 UTC

head link

[LLVMdev] Question regarding basic-block placement optimization

On Fri, Oct 21, 2011 at 6:17 PM, Andrew Trick <atrick at apple.com> wrote:
> On Oct 20, 2011, at 11:53 PM, Chandler Carruth wrote:
>
> On Thu, Oct 20, 2011 at 10:53 AM, Jakob Stoklund Olesen <stoklund at
2pi.dk>wrote:
>>
>> On Oct 20, 2011, at 9:56 AM, Chandler Carruth wrote:
>>
>> > A new patch is attached that is *much* less of a rough draft.
Sorry for
>> any confusion due to the early state of the patch.
>>
>> Thanks, Chandler. This is great stuff.
>>
>> > Still, I never intended this to be on-by-default at first. This is
a
>> starting point, that I hope can be improved into something that is
>> on-by-default eventually, but I'll be the first to say it's
almost certainly
>> not ready for that just yet.
>>
>> I would like to see this go in ASAP under a flag. I prefer to see
>> development as commits rather than a series of updated patches.
>
> ...
>
> Awesome, same way I feel. Checked in as r142641. I've got some generic
> cleanup I'll go ahead and commit to it over the next few days.
> Done, and in progress. SHould have the loop alignment stuff cloned right
> away, and then i'll start looking at the loop structure issues.
I'll
> probably have some questions for you and/or andy about exactly how that
> should work, whether CodePlacementOpt is doing the right thing, etc.
>
>
> Your work looks good and there's talk of making this the default
> layout pass. So let's figure out the goals and requirements. We agree
> that preserving "structure" of the CFG is important in the
absence of
> profile information to the contrary. This is important for performance
> stability in the presence of imperfect profile data and unbiased
> branches. But it's also important for the sanity of fellow compiler
> engineers and anyone debugging the disassembly. We can define
> structure very simply by saying that we want contiguous loops and
> topologically ordered regions within a loop. Those are not the
> strongest constraints. We could still end up with strangely shuffled
> code, but they are easy constraints to define.
>
> Once you decide to break these constraints, you have effectively
> designated the offending paths as "cold". It's up to you how
to
> decide which paths are cold. You'll need some threshold based on
> confidence in the profile, and we can measure the effectiveness of
> your heuristics and add more over time. The key point is that the
> layout algorithm needs to make an explicit binary decision at some
> point to violate the natural structure of the CFG.
>
> I don't care much about how you layout the cold chains as long as they
> are not interleaved with the rest of the code, thus breaking the
> topological ordering and forcing extra branches. In practice, this
> means they get their own little Siberia after the function return,
> effectively splitting the function into two sections. However, the
> notion of "coldness" is really relative to the current loop head.
So
> you may layout a nice contiguous, topologically ordered inner loop,
> then determine the whole thing is cold relative to its outer loop, so
> send the whole thing packing to Siberia. That's exactly what we
> want. The loop itself is sane, but doesn't get in the way of hotter
> enclosing code.
>
> Even within the constraints imposed by CFG structure, you can reduce
> the likelihood of taken branches. Here you can employ any
> profile-based or static heuristics with lower confidence, because the
> resulting layout will be reasonably good either way. This will still
> do a great job in the absence of accurate profile information, builtin
> expect, or very strong static heuristics (e.g. landing pads are cold).
>
> This is where the algorithm gets interesting and you can be creative,
> but I'll give you a rough idea so we're on the same page:
>
> 1) Visit each loop independently inside-out.
>
> 2) Construct chains of blocks within the loop. Here the "entry"
chain
>    is the loop header, or function entry. You can probably represent
>    each already laid out inner loop as a single chain. The graph of
>    chains within a loop is *acyclic* except for irreducible control
>    flow, which doesn't have any structure to preserve anyway.
There's
>    no need to compute SCCs.
>
> 3) Stop constructing the chain at a CFG merge. Chaining past a merge
>    would immediately violate topological ordering, which is only ok if
>    other predecessors of the merge have been designated "cold".
>
> 4) Splice the chains together, preserving topological order. Assigning
>    each block to a DFS reverse-post-order number provides a nice
>    ordering that is easy to compare or sort.
>
FYI, I'm still working on the details, and the code is a complete hack, but
I have a big chunk of this working now. =] The more precise strategy I
implemented follows your suggestion very closely. I've got lots more ideas
to layer on top of this, but this is my first stab at providing a better
underpinning.

First, it walks all loops in the function from the inside out just as you
describe. =] Really like this, it's a very nice model to use.

For each loop, after having walked all sub-loops, it walks forward across
all the blocks in the loop. For each block not part of a chain, it creates a
single-block chain. If the block has more than one predecessor (I think this
is what you were after w/ a "merge point"?), we're done. If the
block is the
start of the chain it belongs to, and if it is the most likely of the
in-loop successors of the previous block, and that previous block is the
tail of its chain, we merge its chain to the previous block's chain. This
should only form chains which are valid in the existing topological ordering
within a loop, and don't cross loop boundaries or cross incoming branches.

Then we traverse the BB graph in RPO, and for each block look up its chain.
The first time we encounter a chain, we splice it into the function.

I think this satisfies your desire to preserve the structural components of
the CFG? If not, I'd love to understand where I've deviated / what
I've
misunderstood.

It already makes some minimal use of the probabilities to select which
topology-preserving fall-through options to prefer. I'm also going to
tie-break with the existing layout. I've not started to really integrate it
with probabilities though, I wanted to get something that fit the constraint
model you're looking for first, and had clean traversal semantics.

I'll clean the patch up and mail it out tomorrow. The implementation gets
tremendously simpler by A) using the loop info, and B) exclusively walking
forward over the blocks. The first simplified the traversal greatly, and the
second made it easy to just store pointers to basic blocks for the chains
rather than moving blocks around in the function itself. That gets rid of a
bunch of other code, and the most bug-prone of the other code. So all in
all, I like where this is going.

> Let me give you a little example to think about:
>
> A:
>   br B(80%), C(20%)
> B:
>   br D
> C:
>   br D(90%), E(10%)
> D:
>   br F
> E:
>   br F
> F:
>   ret
>
> This is a fine topological sort but bad layout based on expected
> branch probability (even with low confidence in the profile).
>
> The only good layout that preserves topological ordering is:
> A, B, C, E, D, F
>
> .8 * 1 branch + .18 * 2 branches + .02 * 2 branches
>
> This approach likely covers the existing functionality of
> CodePlacement, which attempts to keep loops contiguous.
> You will probably need to do the loop tail adjustment that
> CodePlacement currently does as a special case.
> (Shift the tail of the loop to precede its header, resulting in
> a fall-through loop test.)
>
> We can discuss the design more, but you may want to think about it and
> come with your own design first.
>
> Finally, we need some way to validate the design and verify the
> implementation. Weighted statistics will be very useful here, similar
> to those that Jakob used in the register allocator. For example:
>
> For each nonadjacent edge:
>   TakenBranchFrequency += frequency(Edge)
>
> This metric will prefer the most "extreme" layout because it
assumes
> perfect accuracy of block frequencies. It would be nice to look at the
> same metric computed from skewed branch probabilities to see how
> sensitive it is to small shifts in branch behavior.
>
> Taken branch frequency is the most obvious, but doesn't completely
> capture locality. You could also measure distance jumped fairly
> easily, also weighted by edge frequency.
>
> It might be useful to gather these stats as late as possible to take
> into consideration the final impact of layout, including the effect on
> any later passes.
>
> -Andy
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111022/6a6a1fb5/attachment.html>

Chandler Carruth

2011-Oct-23 08:11 UTC

head link

[LLVMdev] Question regarding basic-block placement optimization

Ok, I think I have a working pass that is based much more on what we've
talked about here. The patch is attached. I'd love to commit it soon-ish and
then start tweaking it based on feedback from you, others, and looking at
how it actually works in the wild.

It's essentially a re-write, so it may be hard to read... Let me know if
some other form would be easier.

Some responses to your comments below, perhaps giving insight into the
choices I made or what the pass is currently doing and what I haven't
addressed yet.

On Fri, Oct 21, 2011 at 6:17 PM, Andrew Trick <atrick at apple.com> wrote:
> Once you decide to break these constraints, you have effectively
> designated the offending paths as "cold". It's up to you how
to
> decide which paths are cold. You'll need some threshold based on
> confidence in the profile, and we can measure the effectiveness of
> your heuristics and add more over time.
>
FYI, I ended up with a heuristic that requires a hot edge to be 4x more
likely than any other competing in the CFG. This is very loosely based on
the definition of a 'hot' edge in BranchProbabilityInfo (80% vs. 20%).

I don't care much about how you layout the cold chains as long as
they> are not interleaved with the rest of the code, thus breaking the
> topological ordering and forcing extra branches. In practice, this
> means they get their own little Siberia after the function return,
> effectively splitting the function into two sections. However, the
> notion of "coldness" is really relative to the current loop head.
>
I've not really tried to ensure this happens correctly. Mostly, this is just
forming hot-paths through the code when they are *really* hot. I'll work on
better sectioning off the cold paths that are cut off in a followup patch.

2) Construct chains of blocks within the loop. Here the "entry"
chain>    is the loop header, or function entry. You can probably represent
>    each already laid out inner loop as a single chain. The graph of
>    chains within a loop is *acyclic* except for irreducible control
>    flow, which doesn't have any structure to preserve anyway.
There's
>    no need to compute SCCs.
>
I'm actually not satisfied with how I've done this. I think there is
likely
a cleaner / better solution than the one I have in the patch, but it largely
works, and is easy to change later. The patch just walks forward over the
function, and tries to chain blocks together with their successors when the
CFG permits, choosing successors based on whatever (even weak) signal the
probability info gives us.

4) Splice the chains together, preserving topological order.
Assigning>    each block to a DFS reverse-post-order number provides a nice
>    ordering that is easy to compare or sort.
>
Currently I'm doing this once at the very end. I wonder if it would be
better to do iteratively as we walk up through the loops?

Let me give you a little example to think about:>
> A:
>   br B(80%), C(20%)
> B:
>   br D
> C:
>   br D(90%), E(10%)
> D:
>   br F
> E:
>   br F
> F:
>   ret
>
> This is a fine topological sort but bad layout based on expected
> branch probability (even with low confidence in the profile).
>
> The only good layout that preserves topological ordering is:
> A, B, C, E, D, F
>
I made a test case out of this, and I think it works. It's a bit confusing
at it just replaces branches to F with ret. That removes the edge from D to
F, and makes the following ordering also viable: A, B, C, D, E. That's the
ordering it chooses, and it also duplicates D up into B instead of branching
from B to D. :: shrug ::. It looks pretty good to me though. =]

> Finally, we need some way to validate the design and verify the
> implementation. Weighted statistics will be very useful here, similar
> to those that Jakob used in the register allocator. For example:
>
> For each nonadjacent edge:
>   TakenBranchFrequency += frequency(Edge)
>
> This metric will prefer the most "extreme" layout because it
assumes
> perfect accuracy of block frequencies. It would be nice to look at the
> same metric computed from skewed branch probabilities to see how
> sensitive it is to small shifts in branch behavior.
>
> Taken branch frequency is the most obvious, but doesn't completely
> capture locality. You could also measure distance jumped fairly
> easily, also weighted by edge frequency.
>
> It might be useful to gather these stats as late as possible to take
> into consideration the final impact of layout, including the effect on
> any later passes.
>
Cool, I'm working on this next. I've also uncovered some *completely*
broken
aspects of branch probability. Going back and adding tests and fixes for it.
=]
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111023/8781c895/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mbp-redux.patch
Type: application/octet-stream
Size: 33590 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111023/8781c895/attachment.obj>

Maybe Matching Threads

Search for more reasonably related threads

llvm dev - Oct 2011 - [LLVMdev] Question regarding basic-block placement optimization

[LLVMdev] Question regarding basic-block placement optimization

[LLVMdev] Question regarding basic-block placement optimization

[LLVMdev] Question regarding basic-block placement optimization

[LLVMdev] Question regarding basic-block placement optimization

Maybe Matching Threads