Chandler Carruth
2011-Oct-21 06:53 UTC
[LLVMdev] Question regarding basic-block placement optimization
On Thu, Oct 20, 2011 at 10:53 AM, Jakob Stoklund Olesen <stoklund at 2pi.dk>wrote:> > On Oct 20, 2011, at 9:56 AM, Chandler Carruth wrote: > > > A new patch is attached that is *much* less of a rough draft. Sorry for > any confusion due to the early state of the patch. > > Thanks, Chandler. This is great stuff. > > > Still, I never intended this to be on-by-default at first. This is a > starting point, that I hope can be improved into something that is > on-by-default eventually, but I'll be the first to say it's almost certainly > not ready for that just yet. > > I would like to see this go in ASAP under a flag. I prefer to see > development as commits rather than a series of updated patches. >Awesome, same way I feel. Checked in as r142641. I've got some generic cleanup I'll go ahead and commit to it over the next few days. Also, let me know whether you'd prefer post-commit or pre-commit review for various feature additions we've talked about here. I'm fine either way, so its really what makes things easier on your end. I'll definitely want review on all of it as some of this (especially the loop structure issues) is still a bit new to me. =D At least some of it (the loop alignment from CodePlacementOpt) is obvious how to port across, but I'm sure there will be more tricky elements around the loop structures. Could you rename it to MachineBlockPlacement.cpp first, though? That makes> it clear it's a CodeGen pass, and the BlockPlacement2 name is icky. >Yea, the 2 thing was only intended to work around a brief duplication with the existing pass. MachineBlockPlacement solves both problems nicely.> > I'm more hopeful that we can delete the existing block placement pass, > and direct anyone interested in profile file guided stuff to write a simple > pass to load profiles into metadata. I suspect this pass is already superior > to that one. > > I also see it as a replacement for CodePlacementOpt.cpp, so I think your > flag should run your pass instead of CodePlacementOpt rather than before or > after it. > > You should simply clone the loop alignment stuff, it's pretty trivial.Done, and in progress. SHould have the loop alignment stuff cloned right away, and then i'll start looking at the loop structure issues. I'll probably have some questions for you and/or andy about exactly how that should work, whether CodePlacementOpt is doing the right thing, etc. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111020/8708a06e/attachment.html>
Andrew Trick
2011-Oct-22 01:17 UTC
[LLVMdev] Question regarding basic-block placement optimization
On Oct 20, 2011, at 11:53 PM, Chandler Carruth wrote:> On Thu, Oct 20, 2011 at 10:53 AM, Jakob Stoklund Olesen <stoklund at 2pi.dk> wrote: > On Oct 20, 2011, at 9:56 AM, Chandler Carruth wrote: > > > A new patch is attached that is *much* less of a rough draft. Sorry for any confusion due to the early state of the patch. > > Thanks, Chandler. This is great stuff. > > > Still, I never intended this to be on-by-default at first. This is a starting point, that I hope can be improved into something that is on-by-default eventually, but I'll be the first to say it's almost certainly not ready for that just yet. > > I would like to see this go in ASAP under a flag. I prefer to see development as commits rather than a series of updated patches....> Awesome, same way I feel. Checked in as r142641. I've got some generic cleanup I'll go ahead and commit to it over the next few days. > Done, and in progress. SHould have the loop alignment stuff cloned right away, and then i'll start looking at the loop structure issues. I'll probably have some questions for you and/or andy about exactly how that should work, whether CodePlacementOpt is doing the right thing, etc.Your work looks good and there's talk of making this the default layout pass. So let's figure out the goals and requirements. We agree that preserving "structure" of the CFG is important in the absence of profile information to the contrary. This is important for performance stability in the presence of imperfect profile data and unbiased branches. But it's also important for the sanity of fellow compiler engineers and anyone debugging the disassembly. We can define structure very simply by saying that we want contiguous loops and topologically ordered regions within a loop. Those are not the strongest constraints. We could still end up with strangely shuffled code, but they are easy constraints to define. Once you decide to break these constraints, you have effectively designated the offending paths as "cold". It's up to you how to decide which paths are cold. You'll need some threshold based on confidence in the profile, and we can measure the effectiveness of your heuristics and add more over time. The key point is that the layout algorithm needs to make an explicit binary decision at some point to violate the natural structure of the CFG. I don't care much about how you layout the cold chains as long as they are not interleaved with the rest of the code, thus breaking the topological ordering and forcing extra branches. In practice, this means they get their own little Siberia after the function return, effectively splitting the function into two sections. However, the notion of "coldness" is really relative to the current loop head. So you may layout a nice contiguous, topologically ordered inner loop, then determine the whole thing is cold relative to its outer loop, so send the whole thing packing to Siberia. That's exactly what we want. The loop itself is sane, but doesn't get in the way of hotter enclosing code. Even within the constraints imposed by CFG structure, you can reduce the likelihood of taken branches. Here you can employ any profile-based or static heuristics with lower confidence, because the resulting layout will be reasonably good either way. This will still do a great job in the absence of accurate profile information, builtin expect, or very strong static heuristics (e.g. landing pads are cold). This is where the algorithm gets interesting and you can be creative, but I'll give you a rough idea so we're on the same page: 1) Visit each loop independently inside-out. 2) Construct chains of blocks within the loop. Here the "entry" chain is the loop header, or function entry. You can probably represent each already laid out inner loop as a single chain. The graph of chains within a loop is *acyclic* except for irreducible control flow, which doesn't have any structure to preserve anyway. There's no need to compute SCCs. 3) Stop constructing the chain at a CFG merge. Chaining past a merge would immediately violate topological ordering, which is only ok if other predecessors of the merge have been designated "cold". 4) Splice the chains together, preserving topological order. Assigning each block to a DFS reverse-post-order number provides a nice ordering that is easy to compare or sort. Let me give you a little example to think about: A: br B(80%), C(20%) B: br D C: br D(90%), E(10%) D: br F E: br F F: ret This is a fine topological sort but bad layout based on expected branch probability (even with low confidence in the profile). The only good layout that preserves topological ordering is: A, B, C, E, D, F .8 * 1 branch + .18 * 2 branches + .02 * 2 branches This approach likely covers the existing functionality of CodePlacement, which attempts to keep loops contiguous. You will probably need to do the loop tail adjustment that CodePlacement currently does as a special case. (Shift the tail of the loop to precede its header, resulting in a fall-through loop test.) We can discuss the design more, but you may want to think about it and come with your own design first. Finally, we need some way to validate the design and verify the implementation. Weighted statistics will be very useful here, similar to those that Jakob used in the register allocator. For example: For each nonadjacent edge: TakenBranchFrequency += frequency(Edge) This metric will prefer the most "extreme" layout because it assumes perfect accuracy of block frequencies. It would be nice to look at the same metric computed from skewed branch probabilities to see how sensitive it is to small shifts in branch behavior. Taken branch frequency is the most obvious, but doesn't completely capture locality. You could also measure distance jumped fairly easily, also weighted by edge frequency. It might be useful to gather these stats as late as possible to take into consideration the final impact of layout, including the effect on any later passes. -Andy -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111021/302813af/attachment.html>
Chandler Carruth
2011-Oct-22 11:25 UTC
[LLVMdev] Question regarding basic-block placement optimization
On Fri, Oct 21, 2011 at 6:17 PM, Andrew Trick <atrick at apple.com> wrote:> On Oct 20, 2011, at 11:53 PM, Chandler Carruth wrote: > > On Thu, Oct 20, 2011 at 10:53 AM, Jakob Stoklund Olesen <stoklund at 2pi.dk>wrote: >> >> On Oct 20, 2011, at 9:56 AM, Chandler Carruth wrote: >> >> > A new patch is attached that is *much* less of a rough draft. Sorry for >> any confusion due to the early state of the patch. >> >> Thanks, Chandler. This is great stuff. >> >> > Still, I never intended this to be on-by-default at first. This is a >> starting point, that I hope can be improved into something that is >> on-by-default eventually, but I'll be the first to say it's almost certainly >> not ready for that just yet. >> >> I would like to see this go in ASAP under a flag. I prefer to see >> development as commits rather than a series of updated patches. > > ... > > Awesome, same way I feel. Checked in as r142641. I've got some generic > cleanup I'll go ahead and commit to it over the next few days. > Done, and in progress. SHould have the loop alignment stuff cloned right > away, and then i'll start looking at the loop structure issues. I'll > probably have some questions for you and/or andy about exactly how that > should work, whether CodePlacementOpt is doing the right thing, etc. > > > Your work looks good and there's talk of making this the default > layout pass. So let's figure out the goals and requirements. We agree > that preserving "structure" of the CFG is important in the absence of > profile information to the contrary. This is important for performance > stability in the presence of imperfect profile data and unbiased > branches. But it's also important for the sanity of fellow compiler > engineers and anyone debugging the disassembly. We can define > structure very simply by saying that we want contiguous loops and > topologically ordered regions within a loop. Those are not the > strongest constraints. We could still end up with strangely shuffled > code, but they are easy constraints to define. > > Once you decide to break these constraints, you have effectively > designated the offending paths as "cold". It's up to you how to > decide which paths are cold. You'll need some threshold based on > confidence in the profile, and we can measure the effectiveness of > your heuristics and add more over time. The key point is that the > layout algorithm needs to make an explicit binary decision at some > point to violate the natural structure of the CFG. > > I don't care much about how you layout the cold chains as long as they > are not interleaved with the rest of the code, thus breaking the > topological ordering and forcing extra branches. In practice, this > means they get their own little Siberia after the function return, > effectively splitting the function into two sections. However, the > notion of "coldness" is really relative to the current loop head. So > you may layout a nice contiguous, topologically ordered inner loop, > then determine the whole thing is cold relative to its outer loop, so > send the whole thing packing to Siberia. That's exactly what we > want. The loop itself is sane, but doesn't get in the way of hotter > enclosing code. > > Even within the constraints imposed by CFG structure, you can reduce > the likelihood of taken branches. Here you can employ any > profile-based or static heuristics with lower confidence, because the > resulting layout will be reasonably good either way. This will still > do a great job in the absence of accurate profile information, builtin > expect, or very strong static heuristics (e.g. landing pads are cold). > > This is where the algorithm gets interesting and you can be creative, > but I'll give you a rough idea so we're on the same page: > > 1) Visit each loop independently inside-out. > > 2) Construct chains of blocks within the loop. Here the "entry" chain > is the loop header, or function entry. You can probably represent > each already laid out inner loop as a single chain. The graph of > chains within a loop is *acyclic* except for irreducible control > flow, which doesn't have any structure to preserve anyway. There's > no need to compute SCCs. > > 3) Stop constructing the chain at a CFG merge. Chaining past a merge > would immediately violate topological ordering, which is only ok if > other predecessors of the merge have been designated "cold". > > 4) Splice the chains together, preserving topological order. Assigning > each block to a DFS reverse-post-order number provides a nice > ordering that is easy to compare or sort. >FYI, I'm still working on the details, and the code is a complete hack, but I have a big chunk of this working now. =] The more precise strategy I implemented follows your suggestion very closely. I've got lots more ideas to layer on top of this, but this is my first stab at providing a better underpinning. First, it walks all loops in the function from the inside out just as you describe. =] Really like this, it's a very nice model to use. For each loop, after having walked all sub-loops, it walks forward across all the blocks in the loop. For each block not part of a chain, it creates a single-block chain. If the block has more than one predecessor (I think this is what you were after w/ a "merge point"?), we're done. If the block is the start of the chain it belongs to, and if it is the most likely of the in-loop successors of the previous block, and that previous block is the tail of its chain, we merge its chain to the previous block's chain. This should only form chains which are valid in the existing topological ordering within a loop, and don't cross loop boundaries or cross incoming branches. Then we traverse the BB graph in RPO, and for each block look up its chain. The first time we encounter a chain, we splice it into the function. I think this satisfies your desire to preserve the structural components of the CFG? If not, I'd love to understand where I've deviated / what I've misunderstood. It already makes some minimal use of the probabilities to select which topology-preserving fall-through options to prefer. I'm also going to tie-break with the existing layout. I've not started to really integrate it with probabilities though, I wanted to get something that fit the constraint model you're looking for first, and had clean traversal semantics. I'll clean the patch up and mail it out tomorrow. The implementation gets tremendously simpler by A) using the loop info, and B) exclusively walking forward over the blocks. The first simplified the traversal greatly, and the second made it easy to just store pointers to basic blocks for the chains rather than moving blocks around in the function itself. That gets rid of a bunch of other code, and the most bug-prone of the other code. So all in all, I like where this is going.> Let me give you a little example to think about: > > A: > br B(80%), C(20%) > B: > br D > C: > br D(90%), E(10%) > D: > br F > E: > br F > F: > ret > > This is a fine topological sort but bad layout based on expected > branch probability (even with low confidence in the profile). > > The only good layout that preserves topological ordering is: > A, B, C, E, D, F > > .8 * 1 branch + .18 * 2 branches + .02 * 2 branches > > This approach likely covers the existing functionality of > CodePlacement, which attempts to keep loops contiguous. > You will probably need to do the loop tail adjustment that > CodePlacement currently does as a special case. > (Shift the tail of the loop to precede its header, resulting in > a fall-through loop test.) > > We can discuss the design more, but you may want to think about it and > come with your own design first. > > Finally, we need some way to validate the design and verify the > implementation. Weighted statistics will be very useful here, similar > to those that Jakob used in the register allocator. For example: > > For each nonadjacent edge: > TakenBranchFrequency += frequency(Edge) > > This metric will prefer the most "extreme" layout because it assumes > perfect accuracy of block frequencies. It would be nice to look at the > same metric computed from skewed branch probabilities to see how > sensitive it is to small shifts in branch behavior. > > Taken branch frequency is the most obvious, but doesn't completely > capture locality. You could also measure distance jumped fairly > easily, also weighted by edge frequency. > > It might be useful to gather these stats as late as possible to take > into consideration the final impact of layout, including the effect on > any later passes. > > -Andy >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111022/6a6a1fb5/attachment.html>
Chandler Carruth
2011-Oct-23 08:11 UTC
[LLVMdev] Question regarding basic-block placement optimization
Ok, I think I have a working pass that is based much more on what we've talked about here. The patch is attached. I'd love to commit it soon-ish and then start tweaking it based on feedback from you, others, and looking at how it actually works in the wild. It's essentially a re-write, so it may be hard to read... Let me know if some other form would be easier. Some responses to your comments below, perhaps giving insight into the choices I made or what the pass is currently doing and what I haven't addressed yet. On Fri, Oct 21, 2011 at 6:17 PM, Andrew Trick <atrick at apple.com> wrote:> Once you decide to break these constraints, you have effectively > designated the offending paths as "cold". It's up to you how to > decide which paths are cold. You'll need some threshold based on > confidence in the profile, and we can measure the effectiveness of > your heuristics and add more over time. >FYI, I ended up with a heuristic that requires a hot edge to be 4x more likely than any other competing in the CFG. This is very loosely based on the definition of a 'hot' edge in BranchProbabilityInfo (80% vs. 20%). I don't care much about how you layout the cold chains as long as they> are not interleaved with the rest of the code, thus breaking the > topological ordering and forcing extra branches. In practice, this > means they get their own little Siberia after the function return, > effectively splitting the function into two sections. However, the > notion of "coldness" is really relative to the current loop head. >I've not really tried to ensure this happens correctly. Mostly, this is just forming hot-paths through the code when they are *really* hot. I'll work on better sectioning off the cold paths that are cut off in a followup patch. 2) Construct chains of blocks within the loop. Here the "entry" chain> is the loop header, or function entry. You can probably represent > each already laid out inner loop as a single chain. The graph of > chains within a loop is *acyclic* except for irreducible control > flow, which doesn't have any structure to preserve anyway. There's > no need to compute SCCs. >I'm actually not satisfied with how I've done this. I think there is likely a cleaner / better solution than the one I have in the patch, but it largely works, and is easy to change later. The patch just walks forward over the function, and tries to chain blocks together with their successors when the CFG permits, choosing successors based on whatever (even weak) signal the probability info gives us. 4) Splice the chains together, preserving topological order. Assigning> each block to a DFS reverse-post-order number provides a nice > ordering that is easy to compare or sort. >Currently I'm doing this once at the very end. I wonder if it would be better to do iteratively as we walk up through the loops? Let me give you a little example to think about:> > A: > br B(80%), C(20%) > B: > br D > C: > br D(90%), E(10%) > D: > br F > E: > br F > F: > ret > > This is a fine topological sort but bad layout based on expected > branch probability (even with low confidence in the profile). > > The only good layout that preserves topological ordering is: > A, B, C, E, D, F >I made a test case out of this, and I think it works. It's a bit confusing at it just replaces branches to F with ret. That removes the edge from D to F, and makes the following ordering also viable: A, B, C, D, E. That's the ordering it chooses, and it also duplicates D up into B instead of branching from B to D. :: shrug ::. It looks pretty good to me though. =]> Finally, we need some way to validate the design and verify the > implementation. Weighted statistics will be very useful here, similar > to those that Jakob used in the register allocator. For example: > > For each nonadjacent edge: > TakenBranchFrequency += frequency(Edge) > > This metric will prefer the most "extreme" layout because it assumes > perfect accuracy of block frequencies. It would be nice to look at the > same metric computed from skewed branch probabilities to see how > sensitive it is to small shifts in branch behavior. > > Taken branch frequency is the most obvious, but doesn't completely > capture locality. You could also measure distance jumped fairly > easily, also weighted by edge frequency. > > It might be useful to gather these stats as late as possible to take > into consideration the final impact of layout, including the effect on > any later passes. >Cool, I'm working on this next. I've also uncovered some *completely* broken aspects of branch probability. Going back and adding tests and fixes for it. =] -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111023/8781c895/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: mbp-redux.patch Type: application/octet-stream Size: 33590 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111023/8781c895/attachment.obj>