On Mon, Sep 18, 2017 at 1:16 PM, Andrew Trick <atrick at apple.com> wrote:> > On Sep 14, 2017, at 6:53 PM, Kyle Butt via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > I plan on rewriting the block placement algorithm to proceed by traces. > > A trace is a chain of blocks where each block in the chain may fall > through to > the successor in the chain. > > The overall algorithm would be to first produce traces for a function, and > then > order those traces to try and get cache locality. > > Currently block placement uses a greedy single step approach to layout. It > produces chains working from inner to outer loops. Unlike a trace, a chain > may > contain non-fallthrough edges. This causes problems with loop layout. The > main > problems with loop layout are: loop rotation and cold blocks in a loop. > > Overview of proposed solution: > > Phase 1: > Greedily produce a set of traces through the function. A trace is a list of > blocks with each block in the list falling through (possibly > conditionally) to > the next block in the list. Loop rotation will occur naturally in this > phase via > the triangle replacement algorithm below. Handling single trace loops > requires a > tweak, see the detailed design. > > Phase 2: > After producing what we believe are the best traces, they need to be > ordered. > They will be ordered topologically, except that traces that are cold > enough (As > measured by their warmest block) will be floated later, This may push them > out > of a loop or to the end of the function. > > Detailed Design > > Note whenever an edge is used as a number, I am referring to the edge > frequency. > > Phase 1: Producing traces > Traces are produced according to the following algorithm: > * Sort the edges according to weight, stable-sorting them according the > incoming > block and edge ordering. > * Place each block in a trace of length 1. > * For each edge in order: > * If the source is at the end of a trace, and the target is at the > beginning > of a trace, glue those 2 traces into 1 longer trace. > * If an edge has a target or source in the middle of another trace, > consider > tail duplication. The benefit calculation is the same as the existing > code. > * If an edge has a source or target in the middle, check them to see > if they > can be replaced as a triangle. (Triangle replacement described below) > * Compare the benefit of choosing the edge, along with any triangles > found, with the cost of breaking the existing edges. > * If it is a net benefit, perform the switch. > * Triangle checking: > Consider a trace in 2 parts: A1->A2, and the current edge under > consideration > is A1->B (the case for C->A2 is mirror, and both may need to be done) > * First find the best alternative C->B > * Check for an alternative for A2: D->A2 > * Find D's best Alternative: D->E > * Compare the frequencies: A1->A2 + C->B + D->E vs A1->B + D->A2 > * If the 2nd sum is bigger, do the switch. > * Loop Rotation Tweak: > If A contains a backedge A2->A1, then when considering A1->B or C->A2, > we > can include that backedge in the gain: > A1->A2 + C->D + E->B vs A1->B + C->A2 + A2->A > > Phase 2: Order traces. > First we compute the frequency of a trace by finding the max frequency of > any of > its blocks. > Then we attempt to place the traces topologically. When a trace cannot be > placed > topologically, we prefer warmer traces first. > > Questions and comments welcome. > _______________________________________________ > > > This algorithm should be easy enough to implement and experiment with out > of tree. A reasonable goal would be to identify cases that the current, > more sophisticated algorithm are handling poorly. At that point, it would > be much more useful to fix the current algorithm’s heuristics rather than > introducing another less sophisticated alternative. >I wrote many of the current algorithms heuristics, and they feel like hacks to work around specific problems. What I'm proposing is more, not less sophisticated. Handling loop rotation as part of a general lookahead problem is more sophisticated than the existing method of laying out the loop and praying that one of the n rotations will be a good one. Even the tail-duplication support I added is kind of a hack. It would be better to do block cloning to handle cases where tail-duplication currently fails> > The current algorithm deals well with partial profile information and > maintains a number of layout properties relative to the CFG structure. I’m > not the best person to explain all of this, but you will certainly need to > understand the original goals, show concrete examples of how it “fails” and > can’t be easily fixed, before you propose replacing it. >I'll make sure to include test cases with any diffs I mail out.> > -Andy >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170918/86525a71/attachment-0001.html>
> On Sep 18, 2017, at 5:17 PM, Kyle Butt <iteratee at google.com> wrote: > > > > On Mon, Sep 18, 2017 at 1:16 PM, Andrew Trick <atrick at apple.com <mailto:atrick at apple.com>> wrote: > >> On Sep 14, 2017, at 6:53 PM, Kyle Butt via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >> >> I plan on rewriting the block placement algorithm to proceed by traces. >> >> A trace is a chain of blocks where each block in the chain may fall through to >> the successor in the chain. >> >> The overall algorithm would be to first produce traces for a function, and then >> order those traces to try and get cache locality. >> >> Currently block placement uses a greedy single step approach to layout. It >> produces chains working from inner to outer loops. Unlike a trace, a chain may >> contain non-fallthrough edges. This causes problems with loop layout. The main >> problems with loop layout are: loop rotation and cold blocks in a loop. >> >> Overview of proposed solution: >> >> Phase 1: >> Greedily produce a set of traces through the function. A trace is a list of >> blocks with each block in the list falling through (possibly conditionally) to >> the next block in the list. Loop rotation will occur naturally in this phase via >> the triangle replacement algorithm below. Handling single trace loops requires a >> tweak, see the detailed design. >> >> Phase 2: >> After producing what we believe are the best traces, they need to be ordered. >> They will be ordered topologically, except that traces that are cold enough (As >> measured by their warmest block) will be floated later, This may push them out >> of a loop or to the end of the function. >> >> Detailed Design >> >> Note whenever an edge is used as a number, I am referring to the edge frequency. >> >> Phase 1: Producing traces >> Traces are produced according to the following algorithm: >> * Sort the edges according to weight, stable-sorting them according the incoming >> block and edge ordering. >> * Place each block in a trace of length 1. >> * For each edge in order: >> * If the source is at the end of a trace, and the target is at the beginning >> of a trace, glue those 2 traces into 1 longer trace. >> * If an edge has a target or source in the middle of another trace, consider >> tail duplication. The benefit calculation is the same as the existing >> code. >> * If an edge has a source or target in the middle, check them to see if they >> can be replaced as a triangle. (Triangle replacement described below) >> * Compare the benefit of choosing the edge, along with any triangles >> found, with the cost of breaking the existing edges. >> * If it is a net benefit, perform the switch. >> * Triangle checking: >> Consider a trace in 2 parts: A1->A2, and the current edge under consideration >> is A1->B (the case for C->A2 is mirror, and both may need to be done) >> * First find the best alternative C->B >> * Check for an alternative for A2: D->A2 >> * Find D's best Alternative: D->E >> * Compare the frequencies: A1->A2 + C->B + D->E vs A1->B + D->A2 >> * If the 2nd sum is bigger, do the switch. >> * Loop Rotation Tweak: >> If A contains a backedge A2->A1, then when considering A1->B or C->A2, we >> can include that backedge in the gain: >> A1->A2 + C->D + E->B vs A1->B + C->A2 + A2->A >> >> Phase 2: Order traces. >> First we compute the frequency of a trace by finding the max frequency of any of >> its blocks. >> Then we attempt to place the traces topologically. When a trace cannot be placed >> topologically, we prefer warmer traces first. >> >> Questions and comments welcome. >> _______________________________________________ > > This algorithm should be easy enough to implement and experiment with out of tree. A reasonable goal would be to identify cases that the current, more sophisticated algorithm are handling poorly. At that point, it would be much more useful to fix the current algorithm’s heuristics rather than introducing another less sophisticated alternative. > > I wrote many of the current algorithms heuristics, and they feel like hacks to work around specific problems. > > What I'm proposing is more, not less sophisticated. Handling loop rotation as part of a general lookahead problem is more sophisticated than the existing method of laying out the loop and praying that one of the n rotations will be a good one. > Even the tail-duplication support I added is kind of a hack. It would be better to do block cloning to handle cases where tail-duplication currently failsThe “Algo2” that you referred to is the least sophisticated layout algorithm I can imagine. I’ve always found it to be extremely unsatisfying when it comes to partial profile, unbiased branches, or general debugging sanity. When I say the existing algorithm is more sophisticated, I mean that it is designed to meet a couple basic requirements: - contiguous loops - topologically ordered regions with a loop With the exception that clearly cold paths are moved to the end of the function. I’m not sure how well your "triangle look-ahead” algorithm works, but if you can also meet those requirements then I would say it’s also very sophisticated. I haven’t spent any time evaluating the current LLVM algorithm, so won’t defend it over another approach that meets the same goals. I am very curious to see examples of situations where your approach yields better performance.> The current algorithm deals well with partial profile information and maintains a number of layout properties relative to the CFG structure. I’m not the best person to explain all of this, but you will certainly need to understand the original goals, show concrete examples of how it “fails” and can’t be easily fixed, before you propose replacing it. > > I'll make sure to include test cases with any diffs I mail out.Ok. I’m not opposed to someone who has a lot of experience with the current algorithm rewriting it. I’m just concerned that it will result in a lot of arbitrary code shuffling, making it difficult to make sense of the codegen output, and result in worse block layout for the typical situation: many missing branch weights, or workloads that deviate from the profile. -Andy -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170918/652670e3/attachment.html>
Chiming in to this discussion just a little bit late... I share Andy's concerns about whether the proposed trace algorithm subtly improves on the existing code. I'm perfectly happy to be convinced, but the burden of proof here is high. Thinking about the current algorithm, I find it helpful to think in terms of extended basic blocks. At it's most basic, the existing code (and your alternate proposal unless I'm misreading it) eagerly forms extended basic blocks with a single exit, but potentially many early exits before a single N-way terminator. At this stage, we're only handling "obvious" cases where the exit path is substantially colder than the fall-through which continues within the same EBB. The "interesting bits" of both the current code and your new algorithm can be phrased as merging these EBBs to form super-blocks. Once we start merging EBBs, we have to allow multiple entries into our super-blocks. One nice property of loops, triangles, and diamonds is that control flow *entirely within* a super-block is invisible to other super-blocks. Note there's, for example, a policy choice as to whether cold loop blocks are placed into the loops super-block (hiding the control flow) or into a separate super-block (outlining the cold path far away). I suspect the current code could be substantially improved by making the abstractions of EBBs and super-blocks explicit in the code. Then the heuristic pieces would be much more obvious and easier to evaluate in isolation. Philip On 09/18/2017 06:05 PM, Andrew Trick via llvm-dev wrote:> >> On Sep 18, 2017, at 5:17 PM, Kyle Butt <iteratee at google.com >> <mailto:iteratee at google.com>> wrote: >> >> >> >> On Mon, Sep 18, 2017 at 1:16 PM, Andrew Trick <atrick at apple.com >> <mailto:atrick at apple.com>> wrote: >> >> >>> On Sep 14, 2017, at 6:53 PM, Kyle Butt via llvm-dev >>> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >>> >>> I plan on rewriting the block placement algorithm to proceed by >>> traces. >>> >>> A trace is a chain of blocks where each block in the chain may >>> fall through to >>> the successor in the chain. >>> >>> The overall algorithm would be to first produce traces for a >>> function, and then >>> order those traces to try and get cache locality. >>> >>> Currently block placement uses a greedy single step approach to >>> layout. It >>> produces chains working from inner to outer loops. Unlike a >>> trace, a chain may >>> contain non-fallthrough edges. This causes problems with loop >>> layout. The main >>> problems with loop layout are: loop rotation and cold blocks in >>> a loop. >>> >>> Overview of proposed solution: >>> >>> Phase 1: >>> Greedily produce a set of traces through the function. A trace >>> is a list of >>> blocks with each block in the list falling through (possibly >>> conditionally) to >>> the next block in the list. Loop rotation will occur naturally >>> in this phase via >>> the triangle replacement algorithm below. Handling single trace >>> loops requires a >>> tweak, see the detailed design. >>> >>> Phase 2: >>> After producing what we believe are the best traces, they need >>> to be ordered. >>> They will be ordered topologically, except that traces that are >>> cold enough (As >>> measured by their warmest block) will be floated later, This may >>> push them out >>> of a loop or to the end of the function. >>> >>> Detailed Design >>> >>> Note whenever an edge is used as a number, I am referring to the >>> edge frequency. >>> >>> Phase 1: Producing traces >>> Traces are produced according to the following algorithm: >>> * Sort the edges according to weight, stable-sorting them >>> according the incoming >>> block and edge ordering. >>> * Place each block in a trace of length 1. >>> * For each edge in order: >>> * If the source is at the end of a trace, and the target is at >>> the beginning >>> of a trace, glue those 2 traces into 1 longer trace. >>> * If an edge has a target or source in the middle of another >>> trace, consider >>> tail duplication. The benefit calculation is the same as the >>> existing >>> code. >>> * If an edge has a source or target in the middle, check them >>> to see if they >>> can be replaced as a triangle. (Triangle replacement >>> described below) >>> * Compare the benefit of choosing the edge, along with any >>> triangles >>> found, with the cost of breaking the existing edges. >>> * If it is a net benefit, perform the switch. >>> * Triangle checking: >>> Consider a trace in 2 parts: A1->A2, and the current edge >>> under consideration >>> is A1->B (the case for C->A2 is mirror, and both may need to >>> be done) >>> * First find the best alternative C->B >>> * Check for an alternative for A2: D->A2 >>> * Find D's best Alternative: D->E >>> * Compare the frequencies: A1->A2 + C->B + D->E vs A1->B + D->A2 >>> * If the 2nd sum is bigger, do the switch. >>> * Loop Rotation Tweak: >>> If A contains a backedge A2->A1, then when considering A1->B >>> or C->A2, we >>> can include that backedge in the gain: >>> A1->A2 + C->D + E->B vs A1->B + C->A2 + A2->A >>> >>> Phase 2: Order traces. >>> First we compute the frequency of a trace by finding the max >>> frequency of any of >>> its blocks. >>> Then we attempt to place the traces topologically. When a trace >>> cannot be placed >>> topologically, we prefer warmer traces first. >>> >>> Questions and comments welcome. >>> _______________________________________________ >> >> This algorithm should be easy enough to implement and experiment >> with out of tree. A reasonable goal would be to identify cases >> that the current, more sophisticated algorithm are handling >> poorly. At that point, it would be much more useful to fix the >> current algorithm’s heuristics rather than introducing another >> less sophisticated alternative. >> >> >> I wrote many of the current algorithms heuristics, and they feel like >> hacks to work around specific problems. >> >> What I'm proposing is more, not less sophisticated. Handling loop >> rotation as part of a general lookahead problem is more sophisticated >> than the existing method of laying out the loop and praying that one >> of the n rotations will be a good one. >> Even the tail-duplication support I added is kind of a hack. It would >> be better to do block cloning to handle cases where tail-duplication >> currently fails > > The “Algo2” that you referred to is the least sophisticated layout > algorithm I can imagine. I’ve always found it to be extremely > unsatisfying when it comes to partial profile, unbiased branches, or > general debugging sanity. > > When I say the existing algorithm is more sophisticated, I mean that > it is designed to meet a couple basic requirements: > - contiguous loops > - topologically ordered regions with a loop > With the exception that clearly cold paths are moved to the end of the > function. > > I’m not sure how well your "triangle look-ahead” algorithm works, but > if you can also meet those requirements then I would say it’s also > very sophisticated. I haven’t spent any time evaluating the current > LLVM algorithm, so won’t defend it over another approach that meets > the same goals. > > I am very curious to see examples of situations where your approach > yields better performance. > >> The current algorithm deals well with partial profile information >> and maintains a number of layout properties relative to the CFG >> structure. I’m not the best person to explain all of this, but >> you will certainly need to understand the original goals, show >> concrete examples of how it “fails” and can’t be easily fixed, >> before you propose replacing it. >> >> >> I'll make sure to include test cases with any diffs I mail out. > > Ok. I’m not opposed to someone who has a lot of experience with the > current algorithm rewriting it. I’m just concerned that it will result > in a lot of arbitrary code shuffling, making it difficult to make > sense of the codegen output, and result in worse block layout for the > typical situation: many missing branch weights, or workloads that > deviate from the profile. > > -Andy > > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20171017/d669c8ab/attachment.html>