Sean Silva via llvm-dev
2016-Jun-08 11:19 UTC
[llvm-dev] Intended behavior of CGSCC pass manager.
Hi Chandler, Philip, Mehdi, (and llvm-dev,) (this is partially a summary of some discussions that happened at the last LLVM bay area social, and partially a discussion about the direction of the CGSCC pass manager) A the last LLVM social we discussed the progress on the CGSCC pass manager. It seems like Chandler has a CGSCC pass manager working, but it is still unresolved exactly which semantics we want (more about this below) that are reasonably implementable. AFAICT, there has been no public discussion about what exact semantics we ultimately want to have. We should figure that out. The main difficulty which Chandler described is the apparently quite complex logic surrounding needing to run function passes nested within an SCC pass manager, while providing some guarantees about exactly what order the function passes are run. The existing CGSCC pass manager just punts on some of the problems that arise (look in CGPassManager::runOnModule, CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that Chandler has been trying to solve. ( Why is this "function passes inside CGSCC passes" stuff interesting? Because LLVM can do inlining on an SCC (often just a single function) and then run function passes to simplify the function(s) in the SCC before it tries to inline into a parent SCC. (the SCC visitation order is post-order) For example, we may inline a bunch of code, but after inlining we can tremendously simplify the function, and we want to do so before considering this function for inlining into its callers so that we get an accurate evaluation of the inline cost. Based on what Chandler said, it seems that LLVM is fairly unique in this regard and other compilers don't do this (which is why we can't just look at how other compilers solve this problem; they don't have this problem (maybe they should? or maybe we shouldn't?)). For example, he described that GCC uses different inlining "phases"; e.g. it does early inlining on the entire module, then does simplifications on the entire module, then does late inlining on the entire module; so it is not able to incrementally simplify as it inlines like LLVM does. ) As background for what is below, the LazyCallGraph tracks two graphs: the "call graph" and the "ref graph". Conceptually, the call graph is the graph of direct calls, where indirect calls and calls to external functions do not appear (or are connected to dummy nodes). The ref graph is basically the graph of all functions transitively accessible based on the globals/constants/etc. referenced by a function (e.g. if a function `foo` references a vtable that is defined in the module, there is an edge in the ref graph from `foo` to every function in the vtable). The call graph is a strict subset of the ref graph. Chandler described that he had a major breakthrough in that the CGSCC pass manager only had to deal with 3 classes of modifications that can occur: - a pass may e.g. propagate a load of a function pointer into an indirect call, turning it into an direct call. This requires adding an edge in the CG but not in the ref graph. - a pass may take a direct call and turn it into an indirect call. This requires removing an edge from the CG, but not in the ref graph. - a pass may delete a direct call. This removes an edge in the CG and also in the ref graph.>From the perspective of the CGSCC pass manager, these operations can affectthe SCC structure. Adding an edge might merge SCC's and deleting an edge might split SCC's. Chandler mentioned that apparently the issues of splitting and merging SCC's within the current infrastructure are actually quite challenging and lead to e.g. iterator invalidation issues, and that is what he is working on. ( The ref graph is important to guide the overall SCC visitation order because it basically represents "the largest graph that the CG may turn into due to our static analysis of this module". I.e. no transformation we can statically make in the CGSCC passes can ever cause us to need to merge SCC's in the ref graph. ) I have a couple overall questions/concerns: 1. The ref graph can easily go quadratic. E.g. typedef void (*fp)(); fp funcs[] = { &foo1, &foo2, ... &fooN } void foo1() { funcs[something](); } void foo2() { funcs[something](); } ... void fooN() { funcs[something](); } One real-world case where this might come about is in the presence of vtables. The existing CGSCC pass manager does not have this issue AFAIK because it does not consider the ref graph. Does anybody have any info/experience about how densely connected the ref graph can get in programs that might reasonably be fed to the compiler? I just did a quick sanity check with LLD/ELF using `--lto-newpm-passes=cgscc(no-op-cgscc)` and it at least seemed to terminate / not run out of memory. Based on some rough calculations looking at the profile, it seem like the entire run of the inliner in the old LTO pipeline (which is about 5% of total LTO time on this particular example I looked at) is only 2-3x as expensive as just `--lto-newpm-passes=cgscc(no-op-cgscc)`, so the LazyCallGraph construction is certainly not free. 2. What is the intended behavior of CGSCC passes when SCC's are split or merged? E.g. a CGSCC pass runs on an SCC (e.g. the inliner). Now we run some function passes nested inside the CGSCC pass manager (e.g. to simplify things after inlining). Consider: a) These function passes are e.g. now able to devirtualize a call, adding an edge to the CG, forming a larger CG SCC. Do you re-run the CGSCC pass (say, the inliner) on this larger SCC? b) These function passes are e.g. able to DCE a call, removing an edge from the CG. This converts, say, a CG SCC which is a cycle graph (like a->b->c->a) into a path graph (a->b->c, with no edge back to a). The inliner had already visited a, b, and c as a single SCC. Now does it have to re-visit c, then b, then a, as single-node SCC's? btw: One way that I have found it useful to think about this is in terms of the visitation during Tarjan's SCC algorithm. I'll reference the pseudocode in https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm. Inside the "strongconnect" routine when we have identified an SCC (the true branch of `if (v.lowlink = v.index)` test ) we can visit stack[v.index:stack.size()] as an SCC. This may or may not invalidate some things on the stack (the variable `S` in the pseudocode) and we may need to fix it up (e.g. inlining deleted a function, so we can't have an entry on the stack). Then, we can run function passes as we pop individual functions off the stack, but it is easier to think about IMO than merging of SCC data structures: if we add edges to the CG then we have to do more DFS on the new edges and if we delete edges then the DFS order of the stack gives us certain guarantees. Personally I find this much easier to reason about than the description in terms of splitting and merging SCC's in the CG and ref graph (which the LazyCallGraph API makes one to think about since it hides the underlying Tarjan's algorithm). The LazyCallGraph API makes the current loop in http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/include/llvm/Analysis/CGSCCPassManager.h;272124$100 very clean, but at least for my thinking about the problem, it seems like the wrong abstraction (and most of the LazyCallGraph API seems to be unused, so it seems like it may be overly heavyweight). E.g. I think that maybe the easiest thing to do is to turn the current approach inside out: instead of having the pass manager logic be the "normal code" and forcing the Tarjan algorithm to become a state machine of iterators, use an open-coded Tarjan algorithm with some callbacks and make the pass management logic be the state machine. This will also open the door to avoiding the potentially quadratic size of the ref graph, since e.g. in the example I gave above, we can mark the `funcs` array itself as already having been visited during the walk. In the current LazyCallGraph, this would require adding some sort of notion of hyperedge. Since this is such a high priority (due to blocking PGO inlining), I will probably try my hand at implementing the CGSCC pass manager sometime soon unless somebody beats me to it. (I'll probably try the "open-coded SCC visit" approach). Another possibility is implementing the new CGSCC pass manager that uses the same visitation semantics as the one in the old PM, and then we can refactor that as needed. In fact, that may be the best approach so that porting to the new PM is as NFC as possible and we can isolate the functional (i.e., need benchmarks, measurements ...) changes in separate commits. Sorry for the wall of text. -- Sean Silva -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160608/3ca09f70/attachment.html>
Hal Finkel via llvm-dev
2016-Jun-08 16:32 UTC
[llvm-dev] Intended behavior of CGSCC pass manager.
----- Original Message -----> From: "Sean Silva via llvm-dev" <llvm-dev at lists.llvm.org> > To: "llvm-dev" <llvm-dev at lists.llvm.org> > Sent: Wednesday, June 8, 2016 6:19:03 AM > Subject: [llvm-dev] Intended behavior of CGSCC pass manager.> Hi Chandler, Philip, Mehdi, (and llvm-dev,)> (this is partially a summary of some discussions that happened at the > last LLVM bay area social, and partially a discussion about the > direction of the CGSCC pass manager)> A the last LLVM social we discussed the progress on the CGSCC pass > manager. It seems like Chandler has a CGSCC pass manager working, > but it is still unresolved exactly which semantics we want (more > about this below) that are reasonably implementable.> AFAICT, there has been no public discussion about what exact > semantics we ultimately want to have. We should figure that out.> The main difficulty which Chandler described is the apparently quite > complex logic surrounding needing to run function passes nested > within an SCC pass manager, while providing some guarantees about > exactly what order the function passes are run. The existing CGSCC > pass manager just punts on some of the problems that arise (look in > CGPassManager::runOnModule, CGPassManager::RunAllPassesOnSCC, and > CGPassManager::RunPassOnSCC in > llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems > that Chandler has been trying to solve.> ( > Why is this "function passes inside CGSCC passes" stuff interesting? > Because LLVM can do inlining on an SCC (often just a single > function) and then run function passes to simplify the function(s) > in the SCC before it tries to inline into a parent SCC. (the SCC > visitation order is post-order) > For example, we may inline a bunch of code, but after inlining we can > tremendously simplify the function, and we want to do so before > considering this function for inlining into its callers so that we > get an accurate evaluation of the inline cost. > Based on what Chandler said, it seems that LLVM is fairly unique in > this regard and other compilers don't do this (which is why we can't > just look at how other compilers solve this problem; they don't have > this problem (maybe they should? or maybe we shouldn't?)). For > example, he described that GCC uses different inlining "phases"; > e.g. it does early inlining on the entire module, then does > simplifications on the entire module, then does late inlining on the > entire module; so it is not able to incrementally simplify as it > inlines like LLVM does.This incremental simplification is an important feature of our inliner, and one we should endeavor to keep. We might also want different phases at some point (e.g. a top-down and a bottom-up phase), but that's another story.> )> As background for what is below, the LazyCallGraph tracks two graphs: > the "call graph" and the "ref graph". > Conceptually, the call graph is the graph of direct calls, where > indirect calls and calls to external functions do not appear (or are > connected to dummy nodes). The ref graph is basically the graph of > all functions transitively accessible based on the > globals/constants/etc. referenced by a function (e.g. if a function > `foo` references a vtable that is defined in the module, there is an > edge in the ref graph from `foo` to every function in the vtable). > The call graph is a strict subset of the ref graph.> Chandler described that he had a major breakthrough in that the CGSCC > pass manager only had to deal with 3 classes of modifications that > can occur: > - a pass may e.g. propagate a load of a function pointer into an > indirect call, turning it into an direct call. This requires adding > an edge in the CG but not in the ref graph. > - a pass may take a direct call and turn it into an indirect call. > This requires removing an edge from the CG, but not in the ref > graph. > - a pass may delete a direct call. This removes an edge in the CG and > also in the ref graph.> From the perspective of the CGSCC pass manager, these operations can > affect the SCC structure. Adding an edge might merge SCC's and > deleting an edge might split SCC's. Chandler mentioned that > apparently the issues of splitting and merging SCC's within the > current infrastructure are actually quite challenging and lead to > e.g. iterator invalidation issues, and that is what he is working > on.> ( > The ref graph is important to guide the overall SCC visitation order > because it basically represents "the largest graph that the CG may > turn into due to our static analysis of this module". I.e. no > transformation we can statically make in the CGSCC passes can ever > cause us to need to merge SCC's in the ref graph. > )> I have a couple overall questions/concerns:> 1. The ref graph can easily go quadratic. E.g.> typedef void (*fp)(); > fp funcs[] = { > &foo1, > &foo2, > ... > &fooN > } > void foo1() { funcs[something](); }> void foo2() { funcs[something](); } > ... > void fooN() { funcs[something](); }> One real-world case where this might come about is in the presence of > vtables.> The existing CGSCC pass manager does not have this issue AFAIK > because it does not consider the ref graph.> Does anybody have any info/experience about how densely connected the > ref graph can get in programs that might reasonably be fed to the > compiler? > I just did a quick sanity check with LLD/ELF using > `--lto-newpm-passes=cgscc(no-op-cgscc)` and it at least seemed to > terminate / not run out of memory. Based on some rough calculations > looking at the profile, it seem like the entire run of the inliner > in the old LTO pipeline (which is about 5% of total LTO time on this > particular example I looked at) is only 2-3x as expensive as just > `--lto-newpm-passes=cgscc(no-op-cgscc)`, so the LazyCallGraph > construction is certainly not free.> 2. What is the intended behavior of CGSCC passes when SCC's are split > or merged? E.g. a CGSCC pass runs on an SCC (e.g. the inliner). Now > we run some function passes nested inside the CGSCC pass manager > (e.g. to simplify things after inlining). Consider:This is not how I thought the current scheme worked ;) -- I was under the impression that we had a call graph with conservatively-connected dummy nodes for external/indirect functions. As a result, there is no semantics-preserving optimization that will merge SCCs, only split them. In that case, I'd expect that once an SCC is split, we re-run the CGSCC passes over the newly-separated SCCs. But this corresponds to running over the "ref graph", as you describe it. I don't understand why we want the non-conservative graph.> a) These function passes are e.g. now able to devirtualize a call, > adding an edge to the CG, forming a larger CG SCC. Do you re-run the > CGSCC pass (say, the inliner) on this larger SCC?> b) These function passes are e.g. able to DCE a call, removing an > edge from the CG. This converts, say, a CG SCC which is a cycle > graph (like a->b->c->a) into a path graph (a->b->c, with no edge > back to a). The inliner had already visited a, b, and c as a single > SCC. Now does it have to re-visit c, then b, then a, as single-node > SCC's?> btw:> One way that I have found it useful to think about this is in terms > of the visitation during Tarjan's SCC algorithm. I'll reference the > pseudocode in > https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm > . Inside the "strongconnect" routine when we have identified an SCC > (the true branch of `if (v.lowlink = v.index)` test ) we can visit > stack[v.index:stack.size()] as an SCC. This may or may not > invalidate some things on the stack (the variable `S` in the > pseudocode) and we may need to fix it up (e.g. inlining deleted a > function, so we can't have an entry on the stack). Then, we can run > function passes as we pop individual functions off the stack, but it > is easier to think about IMO than merging of SCC data structures: if > we add edges to the CG then we have to do more DFS on the new edges > and if we delete edges then the DFS order of the stack gives us > certain guarantees. > Personally I find this much easier to reason about than the > description in terms of splitting and merging SCC's in the CG and > ref graph (which the LazyCallGraph API makes one to think about > since it hides the underlying Tarjan's algorithm). > The LazyCallGraph API makes the current loop in > http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/include/llvm/Analysis/CGSCCPassManager.h;272124$100 > very clean, but at least for my thinking about the problem, it seems > like the wrong abstraction (and most of the LazyCallGraph API seems > to be unused, so it seems like it may be overly heavyweight). > E.g. I think that maybe the easiest thing to do is to turn the > current approach inside out: instead of having the pass manager > logic be the "normal code" and forcing the Tarjan algorithm to > become a state machine of iterators, use an open-coded Tarjan > algorithm with some callbacks and make the pass management logic be > the state machine. > This will also open the door to avoiding the potentially quadratic > size of the ref graph, since e.g. in the example I gave above, we > can mark the `funcs` array itself as already having been visited > during the walk. In the current LazyCallGraph, this would require > adding some sort of notion of hyperedge.FWIW, I see no purpose in abstracting one algorithm, especially if that makes things algorithmically harder. Also, the LazyCallGraph abstraction and the iterator abstraction seem like separate issues. Iterator abstractions are often useful because you can use them in generic algorithms, etc.> Since this is such a high priority (due to blocking PGO inlining), I > will probably try my hand at implementing the CGSCC pass manager > sometime soon unless somebody beats me to it. (I'll probably try the > "open-coded SCC visit" approach).> Another possibility is implementing the new CGSCC pass manager that > uses the same visitation semantics as the one in the old PM, and > then we can refactor that as needed. In fact, that may be the best > approach so that porting to the new PM is as NFC as possible and we > can isolate the functional (i.e., need benchmarks, measurements ...) > changes in separate commits.I'm in favor of this approach for exactly the reason you mention. Being able to bisect regressions to the algorithmic change, separate from the infrastructure change, will likely make things easier in the long run (and will avoid the problem, to the extent possible, of performance regressions blocking the pass-manager work).> Sorry for the wall of text.No problem; much appreciated. -Hal> -- Sean Silva > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160608/e1eb0bc5/attachment.html>
Daniel Berlin via llvm-dev
2016-Jun-08 16:39 UTC
[llvm-dev] Intended behavior of CGSCC pass manager.
On Wed, Jun 8, 2016 at 4:19 AM, Sean Silva <chisophugis at gmail.com> wrote:> Hi Chandler, Philip, Mehdi, (and llvm-dev,) > > (this is partially a summary of some discussions that happened at the last > LLVM bay area social, and partially a discussion about the direction of the > CGSCC pass manager) > > > A the last LLVM social we discussed the progress on the CGSCC pass > manager. It seems like Chandler has a CGSCC pass manager working, but it is > still unresolved exactly which semantics we want (more about this below) > that are reasonably implementable. > > AFAICT, there has been no public discussion about what exact semantics we > ultimately want to have. We should figure that out. > > The main difficulty which Chandler described is the apparently quite > complex logic surrounding needing to run function passes nested within an > SCC pass manager, while providing some guarantees about exactly what order > the function passes are run. The existing CGSCC pass manager just punts on > some of the problems that arise (look in CGPassManager::runOnModule, > CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in > llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that > Chandler has been trying to solve. > > ( > Why is this "function passes inside CGSCC passes" stuff interesting? > Because LLVM can do inlining on an SCC (often just a single function) and > then run function passes to simplify the function(s) in the SCC before it > tries to inline into a parent SCC. (the SCC visitation order is post-order) > For example, we may inline a bunch of code, but after inlining we can > tremendously simplify the function, and we want to do so before considering > this function for inlining into its callers so that we get an accurate > evaluation of the inline cost. > Based on what Chandler said, it seems that LLVM is fairly unique in this > regard and other compilers don't do this (which is why we can't just look > at how other compilers solve this problem; they don't have this problem > (maybe they should? or maybe we shouldn't?)). For example, he described > that GCC uses different inlining "phases"; e.g. it does early inlining on > the entire module, then does simplifications on the entire module, then > does late inlining on the entire module; so it is not able to incrementally > simplify as it inlines like LLVM does. > ) > > As background for what is below, the LazyCallGraph tracks two graphs: the > "call graph" and the "ref graph". > Conceptually, the call graph is the graph of direct calls, where indirect > calls and calls to external functions do not appear (or are connected to > dummy nodes). The ref graph is basically the graph of all functions > transitively accessible based on the globals/constants/etc. referenced by a > function (e.g. if a function `foo` references a vtable that is defined in > the module, there is an edge in the ref graph from `foo` to every function > in the vtable). > The call graph is a strict subset of the ref graph. > > Chandler described that he had a major breakthrough in that the CGSCC pass > manager only had to deal with 3 classes of modifications that can occur: > - a pass may e.g. propagate a load of a function pointer into an indirect > call, turning it into an direct call. This requires adding an edge in the > CG but not in the ref graph. > - a pass may take a direct call and turn it into an indirect call. This > requires removing an edge from the CG, but not in the ref graph. > - a pass may delete a direct call. This removes an edge in the CG and also > in the ref graph. > > From the perspective of the CGSCC pass manager, these operations can > affect the SCC structure. Adding an edge might merge SCC's and deleting an > edge might split SCC's. Chandler mentioned that apparently the issues of > splitting and merging SCC's within the current infrastructure are actually > quite challenging and lead to e.g. iterator invalidation issues, and that > is what he is working on. > > ( > The ref graph is important to guide the overall SCC visitation order > because it basically represents "the largest graph that the CG may turn > into due to our static analysis of this module". I.e. no transformation we > can statically make in the CGSCC passes can ever cause us to need to merge > SCC's in the ref graph. > ) > > > > I have a couple overall questions/concerns: > > > 1. The ref graph can easily go quadratic. E.g. > > typedef void (*fp)(); > fp funcs[] = { > &foo1, > &foo2, > ... > &fooN > } > void foo1() { funcs[something](); } > void foo2() { funcs[something](); } > ... > void fooN() { funcs[something](); } > > One real-world case where this might come about is in the presence of > vtables. > > The existing CGSCC pass manager does not have this issue AFAIK because it > does not consider the ref graph. > > Does anybody have any info/experience about how densely connected the ref > graph can get in programs that might reasonably be fed to the compiler? >I can state that almost all call graphs of compilers include edges for indirect calls and external functions, so they are already quadratic in this sense. if what you state is correct, and we don't have a conservatively correct call graph, that would be ... interesting. The solution to most issues (large sccs, etc) that exist here that most other compilers take is to try to make the call graph more precise, not to avoid indirect/external calls in the call graph. In turn, this means the solution often take is to not have two graphs at all. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160608/4600990f/attachment.html>
Xinliang David Li via llvm-dev
2016-Jun-08 17:57 UTC
[llvm-dev] Intended behavior of CGSCC pass manager.
On Wed, Jun 8, 2016 at 4:19 AM, Sean Silva <chisophugis at gmail.com> wrote:> Hi Chandler, Philip, Mehdi, (and llvm-dev,) > > (this is partially a summary of some discussions that happened at the last > LLVM bay area social, and partially a discussion about the direction of the > CGSCC pass manager) > > > A the last LLVM social we discussed the progress on the CGSCC pass > manager. It seems like Chandler has a CGSCC pass manager working, but it is > still unresolved exactly which semantics we want (more about this below) > that are reasonably implementable. > > AFAICT, there has been no public discussion about what exact semantics we > ultimately want to have. We should figure that out. > > The main difficulty which Chandler described is the apparently quite > complex logic surrounding needing to run function passes nested within an > SCC pass manager, while providing some guarantees about exactly what order > the function passes are run. The existing CGSCC pass manager just punts on > some of the problems that arise (look in CGPassManager::runOnModule, > CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in > llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that > Chandler has been trying to solve. > > ( > Why is this "function passes inside CGSCC passes" stuff interesting? > Because LLVM can do inlining on an SCC (often just a single function) and > then run function passes to simplify the function(s) in the SCC before it > tries to inline into a parent SCC. (the SCC visitation order is post-order) > For example, we may inline a bunch of code, but after inlining we can > tremendously simplify the function, and we want to do so before considering > this function for inlining into its callers so that we get an accurate > evaluation of the inline cost. > Based on what Chandler said, it seems that LLVM is fairly unique in this > regard and other compilers don't do this (which is why we can't just look > at how other compilers solve this problem; they don't have this problem > (maybe they should? or maybe we shouldn't?)). For example, he described > that GCC uses different inlining "phases"; e.g. it does early inlining on > the entire module, then does simplifications on the entire module, then > does late inlining on the entire module; so it is not able to incrementally > simplify as it inlines like LLVM does. > ) > >I want to clarify a little more on GCC's behavior. GCC has two types of ipa_passes. One called simple_ipa pass and the other is called regular ipa_pass. A simple IPA pass does not transform IR by itself, but it can have function level sub-passes that does transformation. If it has function sub-passes, the function passes will be executed in bottom up order just like LLVM's CGSCC pass. A regular ipa pass is somewhat like the module pass in LLVM. GCC's pass pipeline is actually similar to LLVM's pipeline overall: it has the following components: (1) lowering passes (2) small/simple IPA passes including the early optimization pipeline (bottom-up) (3) full ipa pipelines (4) post-ipa optimization and code generation. The difference is that LLVM's (2) includes only inst-combine and simplfy CFG, but GCC's (2) is larger which happens also includes an early inlining pass. GCC's (2) is similar to LLVM's CGSCC pass but with fewer passes. The inliner is also targeting tiny functions that reduces size overall. LLVM's LTO pipeline is kind of like this too. GCC's (3) includes a full blown regular inlining pass. This inliner uses the priority order based algorithm, so bottom-up traversal does not make sense. It requires (2) to have pass to collect summary to drive the decisions (as well as profile data). David> As background for what is below, the LazyCallGraph tracks two graphs: the > "call graph" and the "ref graph". > Conceptually, the call graph is the graph of direct calls, where indirect > calls and calls to external functions do not appear (or are connected to > dummy nodes). The ref graph is basically the graph of all functions > transitively accessible based on the globals/constants/etc. referenced by a > function (e.g. if a function `foo` references a vtable that is defined in > the module, there is an edge in the ref graph from `foo` to every function > in the vtable). > The call graph is a strict subset of the ref graph. > > Chandler described that he had a major breakthrough in that the CGSCC pass > manager only had to deal with 3 classes of modifications that can occur: > - a pass may e.g. propagate a load of a function pointer into an indirect > call, turning it into an direct call. This requires adding an edge in the > CG but not in the ref graph. > - a pass may take a direct call and turn it into an indirect call. This > requires removing an edge from the CG, but not in the ref graph. > - a pass may delete a direct call. This removes an edge in the CG and also > in the ref graph. > > From the perspective of the CGSCC pass manager, these operations can > affect the SCC structure. Adding an edge might merge SCC's and deleting an > edge might split SCC's. Chandler mentioned that apparently the issues of > splitting and merging SCC's within the current infrastructure are actually > quite challenging and lead to e.g. iterator invalidation issues, and that > is what he is working on. > > ( > The ref graph is important to guide the overall SCC visitation order > because it basically represents "the largest graph that the CG may turn > into due to our static analysis of this module". I.e. no transformation we > can statically make in the CGSCC passes can ever cause us to need to merge > SCC's in the ref graph. > ) > > > > I have a couple overall questions/concerns: > > > 1. The ref graph can easily go quadratic. E.g. > > typedef void (*fp)(); > fp funcs[] = { > &foo1, > &foo2, > ... > &fooN > } > void foo1() { funcs[something](); } > void foo2() { funcs[something](); } > ... > void fooN() { funcs[something](); } > > One real-world case where this might come about is in the presence of > vtables. > > The existing CGSCC pass manager does not have this issue AFAIK because it > does not consider the ref graph. > > Does anybody have any info/experience about how densely connected the ref > graph can get in programs that might reasonably be fed to the compiler? > I just did a quick sanity check with LLD/ELF using > `--lto-newpm-passes=cgscc(no-op-cgscc)` and it at least seemed to terminate > / not run out of memory. Based on some rough calculations looking at the > profile, it seem like the entire run of the inliner in the old LTO pipeline > (which is about 5% of total LTO time on this particular example I looked > at) is only 2-3x as expensive as just > `--lto-newpm-passes=cgscc(no-op-cgscc)`, so the LazyCallGraph construction > is certainly not free. > > > 2. What is the intended behavior of CGSCC passes when SCC's are split or > merged? E.g. a CGSCC pass runs on an SCC (e.g. the inliner). Now we run > some function passes nested inside the CGSCC pass manager (e.g. to simplify > things after inlining). Consider: > > a) These function passes are e.g. now able to devirtualize a call, adding > an edge to the CG, forming a larger CG SCC. Do you re-run the CGSCC pass > (say, the inliner) on this larger SCC? > > b) These function passes are e.g. able to DCE a call, removing an edge > from the CG. This converts, say, a CG SCC which is a cycle graph (like > a->b->c->a) into a path graph (a->b->c, with no edge back to a). The > inliner had already visited a, b, and c as a single SCC. Now does it have > to re-visit c, then b, then a, as single-node SCC's? > > > btw: > > One way that I have found it useful to think about this is in terms of the > visitation during Tarjan's SCC algorithm. I'll reference the pseudocode in > https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm. > Inside the "strongconnect" routine when we have identified an SCC (the true > branch of `if (v.lowlink = v.index)` test ) we can visit > stack[v.index:stack.size()] as an SCC. This may or may not invalidate some > things on the stack (the variable `S` in the pseudocode) and we may need to > fix it up (e.g. inlining deleted a function, so we can't have an entry on > the stack). Then, we can run function passes as we pop individual functions > off the stack, but it is easier to think about IMO than merging of SCC data > structures: if we add edges to the CG then we have to do more DFS on the > new edges and if we delete edges then the DFS order of the stack gives us > certain guarantees. > Personally I find this much easier to reason about than the description in > terms of splitting and merging SCC's in the CG and ref graph (which the > LazyCallGraph API makes one to think about since it hides the underlying > Tarjan's algorithm). > The LazyCallGraph API makes the current loop in > http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/include/llvm/Analysis/CGSCCPassManager.h;272124$100 > very clean, but at least for my thinking about the problem, it seems like > the wrong abstraction (and most of the LazyCallGraph API seems to be > unused, so it seems like it may be overly heavyweight). > E.g. I think that maybe the easiest thing to do is to turn the current > approach inside out: instead of having the pass manager logic be the > "normal code" and forcing the Tarjan algorithm to become a state machine of > iterators, use an open-coded Tarjan algorithm with some callbacks and make > the pass management logic be the state machine. > This will also open the door to avoiding the potentially quadratic size of > the ref graph, since e.g. in the example I gave above, we can mark the > `funcs` array itself as already having been visited during the walk. In the > current LazyCallGraph, this would require adding some sort of notion of > hyperedge. > > Since this is such a high priority (due to blocking PGO inlining), I will > probably try my hand at implementing the CGSCC pass manager sometime soon > unless somebody beats me to it. (I'll probably try the "open-coded SCC > visit" approach). > > Another possibility is implementing the new CGSCC pass manager that uses > the same visitation semantics as the one in the old PM, and then we can > refactor that as needed. In fact, that may be the best approach so that > porting to the new PM is as NFC as possible and we can isolate the > functional (i.e., need benchmarks, measurements ...) changes in separate > commits. > > > Sorry for the wall of text. > > > -- Sean Silva >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160608/750f07ed/attachment.html>
Sanjoy Das via llvm-dev
2016-Jun-08 18:14 UTC
[llvm-dev] Intended behavior of CGSCC pass manager.
Hi Sean, Sean Silva wrote: > Hi Chandler, Philip, Mehdi, (and llvm-dev,) > > (this is partially a summary of some discussions that happened at the last LLVM bay area social, and partially a > discussion about the direction of the CGSCC pass manager) Thanks for writing this up! This sort of thing is very helpful for people who don't attend the social, or had to leave early. :) > Chandler described that he had a major breakthrough in that the CGSCC pass manager only had to deal with 3 classes of > modifications that can occur: > - a pass may e.g. propagate a load of a function pointer into an indirect call, turning it into an direct call. This > requires adding an edge in the CG but not in the ref graph. > - a pass may take a direct call and turn it into an indirect call. This requires removing an edge from the CG, but not > in the ref graph. > - a pass may delete a direct call. This removes an edge in the CG and also in the ref graph. At what granularity are we modeling these things? E.g. if SimplifyCFG deletes a basic block, will we remove call edges that start from that block? Note there is fourth kind of modification that isn't modeled here: devirtualization, in which we transform IR like %fnptr = compute_fnptr(%receiver, i32 <method id>) call %fptr(%receiver) to (via a pass that understands the semantics of compute_fnptr). call some.specific.Class::method(%receiver) However, at this time I don't think modeling "out of thin air" devirtualization of this sort is important, since nothing upstream does things like this (we do have ModulePasses downstream that do this). > From the perspective of the CGSCC pass manager, these operations can affect the SCC structure. Adding an edge might > merge SCC's and deleting an edge might split SCC's. Chandler mentioned that apparently the issues of splitting and > merging SCC's within the current infrastructure are actually quite challenging and lead to e.g. iterator invalidation > issues, and that is what he is working on. > > ( > The ref graph is important to guide the overall SCC visitation order because it basically represents "the largest graph > that the CG may turn into due to our static analysis of this module". I.e. no transformation we can statically make in > the CGSCC passes can ever cause us to need to merge SCC's in the ref graph. > ) Except in the above "out of thin air" devirtualization case. I think cross-function store forwarding can also be problematic: void foo(fnptr* bar) { if (bar) (*bar)(); } void baz() { foo(null); } void caller() { fnptr *t = malloc(); *t = baz; foo(t); } // RefGraph is // caller -> baz // caller -> foo // baz -> foo Now the RefSCCs are {foo}, {caller}, {baz} but if we forward the store to *t in caller into the load in foo (and are smart about the null check) we get: void foo(fnptr* bar) { if (bar) baz(); } void baz() { foo(null); } void caller() { fnptr *t = malloc(); *t = baz; foo(t); } // RefGraph is // foo -> baz // baz -> foo // caller -> foo // caller -> baz and now the RefSCCs are {foo, baz}, {caller} But again, I think this is fine since nothing upstream does this at this time; and we can cross that bridge when we come to it. > 2. What is the intended behavior of CGSCC passes when SCC's are split or merged? E.g. a CGSCC pass runs on an SCC (e.g. > the inliner). Now we run some function passes nested inside the CGSCC pass manager (e.g. to simplify things after > inlining). Consider: > > a) These function passes are e.g. now able to devirtualize a call, adding an edge to the CG, forming a larger CG SCC. Do > you re-run the CGSCC pass (say, the inliner) on this larger SCC? > > b) These function passes are e.g. able to DCE a call, removing an edge from the CG. This converts, say, a CG SCC which > is a cycle graph (like a->b->c->a) into a path graph (a->b->c, with no edge back to a). The inliner had already visited > a, b, and c as a single SCC. Now does it have to re-visit c, then b, then a, as single-node SCC's? Okay, so this is the same question I wrote above: "At what granularity are we modeling these things?", phrased more eloquently. :) > One way that I have found it useful to think about this is in terms of the visitation during Tarjan's SCC algorithm. > I'll reference the pseudocode in https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm. > Inside the "strongconnect" routine when we have identified an SCC (the true branch of `if (v.lowlink = v.index)` test ) > we can visit stack[v.index:stack.size()] as an SCC. This may or may not invalidate some things on the stack (the > variable `S` in the pseudocode) and we may need to fix it up (e.g. inlining deleted a function, so we can't have an > entry on the stack). Then, we can run function passes as we pop individual functions off the stack, but it is easier to > think about IMO than merging of SCC data structures: if we add edges to the CG then we have to do more DFS on the new > edges and if we delete edges then the DFS order of the stack gives us certain guarantees. I'm not sure how this will be easier E.g. consider X -> A A -> B B -> C C -> B B -> A Now you're going to pop A,B,C together, as an SCC; and you start optimizing them, and the edge from B to A falls out. Don't you have the same problem now (i.e. we've removed an edge from an SCC we're iterating over)? Perhaps I misunderstood your scheme? -- Sanjoy
Mehdi Amini via llvm-dev
2016-Jun-08 18:58 UTC
[llvm-dev] Intended behavior of CGSCC pass manager.
> On Jun 8, 2016, at 9:32 AM, Hal Finkel via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > > From: "Sean Silva via llvm-dev" <llvm-dev at lists.llvm.org> > To: "llvm-dev" <llvm-dev at lists.llvm.org> > Sent: Wednesday, June 8, 2016 6:19:03 AM > Subject: [llvm-dev] Intended behavior of CGSCC pass manager. > > Hi Chandler, Philip, Mehdi, (and llvm-dev,) > > (this is partially a summary of some discussions that happened at the last LLVM bay area social, and partially a discussion about the direction of the CGSCC pass manager) > > > A the last LLVM social we discussed the progress on the CGSCC pass manager. It seems like Chandler has a CGSCC pass manager working, but it is still unresolved exactly which semantics we want (more about this below) that are reasonably implementable. > > AFAICT, there has been no public discussion about what exact semantics we ultimately want to have. We should figure that out. > > The main difficulty which Chandler described is the apparently quite complex logic surrounding needing to run function passes nested within an SCC pass manager, while providing some guarantees about exactly what order the function passes are run. The existing CGSCC pass manager just punts on some of the problems that arise (look in CGPassManager::runOnModule, CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that Chandler has been trying to solve. > > ( > Why is this "function passes inside CGSCC passes" stuff interesting? Because LLVM can do inlining on an SCC (often just a single function) and then run function passes to simplify the function(s) in the SCC before it tries to inline into a parent SCC. (the SCC visitation order is post-order) > For example, we may inline a bunch of code, but after inlining we can tremendously simplify the function, and we want to do so before considering this function for inlining into its callers so that we get an accurate evaluation of the inline cost. > Based on what Chandler said, it seems that LLVM is fairly unique in this regard and other compilers don't do this (which is why we can't just look at how other compilers solve this problem; they don't have this problem (maybe they should? or maybe we shouldn't?)). For example, he described that GCC uses different inlining "phases"; e.g. it does early inlining on the entire module, then does simplifications on the entire module, then does late inlining on the entire module; so it is not able to incrementally simplify as it inlines like LLVM does. > This incremental simplification is an important feature of our inliner, and one we should endeavor to keep. We might also want different phases at some point (e.g. a top-down and a bottom-up phase), but that's another story. > ) > > As background for what is below, the LazyCallGraph tracks two graphs: the "call graph" and the "ref graph". > Conceptually, the call graph is the graph of direct calls, where indirect calls and calls to external functions do not appear (or are connected to dummy nodes). The ref graph is basically the graph of all functions transitively accessible based on the globals/constants/etc. referenced by a function (e.g. if a function `foo` references a vtable that is defined in the module, there is an edge in the ref graph from `foo` to every function in the vtable). > The call graph is a strict subset of the ref graph. > > Chandler described that he had a major breakthrough in that the CGSCC pass manager only had to deal with 3 classes of modifications that can occur: > - a pass may e.g. propagate a load of a function pointer into an indirect call, turning it into an direct call. This requires adding an edge in the CG but not in the ref graph. > - a pass may take a direct call and turn it into an indirect call. This requires removing an edge from the CG, but not in the ref graph. > - a pass may delete a direct call. This removes an edge in the CG and also in the ref graph. > > From the perspective of the CGSCC pass manager, these operations can affect the SCC structure. Adding an edge might merge SCC's and deleting an edge might split SCC's. Chandler mentioned that apparently the issues of splitting and merging SCC's within the current infrastructure are actually quite challenging and lead to e.g. iterator invalidation issues, and that is what he is working on. > > ( > The ref graph is important to guide the overall SCC visitation order because it basically represents "the largest graph that the CG may turn into due to our static analysis of this module". I.e. no transformation we can statically make in the CGSCC passes can ever cause us to need to merge SCC's in the ref graph. > ) > > > > I have a couple overall questions/concerns: > > > 1. The ref graph can easily go quadratic. E.g. > > typedef void (*fp)(); > fp funcs[] = { > &foo1, > &foo2, > ... > &fooN > } > void foo1() { funcs[something](); } > void foo2() { funcs[something](); } > ... > void fooN() { funcs[something](); } > > One real-world case where this might come about is in the presence of vtables. > > The existing CGSCC pass manager does not have this issue AFAIK because it does not consider the ref graph. > > Does anybody have any info/experience about how densely connected the ref graph can get in programs that might reasonably be fed to the compiler? > I just did a quick sanity check with LLD/ELF using `--lto-newpm-passes=cgscc(no-op-cgscc)` and it at least seemed to terminate / not run out of memory. Based on some rough calculations looking at the profile, it seem like the entire run of the inliner in the old LTO pipeline (which is about 5% of total LTO time on this particular example I looked at) is only 2-3x as expensive as just `--lto-newpm-passes=cgscc(no-op-cgscc)`, so the LazyCallGraph construction is certainly not free. > > > 2. What is the intended behavior of CGSCC passes when SCC's are split or merged? E.g. a CGSCC pass runs on an SCC (e.g. the inliner). Now we run some function passes nested inside the CGSCC pass manager (e.g. to simplify things after inlining). Consider: > This is not how I thought the current scheme worked ;) -- I was under the impression that we had a call graph with conservatively-connected dummy nodes for external/indirect functions.The fact that we have a separate nodes for calling into an external function and "being called" from an external function, these don't form SCC. So it means we can end up merging SCC IIUC. -- Mehdi> As a result, there is no semantics-preserving optimization that will merge SCCs, only split them. In that case, I'd expect that once an SCC is split, we re-run the CGSCC passes over the newly-separated SCCs. But this corresponds to running over the "ref graph", as you describe it. I don't understand why we want the non-conservative graph. > > > a) These function passes are e.g. now able to devirtualize a call, adding an edge to the CG, forming a larger CG SCC. Do you re-run the CGSCC pass (say, the inliner) on this larger SCC? > > b) These function passes are e.g. able to DCE a call, removing an edge from the CG. This converts, say, a CG SCC which is a cycle graph (like a->b->c->a) into a path graph (a->b->c, with no edge back to a). The inliner had already visited a, b, and c as a single SCC. Now does it have to re-visit c, then b, then a, as single-node SCC's? > > > btw: > > One way that I have found it useful to think about this is in terms of the visitation during Tarjan's SCC algorithm. I'll reference the pseudocode in https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm <https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm>. Inside the "strongconnect" routine when we have identified an SCC (the true branch of `if (v.lowlink = v.index)` test ) we can visit stack[v.index:stack.size()] as an SCC. This may or may not invalidate some things on the stack (the variable `S` in the pseudocode) and we may need to fix it up (e.g. inlining deleted a function, so we can't have an entry on the stack). Then, we can run function passes as we pop individual functions off the stack, but it is easier to think about IMO than merging of SCC data structures: if we add edges to the CG then we have to do more DFS on the new edges and if we delete edges then the DFS order of the stack gives us certain guarantees. > Personally I find this much easier to reason about than the description in terms of splitting and merging SCC's in the CG and ref graph (which the LazyCallGraph API makes one to think about since it hides the underlying Tarjan's algorithm). > The LazyCallGraph API makes the current loop in http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/include/llvm/Analysis/CGSCCPassManager.h;272124$100 <http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/include/llvm/Analysis/CGSCCPassManager.h;272124$100> very clean, but at least for my thinking about the problem, it seems like the wrong abstraction (and most of the LazyCallGraph API seems to be unused, so it seems like it may be overly heavyweight). > E.g. I think that maybe the easiest thing to do is to turn the current approach inside out: instead of having the pass manager logic be the "normal code" and forcing the Tarjan algorithm to become a state machine of iterators, use an open-coded Tarjan algorithm with some callbacks and make the pass management logic be the state machine. > This will also open the door to avoiding the potentially quadratic size of the ref graph, since e.g. in the example I gave above, we can mark the `funcs` array itself as already having been visited during the walk. In the current LazyCallGraph, this would require adding some sort of notion of hyperedge. > FWIW, I see no purpose in abstracting one algorithm, especially if that makes things algorithmically harder. Also, the LazyCallGraph abstraction and the iterator abstraction seem like separate issues. Iterator abstractions are often useful because you can use them in generic algorithms, etc. > > Since this is such a high priority (due to blocking PGO inlining), I will probably try my hand at implementing the CGSCC pass manager sometime soon unless somebody beats me to it. (I'll probably try the "open-coded SCC visit" approach). > > Another possibility is implementing the new CGSCC pass manager that uses the same visitation semantics as the one in the old PM, and then we can refactor that as needed. In fact, that may be the best approach so that porting to the new PM is as NFC as possible and we can isolate the functional (i.e., need benchmarks, measurements ...) changes in separate commits. > I'm in favor of this approach for exactly the reason you mention. Being able to bisect regressions to the algorithmic change, separate from the infrastructure change, will likely make things easier in the long run (and will avoid the problem, to the extent possible, of performance regressions blocking the pass-manager work). > > Sorry for the wall of text. > No problem; much appreciated. > > -Hal > > > -- Sean Silva > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> > > > > -- > Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160608/e6970d41/attachment-0001.html>
Xinliang David Li via llvm-dev
2016-Jun-08 19:31 UTC
[llvm-dev] Intended behavior of CGSCC pass manager.
On Wed, Jun 8, 2016 at 4:19 AM, Sean Silva <chisophugis at gmail.com> wrote:> Hi Chandler, Philip, Mehdi, (and llvm-dev,) > > (this is partially a summary of some discussions that happened at the last > LLVM bay area social, and partially a discussion about the direction of the > CGSCC pass manager) > > > A the last LLVM social we discussed the progress on the CGSCC pass > manager. It seems like Chandler has a CGSCC pass manager working, but it is > still unresolved exactly which semantics we want (more about this below) > that are reasonably implementable. > > AFAICT, there has been no public discussion about what exact semantics we > ultimately want to have. We should figure that out. > > The main difficulty which Chandler described is the apparently quite > complex logic surrounding needing to run function passes nested within an > SCC pass manager, while providing some guarantees about exactly what order > the function passes are run. The existing CGSCC pass manager just punts on > some of the problems that arise (look in CGPassManager::runOnModule, > CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in > llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that > Chandler has been trying to solve. > > ( > Why is this "function passes inside CGSCC passes" stuff interesting? > Because LLVM can do inlining on an SCC (often just a single function) and > then run function passes to simplify the function(s) in the SCC before it > tries to inline into a parent SCC. (the SCC visitation order is post-order) > For example, we may inline a bunch of code, but after inlining we can > tremendously simplify the function, and we want to do so before considering > this function for inlining into its callers so that we get an accurate > evaluation of the inline cost. > Based on what Chandler said, it seems that LLVM is fairly unique in this > regard and other compilers don't do this (which is why we can't just look > at how other compilers solve this problem; they don't have this problem > (maybe they should? or maybe we shouldn't?)). For example, he described > that GCC uses different inlining "phases"; e.g. it does early inlining on > the entire module, then does simplifications on the entire module, then > does late inlining on the entire module; so it is not able to incrementally > simplify as it inlines like LLVM does. > ) > > As background for what is below, the LazyCallGraph tracks two graphs: the > "call graph" and the "ref graph". > Conceptually, the call graph is the graph of direct calls, where indirect > calls and calls to external functions do not appear (or are connected to > dummy nodes). The ref graph is basically the graph of all functions > transitively accessible based on the globals/constants/etc. referenced by a > function (e.g. if a function `foo` references a vtable that is defined in > the module, there is an edge in the ref graph from `foo` to every function > in the vtable). > The call graph is a strict subset of the ref graph. > > Chandler described that he had a major breakthrough in that the CGSCC pass > manager only had to deal with 3 classes of modifications that can occur: > - a pass may e.g. propagate a load of a function pointer into an indirect > call, turning it into an direct call. This requires adding an edge in the > CG but not in the ref graph. > - a pass may take a direct call and turn it into an indirect call. This > requires removing an edge from the CG, but not in the ref graph. > - a pass may delete a direct call. This removes an edge in the CG and also > in the ref graph. > > From the perspective of the CGSCC pass manager, these operations can > affect the SCC structure. Adding an edge might merge SCC's and deleting an > edge might split SCC's. Chandler mentioned that apparently the issues of > splitting and merging SCC's within the current infrastructure are actually > quite challenging and lead to e.g. iterator invalidation issues, and that > is what he is working on. > > ( > The ref graph is important to guide the overall SCC visitation order > because it basically represents "the largest graph that the CG may turn > into due to our static analysis of this module". I.e. no transformation we > can statically make in the CGSCC passes can ever cause us to need to merge > SCC's in the ref graph. > ) > > > > I have a couple overall questions/concerns: > > > 1. The ref graph can easily go quadratic. E.g. > > typedef void (*fp)(); > fp funcs[] = { > &foo1, > &foo2, > ... > &fooN > } > void foo1() { funcs[something](); } > void foo2() { funcs[something](); } > ... > void fooN() { funcs[something](); } > > One real-world case where this might come about is in the presence of > vtables. > > The existing CGSCC pass manager does not have this issue AFAIK because it > does not consider the ref graph. > > Does anybody have any info/experience about how densely connected the ref > graph can get in programs that might reasonably be fed to the compiler? > I just did a quick sanity check with LLD/ELF using > `--lto-newpm-passes=cgscc(no-op-cgscc)` and it at least seemed to terminate > / not run out of memory. Based on some rough calculations looking at the > profile, it seem like the entire run of the inliner in the old LTO pipeline > (which is about 5% of total LTO time on this particular example I looked > at) is only 2-3x as expensive as just > `--lto-newpm-passes=cgscc(no-op-cgscc)`, so the LazyCallGraph construction > is certainly not free. >Conceptually, reference graph should also include variable nodes. With variable nodes introduced, the quadratic behavior mentioned can be avoided.> > > 2. What is the intended behavior of CGSCC passes when SCC's are split or > merged? E.g. a CGSCC pass runs on an SCC (e.g. the inliner). Now we run > some function passes nested inside the CGSCC pass manager (e.g. to simplify > things after inlining). Consider: > > a) These function passes are e.g. now able to devirtualize a call, adding > an edge to the CG, forming a larger CG SCC. Do you re-run the CGSCC pass > (say, the inliner) on this larger SCC? > > b) These function passes are e.g. able to DCE a call, removing an edge > from the CG. This converts, say, a CG SCC which is a cycle graph (like > a->b->c->a) into a path graph (a->b->c, with no edge back to a). The > inliner had already visited a, b, and c as a single SCC. Now does it have > to re-visit c, then b, then a, as single-node SCC's? > > > btw: > > One way that I have found it useful to think about this is in terms of the > visitation during Tarjan's SCC algorithm. I'll reference the pseudocode in > https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm. > Inside the "strongconnect" routine when we have identified an SCC (the true > branch of `if (v.lowlink = v.index)` test ) we can visit > stack[v.index:stack.size()] as an SCC. This may or may not invalidate some > things on the stack (the variable `S` in the pseudocode) and we may need to > fix it up (e.g. inlining deleted a function, so we can't have an entry on > the stack). Then, we can run function passes as we pop individual functions > off the stack, but it is easier to think about IMO than merging of SCC data > structures: if we add edges to the CG then we have to do more DFS on the > new edges and if we delete edges then the DFS order of the stack gives us > certain guarantees. > Personally I find this much easier to reason about than the description in > terms of splitting and merging SCC's in the CG and ref graph (which the > LazyCallGraph API makes one to think about since it hides the underlying > Tarjan's algorithm). > The LazyCallGraph API makes the current loop in > http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/include/llvm/Analysis/CGSCCPassManager.h;272124$100 > very clean, but at least for my thinking about the problem, it seems like > the wrong abstraction (and most of the LazyCallGraph API seems to be > unused, so it seems like it may be overly heavyweight). > E.g. I think that maybe the easiest thing to do is to turn the current > approach inside out: instead of having the pass manager logic be the > "normal code" and forcing the Tarjan algorithm to become a state machine of > iterators, use an open-coded Tarjan algorithm with some callbacks and make > the pass management logic be the state machine. > This will also open the door to avoiding the potentially quadratic size of > the ref graph, since e.g. in the example I gave above, we can mark the > `funcs` array itself as already having been visited during the walk. In the > current LazyCallGraph, this would require adding some sort of notion of > hyperedge. > > Since this is such a high priority (due to blocking PGO inlining), I will > probably try my hand at implementing the CGSCC pass manager sometime soon > unless somebody beats me to it. (I'll probably try the "open-coded SCC > visit" approach). > > Another possibility is implementing the new CGSCC pass manager that uses > the same visitation semantics as the one in the old PM, and then we can > refactor that as needed. In fact, that may be the best approach so that > porting to the new PM is as NFC as possible and we can isolate the > functional (i.e., need benchmarks, measurements ...) changes in separate > commits. > >A very high level comment: why do we need to update callgraph on the fly ? Can we have a more general support of iterative SCC pass invocation? something like: 1) build the callgraph 2) cache the post-order traversal order 3) if the order list is empty -- done 4) traversal: invoke function passes for each function on the order (step 2 or 5). The call graph gets updated on the fly (with new edges, or new nodes for cloned functions) 5) update the function traversal order from new nodes and new edges created in 4) 6) go to step 3). David> > Sorry for the wall of text. > > > -- Sean Silva >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160608/eaea6e09/attachment.html>
Sean Silva via llvm-dev
2016-Jun-08 21:54 UTC
[llvm-dev] Intended behavior of CGSCC pass manager.
On Wed, Jun 8, 2016 at 9:39 AM, Daniel Berlin <dberlin at dberlin.org> wrote:> > > On Wed, Jun 8, 2016 at 4:19 AM, Sean Silva <chisophugis at gmail.com> wrote: > >> Hi Chandler, Philip, Mehdi, (and llvm-dev,) >> >> (this is partially a summary of some discussions that happened at the >> last LLVM bay area social, and partially a discussion about the direction >> of the CGSCC pass manager) >> >> >> A the last LLVM social we discussed the progress on the CGSCC pass >> manager. It seems like Chandler has a CGSCC pass manager working, but it is >> still unresolved exactly which semantics we want (more about this below) >> that are reasonably implementable. >> >> AFAICT, there has been no public discussion about what exact semantics we >> ultimately want to have. We should figure that out. >> >> The main difficulty which Chandler described is the apparently quite >> complex logic surrounding needing to run function passes nested within an >> SCC pass manager, while providing some guarantees about exactly what order >> the function passes are run. The existing CGSCC pass manager just punts on >> some of the problems that arise (look in CGPassManager::runOnModule, >> CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in >> llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that >> Chandler has been trying to solve. >> >> ( >> Why is this "function passes inside CGSCC passes" stuff interesting? >> Because LLVM can do inlining on an SCC (often just a single function) and >> then run function passes to simplify the function(s) in the SCC before it >> tries to inline into a parent SCC. (the SCC visitation order is post-order) >> For example, we may inline a bunch of code, but after inlining we can >> tremendously simplify the function, and we want to do so before considering >> this function for inlining into its callers so that we get an accurate >> evaluation of the inline cost. >> Based on what Chandler said, it seems that LLVM is fairly unique in this >> regard and other compilers don't do this (which is why we can't just look >> at how other compilers solve this problem; they don't have this problem >> (maybe they should? or maybe we shouldn't?)). For example, he described >> that GCC uses different inlining "phases"; e.g. it does early inlining on >> the entire module, then does simplifications on the entire module, then >> does late inlining on the entire module; so it is not able to incrementally >> simplify as it inlines like LLVM does. >> ) >> >> As background for what is below, the LazyCallGraph tracks two graphs: the >> "call graph" and the "ref graph". >> Conceptually, the call graph is the graph of direct calls, where indirect >> calls and calls to external functions do not appear (or are connected to >> dummy nodes). The ref graph is basically the graph of all functions >> transitively accessible based on the globals/constants/etc. referenced by a >> function (e.g. if a function `foo` references a vtable that is defined in >> the module, there is an edge in the ref graph from `foo` to every function >> in the vtable). >> The call graph is a strict subset of the ref graph. >> >> Chandler described that he had a major breakthrough in that the CGSCC >> pass manager only had to deal with 3 classes of modifications that can >> occur: >> - a pass may e.g. propagate a load of a function pointer into an indirect >> call, turning it into an direct call. This requires adding an edge in the >> CG but not in the ref graph. >> - a pass may take a direct call and turn it into an indirect call. This >> requires removing an edge from the CG, but not in the ref graph. >> - a pass may delete a direct call. This removes an edge in the CG and >> also in the ref graph. >> >> From the perspective of the CGSCC pass manager, these operations can >> affect the SCC structure. Adding an edge might merge SCC's and deleting an >> edge might split SCC's. Chandler mentioned that apparently the issues of >> splitting and merging SCC's within the current infrastructure are actually >> quite challenging and lead to e.g. iterator invalidation issues, and that >> is what he is working on. >> >> ( >> The ref graph is important to guide the overall SCC visitation order >> because it basically represents "the largest graph that the CG may turn >> into due to our static analysis of this module". I.e. no transformation we >> can statically make in the CGSCC passes can ever cause us to need to merge >> SCC's in the ref graph. >> ) >> >> >> >> I have a couple overall questions/concerns: >> >> >> 1. The ref graph can easily go quadratic. E.g. >> >> typedef void (*fp)(); >> fp funcs[] = { >> &foo1, >> &foo2, >> ... >> &fooN >> } >> void foo1() { funcs[something](); } >> void foo2() { funcs[something](); } >> ... >> void fooN() { funcs[something](); } >> >> One real-world case where this might come about is in the presence of >> vtables. >> >> The existing CGSCC pass manager does not have this issue AFAIK because it >> does not consider the ref graph. >> >> Does anybody have any info/experience about how densely connected the ref >> graph can get in programs that might reasonably be fed to the compiler? >> > > > I can state that almost all call graphs of compilers include edges for > indirect calls and external functions, so they are already quadratic in > this sense. > > if what you state is correct, and we don't have a conservatively correct > call graph, that would be ... interesting. >Mehdi clarified downthread the situation (at least for me). Now that I look more closely at the comments in https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Analysis/LazyCallGraph.h it intentionally does not model a call graph in this sense. I.e. - in a traditional CG, an edge means something like "at runtime there may be a call from A->B" - in the LazyCallGraph an edge (a "ref edge" as it calls it) represents something like "during optimization of this module, we may discover the existence of a direct call from A->B". There is also a distinguished subgraph of the ref graph (which I think LazyCallGraph calls just the "call graph") which represents the actual direct calls that are present in the module currently. The comments in LazyCallGraph.h are quite good, but the existing CallGraph.h doesn't seem to touch on this up front in its comments in quite the same way, but it does at least say that it models with two external nodes like Mehdi mentioned: https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Analysis/CallGraph.h#L30 So its edges don't represent "at runtime there may be a call from A->B". But since it doesn't maintain a "ref graph" I'm not sure what the edges exactly represent. -- Sean Silva> The solution to most issues (large sccs, etc) that exist here that most > other compilers take is to try to make the call graph more precise, not to > avoid indirect/external calls in the call graph. > > In turn, this means the solution often take is to not have two graphs at > all. > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160608/f812285c/attachment.html>
Sean Silva via llvm-dev
2016-Jun-08 22:22 UTC
[llvm-dev] Intended behavior of CGSCC pass manager.
On Wed, Jun 8, 2016 at 10:57 AM, Xinliang David Li <davidxl at google.com> wrote:> > > On Wed, Jun 8, 2016 at 4:19 AM, Sean Silva <chisophugis at gmail.com> wrote: > >> Hi Chandler, Philip, Mehdi, (and llvm-dev,) >> >> (this is partially a summary of some discussions that happened at the >> last LLVM bay area social, and partially a discussion about the direction >> of the CGSCC pass manager) >> >> >> A the last LLVM social we discussed the progress on the CGSCC pass >> manager. It seems like Chandler has a CGSCC pass manager working, but it is >> still unresolved exactly which semantics we want (more about this below) >> that are reasonably implementable. >> >> AFAICT, there has been no public discussion about what exact semantics we >> ultimately want to have. We should figure that out. >> >> The main difficulty which Chandler described is the apparently quite >> complex logic surrounding needing to run function passes nested within an >> SCC pass manager, while providing some guarantees about exactly what order >> the function passes are run. The existing CGSCC pass manager just punts on >> some of the problems that arise (look in CGPassManager::runOnModule, >> CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in >> llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that >> Chandler has been trying to solve. >> >> ( >> Why is this "function passes inside CGSCC passes" stuff interesting? >> Because LLVM can do inlining on an SCC (often just a single function) and >> then run function passes to simplify the function(s) in the SCC before it >> tries to inline into a parent SCC. (the SCC visitation order is post-order) >> For example, we may inline a bunch of code, but after inlining we can >> tremendously simplify the function, and we want to do so before considering >> this function for inlining into its callers so that we get an accurate >> evaluation of the inline cost. >> Based on what Chandler said, it seems that LLVM is fairly unique in this >> regard and other compilers don't do this (which is why we can't just look >> at how other compilers solve this problem; they don't have this problem >> (maybe they should? or maybe we shouldn't?)). For example, he described >> that GCC uses different inlining "phases"; e.g. it does early inlining on >> the entire module, then does simplifications on the entire module, then >> does late inlining on the entire module; so it is not able to incrementally >> simplify as it inlines like LLVM does. >> ) >> >> > I want to clarify a little more on GCC's behavior. GCC has two types of > ipa_passes. One called simple_ipa pass and the other is called regular > ipa_pass. A simple IPA pass does not transform IR by itself, but it can > have function level sub-passes that does transformation. If it has function > sub-passes, the function passes will be executed in bottom up order just > like LLVM's CGSCC pass. A regular ipa pass is somewhat like the module > pass in LLVM. > > GCC's pass pipeline is actually similar to LLVM's pipeline overall: it has > the following components: > > (1) lowering passes > (2) small/simple IPA passes including the early optimization pipeline > (bottom-up) > (3) full ipa pipelines > (4) post-ipa optimization and code generation. > > The difference is that LLVM's (2) includes only inst-combine and simplfy > CFG, but GCC's (2) is larger which happens also includes an early inlining > pass. >So the early inlining is a function pass? In LLVM I think that the function pass contract would not allow inlining to occur in a function pass (the contract is theoretically designed to allow function passes to run concurrently on functions within a module).> GCC's (2) is similar to LLVM's CGSCC pass but with fewer passes. The > inliner is also targeting tiny functions that reduces size overall. > LLVM's LTO pipeline is kind of like this too. > > GCC's (3) includes a full blown regular inlining pass. This inliner uses > the priority order based algorithm, so bottom-up traversal does not make > sense. It requires (2) to have pass to collect summary to drive the > decisions (as well as profile data). > >Thanks for this clarification; it is always good to know what other compilers do for comparison. -- Sean Silva> David > > > > >> As background for what is below, the LazyCallGraph tracks two graphs: the >> "call graph" and the "ref graph". >> Conceptually, the call graph is the graph of direct calls, where indirect >> calls and calls to external functions do not appear (or are connected to >> dummy nodes). The ref graph is basically the graph of all functions >> transitively accessible based on the globals/constants/etc. referenced by a >> function (e.g. if a function `foo` references a vtable that is defined in >> the module, there is an edge in the ref graph from `foo` to every function >> in the vtable). >> The call graph is a strict subset of the ref graph. >> >> Chandler described that he had a major breakthrough in that the CGSCC >> pass manager only had to deal with 3 classes of modifications that can >> occur: >> - a pass may e.g. propagate a load of a function pointer into an indirect >> call, turning it into an direct call. This requires adding an edge in the >> CG but not in the ref graph. >> - a pass may take a direct call and turn it into an indirect call. This >> requires removing an edge from the CG, but not in the ref graph. >> - a pass may delete a direct call. This removes an edge in the CG and >> also in the ref graph. >> >> From the perspective of the CGSCC pass manager, these operations can >> affect the SCC structure. Adding an edge might merge SCC's and deleting an >> edge might split SCC's. Chandler mentioned that apparently the issues of >> splitting and merging SCC's within the current infrastructure are actually >> quite challenging and lead to e.g. iterator invalidation issues, and that >> is what he is working on. >> >> ( >> The ref graph is important to guide the overall SCC visitation order >> because it basically represents "the largest graph that the CG may turn >> into due to our static analysis of this module". I.e. no transformation we >> can statically make in the CGSCC passes can ever cause us to need to merge >> SCC's in the ref graph. >> ) >> >> >> >> I have a couple overall questions/concerns: >> >> >> 1. The ref graph can easily go quadratic. E.g. >> >> typedef void (*fp)(); >> fp funcs[] = { >> &foo1, >> &foo2, >> ... >> &fooN >> } >> void foo1() { funcs[something](); } >> void foo2() { funcs[something](); } >> ... >> void fooN() { funcs[something](); } >> >> One real-world case where this might come about is in the presence of >> vtables. >> >> The existing CGSCC pass manager does not have this issue AFAIK because it >> does not consider the ref graph. >> >> Does anybody have any info/experience about how densely connected the ref >> graph can get in programs that might reasonably be fed to the compiler? >> I just did a quick sanity check with LLD/ELF using >> `--lto-newpm-passes=cgscc(no-op-cgscc)` and it at least seemed to terminate >> / not run out of memory. Based on some rough calculations looking at the >> profile, it seem like the entire run of the inliner in the old LTO pipeline >> (which is about 5% of total LTO time on this particular example I looked >> at) is only 2-3x as expensive as just >> `--lto-newpm-passes=cgscc(no-op-cgscc)`, so the LazyCallGraph construction >> is certainly not free. >> >> >> 2. What is the intended behavior of CGSCC passes when SCC's are split or >> merged? E.g. a CGSCC pass runs on an SCC (e.g. the inliner). Now we run >> some function passes nested inside the CGSCC pass manager (e.g. to simplify >> things after inlining). Consider: >> >> a) These function passes are e.g. now able to devirtualize a call, adding >> an edge to the CG, forming a larger CG SCC. Do you re-run the CGSCC pass >> (say, the inliner) on this larger SCC? >> >> b) These function passes are e.g. able to DCE a call, removing an edge >> from the CG. This converts, say, a CG SCC which is a cycle graph (like >> a->b->c->a) into a path graph (a->b->c, with no edge back to a). The >> inliner had already visited a, b, and c as a single SCC. Now does it have >> to re-visit c, then b, then a, as single-node SCC's? >> >> >> btw: >> >> One way that I have found it useful to think about this is in terms of >> the visitation during Tarjan's SCC algorithm. I'll reference the pseudocode >> in >> https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm. >> Inside the "strongconnect" routine when we have identified an SCC (the true >> branch of `if (v.lowlink = v.index)` test ) we can visit >> stack[v.index:stack.size()] as an SCC. This may or may not invalidate some >> things on the stack (the variable `S` in the pseudocode) and we may need to >> fix it up (e.g. inlining deleted a function, so we can't have an entry on >> the stack). Then, we can run function passes as we pop individual functions >> off the stack, but it is easier to think about IMO than merging of SCC data >> structures: if we add edges to the CG then we have to do more DFS on the >> new edges and if we delete edges then the DFS order of the stack gives us >> certain guarantees. >> Personally I find this much easier to reason about than the description >> in terms of splitting and merging SCC's in the CG and ref graph (which the >> LazyCallGraph API makes one to think about since it hides the underlying >> Tarjan's algorithm). >> The LazyCallGraph API makes the current loop in >> http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/include/llvm/Analysis/CGSCCPassManager.h;272124$100 >> very clean, but at least for my thinking about the problem, it seems like >> the wrong abstraction (and most of the LazyCallGraph API seems to be >> unused, so it seems like it may be overly heavyweight). >> E.g. I think that maybe the easiest thing to do is to turn the current >> approach inside out: instead of having the pass manager logic be the >> "normal code" and forcing the Tarjan algorithm to become a state machine of >> iterators, use an open-coded Tarjan algorithm with some callbacks and make >> the pass management logic be the state machine. >> This will also open the door to avoiding the potentially quadratic size >> of the ref graph, since e.g. in the example I gave above, we can mark the >> `funcs` array itself as already having been visited during the walk. In the >> current LazyCallGraph, this would require adding some sort of notion of >> hyperedge. >> >> Since this is such a high priority (due to blocking PGO inlining), I will >> probably try my hand at implementing the CGSCC pass manager sometime soon >> unless somebody beats me to it. (I'll probably try the "open-coded SCC >> visit" approach). >> >> Another possibility is implementing the new CGSCC pass manager that uses >> the same visitation semantics as the one in the old PM, and then we can >> refactor that as needed. In fact, that may be the best approach so that >> porting to the new PM is as NFC as possible and we can isolate the >> functional (i.e., need benchmarks, measurements ...) changes in separate >> commits. >> >> >> Sorry for the wall of text. >> >> >> -- Sean Silva >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160608/77063a75/attachment-0001.html>
Sean Silva via llvm-dev
2016-Jun-08 23:36 UTC
[llvm-dev] Intended behavior of CGSCC pass manager.
On Wed, Jun 8, 2016 at 9:32 AM, Hal Finkel <hfinkel at anl.gov> wrote:> > ------------------------------ > > *From: *"Sean Silva via llvm-dev" <llvm-dev at lists.llvm.org> > *To: *"llvm-dev" <llvm-dev at lists.llvm.org> > *Sent: *Wednesday, June 8, 2016 6:19:03 AM > *Subject: *[llvm-dev] Intended behavior of CGSCC pass manager. > > Hi Chandler, Philip, Mehdi, (and llvm-dev,) > > (this is partially a summary of some discussions that happened at the last > LLVM bay area social, and partially a discussion about the direction of the > CGSCC pass manager) > > > A the last LLVM social we discussed the progress on the CGSCC pass > manager. It seems like Chandler has a CGSCC pass manager working, but it is > still unresolved exactly which semantics we want (more about this below) > that are reasonably implementable. > > AFAICT, there has been no public discussion about what exact semantics we > ultimately want to have. We should figure that out. > > The main difficulty which Chandler described is the apparently quite > complex logic surrounding needing to run function passes nested within an > SCC pass manager, while providing some guarantees about exactly what order > the function passes are run. The existing CGSCC pass manager just punts on > some of the problems that arise (look in CGPassManager::runOnModule, > CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in > llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that > Chandler has been trying to solve. > > ( > Why is this "function passes inside CGSCC passes" stuff interesting? > Because LLVM can do inlining on an SCC (often just a single function) and > then run function passes to simplify the function(s) in the SCC before it > tries to inline into a parent SCC. (the SCC visitation order is post-order) > For example, we may inline a bunch of code, but after inlining we can > tremendously simplify the function, and we want to do so before considering > this function for inlining into its callers so that we get an accurate > evaluation of the inline cost. > Based on what Chandler said, it seems that LLVM is fairly unique in this > regard and other compilers don't do this (which is why we can't just look > at how other compilers solve this problem; they don't have this problem > (maybe they should? or maybe we shouldn't?)). For example, he described > that GCC uses different inlining "phases"; e.g. it does early inlining on > the entire module, then does simplifications on the entire module, then > does late inlining on the entire module; so it is not able to incrementally > simplify as it inlines like LLVM does. > > This incremental simplification is an important feature of our inliner, > and one we should endeavor to keep. We might also want different phases at > some point (e.g. a top-down and a bottom-up phase), but that's another > story. > > ) > > As background for what is below, the LazyCallGraph tracks two graphs: the > "call graph" and the "ref graph". > Conceptually, the call graph is the graph of direct calls, where indirect > calls and calls to external functions do not appear (or are connected to > dummy nodes). The ref graph is basically the graph of all functions > transitively accessible based on the globals/constants/etc. referenced by a > function (e.g. if a function `foo` references a vtable that is defined in > the module, there is an edge in the ref graph from `foo` to every function > in the vtable). > The call graph is a strict subset of the ref graph. > > Chandler described that he had a major breakthrough in that the CGSCC pass > manager only had to deal with 3 classes of modifications that can occur: > - a pass may e.g. propagate a load of a function pointer into an indirect > call, turning it into an direct call. This requires adding an edge in the > CG but not in the ref graph. > - a pass may take a direct call and turn it into an indirect call. This > requires removing an edge from the CG, but not in the ref graph. > - a pass may delete a direct call. This removes an edge in the CG and also > in the ref graph. > > From the perspective of the CGSCC pass manager, these operations can > affect the SCC structure. Adding an edge might merge SCC's and deleting an > edge might split SCC's. Chandler mentioned that apparently the issues of > splitting and merging SCC's within the current infrastructure are actually > quite challenging and lead to e.g. iterator invalidation issues, and that > is what he is working on. > > ( > The ref graph is important to guide the overall SCC visitation order > because it basically represents "the largest graph that the CG may turn > into due to our static analysis of this module". I.e. no transformation we > can statically make in the CGSCC passes can ever cause us to need to merge > SCC's in the ref graph. > ) > > > > I have a couple overall questions/concerns: > > > 1. The ref graph can easily go quadratic. E.g. > > typedef void (*fp)(); > fp funcs[] = { > &foo1, > &foo2, > ... > &fooN > } > void foo1() { funcs[something](); } > void foo2() { funcs[something](); } > ... > void fooN() { funcs[something](); } > > One real-world case where this might come about is in the presence of > vtables. > > The existing CGSCC pass manager does not have this issue AFAIK because it > does not consider the ref graph. > > Does anybody have any info/experience about how densely connected the ref > graph can get in programs that might reasonably be fed to the compiler? > I just did a quick sanity check with LLD/ELF using > `--lto-newpm-passes=cgscc(no-op-cgscc)` and it at least seemed to terminate > / not run out of memory. Based on some rough calculations looking at the > profile, it seem like the entire run of the inliner in the old LTO pipeline > (which is about 5% of total LTO time on this particular example I looked > at) is only 2-3x as expensive as just > `--lto-newpm-passes=cgscc(no-op-cgscc)`, so the LazyCallGraph construction > is certainly not free. > > > 2. What is the intended behavior of CGSCC passes when SCC's are split or > merged? E.g. a CGSCC pass runs on an SCC (e.g. the inliner). Now we run > some function passes nested inside the CGSCC pass manager (e.g. to simplify > things after inlining). Consider: > > This is not how I thought the current scheme worked ;) -- I was under the > impression that we had a call graph with conservatively-connected dummy > nodes for external/indirect functions. As a result, there is no > semantics-preserving optimization that will merge SCCs, only split them. In > that case, I'd expect that once an SCC is split, we re-run the CGSCC passes > over the newly-separated SCCs. But this corresponds to running over the > "ref graph", as you describe it. I don't understand why we want the > non-conservative graph. > > > a) These function passes are e.g. now able to devirtualize a call, adding > an edge to the CG, forming a larger CG SCC. Do you re-run the CGSCC pass > (say, the inliner) on this larger SCC? > > b) These function passes are e.g. able to DCE a call, removing an edge > from the CG. This converts, say, a CG SCC which is a cycle graph (like > a->b->c->a) into a path graph (a->b->c, with no edge back to a). The > inliner had already visited a, b, and c as a single SCC. Now does it have > to re-visit c, then b, then a, as single-node SCC's? > > > btw: > > One way that I have found it useful to think about this is in terms of the > visitation during Tarjan's SCC algorithm. I'll reference the pseudocode in > https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm. > Inside the "strongconnect" routine when we have identified an SCC (the true > branch of `if (v.lowlink = v.index)` test ) we can visit > stack[v.index:stack.size()] as an SCC. This may or may not invalidate some > things on the stack (the variable `S` in the pseudocode) and we may need to > fix it up (e.g. inlining deleted a function, so we can't have an entry on > the stack). Then, we can run function passes as we pop individual functions > off the stack, but it is easier to think about IMO than merging of SCC data > structures: if we add edges to the CG then we have to do more DFS on the > new edges and if we delete edges then the DFS order of the stack gives us > certain guarantees. > Personally I find this much easier to reason about than the description in > terms of splitting and merging SCC's in the CG and ref graph (which the > LazyCallGraph API makes one to think about since it hides the underlying > Tarjan's algorithm). > The LazyCallGraph API makes the current loop in > http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/include/llvm/Analysis/CGSCCPassManager.h;272124$100 > very clean, but at least for my thinking about the problem, it seems like > the wrong abstraction (and most of the LazyCallGraph API seems to be > unused, so it seems like it may be overly heavyweight). > E.g. I think that maybe the easiest thing to do is to turn the current > approach inside out: instead of having the pass manager logic be the > "normal code" and forcing the Tarjan algorithm to become a state machine of > iterators, use an open-coded Tarjan algorithm with some callbacks and make > the pass management logic be the state machine. > This will also open the door to avoiding the potentially quadratic size of > the ref graph, since e.g. in the example I gave above, we can mark the > `funcs` array itself as already having been visited during the walk. In the > current LazyCallGraph, this would require adding some sort of notion of > hyperedge. > > FWIW, I see no purpose in abstracting one algorithm, especially if that > makes things algorithmically harder. Also, the LazyCallGraph abstraction > and the iterator abstraction seem like separate issues. Iterator > abstractions are often useful because you can use them in generic > algorithms, etc. > > > Since this is such a high priority (due to blocking PGO inlining), I will > probably try my hand at implementing the CGSCC pass manager sometime soon > unless somebody beats me to it. (I'll probably try the "open-coded SCC > visit" approach). > > Another possibility is implementing the new CGSCC pass manager that uses > the same visitation semantics as the one in the old PM, and then we can > refactor that as needed. In fact, that may be the best approach so that > porting to the new PM is as NFC as possible and we can isolate the > functional (i.e., need benchmarks, measurements ...) changes in separate > commits. > > I'm in favor of this approach for exactly the reason you mention. Being > able to bisect regressions to the algorithmic change, separate from the > infrastructure change, will likely make things easier in the long run (and > will avoid the problem, to the extent possible, of performance regressions > blocking the pass-manager work). >Yeah. Last night after I went home I was thinking to myself that this really seems like the obvious path forward. I will start implementing this. -- Sean Silva> > > Sorry for the wall of text. > > No problem; much appreciated. > > -Hal > > > -- Sean Silva > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > > > -- > Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160608/7b29ab64/attachment-0001.html>
Sean Silva via llvm-dev
2016-Jun-08 23:38 UTC
[llvm-dev] Intended behavior of CGSCC pass manager.
On Wed, Jun 8, 2016 at 12:31 PM, Xinliang David Li <davidxl at google.com> wrote:> > > On Wed, Jun 8, 2016 at 4:19 AM, Sean Silva <chisophugis at gmail.com> wrote: > >> Hi Chandler, Philip, Mehdi, (and llvm-dev,) >> >> (this is partially a summary of some discussions that happened at the >> last LLVM bay area social, and partially a discussion about the direction >> of the CGSCC pass manager) >> >> >> A the last LLVM social we discussed the progress on the CGSCC pass >> manager. It seems like Chandler has a CGSCC pass manager working, but it is >> still unresolved exactly which semantics we want (more about this below) >> that are reasonably implementable. >> >> AFAICT, there has been no public discussion about what exact semantics we >> ultimately want to have. We should figure that out. >> >> The main difficulty which Chandler described is the apparently quite >> complex logic surrounding needing to run function passes nested within an >> SCC pass manager, while providing some guarantees about exactly what order >> the function passes are run. The existing CGSCC pass manager just punts on >> some of the problems that arise (look in CGPassManager::runOnModule, >> CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in >> llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that >> Chandler has been trying to solve. >> >> ( >> Why is this "function passes inside CGSCC passes" stuff interesting? >> Because LLVM can do inlining on an SCC (often just a single function) and >> then run function passes to simplify the function(s) in the SCC before it >> tries to inline into a parent SCC. (the SCC visitation order is post-order) >> For example, we may inline a bunch of code, but after inlining we can >> tremendously simplify the function, and we want to do so before considering >> this function for inlining into its callers so that we get an accurate >> evaluation of the inline cost. >> Based on what Chandler said, it seems that LLVM is fairly unique in this >> regard and other compilers don't do this (which is why we can't just look >> at how other compilers solve this problem; they don't have this problem >> (maybe they should? or maybe we shouldn't?)). For example, he described >> that GCC uses different inlining "phases"; e.g. it does early inlining on >> the entire module, then does simplifications on the entire module, then >> does late inlining on the entire module; so it is not able to incrementally >> simplify as it inlines like LLVM does. >> ) >> >> As background for what is below, the LazyCallGraph tracks two graphs: the >> "call graph" and the "ref graph". >> Conceptually, the call graph is the graph of direct calls, where indirect >> calls and calls to external functions do not appear (or are connected to >> dummy nodes). The ref graph is basically the graph of all functions >> transitively accessible based on the globals/constants/etc. referenced by a >> function (e.g. if a function `foo` references a vtable that is defined in >> the module, there is an edge in the ref graph from `foo` to every function >> in the vtable). >> The call graph is a strict subset of the ref graph. >> >> Chandler described that he had a major breakthrough in that the CGSCC >> pass manager only had to deal with 3 classes of modifications that can >> occur: >> - a pass may e.g. propagate a load of a function pointer into an indirect >> call, turning it into an direct call. This requires adding an edge in the >> CG but not in the ref graph. >> - a pass may take a direct call and turn it into an indirect call. This >> requires removing an edge from the CG, but not in the ref graph. >> - a pass may delete a direct call. This removes an edge in the CG and >> also in the ref graph. >> >> From the perspective of the CGSCC pass manager, these operations can >> affect the SCC structure. Adding an edge might merge SCC's and deleting an >> edge might split SCC's. Chandler mentioned that apparently the issues of >> splitting and merging SCC's within the current infrastructure are actually >> quite challenging and lead to e.g. iterator invalidation issues, and that >> is what he is working on. >> >> ( >> The ref graph is important to guide the overall SCC visitation order >> because it basically represents "the largest graph that the CG may turn >> into due to our static analysis of this module". I.e. no transformation we >> can statically make in the CGSCC passes can ever cause us to need to merge >> SCC's in the ref graph. >> ) >> >> >> >> I have a couple overall questions/concerns: >> >> >> 1. The ref graph can easily go quadratic. E.g. >> >> typedef void (*fp)(); >> fp funcs[] = { >> &foo1, >> &foo2, >> ... >> &fooN >> } >> void foo1() { funcs[something](); } >> void foo2() { funcs[something](); } >> ... >> void fooN() { funcs[something](); } >> >> One real-world case where this might come about is in the presence of >> vtables. >> >> The existing CGSCC pass manager does not have this issue AFAIK because it >> does not consider the ref graph. >> >> Does anybody have any info/experience about how densely connected the ref >> graph can get in programs that might reasonably be fed to the compiler? >> I just did a quick sanity check with LLD/ELF using >> `--lto-newpm-passes=cgscc(no-op-cgscc)` and it at least seemed to terminate >> / not run out of memory. Based on some rough calculations looking at the >> profile, it seem like the entire run of the inliner in the old LTO pipeline >> (which is about 5% of total LTO time on this particular example I looked >> at) is only 2-3x as expensive as just >> `--lto-newpm-passes=cgscc(no-op-cgscc)`, so the LazyCallGraph construction >> is certainly not free. >> > > > Conceptually, reference graph should also include variable nodes. With > variable nodes introduced, the quadratic behavior mentioned can be avoided. >Yeah, this is what I was trying to get at with the statement "In the current LazyCallGraph, this would require adding some sort of notion of hyperedge." But you are right: from an implementation perspective of a call graph data structure that is trying to model the "ref graph" of LazyCallGraph, it is cleaner to have nodes that are not functions than to introduce a notion of hyperedge.> > > > >> >> >> 2. What is the intended behavior of CGSCC passes when SCC's are split or >> merged? E.g. a CGSCC pass runs on an SCC (e.g. the inliner). Now we run >> some function passes nested inside the CGSCC pass manager (e.g. to simplify >> things after inlining). Consider: >> >> a) These function passes are e.g. now able to devirtualize a call, adding >> an edge to the CG, forming a larger CG SCC. Do you re-run the CGSCC pass >> (say, the inliner) on this larger SCC? >> >> b) These function passes are e.g. able to DCE a call, removing an edge >> from the CG. This converts, say, a CG SCC which is a cycle graph (like >> a->b->c->a) into a path graph (a->b->c, with no edge back to a). The >> inliner had already visited a, b, and c as a single SCC. Now does it have >> to re-visit c, then b, then a, as single-node SCC's? >> >> >> btw: >> >> One way that I have found it useful to think about this is in terms of >> the visitation during Tarjan's SCC algorithm. I'll reference the pseudocode >> in >> https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm. >> Inside the "strongconnect" routine when we have identified an SCC (the true >> branch of `if (v.lowlink = v.index)` test ) we can visit >> stack[v.index:stack.size()] as an SCC. This may or may not invalidate some >> things on the stack (the variable `S` in the pseudocode) and we may need to >> fix it up (e.g. inlining deleted a function, so we can't have an entry on >> the stack). Then, we can run function passes as we pop individual functions >> off the stack, but it is easier to think about IMO than merging of SCC data >> structures: if we add edges to the CG then we have to do more DFS on the >> new edges and if we delete edges then the DFS order of the stack gives us >> certain guarantees. >> Personally I find this much easier to reason about than the description >> in terms of splitting and merging SCC's in the CG and ref graph (which the >> LazyCallGraph API makes one to think about since it hides the underlying >> Tarjan's algorithm). >> The LazyCallGraph API makes the current loop in >> http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/include/llvm/Analysis/CGSCCPassManager.h;272124$100 >> very clean, but at least for my thinking about the problem, it seems like >> the wrong abstraction (and most of the LazyCallGraph API seems to be >> unused, so it seems like it may be overly heavyweight). >> E.g. I think that maybe the easiest thing to do is to turn the current >> approach inside out: instead of having the pass manager logic be the >> "normal code" and forcing the Tarjan algorithm to become a state machine of >> iterators, use an open-coded Tarjan algorithm with some callbacks and make >> the pass management logic be the state machine. >> This will also open the door to avoiding the potentially quadratic size >> of the ref graph, since e.g. in the example I gave above, we can mark the >> `funcs` array itself as already having been visited during the walk. In the >> current LazyCallGraph, this would require adding some sort of notion of >> hyperedge. >> >> Since this is such a high priority (due to blocking PGO inlining), I will >> probably try my hand at implementing the CGSCC pass manager sometime soon >> unless somebody beats me to it. (I'll probably try the "open-coded SCC >> visit" approach). >> >> Another possibility is implementing the new CGSCC pass manager that uses >> the same visitation semantics as the one in the old PM, and then we can >> refactor that as needed. In fact, that may be the best approach so that >> porting to the new PM is as NFC as possible and we can isolate the >> functional (i.e., need benchmarks, measurements ...) changes in separate >> commits. >> >> > > > A very high level comment: why do we need to update callgraph on the fly ? > Can we have a more general support of iterative SCC pass invocation? > > something like: > > 1) build the callgraph > 2) cache the post-order traversal order > > 3) if the order list is empty -- done > 4) traversal: invoke function passes for each function on the order (step > 2 or 5). The call graph gets updated on the fly (with new edges, or new > nodes for cloned functions) > 5) update the function traversal order from new nodes and new edges > created in 4) > 6) go to step 3). >(sorry for the delayed reply... this is a very poignant question / example)>From the discussion with Chandler, I think he wants to provide moreguarantees to function passes about the visitation order. He will need to explain his exact concerns. But IIRC the essence of one of the issues is captured in the example I gave in 2.b) in the OP: "These function passes are e.g. able to DCE a call, removing an edge from the CG. This converts, say, a CG SCC which is a cycle graph (like a->b->c->a) into a path graph (a->b->c, with no edge back to a)." The issue that I remember Chandler brought up is that before deleting an edge from an SCC, the nodes in the SCC are "unordered" w.r.t. each other; but after deleting an edge from the CG which splits the SCC, there is an order. In other words, that the cached post-order may no longer be a post-order traversal of the new graph after a function pass runs. For example, in the cached post-order traversal we may have entered the SCC `a->b->c->a` via an edge `x->b` and so our cached post-order traversal visits the functions in the order `a, then c, then b`. If a function pass visits `c` and deletes the call `c->a`, then "a, then c, then b" is not a valid post-order for visiting the functions. Looking at the explanation I gave, I think there isn't really a problem here. The cached post-order can only be invalidated for functions that have already been visited. Speaking more broadly about the algorithm you just described, did you intend to omit an SCC visitation step? The goal of the CGSCC pass manager is the ability to visit an SCC (e.g. inliner visits an SCC), then immediately run function passes to simplify the result of inlining. Only after the simplification has occurred do we visit the parent SCC. By running the simplifications in lock-step with the inliner SCC visitation we give parent SCC's a more accurate view of the real cost of inlining a function from a child SCC. Issues similar to 2.a) (i.e. adding edges to the CG) also affect the visitation order of SCC's (not just functions). For example, we visit an SCC with a CGSCC pass (e.g. inliner). Then we run the first function pass on that SCC, which may add an edge (e.g. promote indirect call to direct call) that may enlarge the SCC. Do we continue running the remaining function passes? Do we re-visit this new enlarged SCC? (if so, when?) These are the types of questions that motivated this post (hence the name "Intended behavior of CGSCC pass manager."). -- Sean Silva> > David > > > >> >> Sorry for the wall of text. >> >> >> -- Sean Silva >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160608/a0b26080/attachment.html>
Sean Silva via llvm-dev
2016-Jun-09 00:35 UTC
[llvm-dev] Intended behavior of CGSCC pass manager.
On Wed, Jun 8, 2016 at 11:14 AM, Sanjoy Das <sanjoy at playingwithpointers.com> wrote:> Hi Sean, > > Sean Silva wrote: > > Hi Chandler, Philip, Mehdi, (and llvm-dev,) > > > > (this is partially a summary of some discussions that happened at the > last LLVM bay area social, and partially a > > discussion about the direction of the CGSCC pass manager) > > Thanks for writing this up! This sort of thing is very helpful for > people who don't attend the social, or had to leave early. :) > > > Chandler described that he had a major breakthrough in that the CGSCC > pass manager only had to deal with 3 classes of > > modifications that can occur: > > - a pass may e.g. propagate a load of a function pointer into an > indirect call, turning it into an direct call. This > > requires adding an edge in the CG but not in the ref graph. > > - a pass may take a direct call and turn it into an indirect call. This > requires removing an edge from the CG, but not > > in the ref graph. > > - a pass may delete a direct call. This removes an edge in the CG and > also in the ref graph. > > At what granularity are we modeling these things? E.g. if SimplifyCFG > deletes a basic block, will we remove call edges that start from that > block? > > Note there is fourth kind of modification that isn't modeled here: > devirtualization, in which we transform IR like > > %fnptr = compute_fnptr(%receiver, i32 <method id>) > call %fptr(%receiver) > > to (via a pass that understands the semantics of compute_fnptr). > > call some.specific.Class::method(%receiver) > > > However, at this time I don't think modeling "out of thin air" > devirtualization of this sort is important, since nothing upstream > does things like this (we do have ModulePasses downstream that do > this). > > > From the perspective of the CGSCC pass manager, these operations can > affect the SCC structure. Adding an edge might > > merge SCC's and deleting an edge might split SCC's. Chandler mentioned > that apparently the issues of splitting and > > merging SCC's within the current infrastructure are actually quite > challenging and lead to e.g. iterator invalidation > > issues, and that is what he is working on. > > > > ( > > The ref graph is important to guide the overall SCC visitation order > because it basically represents "the largest graph > > that the CG may turn into due to our static analysis of this module". > I.e. no transformation we can statically make in > > the CGSCC passes can ever cause us to need to merge SCC's in the ref > graph. > > ) > > Except in the above "out of thin air" devirtualization case. > > I think cross-function store forwarding can also be problematic: > > void foo(fnptr* bar) { > if (bar) > (*bar)(); > } > > void baz() { foo(null); } > > void caller() { > fnptr *t = malloc(); > *t = baz; > foo(t); > } > > // RefGraph is > // caller -> baz > // caller -> foo > // baz -> foo > > Now the RefSCCs are {foo}, {caller}, {baz} but if we forward the store > to *t in caller into the load in foo (and are smart about the null > check) we get: > > void foo(fnptr* bar) { > if (bar) > baz(); > } > > void baz() { foo(null); } > > void caller() { > fnptr *t = malloc(); > *t = baz; > foo(t); > } > > // RefGraph is > // foo -> baz > // baz -> foo > // caller -> foo > // caller -> baz > > and now the RefSCCs are {foo, baz}, {caller} >In both of these examples you gave (out-of-thin-air devirtualization and forwarding into a callee) is the contract of a CGSCC pass being violated? (I believe the contract is that a CGSCC pass cannot invalidate the analyses of a different SCC ( https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Analysis/CGSCCPassManager.h#L104) ) If not, then Chandler's analysis was missing a case (beyond the 3 cases I mentioned which he had convinced himself were the only possible ones).> > But again, I think this is fine since nothing upstream does this at > this time; and we can cross that bridge when we come to it. > > > 2. What is the intended behavior of CGSCC passes when SCC's are split or > merged? E.g. a CGSCC pass runs on an SCC (e.g. > > the inliner). Now we run some function passes nested inside the CGSCC > pass manager (e.g. to simplify things after > > inlining). Consider: > > > > a) These function passes are e.g. now able to devirtualize a call, > adding an edge to the CG, forming a larger CG SCC. Do > > you re-run the CGSCC pass (say, the inliner) on this larger SCC? > > > > b) These function passes are e.g. able to DCE a call, removing an edge > from the CG. This converts, say, a CG SCC which > > is a cycle graph (like a->b->c->a) into a path graph (a->b->c, with no > edge back to a). The inliner had already visited > > a, b, and c as a single SCC. Now does it have to re-visit c, then b, > then a, as single-node SCC's? > > Okay, so this is the same question I wrote above: "At what granularity > are we modeling these things?", phrased more eloquently. :) > > > One way that I have found it useful to think about this is in terms of > the visitation during Tarjan's SCC algorithm. > > I'll reference the pseudocode in > https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm > . > > Inside the "strongconnect" routine when we have identified an SCC (the > true branch of `if (v.lowlink = v.index)` test ) > > we can visit stack[v.index:stack.size()] as an SCC. This may or may not > invalidate some things on the stack (the > > variable `S` in the pseudocode) and we may need to fix it up (e.g. > inlining deleted a function, so we can't have an > > entry on the stack). Then, we can run function passes as we pop > individual functions off the stack, but it is easier to > > think about IMO than merging of SCC data structures: if we add edges to > the CG then we have to do more DFS on the new > > edges and if we delete edges then the DFS order of the stack gives us > certain guarantees. > > I'm not sure how this will be easier E.g. consider > > X -> A > A -> B > B -> C > C -> B > B -> A > > Now you're going to pop A,B,C together, as an SCC; and you start > optimizing them, and the edge from B to A falls out. Don't you have > the same problem now (i.e. we've removed an edge from an SCC we're > iterating over)? Perhaps I misunderstood your scheme?I wasn't really proposing a specific algorithm, but rather a different way to think about the problem and what semantics we really want (hence the thread title) and are implementable. A good example of what I mean is David's example of caching the post-order traversal: thinking in terms of a raw graph algorithm can provide some insights into the actual invariants that are (or can be) maintained. Whereas that example was in terms of just a post-order CG walk, to think about the whole CGSCC pass manager visitation, thinking in terms of something like Tarjan's algorithm seems likely to provide insight about the invariants both on the function order and the SCC order. (btw, http://www.webgraphviz.com/ helped me visualize your graph; handy!) -- Sean Silva> > > -- Sanjoy > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160608/73f857d9/attachment-0001.html>
Sean Silva via llvm-dev
2016-Jul-01 06:46 UTC
[llvm-dev] Intended behavior of CGSCC pass manager.
On Wed, Jun 8, 2016 at 9:32 AM, Hal Finkel <hfinkel at anl.gov> wrote:> > ------------------------------ > > *From: *"Sean Silva via llvm-dev" <llvm-dev at lists.llvm.org> > *To: *"llvm-dev" <llvm-dev at lists.llvm.org> > *Sent: *Wednesday, June 8, 2016 6:19:03 AM > *Subject: *[llvm-dev] Intended behavior of CGSCC pass manager. > > Hi Chandler, Philip, Mehdi, (and llvm-dev,) > > (this is partially a summary of some discussions that happened at the last > LLVM bay area social, and partially a discussion about the direction of the > CGSCC pass manager) > > > A the last LLVM social we discussed the progress on the CGSCC pass > manager. It seems like Chandler has a CGSCC pass manager working, but it is > still unresolved exactly which semantics we want (more about this below) > that are reasonably implementable. > > AFAICT, there has been no public discussion about what exact semantics we > ultimately want to have. We should figure that out. > > The main difficulty which Chandler described is the apparently quite > complex logic surrounding needing to run function passes nested within an > SCC pass manager, while providing some guarantees about exactly what order > the function passes are run. The existing CGSCC pass manager just punts on > some of the problems that arise (look in CGPassManager::runOnModule, > CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in > llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that > Chandler has been trying to solve. > > ( > Why is this "function passes inside CGSCC passes" stuff interesting? > Because LLVM can do inlining on an SCC (often just a single function) and > then run function passes to simplify the function(s) in the SCC before it > tries to inline into a parent SCC. (the SCC visitation order is post-order) > For example, we may inline a bunch of code, but after inlining we can > tremendously simplify the function, and we want to do so before considering > this function for inlining into its callers so that we get an accurate > evaluation of the inline cost. > Based on what Chandler said, it seems that LLVM is fairly unique in this > regard and other compilers don't do this (which is why we can't just look > at how other compilers solve this problem; they don't have this problem > (maybe they should? or maybe we shouldn't?)). For example, he described > that GCC uses different inlining "phases"; e.g. it does early inlining on > the entire module, then does simplifications on the entire module, then > does late inlining on the entire module; so it is not able to incrementally > simplify as it inlines like LLVM does. > > This incremental simplification is an important feature of our inliner, > and one we should endeavor to keep. We might also want different phases at > some point (e.g. a top-down and a bottom-up phase), but that's another > story. > > ) > > As background for what is below, the LazyCallGraph tracks two graphs: the > "call graph" and the "ref graph". > Conceptually, the call graph is the graph of direct calls, where indirect > calls and calls to external functions do not appear (or are connected to > dummy nodes). The ref graph is basically the graph of all functions > transitively accessible based on the globals/constants/etc. referenced by a > function (e.g. if a function `foo` references a vtable that is defined in > the module, there is an edge in the ref graph from `foo` to every function > in the vtable). > The call graph is a strict subset of the ref graph. > > Chandler described that he had a major breakthrough in that the CGSCC pass > manager only had to deal with 3 classes of modifications that can occur: > - a pass may e.g. propagate a load of a function pointer into an indirect > call, turning it into an direct call. This requires adding an edge in the > CG but not in the ref graph. > - a pass may take a direct call and turn it into an indirect call. This > requires removing an edge from the CG, but not in the ref graph. > - a pass may delete a direct call. This removes an edge in the CG and also > in the ref graph. > > From the perspective of the CGSCC pass manager, these operations can > affect the SCC structure. Adding an edge might merge SCC's and deleting an > edge might split SCC's. Chandler mentioned that apparently the issues of > splitting and merging SCC's within the current infrastructure are actually > quite challenging and lead to e.g. iterator invalidation issues, and that > is what he is working on. > > ( > The ref graph is important to guide the overall SCC visitation order > because it basically represents "the largest graph that the CG may turn > into due to our static analysis of this module". I.e. no transformation we > can statically make in the CGSCC passes can ever cause us to need to merge > SCC's in the ref graph. > ) > > > > I have a couple overall questions/concerns: > > > 1. The ref graph can easily go quadratic. E.g. > > typedef void (*fp)(); > fp funcs[] = { > &foo1, > &foo2, > ... > &fooN > } > void foo1() { funcs[something](); } > void foo2() { funcs[something](); } > ... > void fooN() { funcs[something](); } > > One real-world case where this might come about is in the presence of > vtables. > > The existing CGSCC pass manager does not have this issue AFAIK because it > does not consider the ref graph. > > Does anybody have any info/experience about how densely connected the ref > graph can get in programs that might reasonably be fed to the compiler? > I just did a quick sanity check with LLD/ELF using > `--lto-newpm-passes=cgscc(no-op-cgscc)` and it at least seemed to terminate > / not run out of memory. Based on some rough calculations looking at the > profile, it seem like the entire run of the inliner in the old LTO pipeline > (which is about 5% of total LTO time on this particular example I looked > at) is only 2-3x as expensive as just > `--lto-newpm-passes=cgscc(no-op-cgscc)`, so the LazyCallGraph construction > is certainly not free. > > > 2. What is the intended behavior of CGSCC passes when SCC's are split or > merged? E.g. a CGSCC pass runs on an SCC (e.g. the inliner). Now we run > some function passes nested inside the CGSCC pass manager (e.g. to simplify > things after inlining). Consider: > > This is not how I thought the current scheme worked ;) -- I was under the > impression that we had a call graph with conservatively-connected dummy > nodes for external/indirect functions. As a result, there is no > semantics-preserving optimization that will merge SCCs, only split them. In > that case, I'd expect that once an SCC is split, we re-run the CGSCC passes > over the newly-separated SCCs. But this corresponds to running over the > "ref graph", as you describe it. I don't understand why we want the > non-conservative graph. > > > a) These function passes are e.g. now able to devirtualize a call, adding > an edge to the CG, forming a larger CG SCC. Do you re-run the CGSCC pass > (say, the inliner) on this larger SCC? > > b) These function passes are e.g. able to DCE a call, removing an edge > from the CG. This converts, say, a CG SCC which is a cycle graph (like > a->b->c->a) into a path graph (a->b->c, with no edge back to a). The > inliner had already visited a, b, and c as a single SCC. Now does it have > to re-visit c, then b, then a, as single-node SCC's? > > > btw: > > One way that I have found it useful to think about this is in terms of the > visitation during Tarjan's SCC algorithm. I'll reference the pseudocode in > https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm. > Inside the "strongconnect" routine when we have identified an SCC (the true > branch of `if (v.lowlink = v.index)` test ) we can visit > stack[v.index:stack.size()] as an SCC. This may or may not invalidate some > things on the stack (the variable `S` in the pseudocode) and we may need to > fix it up (e.g. inlining deleted a function, so we can't have an entry on > the stack). Then, we can run function passes as we pop individual functions > off the stack, but it is easier to think about IMO than merging of SCC data > structures: if we add edges to the CG then we have to do more DFS on the > new edges and if we delete edges then the DFS order of the stack gives us > certain guarantees. > Personally I find this much easier to reason about than the description in > terms of splitting and merging SCC's in the CG and ref graph (which the > LazyCallGraph API makes one to think about since it hides the underlying > Tarjan's algorithm). > The LazyCallGraph API makes the current loop in > http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/include/llvm/Analysis/CGSCCPassManager.h;272124$100 > very clean, but at least for my thinking about the problem, it seems like > the wrong abstraction (and most of the LazyCallGraph API seems to be > unused, so it seems like it may be overly heavyweight). > E.g. I think that maybe the easiest thing to do is to turn the current > approach inside out: instead of having the pass manager logic be the > "normal code" and forcing the Tarjan algorithm to become a state machine of > iterators, use an open-coded Tarjan algorithm with some callbacks and make > the pass management logic be the state machine. > This will also open the door to avoiding the potentially quadratic size of > the ref graph, since e.g. in the example I gave above, we can mark the > `funcs` array itself as already having been visited during the walk. In the > current LazyCallGraph, this would require adding some sort of notion of > hyperedge. > > FWIW, I see no purpose in abstracting one algorithm, especially if that > makes things algorithmically harder. Also, the LazyCallGraph abstraction > and the iterator abstraction seem like separate issues. Iterator > abstractions are often useful because you can use them in generic > algorithms, etc. > > > Since this is such a high priority (due to blocking PGO inlining), I will > probably try my hand at implementing the CGSCC pass manager sometime soon > unless somebody beats me to it. (I'll probably try the "open-coded SCC > visit" approach). > > Another possibility is implementing the new CGSCC pass manager that uses > the same visitation semantics as the one in the old PM, and then we can > refactor that as needed. In fact, that may be the best approach so that > porting to the new PM is as NFC as possible and we can isolate the > functional (i.e., need benchmarks, measurements ...) changes in separate > commits. > > I'm in favor of this approach for exactly the reason you mention. Being > able to bisect regressions to the algorithmic change, separate from the > infrastructure change, will likely make things easier in the long run (and > will avoid the problem, to the extent possible, of performance regressions > blocking the pass-manager work). >I've just posted a proof of concept: http://reviews.llvm.org/D21921 The proof of concept also ports the remaining 3 CGSCC passes (inliner, argpromotion, prune-eh) Let me know what you think. -- Sean Silva> > > Sorry for the wall of text. > > No problem; much appreciated. > > -Hal > > > -- Sean Silva > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > > > -- > Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160630/6d35dcd2/attachment-0001.html>