On 7/16/13 5:23 AM, Evan Cheng wrote:> Thanks for the proposal. This is important work which is one step towards making LTO more applicable for large applications. Some comments inline. > > On Jul 12, 2013, at 3:49 PM, Shuxin Yang <shuxin.llvm at gmail.com> wrote: > >> >> 3.1.1 Figure out Partition scheme >> ---------------------------------- >> we randomly pick up some function and put them in a partition. >> It would be nice to perform some optimization at this moment. One opt >> in my mind is to reorder functions in order to reduce working-set and >> improve locality. >> >> Unfortunately, this opt seems to be bit blind at this time, because >> - CallGraph is not annotated with estimated or profiled frequency. >> - some linkers don't respect the order. It seems they just >> remembers the function order of the pristine input obj/fake-obj, >> and enforce this order at final link (link-exec/shared-lib) stage. >> >> Anyway, I try to ignore all these problems, and try to perform partition >> via following steps. Maybe we have some luck on some platforms: >> >> o. DFS the call-graph, ignoring the self-resursive edges, if freq is >> available, prioritizing the edges (i.e. corresponding to call-sites) >> such that frequent edges are visited first. >> >> o. Cut the DFS spanning tree obtained from the previous step bottom-up, >> Each cut/partition contains reasonable # of functions, and the aggregate >> size of the functions of the partition should not exceeds predefined >> threshold. > I'd like to see more details about this step. How do you determine the "reasonable # of functions"? How do you define the threshold? It has to be the same for a given target / platform regardless of the configuration of the build machine, right?Say, each module should not contains : - no more than 100 functions, and - the total size of the functions in a partition should not exceed the pre-defined threshold, These threshold can be override by command line.>> o. repeat the previous step until the Call-graph's DFS spanning tree >> is empty. >> >> 3.1.2 Partition transformation >> ------------------------------ >> >> This is bit involved. There are bunch of problems we have to tackle. >> 1) When the use/def of a symbol are separated in different modules, >> its attribute, like linkage, visibility, need to be changed >> as well. >> >> [Example 1], if a symbol is flagged as "internal" to the module where >> the it is defined, the linkage need to be changed into "internal" >> to the executable/lib being compiled. >> >> [Example 2], For compile-time constants, their initialized value >> needs to to cloned to the partitions where it is referenced, >> The rationale is to make the post-ipo passes to take advantage >> of the initlized value to squeeeze some performance. >> >> In order to not bloat the code size, the cloned constant should >> mark "don't emit". [end of eg2] >> >> Being able to precisely update symbols' attribute is not only >> vital to correctness, it has significant impact to the the >> performance as well. >> >> I have not yet taken a thorough investigation of this issue. My >> rudimentary implementation is simply flag symbol "external" when its >> use/def are separated in different module. I believe this is one >> of the most difficult part of this work. I guess it is going to >> take long time to become stable. >> >> 2) In order to compile each partition in each separate thread (see >> Section 3.2), we have to put partitions in distinct LLVMContext. >> >> I could be wrong, but I don't find the code which is able to >> perform function cloning across LLVMContext. >> >> My workaround in the patch is to perform function cloning in >> one LLVMContext (but in different Module, of course), then >> save the module to disk file, and load it to memory using a >> new LLVMContext. >> >> It is bit circuitous and expensive. > Do you plan to fix this? What are the issues that prevented function cloning across multiple LLVMContexts? > > Evan > >We may fix it, I don't know for sure if it is a big gain at this moment. If issues is that, as far as I can tell, current code base dose not have functions support copying IR across different LLVMContext. For example, when it copy an instruction from src to dest, it check the "src", take a look of of its Type, and derive LLVMContext from the Type, and use the same context for the dest. So, we need to change the existing code.
On Jul 16, 2013, at 10:37 AM, Shuxin Yang <shuxin.llvm at gmail.com> wrote:> > On 7/16/13 5:23 AM, Evan Cheng wrote: >> Thanks for the proposal. This is important work which is one step towards making LTO more applicable for large applications. Some comments inline. >> >> On Jul 12, 2013, at 3:49 PM, Shuxin Yang <shuxin.llvm at gmail.com> wrote: >> >>> >>> 3.1.1 Figure out Partition scheme >>> ---------------------------------- >>> we randomly pick up some function and put them in a partition. >>> It would be nice to perform some optimization at this moment. One opt >>> in my mind is to reorder functions in order to reduce working-set and >>> improve locality. >>> >>> Unfortunately, this opt seems to be bit blind at this time, because >>> - CallGraph is not annotated with estimated or profiled frequency. >>> - some linkers don't respect the order. It seems they just >>> remembers the function order of the pristine input obj/fake-obj, >>> and enforce this order at final link (link-exec/shared-lib) stage. >>> >>> Anyway, I try to ignore all these problems, and try to perform partition >>> via following steps. Maybe we have some luck on some platforms: >>> >>> o. DFS the call-graph, ignoring the self-resursive edges, if freq is >>> available, prioritizing the edges (i.e. corresponding to call-sites) >>> such that frequent edges are visited first. >>> >>> o. Cut the DFS spanning tree obtained from the previous step bottom-up, >>> Each cut/partition contains reasonable # of functions, and the aggregate >>> size of the functions of the partition should not exceeds predefined >>> threshold. >> I'd like to see more details about this step. How do you determine the "reasonable # of functions"? How do you define the threshold? It has to be the same for a given target / platform regardless of the configuration of the build machine, right? > Say, each module should not contains : > - no more than 100 functions, and > - the total size of the functions in a partition should not exceed the pre-defined threshold, > > These threshold can be override by command line.But how do you come about the thresholds? And are they fixed thresholds or determined based on analysis of function size, etc.? Evan> >>> o. repeat the previous step until the Call-graph's DFS spanning tree >>> is empty. >>> >>> 3.1.2 Partition transformation >>> ------------------------------ >>> >>> This is bit involved. There are bunch of problems we have to tackle. >>> 1) When the use/def of a symbol are separated in different modules, >>> its attribute, like linkage, visibility, need to be changed >>> as well. >>> >>> [Example 1], if a symbol is flagged as "internal" to the module where >>> the it is defined, the linkage need to be changed into "internal" >>> to the executable/lib being compiled. >>> >>> [Example 2], For compile-time constants, their initialized value >>> needs to to cloned to the partitions where it is referenced, >>> The rationale is to make the post-ipo passes to take advantage >>> of the initlized value to squeeeze some performance. >>> >>> In order to not bloat the code size, the cloned constant should >>> mark "don't emit". [end of eg2] >>> >>> Being able to precisely update symbols' attribute is not only >>> vital to correctness, it has significant impact to the the >>> performance as well. >>> >>> I have not yet taken a thorough investigation of this issue. My >>> rudimentary implementation is simply flag symbol "external" when its >>> use/def are separated in different module. I believe this is one >>> of the most difficult part of this work. I guess it is going to >>> take long time to become stable. >>> >>> 2) In order to compile each partition in each separate thread (see >>> Section 3.2), we have to put partitions in distinct LLVMContext. >>> >>> I could be wrong, but I don't find the code which is able to >>> perform function cloning across LLVMContext. >>> >>> My workaround in the patch is to perform function cloning in >>> one LLVMContext (but in different Module, of course), then >>> save the module to disk file, and load it to memory using a >>> new LLVMContext. >>> >>> It is bit circuitous and expensive. >> Do you plan to fix this? What are the issues that prevented function cloning across multiple LLVMContexts? >> >> Evan >> >> > We may fix it, I don't know for sure if it is a big gain at this moment. > > If issues is that, as far as I can tell, current code base dose not have functions support copying > IR across different LLVMContext. > > For example, when it copy an instruction from src to dest, > it check the "src", take a look of of its Type, and derive LLVMContext from the Type, and use > the same context for the dest. So, we need to change the existing code. > >
On 7/16/13 10:40 AM, Evan Cheng wrote:> On Jul 16, 2013, at 10:37 AM, Shuxin Yang <shuxin.llvm at gmail.com> wrote: > >> On 7/16/13 5:23 AM, Evan Cheng wrote: >>> Thanks for the proposal. This is important work which is one step towards making LTO more applicable for large applications. Some comments inline. >>> >>> On Jul 12, 2013, at 3:49 PM, Shuxin Yang <shuxin.llvm at gmail.com> wrote: >>> >>>> 3.1.1 Figure out Partition scheme >>>> ---------------------------------- >>>> we randomly pick up some function and put them in a partition. >>>> It would be nice to perform some optimization at this moment. One opt >>>> in my mind is to reorder functions in order to reduce working-set and >>>> improve locality. >>>> >>>> Unfortunately, this opt seems to be bit blind at this time, because >>>> - CallGraph is not annotated with estimated or profiled frequency. >>>> - some linkers don't respect the order. It seems they just >>>> remembers the function order of the pristine input obj/fake-obj, >>>> and enforce this order at final link (link-exec/shared-lib) stage. >>>> >>>> Anyway, I try to ignore all these problems, and try to perform partition >>>> via following steps. Maybe we have some luck on some platforms: >>>> >>>> o. DFS the call-graph, ignoring the self-resursive edges, if freq is >>>> available, prioritizing the edges (i.e. corresponding to call-sites) >>>> such that frequent edges are visited first. >>>> >>>> o. Cut the DFS spanning tree obtained from the previous step bottom-up, >>>> Each cut/partition contains reasonable # of functions, and the aggregate >>>> size of the functions of the partition should not exceeds predefined >>>> threshold. >>> I'd like to see more details about this step. How do you determine the "reasonable # of functions"? How do you define the threshold? It has to be the same for a given target / platform regardless of the configuration of the build machine, right? >> Say, each module should not contains : >> - no more than 100 functions, and >> - the total size of the functions in a partition should not exceed the pre-defined threshold, >> >> These threshold can be override by command line. > But how do you come about the thresholds? And are they fixed thresholds or determined based on analysis of function size, etc.?Yes, just count the total # of instructions. I don't know which threshold is "comfortable" at this moment. We can keep changing it until we feel comfortable with it.
I recall another reason for not adding the functionality about cloning functions across LLVMContexts. Currently, LTO merges all stuff together into a "merged module" before IPO and post-IPO starts. I believe this have to change in the future if we need to improve the scalarbility. As with other compilers, we only care "merged symbol table" and "merged call-graph" in some passes, and load some of the merged module on-demand. I guess function-cloning need some changes if we switch to that mode. So for now, we are better using whatever we have today, instead mess around implementing something that are about to change in the future. > 2) In order to compile each partition in each separate thread (see>>> Section 3.2), we have to put partitions in distinct LLVMContext. >>> >>> I could be wrong, but I don't find the code which is able to >>> perform function cloning across LLVMContext. >>> >>> My workaround in the patch is to perform function cloning in >>> one LLVMContext (but in different Module, of course), then >>> save the module to disk file, and load it to memory using a >>> new LLVMContext. >>> >>> It is bit circuitous and expensive. >> Do you plan to fix this? What are the issues that prevented function >> cloning across multiple LLVMContexts? >> >> Evan >> >> > We may fix it, I don't know for sure if it is a big gain at this moment. > > If issues is that, as far as I can tell, current code base dose not > have functions support copying > IR across different LLVMContext. > > For example, when it copy an instruction from src to dest, > it check the "src", take a look of of its Type, and derive LLVMContext > from the Type, and use > the same context for the dest. So, we need to change the existing code. > >