Sean Silva via llvm-dev
2017-Jan-16 23:40 UTC
[llvm-dev] Your help needed: List of LLVM Open Projects 2017
On Mon, Jan 16, 2017 at 3:34 PM, Sean Silva <chisophugis at gmail.com> wrote:> > > On Mon, Jan 16, 2017 at 3:32 PM, Sean Silva <chisophugis at gmail.com> wrote: > >> >> >> On Mon, Jan 16, 2017 at 2:31 PM, Davide Italiano <davide at freebsd.org> >> wrote: >> >>> On Mon, Jan 16, 2017 at 2:07 PM, Mehdi Amini <mehdi.amini at apple.com> >>> wrote: >>> > >>> > On Jan 16, 2017, at 1:47 PM, Sean Silva <chisophugis at gmail.com> wrote: >>> > >>> > >>> > >>> > On Mon, Jan 16, 2017 at 1:25 PM, Davide Italiano <davide at freebsd.org> >>> wrote: >>> >> >>> >> On Mon, Jan 16, 2017 at 12:31 PM, Sean Silva via llvm-dev >>> >> <llvm-dev at lists.llvm.org> wrote: >>> >> > Do we have any open projects on LLD? >>> >> > >>> >> > I know we usually try to avoid any big "projects" and mainly add/fix >>> >> > things >>> >> > in response to user needs, but just wondering if somebody has any >>> ideas. >>> >> > >>> >> >>> >> I'm not particularly active in lld anymore, but the last big item I'd >>> >> like to see implemented is Pettis-Hansen layout. >>> >> http://perso.ensta-paristech.fr/~bmonsuez/Cours/B6-4/Article >>> s/papers15.pdf >>> >> (mainly because it improves performances of the final executable). >>> >> GCC/gold have an implementation of the algorithm that can be used as >>> >> base. I'll expand if anybody is interested. >>> >> Side note: I'd like to propose a couple of llvm projects as well, I'll >>> >> sit down later today and write them. >>> > >>> > >>> > >>> > I’m not sure, can you confirm that such layout optimization on ELF >>> requires >>> > -ffunction-sections? >>> > >>> >>> For the non-LTO case, I think so. >>> >>> > Also, for clang on OSX the best layout we could get is to order >>> functions in >>> > the order in which they get executed at runtime. >>> > >>> >>> That's what we already do for lld. We collect and order file (run a >>> profiler) and pass that to the linker that lays out functions >>> accordingly. >>> This is to improve startup time for a class of startup-time-sensitive >>> operations. The algorithm proposed by Pettis (allegedly) aims to >>> reduce the TLB misses as it tries to lay out hot functions (or >>> functions that are likely to be called together near in the final >>> binary). >>> >> >> IIRC from when I looked at the paper a while ago, it is mostly just a >> "huffman tree construction" type algorithm (agglomerating based on highest >> probability) and assumes that if two functions are hot then they are likely >> to be needed together. This is not always the case. >> >> E.g. consider a server that accepts RPC requests and based on those >> requests either does Foo or Bar which are largely disjoint. It's entirely >> possible for the top two functions of the profile to be one in Foo and one >> in Bar, but laying them out near each other doesn't make sense since there >> is never locality (for a given RPC, either Foo or Bar gets run). A static >> call graph analysis can provide the needed signals to handle this case >> better. >> >> > Hence you said "allegedly" :) I know we've talked about this before. Just > wanted to put the backstory of the "allegedly" on the list. >Looks like I remembered this wrong. The algorithm in section 3.2 of the paper is call-graph aware. It does do greedy coalescing like a Huffman tree construction algorithms, but constrains the available coalescing operations at each step by call graph adjacency (in fact, what it is "greedy" about is the hotness of the edges between call graph nodes and not the nodes themselves). -- Sean Silva> > -- Sean Silva > > >> -- Sean Silva >> >> >>> >>> > >>> > For FullLTO it is conceptually pretty easy to get profile data we need >>> for >>> > this, but I'm not sure about the ThinLTO case. >>> > >>> > Teresa, Mehdi, >>> > >>> > Are there any plans (or things already working!) for getting profile >>> data >>> > from ThinLTO in a format that the linker can use for code layout? I >>> assume >>> > that profile data is being used already to guide importing, so it may >>> just >>> > be a matter of siphoning that off. >>> > >>> > >>> > I’m not sure what kind of “profile information” is needed, and what >>> makes it >>> > easier for MonolithicLTO compared to ThinLTO? >>> > >>> > Or maybe that layout code should be inside LLVM; maybe part of the >>> general >>> > LTO interface? It looks like the current gcc plugin calls back into >>> gcc for >>> > the actual layout algorithm itself (function call >>> > find_pettis_hansen_function_layout) rather than the reordering logic >>> living >>> > in the linker: >>> > https://android.googlesource.com/toolchain/gcc/+/3f73d6ef904 >>> 58b45bbbb33ef4c2b174d4662a22d/gcc-4.6/function_reordering_pl >>> ugin/function_reordering_plugin.c >>> > >>> > >>> > I was thinking about this: could this be done by reorganizing the >>> module >>> > itself for LTO? >>> > >>> > That wouldn’t help non-LTO and ThinLTO though. >>> >>> This is a dimension that I think can be explored. The fact that it >>> wouldn't help with other modes of operation is completely orthogonal, >>> in particular until it's proven that this kind of optimization makes >>> sense with ThinLTO (and if it doesn't, it can be an optimization ran >>> only during full LTO). >>> >>> -- >>> Davide >>> >>> "There are no solved problems; there are only problems that are more >>> or less solved" -- Henri Poincare >>> >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170116/5cab6ee6/attachment.html>
Xinliang David Li via llvm-dev
2017-Jan-17 00:33 UTC
[llvm-dev] Your help needed: List of LLVM Open Projects 2017
Google GCC records profile data (dynamic callgraph) in a special named section in ELF object file to be consumed by the plugin. Those sections will be discarded later by the linker. There are pros and cons of using xray for layout purpose. The call trace from xray is certainly more powerful for layout purpose, but it adds addtional complexity to the optimized build process. You would need to collect xray trace profile on the optimized binary (presumably built with PGO already) and rebuild without xray nop insertion and function layout. David On Mon, Jan 16, 2017 at 3:40 PM, Sean Silva via llvm-dev < llvm-dev at lists.llvm.org> wrote:> > > On Mon, Jan 16, 2017 at 3:34 PM, Sean Silva <chisophugis at gmail.com> wrote: > >> >> >> On Mon, Jan 16, 2017 at 3:32 PM, Sean Silva <chisophugis at gmail.com> >> wrote: >> >>> >>> >>> On Mon, Jan 16, 2017 at 2:31 PM, Davide Italiano <davide at freebsd.org> >>> wrote: >>> >>>> On Mon, Jan 16, 2017 at 2:07 PM, Mehdi Amini <mehdi.amini at apple.com> >>>> wrote: >>>> > >>>> > On Jan 16, 2017, at 1:47 PM, Sean Silva <chisophugis at gmail.com> >>>> wrote: >>>> > >>>> > >>>> > >>>> > On Mon, Jan 16, 2017 at 1:25 PM, Davide Italiano <davide at freebsd.org> >>>> wrote: >>>> >> >>>> >> On Mon, Jan 16, 2017 at 12:31 PM, Sean Silva via llvm-dev >>>> >> <llvm-dev at lists.llvm.org> wrote: >>>> >> > Do we have any open projects on LLD? >>>> >> > >>>> >> > I know we usually try to avoid any big "projects" and mainly >>>> add/fix >>>> >> > things >>>> >> > in response to user needs, but just wondering if somebody has any >>>> ideas. >>>> >> > >>>> >> >>>> >> I'm not particularly active in lld anymore, but the last big item I'd >>>> >> like to see implemented is Pettis-Hansen layout. >>>> >> http://perso.ensta-paristech.fr/~bmonsuez/Cours/B6-4/Article >>>> s/papers15.pdf >>>> >> (mainly because it improves performances of the final executable). >>>> >> GCC/gold have an implementation of the algorithm that can be used as >>>> >> base. I'll expand if anybody is interested. >>>> >> Side note: I'd like to propose a couple of llvm projects as well, >>>> I'll >>>> >> sit down later today and write them. >>>> > >>>> > >>>> > >>>> > I’m not sure, can you confirm that such layout optimization on ELF >>>> requires >>>> > -ffunction-sections? >>>> > >>>> >>>> For the non-LTO case, I think so. >>>> >>>> > Also, for clang on OSX the best layout we could get is to order >>>> functions in >>>> > the order in which they get executed at runtime. >>>> > >>>> >>>> That's what we already do for lld. We collect and order file (run a >>>> profiler) and pass that to the linker that lays out functions >>>> accordingly. >>>> This is to improve startup time for a class of startup-time-sensitive >>>> operations. The algorithm proposed by Pettis (allegedly) aims to >>>> reduce the TLB misses as it tries to lay out hot functions (or >>>> functions that are likely to be called together near in the final >>>> binary). >>>> >>> >>> IIRC from when I looked at the paper a while ago, it is mostly just a >>> "huffman tree construction" type algorithm (agglomerating based on highest >>> probability) and assumes that if two functions are hot then they are likely >>> to be needed together. This is not always the case. >>> >>> E.g. consider a server that accepts RPC requests and based on those >>> requests either does Foo or Bar which are largely disjoint. It's entirely >>> possible for the top two functions of the profile to be one in Foo and one >>> in Bar, but laying them out near each other doesn't make sense since there >>> is never locality (for a given RPC, either Foo or Bar gets run). A static >>> call graph analysis can provide the needed signals to handle this case >>> better. >>> >>> >> Hence you said "allegedly" :) I know we've talked about this before. Just >> wanted to put the backstory of the "allegedly" on the list. >> > > Looks like I remembered this wrong. The algorithm in section 3.2 of the > paper is call-graph aware. It does do greedy coalescing like a Huffman tree > construction algorithms, but constrains the available coalescing operations > at each step by call graph adjacency (in fact, what it is "greedy" about is > the hotness of the edges between call graph nodes and not the nodes > themselves). > > -- Sean Silva > > >> >> -- Sean Silva >> >> >>> -- Sean Silva >>> >>> >>>> >>>> > >>>> > For FullLTO it is conceptually pretty easy to get profile data we >>>> need for >>>> > this, but I'm not sure about the ThinLTO case. >>>> > >>>> > Teresa, Mehdi, >>>> > >>>> > Are there any plans (or things already working!) for getting profile >>>> data >>>> > from ThinLTO in a format that the linker can use for code layout? I >>>> assume >>>> > that profile data is being used already to guide importing, so it may >>>> just >>>> > be a matter of siphoning that off. >>>> > >>>> > >>>> > I’m not sure what kind of “profile information” is needed, and what >>>> makes it >>>> > easier for MonolithicLTO compared to ThinLTO? >>>> > >>>> > Or maybe that layout code should be inside LLVM; maybe part of the >>>> general >>>> > LTO interface? It looks like the current gcc plugin calls back into >>>> gcc for >>>> > the actual layout algorithm itself (function call >>>> > find_pettis_hansen_function_layout) rather than the reordering logic >>>> living >>>> > in the linker: >>>> > https://android.googlesource.com/toolchain/gcc/+/3f73d6ef904 >>>> 58b45bbbb33ef4c2b174d4662a22d/gcc-4.6/function_reordering_pl >>>> ugin/function_reordering_plugin.c >>>> > >>>> > >>>> > I was thinking about this: could this be done by reorganizing the >>>> module >>>> > itself for LTO? >>>> > >>>> > That wouldn’t help non-LTO and ThinLTO though. >>>> >>>> This is a dimension that I think can be explored. The fact that it >>>> wouldn't help with other modes of operation is completely orthogonal, >>>> in particular until it's proven that this kind of optimization makes >>>> sense with ThinLTO (and if it doesn't, it can be an optimization ran >>>> only during full LTO). >>>> >>>> -- >>>> Davide >>>> >>>> "There are no solved problems; there are only problems that are more >>>> or less solved" -- Henri Poincare >>>> >>> >>> >> > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170116/292af3a9/attachment.html>
Sean Silva via llvm-dev
2017-Jan-17 03:41 UTC
[llvm-dev] Your help needed: List of LLVM Open Projects 2017
Would it make sense for xray instrumentation be part of -fprofile-generate? PGO will affect inlining decisions etc for the optimized binary, but the collected traces during the instrumented build would still have quite a bit of useful information. -- Sean Silva On Mon, Jan 16, 2017 at 4:33 PM, Xinliang David Li <xinliangli at gmail.com> wrote:> Google GCC records profile data (dynamic callgraph) in a special named > section in ELF object file to be consumed by the plugin. Those sections > will be discarded later by the linker. > > There are pros and cons of using xray for layout purpose. The call trace > from xray is certainly more powerful for layout purpose, but it adds > addtional complexity to the optimized build process. You would need to > collect xray trace profile on the optimized binary (presumably built with > PGO already) and rebuild without xray nop insertion and function layout. > > David > > On Mon, Jan 16, 2017 at 3:40 PM, Sean Silva via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> >> >> On Mon, Jan 16, 2017 at 3:34 PM, Sean Silva <chisophugis at gmail.com> >> wrote: >> >>> >>> >>> On Mon, Jan 16, 2017 at 3:32 PM, Sean Silva <chisophugis at gmail.com> >>> wrote: >>> >>>> >>>> >>>> On Mon, Jan 16, 2017 at 2:31 PM, Davide Italiano <davide at freebsd.org> >>>> wrote: >>>> >>>>> On Mon, Jan 16, 2017 at 2:07 PM, Mehdi Amini <mehdi.amini at apple.com> >>>>> wrote: >>>>> > >>>>> > On Jan 16, 2017, at 1:47 PM, Sean Silva <chisophugis at gmail.com> >>>>> wrote: >>>>> > >>>>> > >>>>> > >>>>> > On Mon, Jan 16, 2017 at 1:25 PM, Davide Italiano <davide at freebsd.org> >>>>> wrote: >>>>> >> >>>>> >> On Mon, Jan 16, 2017 at 12:31 PM, Sean Silva via llvm-dev >>>>> >> <llvm-dev at lists.llvm.org> wrote: >>>>> >> > Do we have any open projects on LLD? >>>>> >> > >>>>> >> > I know we usually try to avoid any big "projects" and mainly >>>>> add/fix >>>>> >> > things >>>>> >> > in response to user needs, but just wondering if somebody has any >>>>> ideas. >>>>> >> > >>>>> >> >>>>> >> I'm not particularly active in lld anymore, but the last big item >>>>> I'd >>>>> >> like to see implemented is Pettis-Hansen layout. >>>>> >> http://perso.ensta-paristech.fr/~bmonsuez/Cours/B6-4/Article >>>>> s/papers15.pdf >>>>> >> (mainly because it improves performances of the final executable). >>>>> >> GCC/gold have an implementation of the algorithm that can be used as >>>>> >> base. I'll expand if anybody is interested. >>>>> >> Side note: I'd like to propose a couple of llvm projects as well, >>>>> I'll >>>>> >> sit down later today and write them. >>>>> > >>>>> > >>>>> > >>>>> > I’m not sure, can you confirm that such layout optimization on ELF >>>>> requires >>>>> > -ffunction-sections? >>>>> > >>>>> >>>>> For the non-LTO case, I think so. >>>>> >>>>> > Also, for clang on OSX the best layout we could get is to order >>>>> functions in >>>>> > the order in which they get executed at runtime. >>>>> > >>>>> >>>>> That's what we already do for lld. We collect and order file (run a >>>>> profiler) and pass that to the linker that lays out functions >>>>> accordingly. >>>>> This is to improve startup time for a class of startup-time-sensitive >>>>> operations. The algorithm proposed by Pettis (allegedly) aims to >>>>> reduce the TLB misses as it tries to lay out hot functions (or >>>>> functions that are likely to be called together near in the final >>>>> binary). >>>>> >>>> >>>> IIRC from when I looked at the paper a while ago, it is mostly just a >>>> "huffman tree construction" type algorithm (agglomerating based on highest >>>> probability) and assumes that if two functions are hot then they are likely >>>> to be needed together. This is not always the case. >>>> >>>> E.g. consider a server that accepts RPC requests and based on those >>>> requests either does Foo or Bar which are largely disjoint. It's entirely >>>> possible for the top two functions of the profile to be one in Foo and one >>>> in Bar, but laying them out near each other doesn't make sense since there >>>> is never locality (for a given RPC, either Foo or Bar gets run). A static >>>> call graph analysis can provide the needed signals to handle this case >>>> better. >>>> >>>> >>> Hence you said "allegedly" :) I know we've talked about this before. >>> Just wanted to put the backstory of the "allegedly" on the list. >>> >> >> Looks like I remembered this wrong. The algorithm in section 3.2 of the >> paper is call-graph aware. It does do greedy coalescing like a Huffman tree >> construction algorithms, but constrains the available coalescing operations >> at each step by call graph adjacency (in fact, what it is "greedy" about is >> the hotness of the edges between call graph nodes and not the nodes >> themselves). >> >> -- Sean Silva >> >> >>> >>> -- Sean Silva >>> >>> >>>> -- Sean Silva >>>> >>>> >>>>> >>>>> > >>>>> > For FullLTO it is conceptually pretty easy to get profile data we >>>>> need for >>>>> > this, but I'm not sure about the ThinLTO case. >>>>> > >>>>> > Teresa, Mehdi, >>>>> > >>>>> > Are there any plans (or things already working!) for getting profile >>>>> data >>>>> > from ThinLTO in a format that the linker can use for code layout? I >>>>> assume >>>>> > that profile data is being used already to guide importing, so it >>>>> may just >>>>> > be a matter of siphoning that off. >>>>> > >>>>> > >>>>> > I’m not sure what kind of “profile information” is needed, and what >>>>> makes it >>>>> > easier for MonolithicLTO compared to ThinLTO? >>>>> > >>>>> > Or maybe that layout code should be inside LLVM; maybe part of the >>>>> general >>>>> > LTO interface? It looks like the current gcc plugin calls back into >>>>> gcc for >>>>> > the actual layout algorithm itself (function call >>>>> > find_pettis_hansen_function_layout) rather than the reordering >>>>> logic living >>>>> > in the linker: >>>>> > https://android.googlesource.com/toolchain/gcc/+/3f73d6ef904 >>>>> 58b45bbbb33ef4c2b174d4662a22d/gcc-4.6/function_reordering_pl >>>>> ugin/function_reordering_plugin.c >>>>> > >>>>> > >>>>> > I was thinking about this: could this be done by reorganizing the >>>>> module >>>>> > itself for LTO? >>>>> > >>>>> > That wouldn’t help non-LTO and ThinLTO though. >>>>> >>>>> This is a dimension that I think can be explored. The fact that it >>>>> wouldn't help with other modes of operation is completely orthogonal, >>>>> in particular until it's proven that this kind of optimization makes >>>>> sense with ThinLTO (and if it doesn't, it can be an optimization ran >>>>> only during full LTO). >>>>> >>>>> -- >>>>> Davide >>>>> >>>>> "There are no solved problems; there are only problems that are more >>>>> or less solved" -- Henri Poincare >>>>> >>>> >>>> >>> >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170116/8c57699e/attachment.html>