On Thu, Jan 17, 2019 at 2:47 PM Xinliang David Li <davidxl at google.com> wrote:> > > On Thu, Jan 17, 2019 at 2:32 PM Manman Ren <manman.ren at gmail.com> wrote: > >> >> >> On Thu, Jan 17, 2019 at 10:53 AM Xinliang David Li <davidxl at google.com> >> wrote: >> >>> Hi Manman, >>> >>> Ordering profiling is certainly something very useful to have to startup >>> time performance. GCC has something similar. >>> >>> In terms of implementation, it is possible to simply extend the edge >>> profiling counters by 1 for each function, and instrument the function to >>> record the time stamp the first time the function is executed. The overhead >>> will be minimized and you can leverage all the other existing support in >>> profiling runtime. >>> >> >> Hi David, >> >> Just to clarify, are you suggesting to add an edge profiling counter per >> function to record the time stamp? Where are the edge profiling counters >> defined? >> > > There is no needed to define the counter explicitly. What is needed is to > introduce a new intrinsic to update the order counter, and the InstProf > lowerer will create the counter for you. See > Transforms/Instrumentation/InstrProfiling.cpp. +Vedant Kumar > <vsk at apple.com> >We need the instrumentation to be run late as an IR pass, so it has all the function symbols. Are you suggesting a front-end instrumentation by mentioning InstrProfiling.cpp?> > >> So the difference will be where we store the profile information and in >> what format? >> >> > Time stamp can be simply a uint64 value just like any edge/block profile > count. llvm_profdata tool can be taught to dump the ordering file to be > exported to the linker. > > >> With the suggested approach, we need to allocate one time stamp for each >> function, what is implemented is a pair of numbers for each executed >> function. The runtime performance can be different as well, the suggested >> approach gets the time stamp, and saves it to memory, what is implemented >> is saving the pair of numbers and incrementing a counter. >> >> > The runtime cost is probably not much for either approach. The suggested > approach eliminates the trouble to maintain/implement a different way to > identify functions. >Agree that using existing infra is better! Manman> > > >> >>> Another possibility is to use xray to implement the functionality -- >>> xray is useful for trace like profiling by design. >>> >> >> We have not looked into XRay. We need something with low binary size >> penalty and low runtime perf degradation, not sure if XRay is a good fit! >> >> > xray can achieve very low runtime overhead. +Dean Michael Berris > <dberris at google.com> for additional comments. > > David > > > > >> Thanks, >> Manman >> >> >>> David >>> >>> On Thu, Jan 17, 2019 at 10:24 AM Manman Ren <manman.ren at gmail.com> >>> wrote: >>> >>>> Order file is used to teach ld64 how to order the functions in a >>>> binary. If we put all functions executed during startup together in the >>>> right order, we will greatly reduce the page faults during startup. >>>> >>>> To generate order file for iOS apps, we usually use dtrace, but some >>>> apps have various startup scenarios that we want to capture in the order >>>> file. dtrace approach is not easy to automate, it is hard to capture the >>>> different ways of starting an app without automation. Instrumented builds >>>> however can be deployed to phones and profile data can be automatically >>>> collected. >>>> >>>> For the Facebook app, by looking at the startup distribution, we are >>>> expecting a big win out of the order file instrumentation, from 100ms to >>>> 500ms+, in startup time. >>>> >>>> The basic idea of the pass is to use a circular buffer to log the >>>> execution ordering of the functions. We only log the function when it is >>>> first executed. Instead of logging the symbol name of the function, we log >>>> a pair of integers, with one integer specifying the module id, and the >>>> other specifying the function id within the module. >>>> >>>> In this pass, we add three global variables: >>>> (1) an order file buffer >>>> The order file buffer is a circular buffer at its own llvm section. >>>> Each entry is a pair of integers, with one integer specifying the module >>>> id, and the other specifying the function id within the module. >>>> (2) a bitmap for each module: one bit for each function to say if the >>>> function is already executed; >>>> (3) a global index to the buffer >>>> >>>> At the function prologue, if the function has not been executed (by >>>> checking the bitmap), log the module id and the function id, then >>>> atomically increase the index. >>>> >>>> This pass is intended to be used as a ThinLTO pass or a LTO pass. It >>>> maps each module to a distinct integer, it also generate a mapping file so >>>> we can decode the function symbol name from the pair of ids. >>>> >>>> clang has '-finstrument-function-entry-bare' which inserts a function >>>> call and is not as efficient. >>>> >>>> Three patches are attached, for llvm, clang, and compiler-rt >>>> respectively. >>>> >>>> TODO: >>>> (1) Migrate to the new pass manager with a shim for the legacy pass >>>> manager. >>>> (2) For the order file buffer, consider always emitting definitions, >>>> making them LinkOnceODR with a COMDAT group. >>>> (3) Add testing case for clang/compiler-rt patches. >>>> (4) Add utilities to deobfuscate the profile dump. >>>> (5) The size of the buffer is currently hard-coded ( >>>> INSTR_ORDER_FILE_BUFFER_SIZE). >>>> >>>> Thanks Kamal for contributing to the patches! Thanks to Aditya and >>>> Saleem for doing an initial review pass over the patches! >>>> >>>> Manman >>>> >>>> >>>>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190117/c0e2ac9b/attachment.html>
Xinliang David Li via llvm-dev
2019-Jan-17 23:57 UTC
[llvm-dev] [RFC] Order File Instrumentation
On Thu, Jan 17, 2019 at 3:25 PM Manman Ren <manman.ren at gmail.com> wrote:> > > On Thu, Jan 17, 2019 at 2:47 PM Xinliang David Li <davidxl at google.com> > wrote: > >> >> >> On Thu, Jan 17, 2019 at 2:32 PM Manman Ren <manman.ren at gmail.com> wrote: >> >>> >>> >>> On Thu, Jan 17, 2019 at 10:53 AM Xinliang David Li <davidxl at google.com> >>> wrote: >>> >>>> Hi Manman, >>>> >>>> Ordering profiling is certainly something very useful to have to >>>> startup time performance. GCC has something similar. >>>> >>>> In terms of implementation, it is possible to simply extend the edge >>>> profiling counters by 1 for each function, and instrument the function to >>>> record the time stamp the first time the function is executed. The overhead >>>> will be minimized and you can leverage all the other existing support in >>>> profiling runtime. >>>> >>> >>> Hi David, >>> >>> Just to clarify, are you suggesting to add an edge profiling counter per >>> function to record the time stamp? Where are the edge profiling counters >>> defined? >>> >> >> There is no needed to define the counter explicitly. What is needed is to >> introduce a new intrinsic to update the order counter, and the InstProf >> lowerer will create the counter for you. See >> Transforms/Instrumentation/InstrProfiling.cpp. +Vedant Kumar >> <vsk at apple.com> >> > > We need the instrumentation to be run late as an IR pass, so it has all > the function symbols. Are you suggesting a front-end instrumentation by > mentioning InstrProfiling.cpp? > >The lowering is actually shared by both FE instrumentation and IR PGO. For ordering profiling, it needs to run the pass after the inliner pass in theory (which is later even than the current IR PGO instrumentation pass). One way to achieve the same effect is lower the time stamp counter update intrinsic in a separate pass (after inlining). In the lowerer, the counter update intrinsic can be eliminated if it is in an inline instance, but this is not ideal either as the inlining decisions may be affected due to instrumentation. To combine the ordering profiling requirements (post inlining) and the benefit of sharing (runtime, tooling etc), the proposed implementation can be modified with the following changes: 0) keep the phase ordering in the patch 1) the instrumentation just does insert intrinsics 2) enhance lowering to handle new intrinsics. I assume ordering profiling is not intended to work together with PGO instrumentation (to make sure inlining decisions are the same), so the you end up with a counter section with one counter per out of line function. The rest will just work. The changes to runtime will be minimal. If PGO (IR or FE) + Ordering profiling is a desired use case, it can be made working too -- the lowerer needs to update the per function counter array size (appending one). David> >> >>> So the difference will be where we store the profile information and in >>> what format? >>> >>> >> Time stamp can be simply a uint64 value just like any edge/block profile >> count. llvm_profdata tool can be taught to dump the ordering file to be >> exported to the linker. >> >> >>> With the suggested approach, we need to allocate one time stamp for each >>> function, what is implemented is a pair of numbers for each executed >>> function. The runtime performance can be different as well, the suggested >>> approach gets the time stamp, and saves it to memory, what is implemented >>> is saving the pair of numbers and incrementing a counter. >>> >>> >> The runtime cost is probably not much for either approach. The suggested >> approach eliminates the trouble to maintain/implement a different way to >> identify functions. >> > > Agree that using existing infra is better! > >> Manman > >> >> >> >>> >>>> Another possibility is to use xray to implement the functionality -- >>>> xray is useful for trace like profiling by design. >>>> >>> >>> We have not looked into XRay. We need something with low binary size >>> penalty and low runtime perf degradation, not sure if XRay is a good fit! >>> >>> >> xray can achieve very low runtime overhead. +Dean Michael Berris >> <dberris at google.com> for additional comments. >> >> David >> >> >> >> >>> Thanks, >>> Manman >>> >>> >>>> David >>>> >>>> On Thu, Jan 17, 2019 at 10:24 AM Manman Ren <manman.ren at gmail.com> >>>> wrote: >>>> >>>>> Order file is used to teach ld64 how to order the functions in a >>>>> binary. If we put all functions executed during startup together in the >>>>> right order, we will greatly reduce the page faults during startup. >>>>> >>>>> To generate order file for iOS apps, we usually use dtrace, but some >>>>> apps have various startup scenarios that we want to capture in the order >>>>> file. dtrace approach is not easy to automate, it is hard to capture the >>>>> different ways of starting an app without automation. Instrumented builds >>>>> however can be deployed to phones and profile data can be automatically >>>>> collected. >>>>> >>>>> For the Facebook app, by looking at the startup distribution, we are >>>>> expecting a big win out of the order file instrumentation, from 100ms to >>>>> 500ms+, in startup time. >>>>> >>>>> The basic idea of the pass is to use a circular buffer to log the >>>>> execution ordering of the functions. We only log the function when it is >>>>> first executed. Instead of logging the symbol name of the function, we log >>>>> a pair of integers, with one integer specifying the module id, and the >>>>> other specifying the function id within the module. >>>>> >>>>> In this pass, we add three global variables: >>>>> (1) an order file buffer >>>>> The order file buffer is a circular buffer at its own llvm section. >>>>> Each entry is a pair of integers, with one integer specifying the module >>>>> id, and the other specifying the function id within the module. >>>>> (2) a bitmap for each module: one bit for each function to say if the >>>>> function is already executed; >>>>> (3) a global index to the buffer >>>>> >>>>> At the function prologue, if the function has not been executed (by >>>>> checking the bitmap), log the module id and the function id, then >>>>> atomically increase the index. >>>>> >>>>> This pass is intended to be used as a ThinLTO pass or a LTO pass. It >>>>> maps each module to a distinct integer, it also generate a mapping file so >>>>> we can decode the function symbol name from the pair of ids. >>>>> >>>>> clang has '-finstrument-function-entry-bare' which inserts a function >>>>> call and is not as efficient. >>>>> >>>>> Three patches are attached, for llvm, clang, and compiler-rt >>>>> respectively. >>>>> >>>>> TODO: >>>>> (1) Migrate to the new pass manager with a shim for the legacy pass >>>>> manager. >>>>> (2) For the order file buffer, consider always emitting definitions, >>>>> making them LinkOnceODR with a COMDAT group. >>>>> (3) Add testing case for clang/compiler-rt patches. >>>>> (4) Add utilities to deobfuscate the profile dump. >>>>> (5) The size of the buffer is currently hard-coded ( >>>>> INSTR_ORDER_FILE_BUFFER_SIZE). >>>>> >>>>> Thanks Kamal for contributing to the patches! Thanks to Aditya and >>>>> Saleem for doing an initial review pass over the patches! >>>>> >>>>> Manman >>>>> >>>>> >>>>>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190117/5ed4defd/attachment.html>
On Thu, Jan 17, 2019 at 3:58 PM Xinliang David Li <davidxl at google.com> wrote:> > > On Thu, Jan 17, 2019 at 3:25 PM Manman Ren <manman.ren at gmail.com> wrote: > >> >> >> On Thu, Jan 17, 2019 at 2:47 PM Xinliang David Li <davidxl at google.com> >> wrote: >> >>> >>> >>> On Thu, Jan 17, 2019 at 2:32 PM Manman Ren <manman.ren at gmail.com> wrote: >>> >>>> >>>> >>>> On Thu, Jan 17, 2019 at 10:53 AM Xinliang David Li <davidxl at google.com> >>>> wrote: >>>> >>>>> Hi Manman, >>>>> >>>>> Ordering profiling is certainly something very useful to have to >>>>> startup time performance. GCC has something similar. >>>>> >>>>> In terms of implementation, it is possible to simply extend the edge >>>>> profiling counters by 1 for each function, and instrument the function to >>>>> record the time stamp the first time the function is executed. The overhead >>>>> will be minimized and you can leverage all the other existing support in >>>>> profiling runtime. >>>>> >>>> >>>> Hi David, >>>> >>>> Just to clarify, are you suggesting to add an edge profiling counter >>>> per function to record the time stamp? Where are the edge profiling >>>> counters defined? >>>> >>> >>> There is no needed to define the counter explicitly. What is needed is >>> to introduce a new intrinsic to update the order counter, and the InstProf >>> lowerer will create the counter for you. See >>> Transforms/Instrumentation/InstrProfiling.cpp. +Vedant Kumar >>> <vsk at apple.com> >>> >> >> We need the instrumentation to be run late as an IR pass, so it has all >> the function symbols. Are you suggesting a front-end instrumentation by >> mentioning InstrProfiling.cpp? >> >> > > The lowering is actually shared by both FE instrumentation and IR PGO. > For ordering profiling, it needs to run the pass after the inliner pass in > theory (which is later even than the current IR PGO instrumentation pass). > One way to achieve the same effect is lower the time stamp counter update > intrinsic in a separate pass (after inlining). In the lowerer, the counter > update intrinsic can be eliminated if it is in an inline instance, but this > is not ideal either as the inlining decisions may be affected due to > instrumentation. > > To combine the ordering profiling requirements (post inlining) and the > benefit of sharing (runtime, tooling etc), the proposed implementation can > be modified with the following changes: > 0) keep the phase ordering in the patch > 1) the instrumentation just does insert intrinsics > 2) enhance lowering to handle new intrinsics. >> I assume ordering profiling is not intended to work together with PGO > instrumentation (to make sure inlining decisions are the same), so the you > end up with a counter section with one counter per out of line function. > The rest will just work. The changes to runtime will be minimal. >Got it, this should work! The overhead will be one counter for each function, and we also need to save the function names in the instrumented binary, right? Other than binary size, the other concern is the size of the profile data, as we need to download it from device. The profile data will be one counter for each function plus all the function names? In the proposed implementation, the profile data will be the size of the circular buffer, which should be the maximal number of functions that can be executed during startup.> If PGO (IR or FE) + Ordering profiling is a desired use case, it can be > made working too -- the lowerer needs to update the per function counter > array size (appending one). > >There may be a use case for IR Instr + ordering profiling. One important usage of IR Instr is to guide hot/cold splitting for startup time. It is always simpler if we only need to maintain one instrumentation build! But that depends on how much overhead IR Instr adds in terms of binary size, profile data size, and runtime perf. Manman> David > > > >> >>> >>>> So the difference will be where we store the profile information and in >>>> what format? >>>> >>>> >>> Time stamp can be simply a uint64 value just like any edge/block profile >>> count. llvm_profdata tool can be taught to dump the ordering file to be >>> exported to the linker. >>> >>> >>>> With the suggested approach, we need to allocate one time stamp for >>>> each function, what is implemented is a pair of numbers for each executed >>>> function. The runtime performance can be different as well, the suggested >>>> approach gets the time stamp, and saves it to memory, what is implemented >>>> is saving the pair of numbers and incrementing a counter. >>>> >>>> >>> The runtime cost is probably not much for either approach. The suggested >>> approach eliminates the trouble to maintain/implement a different way to >>> identify functions. >>> >> >> Agree that using existing infra is better! >> >> > > > > >> Manman >> >>> >>> >>> >>>> >>>>> Another possibility is to use xray to implement the functionality -- >>>>> xray is useful for trace like profiling by design. >>>>> >>>> >>>> We have not looked into XRay. We need something with low binary size >>>> penalty and low runtime perf degradation, not sure if XRay is a good fit! >>>> >>>> >>> xray can achieve very low runtime overhead. +Dean Michael Berris >>> <dberris at google.com> for additional comments. >>> >>> David >>> >>> >>> >>> >>>> Thanks, >>>> Manman >>>> >>>> >>>>> David >>>>> >>>>> On Thu, Jan 17, 2019 at 10:24 AM Manman Ren <manman.ren at gmail.com> >>>>> wrote: >>>>> >>>>>> Order file is used to teach ld64 how to order the functions in a >>>>>> binary. If we put all functions executed during startup together in the >>>>>> right order, we will greatly reduce the page faults during startup. >>>>>> >>>>>> To generate order file for iOS apps, we usually use dtrace, but some >>>>>> apps have various startup scenarios that we want to capture in the order >>>>>> file. dtrace approach is not easy to automate, it is hard to capture the >>>>>> different ways of starting an app without automation. Instrumented builds >>>>>> however can be deployed to phones and profile data can be automatically >>>>>> collected. >>>>>> >>>>>> For the Facebook app, by looking at the startup distribution, we are >>>>>> expecting a big win out of the order file instrumentation, from 100ms to >>>>>> 500ms+, in startup time. >>>>>> >>>>>> The basic idea of the pass is to use a circular buffer to log the >>>>>> execution ordering of the functions. We only log the function when it is >>>>>> first executed. Instead of logging the symbol name of the function, we log >>>>>> a pair of integers, with one integer specifying the module id, and the >>>>>> other specifying the function id within the module. >>>>>> >>>>>> In this pass, we add three global variables: >>>>>> (1) an order file buffer >>>>>> The order file buffer is a circular buffer at its own llvm section. >>>>>> Each entry is a pair of integers, with one integer specifying the module >>>>>> id, and the other specifying the function id within the module. >>>>>> (2) a bitmap for each module: one bit for each function to say if the >>>>>> function is already executed; >>>>>> (3) a global index to the buffer >>>>>> >>>>>> At the function prologue, if the function has not been executed (by >>>>>> checking the bitmap), log the module id and the function id, then >>>>>> atomically increase the index. >>>>>> >>>>>> This pass is intended to be used as a ThinLTO pass or a LTO pass. It >>>>>> maps each module to a distinct integer, it also generate a mapping file so >>>>>> we can decode the function symbol name from the pair of ids. >>>>>> >>>>>> clang has '-finstrument-function-entry-bare' which inserts a function >>>>>> call and is not as efficient. >>>>>> >>>>>> Three patches are attached, for llvm, clang, and compiler-rt >>>>>> respectively. >>>>>> >>>>>> TODO: >>>>>> (1) Migrate to the new pass manager with a shim for the legacy pass >>>>>> manager. >>>>>> (2) For the order file buffer, consider always emitting definitions, >>>>>> making them LinkOnceODR with a COMDAT group. >>>>>> (3) Add testing case for clang/compiler-rt patches. >>>>>> (4) Add utilities to deobfuscate the profile dump. >>>>>> (5) The size of the buffer is currently hard-coded ( >>>>>> INSTR_ORDER_FILE_BUFFER_SIZE). >>>>>> >>>>>> Thanks Kamal for contributing to the patches! Thanks to Aditya and >>>>>> Saleem for doing an initial review pass over the patches! >>>>>> >>>>>> Manman >>>>>> >>>>>> >>>>>>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190117/621aff78/attachment.html>