Order file is used to teach ld64 how to order the functions in a binary. If we put all functions executed during startup together in the right order, we will greatly reduce the page faults during startup. To generate order file for iOS apps, we usually use dtrace, but some apps have various startup scenarios that we want to capture in the order file. dtrace approach is not easy to automate, it is hard to capture the different ways of starting an app without automation. Instrumented builds however can be deployed to phones and profile data can be automatically collected. For the Facebook app, by looking at the startup distribution, we are expecting a big win out of the order file instrumentation, from 100ms to 500ms+, in startup time. The basic idea of the pass is to use a circular buffer to log the execution ordering of the functions. We only log the function when it is first executed. Instead of logging the symbol name of the function, we log a pair of integers, with one integer specifying the module id, and the other specifying the function id within the module. In this pass, we add three global variables: (1) an order file buffer The order file buffer is a circular buffer at its own llvm section. Each entry is a pair of integers, with one integer specifying the module id, and the other specifying the function id within the module. (2) a bitmap for each module: one bit for each function to say if the function is already executed; (3) a global index to the buffer At the function prologue, if the function has not been executed (by checking the bitmap), log the module id and the function id, then atomically increase the index. This pass is intended to be used as a ThinLTO pass or a LTO pass. It maps each module to a distinct integer, it also generate a mapping file so we can decode the function symbol name from the pair of ids. clang has '-finstrument-function-entry-bare' which inserts a function call and is not as efficient. Three patches are attached, for llvm, clang, and compiler-rt respectively. TODO: (1) Migrate to the new pass manager with a shim for the legacy pass manager. (2) For the order file buffer, consider always emitting definitions, making them LinkOnceODR with a COMDAT group. (3) Add testing case for clang/compiler-rt patches. (4) Add utilities to deobfuscate the profile dump. (5) The size of the buffer is currently hard-coded ( INSTR_ORDER_FILE_BUFFER_SIZE). Thanks Kamal for contributing to the patches! Thanks to Aditya and Saleem for doing an initial review pass over the patches! Manman -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190117/f4b4c2b9/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: orderfile-llvm.patch Type: application/octet-stream Size: 16418 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190117/f4b4c2b9/attachment-0003.obj> -------------- next part -------------- A non-text attachment was scrubbed... Name: orderfile-rt.patch Type: application/octet-stream Size: 7145 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190117/f4b4c2b9/attachment-0004.obj> -------------- next part -------------- A non-text attachment was scrubbed... Name: orderfile-clang.patch Type: application/octet-stream Size: 1494 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190117/f4b4c2b9/attachment-0005.obj>
Xinliang David Li via llvm-dev
2019-Jan-17 18:53 UTC
[llvm-dev] [RFC] Order File Instrumentation
Hi Manman, Ordering profiling is certainly something very useful to have to startup time performance. GCC has something similar. In terms of implementation, it is possible to simply extend the edge profiling counters by 1 for each function, and instrument the function to record the time stamp the first time the function is executed. The overhead will be minimized and you can leverage all the other existing support in profiling runtime. Another possibility is to use xray to implement the functionality -- xray is useful for trace like profiling by design. David On Thu, Jan 17, 2019 at 10:24 AM Manman Ren <manman.ren at gmail.com> wrote:> Order file is used to teach ld64 how to order the functions in a binary. > If we put all functions executed during startup together in the right > order, we will greatly reduce the page faults during startup. > > To generate order file for iOS apps, we usually use dtrace, but some apps > have various startup scenarios that we want to capture in the order file. > dtrace approach is not easy to automate, it is hard to capture the > different ways of starting an app without automation. Instrumented builds > however can be deployed to phones and profile data can be automatically > collected. > > For the Facebook app, by looking at the startup distribution, we are > expecting a big win out of the order file instrumentation, from 100ms to > 500ms+, in startup time. > > The basic idea of the pass is to use a circular buffer to log the > execution ordering of the functions. We only log the function when it is > first executed. Instead of logging the symbol name of the function, we log > a pair of integers, with one integer specifying the module id, and the > other specifying the function id within the module. > > In this pass, we add three global variables: > (1) an order file buffer > The order file buffer is a circular buffer at its own llvm section. Each > entry is a pair of integers, with one integer specifying the module id, and > the other specifying the function id within the module. > (2) a bitmap for each module: one bit for each function to say if the > function is already executed; > (3) a global index to the buffer > > At the function prologue, if the function has not been executed (by > checking the bitmap), log the module id and the function id, then > atomically increase the index. > > This pass is intended to be used as a ThinLTO pass or a LTO pass. It maps > each module to a distinct integer, it also generate a mapping file so we > can decode the function symbol name from the pair of ids. > > clang has '-finstrument-function-entry-bare' which inserts a function call > and is not as efficient. > > Three patches are attached, for llvm, clang, and compiler-rt respectively. > > TODO: > (1) Migrate to the new pass manager with a shim for the legacy pass > manager. > (2) For the order file buffer, consider always emitting definitions, > making them LinkOnceODR with a COMDAT group. > (3) Add testing case for clang/compiler-rt patches. > (4) Add utilities to deobfuscate the profile dump. > (5) The size of the buffer is currently hard-coded ( > INSTR_ORDER_FILE_BUFFER_SIZE). > > Thanks Kamal for contributing to the patches! Thanks to Aditya and Saleem > for doing an initial review pass over the patches! > > Manman > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190117/6e6d58c7/attachment.html>
On Thu, Jan 17, 2019 at 10:53 AM Xinliang David Li <davidxl at google.com> wrote:> Hi Manman, > > Ordering profiling is certainly something very useful to have to startup > time performance. GCC has something similar. > > In terms of implementation, it is possible to simply extend the edge > profiling counters by 1 for each function, and instrument the function to > record the time stamp the first time the function is executed. The overhead > will be minimized and you can leverage all the other existing support in > profiling runtime. >Hi David, Just to clarify, are you suggesting to add an edge profiling counter per function to record the time stamp? Where are the edge profiling counters defined? So the difference will be where we store the profile information and in what format? With the suggested approach, we need to allocate one time stamp for each function, what is implemented is a pair of numbers for each executed function. The runtime performance can be different as well, the suggested approach gets the time stamp, and saves it to memory, what is implemented is saving the pair of numbers and incrementing a counter.> Another possibility is to use xray to implement the functionality -- xray > is useful for trace like profiling by design. >We have not looked into XRay. We need something with low binary size penalty and low runtime perf degradation, not sure if XRay is a good fit! Thanks, Manman> David > > On Thu, Jan 17, 2019 at 10:24 AM Manman Ren <manman.ren at gmail.com> wrote: > >> Order file is used to teach ld64 how to order the functions in a binary. >> If we put all functions executed during startup together in the right >> order, we will greatly reduce the page faults during startup. >> >> To generate order file for iOS apps, we usually use dtrace, but some apps >> have various startup scenarios that we want to capture in the order file. >> dtrace approach is not easy to automate, it is hard to capture the >> different ways of starting an app without automation. Instrumented builds >> however can be deployed to phones and profile data can be automatically >> collected. >> >> For the Facebook app, by looking at the startup distribution, we are >> expecting a big win out of the order file instrumentation, from 100ms to >> 500ms+, in startup time. >> >> The basic idea of the pass is to use a circular buffer to log the >> execution ordering of the functions. We only log the function when it is >> first executed. Instead of logging the symbol name of the function, we log >> a pair of integers, with one integer specifying the module id, and the >> other specifying the function id within the module. >> >> In this pass, we add three global variables: >> (1) an order file buffer >> The order file buffer is a circular buffer at its own llvm section. Each >> entry is a pair of integers, with one integer specifying the module id, and >> the other specifying the function id within the module. >> (2) a bitmap for each module: one bit for each function to say if the >> function is already executed; >> (3) a global index to the buffer >> >> At the function prologue, if the function has not been executed (by >> checking the bitmap), log the module id and the function id, then >> atomically increase the index. >> >> This pass is intended to be used as a ThinLTO pass or a LTO pass. It maps >> each module to a distinct integer, it also generate a mapping file so we >> can decode the function symbol name from the pair of ids. >> >> clang has '-finstrument-function-entry-bare' which inserts a function >> call and is not as efficient. >> >> Three patches are attached, for llvm, clang, and compiler-rt respectively. >> >> TODO: >> (1) Migrate to the new pass manager with a shim for the legacy pass >> manager. >> (2) For the order file buffer, consider always emitting definitions, >> making them LinkOnceODR with a COMDAT group. >> (3) Add testing case for clang/compiler-rt patches. >> (4) Add utilities to deobfuscate the profile dump. >> (5) The size of the buffer is currently hard-coded ( >> INSTR_ORDER_FILE_BUFFER_SIZE). >> >> Thanks Kamal for contributing to the patches! Thanks to Aditya and Saleem >> for doing an initial review pass over the patches! >> >> Manman >> >> >>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190117/adbc4300/attachment.html>
Hans Wennborg via llvm-dev
2019-Jan-18 10:43 UTC
[llvm-dev] [RFC] Order File Instrumentation
On Thu, Jan 17, 2019 at 7:24 PM Manman Ren via llvm-dev <llvm-dev at lists.llvm.org> wrote:> > Order file is used to teach ld64 how to order the functions in a binary. If we put all functions executed during startup together in the right order, we will greatly reduce the page faults during startup. > > To generate order file for iOS apps, we usually use dtrace, but some apps have various startup scenarios that we want to capture in the order file. dtrace approach is not easy to automate, it is hard to capture the different ways of starting an app without automation. Instrumented builds however can be deployed to phones and profile data can be automatically collected. > > For the Facebook app, by looking at the startup distribution, we are expecting a big win out of the order file instrumentation, from 100ms to 500ms+, in startup time. > > The basic idea of the pass is to use a circular buffer to log the execution ordering of the functions. We only log the function when it is first executed. Instead of logging the symbol name of the function, we log a pair of integers, with one integer specifying the module id, and the other specifying the function id within the module.[...]> clang has '-finstrument-function-entry-bare' which inserts a function call and is not as efficient.Can you elaborate on why this existing functionality is not efficient enough for you? For Chrome on Windows, we use -finstrument-functions-after-inlining to insert calls at function entry (after inlining) that calls a function which captures the addresses in a buffer, and later symbolizes and dumps them to an order file that we feed the linker. We use a similar approach on for Chrome on Android, but I'm not as familiar with the details there. Thanks, Hans
Chris Bieneman via llvm-dev
2019-Jan-18 22:01 UTC
[llvm-dev] [RFC] Order File Instrumentation
I would love to see this kind of order profiling support. Using dtrace to generate function orders is actually really problematic because dtrace made tradeoffs in implementation allowing it to ignore probe execution if the performance impact is too great on the system. This can result in dtrace being non-deterministic which is not ideal for generating optimization data. Additionally if order generation could be enabled at the same time as PGO generation that would be a great solution for generating profile data for optimizing clang itself. Clang has some scripts and build-system goop under utils/perf-training that can generate order files using dtrace and PGO data, it would be great to apply this technique to those tools too. -Chris> On Jan 18, 2019, at 2:43 AM, Hans Wennborg via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > On Thu, Jan 17, 2019 at 7:24 PM Manman Ren via llvm-dev > <llvm-dev at lists.llvm.org> wrote: >> >> Order file is used to teach ld64 how to order the functions in a binary. If we put all functions executed during startup together in the right order, we will greatly reduce the page faults during startup. >> >> To generate order file for iOS apps, we usually use dtrace, but some apps have various startup scenarios that we want to capture in the order file. dtrace approach is not easy to automate, it is hard to capture the different ways of starting an app without automation. Instrumented builds however can be deployed to phones and profile data can be automatically collected. >> >> For the Facebook app, by looking at the startup distribution, we are expecting a big win out of the order file instrumentation, from 100ms to 500ms+, in startup time. >> >> The basic idea of the pass is to use a circular buffer to log the execution ordering of the functions. We only log the function when it is first executed. Instead of logging the symbol name of the function, we log a pair of integers, with one integer specifying the module id, and the other specifying the function id within the module. > > [...] > >> clang has '-finstrument-function-entry-bare' which inserts a function call and is not as efficient. > > Can you elaborate on why this existing functionality is not efficient > enough for you? > > For Chrome on Windows, we use -finstrument-functions-after-inlining to > insert calls at function entry (after inlining) that calls a function > which captures the addresses in a buffer, and later symbolizes and > dumps them to an order file that we feed the linker. We use a similar > approach on for Chrome on Android, but I'm not as familiar with the > details there. > > Thanks, > Hans > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev