Xinliang David Li via llvm-dev
2019-Jan-19  00:11 UTC
[llvm-dev] [RFC] Order File Instrumentation
On Fri, Jan 18, 2019 at 3:56 PM Manman Ren <manman.ren at gmail.com> wrote:> Some background information first, then a quick summary of what we have > discussed so far! > > Background: Facebook app is one of the biggest iOS apps. Because of this, > we want the instrumentation to be as lightweight as possible in terms of > binary size, profile data size, and runtime performance. The plan to > improve Facebook app start up time is to (1) implement order file > instrumentation to be as light as possible, (2) push the order file > instrumentation to internal users first, and then to external beta users if > the overhead is low, (3) enable PGO instrumentation to collect information > to guide hot/cold splitting, and (4) push PGO instrumentation to internal > users. > > There are a few alternatives we have discussed: > (A) What is proposed in the initial email: Log (module id, function id) > into a circular buffer in its own profile section when a function is first > executed. > > (B) Re-use existing infra of a per function counter to record the > timestamp when a function is first executed > Compared to option (A), the runtime overhead for option (B) should be > higher since we will be calling timestamp for each function that is > executed at startup time, >The 'timestamp' can be the just an global index. Since there is one counter per func, the counter can be initialized to be '-1' so that you don't need to use bitmap to track if the function has been invoked or not. In other words, the runtime overhead of B) could be lower :) David> and the binary and the profile data will be larger since it needs one > number for each function plus additional overhead in the per-function > metadata recorded in llvm_prf_data. The buffer size for option (A) is > controllable, it needs to be the number of functions executed at startup. > > For the Facebook app, we expect that the number of functions executed > during startup is 1/3 to 1/2 of all functions in the binary. Profile data > size is important since we need to upload the profile data from device to > server. > > The plus side is to reuse the existing infra! > > In terms of integration with PGO instrumentation, both (A) and (B) should > work. For (B), we need to increase the number of per function counters by > one. For (A), they will be in different sections. > > (C) XRay > We have not looked into this, but would like to hear more about it! > > (D) -finstrument-functions-after-inlining or > -finstrument-function-entry-bare > We are worried about the runtime overhead of calling a separate function > when starting up the App. > > Thanks, > Manman > > On Fri, Jan 18, 2019 at 2:01 PM Chris Bieneman <chris.bieneman at me.com> > wrote: > >> I would love to see this kind of order profiling support. Using dtrace to >> generate function orders is actually really problematic because dtrace made >> tradeoffs in implementation allowing it to ignore probe execution if the >> performance impact is too great on the system. This can result in dtrace >> being non-deterministic which is not ideal for generating optimization data. >> >> Additionally if order generation could be enabled at the same time as PGO >> generation that would be a great solution for generating profile data for >> optimizing clang itself. Clang has some scripts and build-system goop under >> utils/perf-training that can generate order files using dtrace and PGO >> data, it would be great to apply this technique to those tools too. >> >> -Chris >> >> > On Jan 18, 2019, at 2:43 AM, Hans Wennborg via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> > >> > On Thu, Jan 17, 2019 at 7:24 PM Manman Ren via llvm-dev >> > <llvm-dev at lists.llvm.org> wrote: >> >> >> >> Order file is used to teach ld64 how to order the functions in a >> binary. If we put all functions executed during startup together in the >> right order, we will greatly reduce the page faults during startup. >> >> >> >> To generate order file for iOS apps, we usually use dtrace, but some >> apps have various startup scenarios that we want to capture in the order >> file. dtrace approach is not easy to automate, it is hard to capture the >> different ways of starting an app without automation. Instrumented builds >> however can be deployed to phones and profile data can be automatically >> collected. >> >> >> >> For the Facebook app, by looking at the startup distribution, we are >> expecting a big win out of the order file instrumentation, from 100ms to >> 500ms+, in startup time. >> >> >> >> The basic idea of the pass is to use a circular buffer to log the >> execution ordering of the functions. We only log the function when it is >> first executed. Instead of logging the symbol name of the function, we log >> a pair of integers, with one integer specifying the module id, and the >> other specifying the function id within the module. >> > >> > [...] >> > >> >> clang has '-finstrument-function-entry-bare' which inserts a function >> call and is not as efficient. >> > >> > Can you elaborate on why this existing functionality is not efficient >> > enough for you? >> > >> > For Chrome on Windows, we use -finstrument-functions-after-inlining to >> > insert calls at function entry (after inlining) that calls a function >> > which captures the addresses in a buffer, and later symbolizes and >> > dumps them to an order file that we feed the linker. We use a similar >> > approach on for Chrome on Android, but I'm not as familiar with the >> > details there. >> > >> > Thanks, >> > Hans >> > _______________________________________________ >> > LLVM Developers mailing list >> > llvm-dev at lists.llvm.org >> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190118/f33b41fc/attachment.html>
On Fri, Jan 18, 2019 at 4:11 PM Xinliang David Li <davidxl at google.com> wrote:> > > On Fri, Jan 18, 2019 at 3:56 PM Manman Ren <manman.ren at gmail.com> wrote: > >> Some background information first, then a quick summary of what we have >> discussed so far! >> >> Background: Facebook app is one of the biggest iOS apps. Because of this, >> we want the instrumentation to be as lightweight as possible in terms of >> binary size, profile data size, and runtime performance. The plan to >> improve Facebook app start up time is to (1) implement order file >> instrumentation to be as light as possible, (2) push the order file >> instrumentation to internal users first, and then to external beta users if >> the overhead is low, (3) enable PGO instrumentation to collect information >> to guide hot/cold splitting, and (4) push PGO instrumentation to internal >> users. >> >> There are a few alternatives we have discussed: >> (A) What is proposed in the initial email: Log (module id, function id) >> into a circular buffer in its own profile section when a function is first >> executed. >> >> (B) Re-use existing infra of a per function counter to record the >> timestamp when a function is first executed >> Compared to option (A), the runtime overhead for option (B) should be >> higher since we will be calling timestamp for each function that is >> executed at startup time, >> > > The 'timestamp' can be the just an global index. Since there is one > counter per func, the counter can be initialized to be '-1' so that you > don't need to use bitmap to track if the function has been invoked or not. > In other words, the runtime overhead of B) could be lower :) >That actually works! We only care about the ordering of the functions. But the concern on profile data size and binary size still exist :]> > David > > > >> and the binary and the profile data will be larger since it needs one >> number for each function plus additional overhead in the per-function >> metadata recorded in llvm_prf_data. The buffer size for option (A) is >> controllable, it needs to be the number of functions executed at startup. >> >Do you have a rough estimation on how much overhead the per-function metadata is? Manman> >> For the Facebook app, we expect that the number of functions executed >> during startup is 1/3 to 1/2 of all functions in the binary. Profile data >> size is important since we need to upload the profile data from device to >> server. >> >> The plus side is to reuse the existing infra! >> >> In terms of integration with PGO instrumentation, both (A) and (B) should >> work. For (B), we need to increase the number of per function counters by >> one. For (A), they will be in different sections. >> >> (C) XRay >> We have not looked into this, but would like to hear more about it! >> >> (D) -finstrument-functions-after-inlining or >> -finstrument-function-entry-bare >> We are worried about the runtime overhead of calling a separate function >> when starting up the App. >> >> Thanks, >> Manman >> >> On Fri, Jan 18, 2019 at 2:01 PM Chris Bieneman <chris.bieneman at me.com> >> wrote: >> >>> I would love to see this kind of order profiling support. Using dtrace >>> to generate function orders is actually really problematic because dtrace >>> made tradeoffs in implementation allowing it to ignore probe execution if >>> the performance impact is too great on the system. This can result in >>> dtrace being non-deterministic which is not ideal for generating >>> optimization data. >>> >>> Additionally if order generation could be enabled at the same time as >>> PGO generation that would be a great solution for generating profile data >>> for optimizing clang itself. Clang has some scripts and build-system goop >>> under utils/perf-training that can generate order files using dtrace and >>> PGO data, it would be great to apply this technique to those tools too. >>> >>> -Chris >>> >>> > On Jan 18, 2019, at 2:43 AM, Hans Wennborg via llvm-dev < >>> llvm-dev at lists.llvm.org> wrote: >>> > >>> > On Thu, Jan 17, 2019 at 7:24 PM Manman Ren via llvm-dev >>> > <llvm-dev at lists.llvm.org> wrote: >>> >> >>> >> Order file is used to teach ld64 how to order the functions in a >>> binary. If we put all functions executed during startup together in the >>> right order, we will greatly reduce the page faults during startup. >>> >> >>> >> To generate order file for iOS apps, we usually use dtrace, but some >>> apps have various startup scenarios that we want to capture in the order >>> file. dtrace approach is not easy to automate, it is hard to capture the >>> different ways of starting an app without automation. Instrumented builds >>> however can be deployed to phones and profile data can be automatically >>> collected. >>> >> >>> >> For the Facebook app, by looking at the startup distribution, we are >>> expecting a big win out of the order file instrumentation, from 100ms to >>> 500ms+, in startup time. >>> >> >>> >> The basic idea of the pass is to use a circular buffer to log the >>> execution ordering of the functions. We only log the function when it is >>> first executed. Instead of logging the symbol name of the function, we log >>> a pair of integers, with one integer specifying the module id, and the >>> other specifying the function id within the module. >>> > >>> > [...] >>> > >>> >> clang has '-finstrument-function-entry-bare' which inserts a function >>> call and is not as efficient. >>> > >>> > Can you elaborate on why this existing functionality is not efficient >>> > enough for you? >>> > >>> > For Chrome on Windows, we use -finstrument-functions-after-inlining to >>> > insert calls at function entry (after inlining) that calls a function >>> > which captures the addresses in a buffer, and later symbolizes and >>> > dumps them to an order file that we feed the linker. We use a similar >>> > approach on for Chrome on Android, but I'm not as familiar with the >>> > details there. >>> > >>> > Thanks, >>> > Hans >>> > _______________________________________________ >>> > LLVM Developers mailing list >>> > llvm-dev at lists.llvm.org >>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190118/d1fb5b48/attachment-0001.html>
On Fri, Jan 18, 2019 at 9:10 PM Manman Ren <manman.ren at gmail.com> wrote:> > > On Fri, Jan 18, 2019 at 4:11 PM Xinliang David Li <davidxl at google.com> > wrote: > >> >> >> On Fri, Jan 18, 2019 at 3:56 PM Manman Ren <manman.ren at gmail.com> wrote: >> >>> Some background information first, then a quick summary of what we have >>> discussed so far! >>> >>> Background: Facebook app is one of the biggest iOS apps. Because of >>> this, we want the instrumentation to be as lightweight as possible in terms >>> of binary size, profile data size, and runtime performance. The plan to >>> improve Facebook app start up time is to (1) implement order file >>> instrumentation to be as light as possible, (2) push the order file >>> instrumentation to internal users first, and then to external beta users if >>> the overhead is low, (3) enable PGO instrumentation to collect information >>> to guide hot/cold splitting, and (4) push PGO instrumentation to internal >>> users. >>> >>> There are a few alternatives we have discussed: >>> (A) What is proposed in the initial email: Log (module id, function id) >>> into a circular buffer in its own profile section when a function is first >>> executed. >>> >>> (B) Re-use existing infra of a per function counter to record the >>> timestamp when a function is first executed >>> Compared to option (A), the runtime overhead for option (B) should be >>> higher since we will be calling timestamp for each function that is >>> executed at startup time, >>> >> >> The 'timestamp' can be the just an global index. Since there is one >> counter per func, the counter can be initialized to be '-1' so that you >> don't need to use bitmap to track if the function has been invoked or not. >> In other words, the runtime overhead of B) could be lower :) >> > > That actually works! We only care about the ordering of the functions. But > the concern on profile data size and binary size still exist :] >The runtime should be similar as we still need to check if the counter is "-1" before saving the global index. We don't need the separate bitmap though. Also the counter can be initialized to 0 and the global index can start from 1.>> David >> >> >> >>> and the binary and the profile data will be larger since it needs one >>> number for each function plus additional overhead in the per-function >>> metadata recorded in llvm_prf_data. The buffer size for option (A) is >>> controllable, it needs to be the number of functions executed at startup. >>> >> > Do you have a rough estimation on how much overhead the per-function > metadata is? > > Manman > >> >>> For the Facebook app, we expect that the number of functions executed >>> during startup is 1/3 to 1/2 of all functions in the binary. Profile data >>> size is important since we need to upload the profile data from device to >>> server. >>> >>> The plus side is to reuse the existing infra! >>> >>> In terms of integration with PGO instrumentation, both (A) and (B) >>> should work. For (B), we need to increase the number of per function >>> counters by one. For (A), they will be in different sections. >>> >>> (C) XRay >>> We have not looked into this, but would like to hear more about it! >>> >>> (D) -finstrument-functions-after-inlining or >>> -finstrument-function-entry-bare >>> We are worried about the runtime overhead of calling a separate function >>> when starting up the App. >>> >>> Thanks, >>> Manman >>> >>> On Fri, Jan 18, 2019 at 2:01 PM Chris Bieneman <chris.bieneman at me.com> >>> wrote: >>> >>>> I would love to see this kind of order profiling support. Using dtrace >>>> to generate function orders is actually really problematic because dtrace >>>> made tradeoffs in implementation allowing it to ignore probe execution if >>>> the performance impact is too great on the system. This can result in >>>> dtrace being non-deterministic which is not ideal for generating >>>> optimization data. >>>> >>>> Additionally if order generation could be enabled at the same time as >>>> PGO generation that would be a great solution for generating profile data >>>> for optimizing clang itself. Clang has some scripts and build-system goop >>>> under utils/perf-training that can generate order files using dtrace and >>>> PGO data, it would be great to apply this technique to those tools too. >>>> >>>> -Chris >>>> >>>> > On Jan 18, 2019, at 2:43 AM, Hans Wennborg via llvm-dev < >>>> llvm-dev at lists.llvm.org> wrote: >>>> > >>>> > On Thu, Jan 17, 2019 at 7:24 PM Manman Ren via llvm-dev >>>> > <llvm-dev at lists.llvm.org> wrote: >>>> >> >>>> >> Order file is used to teach ld64 how to order the functions in a >>>> binary. If we put all functions executed during startup together in the >>>> right order, we will greatly reduce the page faults during startup. >>>> >> >>>> >> To generate order file for iOS apps, we usually use dtrace, but some >>>> apps have various startup scenarios that we want to capture in the order >>>> file. dtrace approach is not easy to automate, it is hard to capture the >>>> different ways of starting an app without automation. Instrumented builds >>>> however can be deployed to phones and profile data can be automatically >>>> collected. >>>> >> >>>> >> For the Facebook app, by looking at the startup distribution, we are >>>> expecting a big win out of the order file instrumentation, from 100ms to >>>> 500ms+, in startup time. >>>> >> >>>> >> The basic idea of the pass is to use a circular buffer to log the >>>> execution ordering of the functions. We only log the function when it is >>>> first executed. Instead of logging the symbol name of the function, we log >>>> a pair of integers, with one integer specifying the module id, and the >>>> other specifying the function id within the module. >>>> > >>>> > [...] >>>> > >>>> >> clang has '-finstrument-function-entry-bare' which inserts a >>>> function call and is not as efficient. >>>> > >>>> > Can you elaborate on why this existing functionality is not efficient >>>> > enough for you? >>>> > >>>> > For Chrome on Windows, we use -finstrument-functions-after-inlining to >>>> > insert calls at function entry (after inlining) that calls a function >>>> > which captures the addresses in a buffer, and later symbolizes and >>>> > dumps them to an order file that we feed the linker. We use a similar >>>> > approach on for Chrome on Android, but I'm not as familiar with the >>>> > details there. >>>> > >>>> > Thanks, >>>> > Hans >>>> > _______________________________________________ >>>> > LLVM Developers mailing list >>>> > llvm-dev at lists.llvm.org >>>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>> >>>>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190118/bd8b9744/attachment.html>
Xinliang David Li via llvm-dev
2019-Jan-19  05:27 UTC
[llvm-dev] [RFC] Order File Instrumentation
On Fri, Jan 18, 2019 at 9:11 PM Manman Ren <manman.ren at gmail.com> wrote:> > > On Fri, Jan 18, 2019 at 4:11 PM Xinliang David Li <davidxl at google.com> > wrote: > >> >> >> On Fri, Jan 18, 2019 at 3:56 PM Manman Ren <manman.ren at gmail.com> wrote: >> >>> Some background information first, then a quick summary of what we have >>> discussed so far! >>> >>> Background: Facebook app is one of the biggest iOS apps. Because of >>> this, we want the instrumentation to be as lightweight as possible in terms >>> of binary size, profile data size, and runtime performance. The plan to >>> improve Facebook app start up time is to (1) implement order file >>> instrumentation to be as light as possible, (2) push the order file >>> instrumentation to internal users first, and then to external beta users if >>> the overhead is low, (3) enable PGO instrumentation to collect information >>> to guide hot/cold splitting, and (4) push PGO instrumentation to internal >>> users. >>> >>> There are a few alternatives we have discussed: >>> (A) What is proposed in the initial email: Log (module id, function id) >>> into a circular buffer in its own profile section when a function is first >>> executed. >>> >>> (B) Re-use existing infra of a per function counter to record the >>> timestamp when a function is first executed >>> Compared to option (A), the runtime overhead for option (B) should be >>> higher since we will be calling timestamp for each function that is >>> executed at startup time, >>> >> >> The 'timestamp' can be the just an global index. Since there is one >> counter per func, the counter can be initialized to be '-1' so that you >> don't need to use bitmap to track if the function has been invoked or not. >> In other words, the runtime overhead of B) could be lower :) >> > > That actually works! We only care about the ordering of the functions. But > the concern on profile data size and binary size still exist :] > >> >> David >> >> >> >>> and the binary and the profile data will be larger since it needs one >>> number for each function plus additional overhead in the per-function >>> metadata recorded in llvm_prf_data. The buffer size for option (A) is >>> controllable, it needs to be the number of functions executed at startup. >>> >> > Do you have a rough estimation on how much overhead the per-function > metadata is? > >For PGO, it is 8 double words for one function, but 7 of the double words are unnecessary. It is entirely reasonable to emit only *one* double word (reference to name) in per function data when only order profiling is turned on (encode this in the profile header version field). We can delay the support of mixed mode (with PGO instrumentation) later. David> Manman > >> >>> For the Facebook app, we expect that the number of functions executed >>> during startup is 1/3 to 1/2 of all functions in the binary. Profile data >>> size is important since we need to upload the profile data from device to >>> server. >>> >>> The plus side is to reuse the existing infra! >>> >>> In terms of integration with PGO instrumentation, both (A) and (B) >>> should work. For (B), we need to increase the number of per function >>> counters by one. For (A), they will be in different sections. >>> >>> (C) XRay >>> We have not looked into this, but would like to hear more about it! >>> >>> (D) -finstrument-functions-after-inlining or >>> -finstrument-function-entry-bare >>> We are worried about the runtime overhead of calling a separate function >>> when starting up the App. >>> >>> Thanks, >>> Manman >>> >>> On Fri, Jan 18, 2019 at 2:01 PM Chris Bieneman <chris.bieneman at me.com> >>> wrote: >>> >>>> I would love to see this kind of order profiling support. Using dtrace >>>> to generate function orders is actually really problematic because dtrace >>>> made tradeoffs in implementation allowing it to ignore probe execution if >>>> the performance impact is too great on the system. This can result in >>>> dtrace being non-deterministic which is not ideal for generating >>>> optimization data. >>>> >>>> Additionally if order generation could be enabled at the same time as >>>> PGO generation that would be a great solution for generating profile data >>>> for optimizing clang itself. Clang has some scripts and build-system goop >>>> under utils/perf-training that can generate order files using dtrace and >>>> PGO data, it would be great to apply this technique to those tools too. >>>> >>>> -Chris >>>> >>>> > On Jan 18, 2019, at 2:43 AM, Hans Wennborg via llvm-dev < >>>> llvm-dev at lists.llvm.org> wrote: >>>> > >>>> > On Thu, Jan 17, 2019 at 7:24 PM Manman Ren via llvm-dev >>>> > <llvm-dev at lists.llvm.org> wrote: >>>> >> >>>> >> Order file is used to teach ld64 how to order the functions in a >>>> binary. If we put all functions executed during startup together in the >>>> right order, we will greatly reduce the page faults during startup. >>>> >> >>>> >> To generate order file for iOS apps, we usually use dtrace, but some >>>> apps have various startup scenarios that we want to capture in the order >>>> file. dtrace approach is not easy to automate, it is hard to capture the >>>> different ways of starting an app without automation. Instrumented builds >>>> however can be deployed to phones and profile data can be automatically >>>> collected. >>>> >> >>>> >> For the Facebook app, by looking at the startup distribution, we are >>>> expecting a big win out of the order file instrumentation, from 100ms to >>>> 500ms+, in startup time. >>>> >> >>>> >> The basic idea of the pass is to use a circular buffer to log the >>>> execution ordering of the functions. We only log the function when it is >>>> first executed. Instead of logging the symbol name of the function, we log >>>> a pair of integers, with one integer specifying the module id, and the >>>> other specifying the function id within the module. >>>> > >>>> > [...] >>>> > >>>> >> clang has '-finstrument-function-entry-bare' which inserts a >>>> function call and is not as efficient. >>>> > >>>> > Can you elaborate on why this existing functionality is not efficient >>>> > enough for you? >>>> > >>>> > For Chrome on Windows, we use -finstrument-functions-after-inlining to >>>> > insert calls at function entry (after inlining) that calls a function >>>> > which captures the addresses in a buffer, and later symbolizes and >>>> > dumps them to an order file that we feed the linker. We use a similar >>>> > approach on for Chrome on Android, but I'm not as familiar with the >>>> > details there. >>>> > >>>> > Thanks, >>>> > Hans >>>> > _______________________________________________ >>>> > LLVM Developers mailing list >>>> > llvm-dev at lists.llvm.org >>>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>> >>>>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190118/f1b5f015/attachment.html>