On Fri, Jan 18, 2019 at 9:10 PM Manman Ren <manman.ren at gmail.com> wrote:> > > On Fri, Jan 18, 2019 at 4:11 PM Xinliang David Li <davidxl at google.com> > wrote: > >> >> >> On Fri, Jan 18, 2019 at 3:56 PM Manman Ren <manman.ren at gmail.com> wrote: >> >>> Some background information first, then a quick summary of what we have >>> discussed so far! >>> >>> Background: Facebook app is one of the biggest iOS apps. Because of >>> this, we want the instrumentation to be as lightweight as possible in terms >>> of binary size, profile data size, and runtime performance. The plan to >>> improve Facebook app start up time is to (1) implement order file >>> instrumentation to be as light as possible, (2) push the order file >>> instrumentation to internal users first, and then to external beta users if >>> the overhead is low, (3) enable PGO instrumentation to collect information >>> to guide hot/cold splitting, and (4) push PGO instrumentation to internal >>> users. >>> >>> There are a few alternatives we have discussed: >>> (A) What is proposed in the initial email: Log (module id, function id) >>> into a circular buffer in its own profile section when a function is first >>> executed. >>> >>> (B) Re-use existing infra of a per function counter to record the >>> timestamp when a function is first executed >>> Compared to option (A), the runtime overhead for option (B) should be >>> higher since we will be calling timestamp for each function that is >>> executed at startup time, >>> >> >> The 'timestamp' can be the just an global index. Since there is one >> counter per func, the counter can be initialized to be '-1' so that you >> don't need to use bitmap to track if the function has been invoked or not. >> In other words, the runtime overhead of B) could be lower :) >> > > That actually works! We only care about the ordering of the functions. But > the concern on profile data size and binary size still exist :] >The runtime should be similar as we still need to check if the counter is "-1" before saving the global index. We don't need the separate bitmap though. Also the counter can be initialized to 0 and the global index can start from 1.>> David >> >> >> >>> and the binary and the profile data will be larger since it needs one >>> number for each function plus additional overhead in the per-function >>> metadata recorded in llvm_prf_data. The buffer size for option (A) is >>> controllable, it needs to be the number of functions executed at startup. >>> >> > Do you have a rough estimation on how much overhead the per-function > metadata is? > > Manman > >> >>> For the Facebook app, we expect that the number of functions executed >>> during startup is 1/3 to 1/2 of all functions in the binary. Profile data >>> size is important since we need to upload the profile data from device to >>> server. >>> >>> The plus side is to reuse the existing infra! >>> >>> In terms of integration with PGO instrumentation, both (A) and (B) >>> should work. For (B), we need to increase the number of per function >>> counters by one. For (A), they will be in different sections. >>> >>> (C) XRay >>> We have not looked into this, but would like to hear more about it! >>> >>> (D) -finstrument-functions-after-inlining or >>> -finstrument-function-entry-bare >>> We are worried about the runtime overhead of calling a separate function >>> when starting up the App. >>> >>> Thanks, >>> Manman >>> >>> On Fri, Jan 18, 2019 at 2:01 PM Chris Bieneman <chris.bieneman at me.com> >>> wrote: >>> >>>> I would love to see this kind of order profiling support. Using dtrace >>>> to generate function orders is actually really problematic because dtrace >>>> made tradeoffs in implementation allowing it to ignore probe execution if >>>> the performance impact is too great on the system. This can result in >>>> dtrace being non-deterministic which is not ideal for generating >>>> optimization data. >>>> >>>> Additionally if order generation could be enabled at the same time as >>>> PGO generation that would be a great solution for generating profile data >>>> for optimizing clang itself. Clang has some scripts and build-system goop >>>> under utils/perf-training that can generate order files using dtrace and >>>> PGO data, it would be great to apply this technique to those tools too. >>>> >>>> -Chris >>>> >>>> > On Jan 18, 2019, at 2:43 AM, Hans Wennborg via llvm-dev < >>>> llvm-dev at lists.llvm.org> wrote: >>>> > >>>> > On Thu, Jan 17, 2019 at 7:24 PM Manman Ren via llvm-dev >>>> > <llvm-dev at lists.llvm.org> wrote: >>>> >> >>>> >> Order file is used to teach ld64 how to order the functions in a >>>> binary. If we put all functions executed during startup together in the >>>> right order, we will greatly reduce the page faults during startup. >>>> >> >>>> >> To generate order file for iOS apps, we usually use dtrace, but some >>>> apps have various startup scenarios that we want to capture in the order >>>> file. dtrace approach is not easy to automate, it is hard to capture the >>>> different ways of starting an app without automation. Instrumented builds >>>> however can be deployed to phones and profile data can be automatically >>>> collected. >>>> >> >>>> >> For the Facebook app, by looking at the startup distribution, we are >>>> expecting a big win out of the order file instrumentation, from 100ms to >>>> 500ms+, in startup time. >>>> >> >>>> >> The basic idea of the pass is to use a circular buffer to log the >>>> execution ordering of the functions. We only log the function when it is >>>> first executed. Instead of logging the symbol name of the function, we log >>>> a pair of integers, with one integer specifying the module id, and the >>>> other specifying the function id within the module. >>>> > >>>> > [...] >>>> > >>>> >> clang has '-finstrument-function-entry-bare' which inserts a >>>> function call and is not as efficient. >>>> > >>>> > Can you elaborate on why this existing functionality is not efficient >>>> > enough for you? >>>> > >>>> > For Chrome on Windows, we use -finstrument-functions-after-inlining to >>>> > insert calls at function entry (after inlining) that calls a function >>>> > which captures the addresses in a buffer, and later symbolizes and >>>> > dumps them to an order file that we feed the linker. We use a similar >>>> > approach on for Chrome on Android, but I'm not as familiar with the >>>> > details there. >>>> > >>>> > Thanks, >>>> > Hans >>>> > _______________________________________________ >>>> > LLVM Developers mailing list >>>> > llvm-dev at lists.llvm.org >>>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>> >>>>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190118/bd8b9744/attachment.html>
Xinliang David Li via llvm-dev
2019-Jan-19 05:37 UTC
[llvm-dev] [RFC] Order File Instrumentation
On Fri, Jan 18, 2019 at 9:19 PM Manman Ren <manman.ren at gmail.com> wrote:> > > On Fri, Jan 18, 2019 at 9:10 PM Manman Ren <manman.ren at gmail.com> wrote: > >> >> >> On Fri, Jan 18, 2019 at 4:11 PM Xinliang David Li <davidxl at google.com> >> wrote: >> >>> >>> >>> On Fri, Jan 18, 2019 at 3:56 PM Manman Ren <manman.ren at gmail.com> wrote: >>> >>>> Some background information first, then a quick summary of what we have >>>> discussed so far! >>>> >>>> Background: Facebook app is one of the biggest iOS apps. Because of >>>> this, we want the instrumentation to be as lightweight as possible in terms >>>> of binary size, profile data size, and runtime performance. The plan to >>>> improve Facebook app start up time is to (1) implement order file >>>> instrumentation to be as light as possible, (2) push the order file >>>> instrumentation to internal users first, and then to external beta users if >>>> the overhead is low, (3) enable PGO instrumentation to collect information >>>> to guide hot/cold splitting, and (4) push PGO instrumentation to internal >>>> users. >>>> >>>> There are a few alternatives we have discussed: >>>> (A) What is proposed in the initial email: Log (module id, function id) >>>> into a circular buffer in its own profile section when a function is first >>>> executed. >>>> >>>> (B) Re-use existing infra of a per function counter to record the >>>> timestamp when a function is first executed >>>> Compared to option (A), the runtime overhead for option (B) should be >>>> higher since we will be calling timestamp for each function that is >>>> executed at startup time, >>>> >>> >>> The 'timestamp' can be the just an global index. Since there is one >>> counter per func, the counter can be initialized to be '-1' so that you >>> don't need to use bitmap to track if the function has been invoked or not. >>> In other words, the runtime overhead of B) could be lower :) >>> >> >> That actually works! We only care about the ordering of the functions. >> But the concern on profile data size and binary size still exist :] >> > > The runtime should be similar as we still need to check if the counter is > "-1" before saving the global index. We don't need the separate bitmap > though. Also the counter can be initialized to 0 and the global index can > start from 1. >If we don't need bitmap, then the two approaches are converging ! David> > >>> David >>> >>> >>> >>>> and the binary and the profile data will be larger since it needs one >>>> number for each function plus additional overhead in the per-function >>>> metadata recorded in llvm_prf_data. The buffer size for option (A) is >>>> controllable, it needs to be the number of functions executed at startup. >>>> >>> >> Do you have a rough estimation on how much overhead the per-function >> metadata is? >> >> Manman >> >>> >>>> For the Facebook app, we expect that the number of functions executed >>>> during startup is 1/3 to 1/2 of all functions in the binary. Profile data >>>> size is important since we need to upload the profile data from device to >>>> server. >>>> >>>> The plus side is to reuse the existing infra! >>>> >>>> In terms of integration with PGO instrumentation, both (A) and (B) >>>> should work. For (B), we need to increase the number of per function >>>> counters by one. For (A), they will be in different sections. >>>> >>>> (C) XRay >>>> We have not looked into this, but would like to hear more about it! >>>> >>>> (D) -finstrument-functions-after-inlining or >>>> -finstrument-function-entry-bare >>>> We are worried about the runtime overhead of calling a separate >>>> function when starting up the App. >>>> >>>> Thanks, >>>> Manman >>>> >>>> On Fri, Jan 18, 2019 at 2:01 PM Chris Bieneman <chris.bieneman at me.com> >>>> wrote: >>>> >>>>> I would love to see this kind of order profiling support. Using dtrace >>>>> to generate function orders is actually really problematic because dtrace >>>>> made tradeoffs in implementation allowing it to ignore probe execution if >>>>> the performance impact is too great on the system. This can result in >>>>> dtrace being non-deterministic which is not ideal for generating >>>>> optimization data. >>>>> >>>>> Additionally if order generation could be enabled at the same time as >>>>> PGO generation that would be a great solution for generating profile data >>>>> for optimizing clang itself. Clang has some scripts and build-system goop >>>>> under utils/perf-training that can generate order files using dtrace and >>>>> PGO data, it would be great to apply this technique to those tools too. >>>>> >>>>> -Chris >>>>> >>>>> > On Jan 18, 2019, at 2:43 AM, Hans Wennborg via llvm-dev < >>>>> llvm-dev at lists.llvm.org> wrote: >>>>> > >>>>> > On Thu, Jan 17, 2019 at 7:24 PM Manman Ren via llvm-dev >>>>> > <llvm-dev at lists.llvm.org> wrote: >>>>> >> >>>>> >> Order file is used to teach ld64 how to order the functions in a >>>>> binary. If we put all functions executed during startup together in the >>>>> right order, we will greatly reduce the page faults during startup. >>>>> >> >>>>> >> To generate order file for iOS apps, we usually use dtrace, but >>>>> some apps have various startup scenarios that we want to capture in the >>>>> order file. dtrace approach is not easy to automate, it is hard to capture >>>>> the different ways of starting an app without automation. Instrumented >>>>> builds however can be deployed to phones and profile data can be >>>>> automatically collected. >>>>> >> >>>>> >> For the Facebook app, by looking at the startup distribution, we >>>>> are expecting a big win out of the order file instrumentation, from 100ms >>>>> to 500ms+, in startup time. >>>>> >> >>>>> >> The basic idea of the pass is to use a circular buffer to log the >>>>> execution ordering of the functions. We only log the function when it is >>>>> first executed. Instead of logging the symbol name of the function, we log >>>>> a pair of integers, with one integer specifying the module id, and the >>>>> other specifying the function id within the module. >>>>> > >>>>> > [...] >>>>> > >>>>> >> clang has '-finstrument-function-entry-bare' which inserts a >>>>> function call and is not as efficient. >>>>> > >>>>> > Can you elaborate on why this existing functionality is not efficient >>>>> > enough for you? >>>>> > >>>>> > For Chrome on Windows, we use -finstrument-functions-after-inlining >>>>> to >>>>> > insert calls at function entry (after inlining) that calls a function >>>>> > which captures the addresses in a buffer, and later symbolizes and >>>>> > dumps them to an order file that we feed the linker. We use a similar >>>>> > approach on for Chrome on Android, but I'm not as familiar with the >>>>> > details there. >>>>> > >>>>> > Thanks, >>>>> > Hans >>>>> > _______________________________________________ >>>>> > LLVM Developers mailing list >>>>> > llvm-dev at lists.llvm.org >>>>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>> >>>>>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190118/558426b9/attachment.html>
I chatted with David offline during the weekend. Thanks for the great discussions, David! The trimmed-down version of the current infra will require 2 x 8 bytes for each function, while the circular buffer implementation requires 4 byte (2 byte for module id, 2 byte for function id) for each startup function. For Facebook app, that means the profile data will be 8 times more. Since we want to push the instrumented build to external test users, we are trying to minimize the uploading from device to servers. The circular buffer implementation currently uses (module id, function id), which only works in ThinLTO mode. David suggested to decouple from ThinLTO by using the 8-byte MD5 of function names. I plan to revise the existing patches to decouple from ThinLTO by taking David's suggestions. Let me know if you have questions on the general approach! Thanks, Manman On Fri, Jan 18, 2019 at 9:37 PM Xinliang David Li <davidxl at google.com> wrote:> > > On Fri, Jan 18, 2019 at 9:19 PM Manman Ren <manman.ren at gmail.com> wrote: > >> >> >> On Fri, Jan 18, 2019 at 9:10 PM Manman Ren <manman.ren at gmail.com> wrote: >> >>> >>> >>> On Fri, Jan 18, 2019 at 4:11 PM Xinliang David Li <davidxl at google.com> >>> wrote: >>> >>>> >>>> >>>> On Fri, Jan 18, 2019 at 3:56 PM Manman Ren <manman.ren at gmail.com> >>>> wrote: >>>> >>>>> Some background information first, then a quick summary of what we >>>>> have discussed so far! >>>>> >>>>> Background: Facebook app is one of the biggest iOS apps. Because of >>>>> this, we want the instrumentation to be as lightweight as possible in terms >>>>> of binary size, profile data size, and runtime performance. The plan to >>>>> improve Facebook app start up time is to (1) implement order file >>>>> instrumentation to be as light as possible, (2) push the order file >>>>> instrumentation to internal users first, and then to external beta users if >>>>> the overhead is low, (3) enable PGO instrumentation to collect information >>>>> to guide hot/cold splitting, and (4) push PGO instrumentation to internal >>>>> users. >>>>> >>>>> There are a few alternatives we have discussed: >>>>> (A) What is proposed in the initial email: Log (module id, function >>>>> id) into a circular buffer in its own profile section when a function is >>>>> first executed. >>>>> >>>>> (B) Re-use existing infra of a per function counter to record the >>>>> timestamp when a function is first executed >>>>> Compared to option (A), the runtime overhead for option (B) should be >>>>> higher since we will be calling timestamp for each function that is >>>>> executed at startup time, >>>>> >>>> >>>> The 'timestamp' can be the just an global index. Since there is one >>>> counter per func, the counter can be initialized to be '-1' so that you >>>> don't need to use bitmap to track if the function has been invoked or not. >>>> In other words, the runtime overhead of B) could be lower :) >>>> >>> >>> That actually works! We only care about the ordering of the functions. >>> But the concern on profile data size and binary size still exist :] >>> >> >> The runtime should be similar as we still need to check if the counter is >> "-1" before saving the global index. We don't need the separate bitmap >> though. Also the counter can be initialized to 0 and the global index can >> start from 1. >> > > If we don't need bitmap, then the two approaches are converging ! > > David > >> >> >>>> David >>>> >>>> >>>> >>>>> and the binary and the profile data will be larger since it needs one >>>>> number for each function plus additional overhead in the per-function >>>>> metadata recorded in llvm_prf_data. The buffer size for option (A) is >>>>> controllable, it needs to be the number of functions executed at startup. >>>>> >>>> >>> Do you have a rough estimation on how much overhead the per-function >>> metadata is? >>> >>> Manman >>> >>>> >>>>> For the Facebook app, we expect that the number of functions executed >>>>> during startup is 1/3 to 1/2 of all functions in the binary. Profile data >>>>> size is important since we need to upload the profile data from device to >>>>> server. >>>>> >>>>> The plus side is to reuse the existing infra! >>>>> >>>>> In terms of integration with PGO instrumentation, both (A) and (B) >>>>> should work. For (B), we need to increase the number of per function >>>>> counters by one. For (A), they will be in different sections. >>>>> >>>>> (C) XRay >>>>> We have not looked into this, but would like to hear more about it! >>>>> >>>>> (D) -finstrument-functions-after-inlining or >>>>> -finstrument-function-entry-bare >>>>> We are worried about the runtime overhead of calling a separate >>>>> function when starting up the App. >>>>> >>>>> Thanks, >>>>> Manman >>>>> >>>>> On Fri, Jan 18, 2019 at 2:01 PM Chris Bieneman <chris.bieneman at me.com> >>>>> wrote: >>>>> >>>>>> I would love to see this kind of order profiling support. Using >>>>>> dtrace to generate function orders is actually really problematic because >>>>>> dtrace made tradeoffs in implementation allowing it to ignore probe >>>>>> execution if the performance impact is too great on the system. This can >>>>>> result in dtrace being non-deterministic which is not ideal for generating >>>>>> optimization data. >>>>>> >>>>>> Additionally if order generation could be enabled at the same time as >>>>>> PGO generation that would be a great solution for generating profile data >>>>>> for optimizing clang itself. Clang has some scripts and build-system goop >>>>>> under utils/perf-training that can generate order files using dtrace and >>>>>> PGO data, it would be great to apply this technique to those tools too. >>>>>> >>>>>> -Chris >>>>>> >>>>>> > On Jan 18, 2019, at 2:43 AM, Hans Wennborg via llvm-dev < >>>>>> llvm-dev at lists.llvm.org> wrote: >>>>>> > >>>>>> > On Thu, Jan 17, 2019 at 7:24 PM Manman Ren via llvm-dev >>>>>> > <llvm-dev at lists.llvm.org> wrote: >>>>>> >> >>>>>> >> Order file is used to teach ld64 how to order the functions in a >>>>>> binary. If we put all functions executed during startup together in the >>>>>> right order, we will greatly reduce the page faults during startup. >>>>>> >> >>>>>> >> To generate order file for iOS apps, we usually use dtrace, but >>>>>> some apps have various startup scenarios that we want to capture in the >>>>>> order file. dtrace approach is not easy to automate, it is hard to capture >>>>>> the different ways of starting an app without automation. Instrumented >>>>>> builds however can be deployed to phones and profile data can be >>>>>> automatically collected. >>>>>> >> >>>>>> >> For the Facebook app, by looking at the startup distribution, we >>>>>> are expecting a big win out of the order file instrumentation, from 100ms >>>>>> to 500ms+, in startup time. >>>>>> >> >>>>>> >> The basic idea of the pass is to use a circular buffer to log the >>>>>> execution ordering of the functions. We only log the function when it is >>>>>> first executed. Instead of logging the symbol name of the function, we log >>>>>> a pair of integers, with one integer specifying the module id, and the >>>>>> other specifying the function id within the module. >>>>>> > >>>>>> > [...] >>>>>> > >>>>>> >> clang has '-finstrument-function-entry-bare' which inserts a >>>>>> function call and is not as efficient. >>>>>> > >>>>>> > Can you elaborate on why this existing functionality is not >>>>>> efficient >>>>>> > enough for you? >>>>>> > >>>>>> > For Chrome on Windows, we use -finstrument-functions-after-inlining >>>>>> to >>>>>> > insert calls at function entry (after inlining) that calls a >>>>>> function >>>>>> > which captures the addresses in a buffer, and later symbolizes and >>>>>> > dumps them to an order file that we feed the linker. We use a >>>>>> similar >>>>>> > approach on for Chrome on Android, but I'm not as familiar with the >>>>>> > details there. >>>>>> > >>>>>> > Thanks, >>>>>> > Hans >>>>>> > _______________________________________________ >>>>>> > LLVM Developers mailing list >>>>>> > llvm-dev at lists.llvm.org >>>>>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>>> >>>>>>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190123/576ff074/attachment.html>