thr3ads.net - llvm dev - [llvm-dev] [RFC] Order File Instrumentation [Jan 2019]

If this information is useful, please help other people find it:
Share via:

Manman Ren via llvm-dev

2019-Jan-17 23:24 UTC

[llvm-dev] [RFC] Order File Instrumentation

On Thu, Jan 17, 2019 at 2:47 PM Xinliang David Li <davidxl at google.com>
wrote:
>
>
> On Thu, Jan 17, 2019 at 2:32 PM Manman Ren <manman.ren at gmail.com>
wrote:
>
>>
>>
>> On Thu, Jan 17, 2019 at 10:53 AM Xinliang David Li <davidxl at
google.com>
>> wrote:
>>
>>> Hi Manman,
>>>
>>> Ordering profiling is certainly something very useful to have to
startup
>>> time performance. GCC has something similar.
>>>
>>> In terms of implementation, it is possible to simply extend the
edge
>>> profiling counters by 1 for each function, and instrument the
function to
>>> record the time stamp the first time the function is executed. The
overhead
>>> will be minimized and you can leverage all the other existing
support in
>>> profiling runtime.
>>>
>>
>> Hi David,
>>
>> Just to clarify, are you suggesting to add an edge profiling counter
per
>> function to record the time stamp? Where are the edge profiling
counters
>> defined?
>>
>
> There is no needed to define the counter explicitly. What is needed is to
> introduce a new intrinsic to update the order counter, and the InstProf
> lowerer will create the counter for you. See
> Transforms/Instrumentation/InstrProfiling.cpp.   +Vedant Kumar
> <vsk at apple.com>
>
We need the instrumentation to be run late as an IR pass, so it has all the
function symbols. Are you suggesting a front-end instrumentation by
mentioning InstrProfiling.cpp?

>
>
>> So the difference will be where we store the profile information and in
>> what format?
>>
>>
> Time stamp can be simply a uint64 value just like any edge/block profile
> count.   llvm_profdata tool can be taught to dump the ordering file to be
> exported to the linker.
>
>
>> With the suggested approach, we need to allocate one time stamp for
each
>> function, what is implemented is a pair of numbers for each executed
>> function. The runtime performance can be different as well, the
suggested
>> approach gets the time stamp, and saves it to memory, what is
implemented
>> is saving the pair of numbers and incrementing a counter.
>>
>>
> The runtime cost is probably not much for either approach. The suggested
> approach eliminates the trouble to maintain/implement a different way to
> identify functions.
>
Agree that using existing infra is better!

Manman
>
>
>
>>
>>> Another possibility is to use xray to implement the functionality
--
>>> xray is useful for trace like profiling by design.
>>>
>>
>> We have not looked into XRay. We need something with low binary size
>> penalty and low runtime perf degradation, not sure if XRay is a good
fit!
>>
>>
> xray can achieve very low runtime overhead. +Dean Michael Berris
> <dberris at google.com> for additional comments.
>
> David
>
>
>
>
>> Thanks,
>> Manman
>>
>>
>>> David
>>>
>>> On Thu, Jan 17, 2019 at 10:24 AM Manman Ren <manman.ren at
gmail.com>
>>> wrote:
>>>
>>>> Order file is used to teach ld64 how to order the functions in
a
>>>> binary. If we put all functions executed during startup
together in the
>>>> right order, we will greatly reduce the page faults during
startup.
>>>>
>>>> To generate order file for iOS apps, we usually use dtrace, but
some
>>>> apps have various startup scenarios that we want to capture in
the order
>>>> file. dtrace approach is not easy to automate, it is hard to
capture the
>>>> different ways of starting an app without automation.
Instrumented builds
>>>> however can be deployed to phones and profile data can be
automatically
>>>> collected.
>>>>
>>>> For the Facebook app, by looking at the startup distribution,
we are
>>>> expecting a big win out of the order file instrumentation, from
100ms to
>>>> 500ms+, in startup time.
>>>>
>>>> The basic idea of the pass is to use a circular buffer to log
the
>>>> execution ordering of the functions. We only log the function
when it is
>>>> first executed. Instead of logging the symbol name of the
function, we log
>>>> a pair of integers, with one integer specifying the module id,
and the
>>>> other specifying the function id within the module.
>>>>
>>>> In this pass, we add three global variables:
>>>> (1) an order file buffer
>>>> The order file buffer is a circular buffer at its own llvm
section.
>>>> Each entry is a pair of integers, with one integer specifying
the module
>>>> id, and the other specifying the function id within the module.
>>>> (2) a bitmap for each module: one bit for each function to say
if the
>>>> function is already executed;
>>>> (3) a global index to the buffer
>>>>
>>>> At the function prologue, if the function has not been executed
(by
>>>> checking the bitmap), log the module id and the function id,
then
>>>> atomically increase the index.
>>>>
>>>> This pass is intended to be used as a ThinLTO pass or a LTO
pass. It
>>>> maps each module to a distinct integer, it also generate a
mapping file so
>>>> we can decode the function symbol name from the pair of ids.
>>>>
>>>> clang has '-finstrument-function-entry-bare' which
inserts a function
>>>> call and is not as efficient.
>>>>
>>>> Three patches are attached, for llvm, clang, and compiler-rt
>>>> respectively.
>>>>
>>>> TODO:
>>>> (1) Migrate to the new pass manager with a shim for the legacy
pass
>>>> manager.
>>>> (2) For the order file buffer, consider always emitting
definitions,
>>>> making them LinkOnceODR with a COMDAT group.
>>>> (3) Add testing case for clang/compiler-rt patches.
>>>> (4) Add utilities to deobfuscate the profile dump.
>>>> (5) The size of the buffer is currently hard-coded (
>>>> INSTR_ORDER_FILE_BUFFER_SIZE).
>>>>
>>>> Thanks Kamal for contributing to the patches! Thanks to Aditya
and
>>>> Saleem for doing an initial review pass over the patches!
>>>>
>>>> Manman
>>>>
>>>>
>>>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190117/c0e2ac9b/attachment.html>

Xinliang David Li via llvm-dev

2019-Jan-17 23:57 UTC

head link

[llvm-dev] [RFC] Order File Instrumentation

On Thu, Jan 17, 2019 at 3:25 PM Manman Ren <manman.ren at gmail.com>
wrote:
>
>
> On Thu, Jan 17, 2019 at 2:47 PM Xinliang David Li <davidxl at
google.com>
> wrote:
>
>>
>>
>> On Thu, Jan 17, 2019 at 2:32 PM Manman Ren <manman.ren at
gmail.com> wrote:
>>
>>>
>>>
>>> On Thu, Jan 17, 2019 at 10:53 AM Xinliang David Li <davidxl at
google.com>
>>> wrote:
>>>
>>>> Hi Manman,
>>>>
>>>> Ordering profiling is certainly something very useful to have
to
>>>> startup time performance. GCC has something similar.
>>>>
>>>> In terms of implementation, it is possible to simply extend the
edge
>>>> profiling counters by 1 for each function, and instrument the
function to
>>>> record the time stamp the first time the function is executed.
The overhead
>>>> will be minimized and you can leverage all the other existing
support in
>>>> profiling runtime.
>>>>
>>>
>>> Hi David,
>>>
>>> Just to clarify, are you suggesting to add an edge profiling
counter per
>>> function to record the time stamp? Where are the edge profiling
counters
>>> defined?
>>>
>>
>> There is no needed to define the counter explicitly. What is needed is
to
>> introduce a new intrinsic to update the order counter, and the InstProf
>> lowerer will create the counter for you. See
>> Transforms/Instrumentation/InstrProfiling.cpp.   +Vedant Kumar
>> <vsk at apple.com>
>>
>
> We need the instrumentation to be run late as an IR pass, so it has all
> the function symbols. Are you suggesting a front-end instrumentation by
> mentioning InstrProfiling.cpp?
>
>
The lowering is actually shared by both FE instrumentation and IR PGO.
 For ordering profiling, it needs to run the pass after the inliner pass in
theory (which is later even than the current IR PGO instrumentation pass).
One way to achieve the same effect is lower the time stamp counter update
intrinsic in a separate pass (after inlining).  In the lowerer, the counter
update intrinsic can be eliminated if it is in an inline instance, but this
is not ideal either as the inlining decisions may be affected due to
instrumentation.

To combine the ordering profiling requirements (post inlining) and the
benefit of sharing (runtime, tooling etc),  the proposed implementation can
be modified with the following changes:
0) keep the phase ordering in the patch
1) the instrumentation just does insert intrinsics
2) enhance lowering to handle new intrinsics.

I assume ordering profiling is not intended to work together with PGO
instrumentation (to make sure inlining decisions are the same), so the you
end up with a counter section with one counter per out of line function.
The rest will just work. The changes to runtime will be minimal.

If  PGO (IR or FE) + Ordering profiling is a desired use case, it can be
made working too -- the lowerer needs to update the per function counter
array size (appending one).


David


>
>>
>>> So the difference will be where we store the profile information
and in
>>> what format?
>>>
>>>
>> Time stamp can be simply a uint64 value just like any edge/block
profile
>> count.   llvm_profdata tool can be taught to dump the ordering file to
be
>> exported to the linker.
>>
>>
>>> With the suggested approach, we need to allocate one time stamp for
each
>>> function, what is implemented is a pair of numbers for each
executed
>>> function. The runtime performance can be different as well, the
suggested
>>> approach gets the time stamp, and saves it to memory, what is
implemented
>>> is saving the pair of numbers and incrementing a counter.
>>>
>>>
>> The runtime cost is probably not much for either approach. The
suggested
>> approach eliminates the trouble to maintain/implement a different way
to
>> identify functions.
>>
>
> Agree that using existing infra is better!
>
>


> Manman
>
>>
>>
>>
>>>
>>>> Another possibility is to use xray to implement the
functionality --
>>>> xray is useful for trace like profiling by design.
>>>>
>>>
>>> We have not looked into XRay. We need something with low binary
size
>>> penalty and low runtime perf degradation, not sure if XRay is a
good fit!
>>>
>>>
>> xray can achieve very low runtime overhead. +Dean Michael Berris
>> <dberris at google.com> for additional comments.
>>
>> David
>>
>>
>>
>>
>>> Thanks,
>>> Manman
>>>
>>>
>>>> David
>>>>
>>>> On Thu, Jan 17, 2019 at 10:24 AM Manman Ren <manman.ren at
gmail.com>
>>>> wrote:
>>>>
>>>>> Order file is used to teach ld64 how to order the functions
in a
>>>>> binary. If we put all functions executed during startup
together in the
>>>>> right order, we will greatly reduce the page faults during
startup.
>>>>>
>>>>> To generate order file for iOS apps, we usually use dtrace,
but some
>>>>> apps have various startup scenarios that we want to capture
in the order
>>>>> file. dtrace approach is not easy to automate, it is hard
to capture the
>>>>> different ways of starting an app without automation.
Instrumented builds
>>>>> however can be deployed to phones and profile data can be
automatically
>>>>> collected.
>>>>>
>>>>> For the Facebook app, by looking at the startup
distribution, we are
>>>>> expecting a big win out of the order file instrumentation,
from 100ms to
>>>>> 500ms+, in startup time.
>>>>>
>>>>> The basic idea of the pass is to use a circular buffer to
log the
>>>>> execution ordering of the functions. We only log the
function when it is
>>>>> first executed. Instead of logging the symbol name of the
function, we log
>>>>> a pair of integers, with one integer specifying the module
id, and the
>>>>> other specifying the function id within the module.
>>>>>
>>>>> In this pass, we add three global variables:
>>>>> (1) an order file buffer
>>>>> The order file buffer is a circular buffer at its own llvm
section.
>>>>> Each entry is a pair of integers, with one integer
specifying the module
>>>>> id, and the other specifying the function id within the
module.
>>>>> (2) a bitmap for each module: one bit for each function to
say if the
>>>>> function is already executed;
>>>>> (3) a global index to the buffer
>>>>>
>>>>> At the function prologue, if the function has not been
executed (by
>>>>> checking the bitmap), log the module id and the function
id, then
>>>>> atomically increase the index.
>>>>>
>>>>> This pass is intended to be used as a ThinLTO pass or a LTO
pass. It
>>>>> maps each module to a distinct integer, it also generate a
mapping file so
>>>>> we can decode the function symbol name from the pair of
ids.
>>>>>
>>>>> clang has '-finstrument-function-entry-bare' which
inserts a function
>>>>> call and is not as efficient.
>>>>>
>>>>> Three patches are attached, for llvm, clang, and
compiler-rt
>>>>> respectively.
>>>>>
>>>>> TODO:
>>>>> (1) Migrate to the new pass manager with a shim for the
legacy pass
>>>>> manager.
>>>>> (2) For the order file buffer, consider always emitting
definitions,
>>>>> making them LinkOnceODR with a COMDAT group.
>>>>> (3) Add testing case for clang/compiler-rt patches.
>>>>> (4) Add utilities to deobfuscate the profile dump.
>>>>> (5) The size of the buffer is currently hard-coded (
>>>>> INSTR_ORDER_FILE_BUFFER_SIZE).
>>>>>
>>>>> Thanks Kamal for contributing to the patches! Thanks to
Aditya and
>>>>> Saleem for doing an initial review pass over the patches!
>>>>>
>>>>> Manman
>>>>>
>>>>>
>>>>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190117/5ed4defd/attachment.html>

Manman Ren via llvm-dev

2019-Jan-18 04:19 UTC

head link

[llvm-dev] [RFC] Order File Instrumentation

On Thu, Jan 17, 2019 at 3:58 PM Xinliang David Li <davidxl at google.com>
wrote:
>
>
> On Thu, Jan 17, 2019 at 3:25 PM Manman Ren <manman.ren at gmail.com>
wrote:
>
>>
>>
>> On Thu, Jan 17, 2019 at 2:47 PM Xinliang David Li <davidxl at
google.com>
>> wrote:
>>
>>>
>>>
>>> On Thu, Jan 17, 2019 at 2:32 PM Manman Ren <manman.ren at
gmail.com> wrote:
>>>
>>>>
>>>>
>>>> On Thu, Jan 17, 2019 at 10:53 AM Xinliang David Li <davidxl
at google.com>
>>>> wrote:
>>>>
>>>>> Hi Manman,
>>>>>
>>>>> Ordering profiling is certainly something very useful to
have to
>>>>> startup time performance. GCC has something similar.
>>>>>
>>>>> In terms of implementation, it is possible to simply extend
the edge
>>>>> profiling counters by 1 for each function, and instrument
the function to
>>>>> record the time stamp the first time the function is
executed. The overhead
>>>>> will be minimized and you can leverage all the other
existing support in
>>>>> profiling runtime.
>>>>>
>>>>
>>>> Hi David,
>>>>
>>>> Just to clarify, are you suggesting to add an edge profiling
counter
>>>> per function to record the time stamp? Where are the edge
profiling
>>>> counters defined?
>>>>
>>>
>>> There is no needed to define the counter explicitly. What is needed
is
>>> to introduce a new intrinsic to update the order counter, and the
InstProf
>>> lowerer will create the counter for you. See
>>> Transforms/Instrumentation/InstrProfiling.cpp.   +Vedant Kumar
>>> <vsk at apple.com>
>>>
>>
>> We need the instrumentation to be run late as an IR pass, so it has all
>> the function symbols. Are you suggesting a front-end instrumentation by
>> mentioning InstrProfiling.cpp?
>>
>>
>
> The lowering is actually shared by both FE instrumentation and IR PGO.
>  For ordering profiling, it needs to run the pass after the inliner pass in
> theory (which is later even than the current IR PGO instrumentation pass).
> One way to achieve the same effect is lower the time stamp counter update
> intrinsic in a separate pass (after inlining).  In the lowerer, the counter
> update intrinsic can be eliminated if it is in an inline instance, but this
> is not ideal either as the inlining decisions may be affected due to
> instrumentation.
>
> To combine the ordering profiling requirements (post inlining) and the
> benefit of sharing (runtime, tooling etc),  the proposed implementation can
> be modified with the following changes:
> 0) keep the phase ordering in the patch
> 1) the instrumentation just does insert intrinsics
> 2) enhance lowering to handle new intrinsics.
>
> I assume ordering profiling is not intended to work together with PGO
> instrumentation (to make sure inlining decisions are the same), so the you
> end up with a counter section with one counter per out of line function.
> The rest will just work. The changes to runtime will be minimal.
>
Got it, this should work! The overhead will be one counter for each
function, and we also need to save the function names in the instrumented
binary, right?
Other than binary size, the other concern is the size of the profile data,
as we need to download it from device. The profile data will be one counter
for each function plus all the function names?

In the proposed implementation, the profile data will be the size of the
circular buffer, which should be the maximal number of functions that can
be executed during startup.

> If  PGO (IR or FE) + Ordering profiling is a desired use case, it can be
> made working too -- the lowerer needs to update the per function counter
> array size (appending one).
>
>There may be a use case for IR Instr + ordering profiling. One important
usage of IR Instr is to guide hot/cold splitting for startup time.
It is always simpler if we only need to maintain one instrumentation build!
But that depends on how much overhead IR Instr adds in terms of binary
size, profile data size, and runtime perf.

Manman

> David
>
>
>
>>
>>>
>>>> So the difference will be where we store the profile
information and in
>>>> what format?
>>>>
>>>>
>>> Time stamp can be simply a uint64 value just like any edge/block
profile
>>> count.   llvm_profdata tool can be taught to dump the ordering file
to be
>>> exported to the linker.
>>>
>>>
>>>> With the suggested approach, we need to allocate one time stamp
for
>>>> each function, what is implemented is a pair of numbers for
each executed
>>>> function. The runtime performance can be different as well, the
suggested
>>>> approach gets the time stamp, and saves it to memory, what is
implemented
>>>> is saving the pair of numbers and incrementing a counter.
>>>>
>>>>
>>> The runtime cost is probably not much for either approach. The
suggested
>>> approach eliminates the trouble to maintain/implement a different
way to
>>> identify functions.
>>>
>>
>> Agree that using existing infra is better!
>>
>>
>
>
>
>
>> Manman
>>
>>>
>>>
>>>
>>>>
>>>>> Another possibility is to use xray to implement the
functionality --
>>>>> xray is useful for trace like profiling by design.
>>>>>
>>>>
>>>> We have not looked into XRay. We need something with low binary
size
>>>> penalty and low runtime perf degradation, not sure if XRay is a
good fit!
>>>>
>>>>
>>> xray can achieve very low runtime overhead. +Dean Michael Berris
>>> <dberris at google.com> for additional comments.
>>>
>>> David
>>>
>>>
>>>
>>>
>>>> Thanks,
>>>> Manman
>>>>
>>>>
>>>>> David
>>>>>
>>>>> On Thu, Jan 17, 2019 at 10:24 AM Manman Ren <manman.ren
at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Order file is used to teach ld64 how to order the
functions in a
>>>>>> binary. If we put all functions executed during startup
together in the
>>>>>> right order, we will greatly reduce the page faults
during startup.
>>>>>>
>>>>>> To generate order file for iOS apps, we usually use
dtrace, but some
>>>>>> apps have various startup scenarios that we want to
capture in the order
>>>>>> file. dtrace approach is not easy to automate, it is
hard to capture the
>>>>>> different ways of starting an app without automation.
Instrumented builds
>>>>>> however can be deployed to phones and profile data can
be automatically
>>>>>> collected.
>>>>>>
>>>>>> For the Facebook app, by looking at the startup
distribution, we are
>>>>>> expecting a big win out of the order file
instrumentation, from 100ms to
>>>>>> 500ms+, in startup time.
>>>>>>
>>>>>> The basic idea of the pass is to use a circular buffer
to log the
>>>>>> execution ordering of the functions. We only log the
function when it is
>>>>>> first executed. Instead of logging the symbol name of
the function, we log
>>>>>> a pair of integers, with one integer specifying the
module id, and the
>>>>>> other specifying the function id within the module.
>>>>>>
>>>>>> In this pass, we add three global variables:
>>>>>> (1) an order file buffer
>>>>>> The order file buffer is a circular buffer at its own
llvm section.
>>>>>> Each entry is a pair of integers, with one integer
specifying the module
>>>>>> id, and the other specifying the function id within the
module.
>>>>>> (2) a bitmap for each module: one bit for each function
to say if the
>>>>>> function is already executed;
>>>>>> (3) a global index to the buffer
>>>>>>
>>>>>> At the function prologue, if the function has not been
executed (by
>>>>>> checking the bitmap), log the module id and the
function id, then
>>>>>> atomically increase the index.
>>>>>>
>>>>>> This pass is intended to be used as a ThinLTO pass or a
LTO pass. It
>>>>>> maps each module to a distinct integer, it also
generate a mapping file so
>>>>>> we can decode the function symbol name from the pair of
ids.
>>>>>>
>>>>>> clang has '-finstrument-function-entry-bare'
which inserts a function
>>>>>> call and is not as efficient.
>>>>>>
>>>>>> Three patches are attached, for llvm, clang, and
compiler-rt
>>>>>> respectively.
>>>>>>
>>>>>> TODO:
>>>>>> (1) Migrate to the new pass manager with a shim for the
legacy pass
>>>>>> manager.
>>>>>> (2) For the order file buffer, consider always emitting
definitions,
>>>>>> making them LinkOnceODR with a COMDAT group.
>>>>>> (3) Add testing case for clang/compiler-rt patches.
>>>>>> (4) Add utilities to deobfuscate the profile dump.
>>>>>> (5) The size of the buffer is currently hard-coded (
>>>>>> INSTR_ORDER_FILE_BUFFER_SIZE).
>>>>>>
>>>>>> Thanks Kamal for contributing to the patches! Thanks to
Aditya and
>>>>>> Saleem for doing an initial review pass over the
patches!
>>>>>>
>>>>>> Manman
>>>>>>
>>>>>>
>>>>>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190117/621aff78/attachment.html>

llvm dev - Jan 2019 - [RFC] Order File Instrumentation

[llvm-dev] [RFC] Order File Instrumentation

[llvm-dev] [RFC] Order File Instrumentation

[llvm-dev] [RFC] Order File Instrumentation