thr3ads.net - llvm dev - [llvm-dev] Add support for in-process profile merging in profile-runtime [Mar 2016]

If this information is useful, please help other people find it:
Share via:

Xinliang David Li via llvm-dev

2016-Feb-28 08:13 UTC

[llvm-dev] Add support for in-process profile merging in profile-runtime

On Sat, Feb 27, 2016 at 6:50 PM, Sean Silva <chisophugis at gmail.com>
wrote:
> I have thought about this issue too, in the context of games. We may want
> to turn profiling only for certain frames (essentially, this is many small
> profile runs).
>
> However, I have not seen it demonstrated that this kind of refined data
> collection will actually improve PGO results in practice.
> The evidence I do have though is that IIRC Apple have found that almost
> all of the benefits of PGO for the Clang binary can be gotten with a
> handful of training runs of Clang. Are your findings different?
>
We have a very wide customer base so we can not claim one use model is
sufficient for all users. For instance, we have users using fine grained
profile dumping control (programatically) as you described above. There are
also other possible use cases such as dump profiles for different
periodical phases into files associated with phases. Later different
phase's profile data can be merged with different weights.

>
> Also, in general, I am very wary of file locking. This can cause huge
> amounts of slowdown for a build and has potential portability problems.
>
I don't see much slow down with a clang build using instrumented clang as
the build compiler. With file locking and profile merging enabled, the
build time on my local machine looks like:

real    18m22.737s
user    293m18.924s
sys     9m55.532s

If profile merging/locking is disabled (i.e, let the profile dumper to
clobber/write over each other),  the real time is about 14m.

> I don't see it as a substantially better solution than wrapping clang
in a
> script that runs clang and then just calls llvm-profdata to do the merging.
> Running llvm-profdata is cheap compared to doing locking in a highly
> parallel situation like a build.
>
That would require synchronization for merging too.
>From Justin's email, it looks like there is a key point I have not madeclear: the on-line profile merge is a very simple raw profile to raw
profile merging which is super fast. The end result of the profile run is
still in raw format. The raw to indexed merging is still needed -- but
instead of merging thousands of raw profiles which can be very slow, with
this model, only one raw profile input is needed.

thanks,

David

>
>
> -- Sean Silva
>
> On Sat, Feb 27, 2016 at 6:02 PM, Xinliang David Li via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> One of the main missing features in Clang/LLVM profile runtime is the
>> lack of support for online/in-process profile merging support. Profile
data
>> collected for different workloads for the same executable binary need
to be
>> collected and merged later by the offline post-processing tool.  This
>> limitation makes it hard to handle cases where the instrumented binary
>> needs to be run with large number of small workloads, possibly in
>> parallel.  For instance, to do PGO for clang, we may choose to  build 
a
>> large project with the instrumented Clang binary. This is because
>>  1) to avoid profile from different runs from overriding others, %p
>> substitution needs to be specified in either the command line or an
>> environment variable so that different process can dump profile data
into
>> its own file named using pid. This will create huge requirement on the
disk
>> storage. For instance, clang's raw profile size is typically 80M --
if the
>> instrumented clang is used to build a medium to large size project
(such as
>> clang itself), profile data can easily use up hundreds of Gig bytes of
>> local storage.
>> 2) pid can also be recycled. This means that some of the profile data
may
>> be overridden without being noticed.
>>
>> The way to solve this problem is to allow profile data to be merged in
>> process.  I have a prototype implementation and plan to send it out for
>> review soon after some clean ups. By default, the profiling merging is
off
>> and it can be turned on with an user option or via an environment
variable.
>> The following summarizes the issues involved in adding this feature:
>>  1. the target platform needs to have file locking support
>>  2. there needs an efficient way to identify the profile data and
>> associate it with the binary using binary/profdata signature;
>>  3. Currently without merging, profile data from shared libraries
>> (including dlopen/dlcose ones) are concatenated into the primary
profile
>> file. This can complicate matters, as the merger also needs to find the
>> matching shared libs, and the merger also needs to avoid unnecessary
data
>> movement/copy;
>>  4. value profile data is variable in length even for the same binary.
>>
>> All the above issues are resolved and clang self build with
instrumented
>> binary passes (with both j1 and high parallelism).
>>
>> If you have any concerns, please let me know.
>>
>> thanks,
>>
>> David
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160228/9e26a356/attachment.html>

Sean Silva via llvm-dev

2016-Feb-29 20:02 UTC

head link

[llvm-dev] Add support for in-process profile merging in profile-runtime

On Sun, Feb 28, 2016 at 12:13 AM, Xinliang David Li <davidxl at
google.com>
wrote:
>
> On Sat, Feb 27, 2016 at 6:50 PM, Sean Silva <chisophugis at
gmail.com> wrote:
>
>> I have thought about this issue too, in the context of games. We may
want
>> to turn profiling only for certain frames (essentially, this is many
small
>> profile runs).
>>
>> However, I have not seen it demonstrated that this kind of refined data
>> collection will actually improve PGO results in practice.
>> The evidence I do have though is that IIRC Apple have found that almost
>> all of the benefits of PGO for the Clang binary can be gotten with a
>> handful of training runs of Clang. Are your findings different?
>>
>
> We have a very wide customer base so we can not claim one use model is
> sufficient for all users. For instance, we have users using fine grained
> profile dumping control (programatically) as you described above. There are
> also other possible use cases such as dump profiles for different
> periodical phases into files associated with phases. Later different
> phase's profile data can be merged with different weights.
>
>
>>
>> Also, in general, I am very wary of file locking. This can cause huge
>> amounts of slowdown for a build and has potential portability problems.
>>
>
> I don't see much slow down with a clang build using instrumented clang
as
> the build compiler. With file locking and profile merging enabled, the
> build time on my local machine looks like:
>
> real    18m22.737s
> user    293m18.924s
> sys     9m55.532s
>
> If profile merging/locking is disabled (i.e, let the profile dumper to
> clobber/write over each other),  the real time is about 14m.
>
>
>> I don't see it as a substantially better solution than wrapping
clang in
>> a script that runs clang and then just calls llvm-profdata to do the
>> merging. Running llvm-profdata is cheap compared to doing locking in a
>> highly parallel situation like a build.
>>
>
> That would require synchronization for merging too.
>
> From Justin's email, it looks like there is a key point I have not made
> clear: the on-line profile merge is a very simple raw profile to raw
> profile merging which is super fast. The end result of the profile run is
> still in raw format. The raw to indexed merging is still needed -- but
> instead of merging thousands of raw profiles which can be very slow, with
> this model, only one raw profile input is needed.
>
I think that __llvm_profile_merge_buffers in the runtime would be a useful
primitive if it can be implemented simply (or
__llvm_profile_load_counters_from_buffer, perhaps). If you could post a
patch for that part as a first incremental step that would be a good
starting point for concrete discussion.

In combination with the buffer API and reset_counters this is all that is
needed for very fine-grained counter capture.

-- Sean Silva

>
> thanks,
>
> David
>
>
>>
>>
>> -- Sean Silva
>>
>> On Sat, Feb 27, 2016 at 6:02 PM, Xinliang David Li via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> One of the main missing features in Clang/LLVM profile runtime is
the
>>> lack of support for online/in-process profile merging support.
Profile data
>>> collected for different workloads for the same executable binary
need to be
>>> collected and merged later by the offline post-processing tool. 
This
>>> limitation makes it hard to handle cases where the instrumented
binary
>>> needs to be run with large number of small workloads, possibly in
>>> parallel.  For instance, to do PGO for clang, we may choose to 
build  a
>>> large project with the instrumented Clang binary. This is because
>>>  1) to avoid profile from different runs from overriding others, %p
>>> substitution needs to be specified in either the command line or an
>>> environment variable so that different process can dump profile
data into
>>> its own file named using pid. This will create huge requirement on
the disk
>>> storage. For instance, clang's raw profile size is typically
80M -- if the
>>> instrumented clang is used to build a medium to large size project
(such as
>>> clang itself), profile data can easily use up hundreds of Gig bytes
of
>>> local storage.
>>> 2) pid can also be recycled. This means that some of the profile
data
>>> may be overridden without being noticed.
>>>
>>> The way to solve this problem is to allow profile data to be merged
in
>>> process.  I have a prototype implementation and plan to send it out
for
>>> review soon after some clean ups. By default, the profiling merging
is off
>>> and it can be turned on with an user option or via an environment
variable.
>>> The following summarizes the issues involved in adding this
feature:
>>>  1. the target platform needs to have file locking support
>>>  2. there needs an efficient way to identify the profile data and
>>> associate it with the binary using binary/profdata signature;
>>>  3. Currently without merging, profile data from shared libraries
>>> (including dlopen/dlcose ones) are concatenated into the primary
profile
>>> file. This can complicate matters, as the merger also needs to find
the
>>> matching shared libs, and the merger also needs to avoid
unnecessary data
>>> movement/copy;
>>>  4. value profile data is variable in length even for the same
binary.
>>>
>>> All the above issues are resolved and clang self build with
instrumented
>>> binary passes (with both j1 and high parallelism).
>>>
>>> If you have any concerns, please let me know.
>>>
>>> thanks,
>>>
>>> David
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160229/31b7e906/attachment.html>

Xinliang David Li via llvm-dev

2016-Mar-01 22:55 UTC

head link

[llvm-dev] Add support for in-process profile merging in profile-runtime

I have implemented the profile pool idea from Mehdi, and collected
performance data related to profile merging and file locking.  The
following is the experiment setup:

1) the machine has 32 logical cores (Intel sandybridge machine/64G memory)
2) the workload is clang self build (~3.3K files to be built), and the
instrumented binary is Clang.
3) ninja parallelism j32

File systems tested (on linux)
1) a local file system on a SSD drive
2) tmpfs
3) a local file system on a hard disk
4) an internal distributed file system

Configurations tested:
1) all processes dump to the same profile data file without locking (this
configuration of course produces useless profile data in the end, but it
serves as the performance baseline)
2) profile-merging enabled with pool sizes : 1, 2, 3, 4, 5, 10, and 32
3) using LLVM_PROFILE_FILE=..._%p to enable each process to dump its own
copy of profile data (resulting in ~3.2K profile data files in the end).
This configuration is only tested on some FS due to size/quota constraints.

Here is a very high level summary of the experiment result. The longer
writing latency it is, the more file locking contention is (which is not
surprising). In some cases, file lock has close to zero overhead, while for
FS with high write latencies, file locking can affect performance
negatively. In such cases, using a small pool of profile files can
completely recover the performance. The size of the required pool size is
capped at a small value (which depends on many different factors: write
latency, the rate at which the instrumented binary retires, io
throughput/network bandwidth etc).

1) SSD

The performance is almost identical across *ALL* the test configurations.
The real time needed to complete the full self build is ~13m10s.  There is
no visible file contention with file locking enabled even with pool size =1.

2) tmpfs

only tested with the following configs
a) shared profile with no merge
b) with merge (pool == 1), with merge (pool == 2)

Not surprisingly, the result is similar to SSD case -- consistently
finished building in a little more than 13m.

3) HDD

With this configuration, file locking start to show some impact -- the
write is slow enough to introduce contention.

a) Shared profile without merging: ~13m10s
b) with merging
   b.1) pool size == 1:  ~18m20s
   b.2) pool size == 2:  ~16m30s
   b.3) pool size == 3:  ~15m55s
   b.4) pool size == 4:  ~16m20s
   b.5) pool size == 5:  ~16m42s
c) >3000 profile file without merging (%p) : ~16m50s

Increasing the size of merge pool increases dumping parallelism -- the
performance improves initially but when it is above 4, it starts to degrade
gradually. When the HDD IO throughput is saturated at that point and
increasing parallelism does not help any more.

In short, with profile merging, we just need to dump 3 profile files to
achieve the same build performance that dumps >3000 files (the current
default behavior).

4) An internal file system using network attached storage

In such a file system, the file write has relatively long latency compared
with local file systems. The backend storage server does dynamic load
balancing so that it can achieve very high IO throughput with high
parallelism (at both FE/client side and backend).

a) Single profile without profile merging : ~60m
b) Profile merging enabled:
    b.1) pool size == 1:  ~80m
    b.2) pool size == 2:  ~47m
    b.3) pool size == 3:  ~43m
    b.4) pool size == 4:  ~40m40s
    b.5) pool size == 5:  ~38m50s
    b.6) pool size == 10: ~36m48s
    b.7) pool size == 32: ~36m24s
c) >3000 profile file without profile merging (%p): ~35m24s

b.6), b.7) and c) have the best performance among all.

Unlike in HDD case, a) has poor performance here -- due to low parallelism
in the storage backend.

With file dumping parallelism, the performance flats out when the pool
size>= 10. This is because the client (ninja+clang) system has reached its peakand becomes the new performance bottleneck.

Again, with profile merging, we only need 10 profile data file to achieve
the same performance as the default behavior that requires >3000 files to
be dumped.

thanks,

David

On Sun, Feb 28, 2016 at 12:13 AM, Xinliang David Li <davidxl at
google.com>
wrote:
>
> On Sat, Feb 27, 2016 at 6:50 PM, Sean Silva <chisophugis at
gmail.com> wrote:
>
>> I have thought about this issue too, in the context of games. We may
want
>> to turn profiling only for certain frames (essentially, this is many
small
>> profile runs).
>>
>> However, I have not seen it demonstrated that this kind of refined data
>> collection will actually improve PGO results in practice.
>> The evidence I do have though is that IIRC Apple have found that almost
>> all of the benefits of PGO for the Clang binary can be gotten with a
>> handful of training runs of Clang. Are your findings different?
>>
>
> We have a very wide customer base so we can not claim one use model is
> sufficient for all users. For instance, we have users using fine grained
> profile dumping control (programatically) as you described above. There are
> also other possible use cases such as dump profiles for different
> periodical phases into files associated with phases. Later different
> phase's profile data can be merged with different weights.
>
>
>>
>> Also, in general, I am very wary of file locking. This can cause huge
>> amounts of slowdown for a build and has potential portability problems.
>>
>
> I don't see much slow down with a clang build using instrumented clang
as
> the build compiler. With file locking and profile merging enabled, the
> build time on my local machine looks like:
>
> real    18m22.737s
> user    293m18.924s
> sys     9m55.532s
>
> If profile merging/locking is disabled (i.e, let the profile dumper to
> clobber/write over each other),  the real time is about 14m.
>
>
>> I don't see it as a substantially better solution than wrapping
clang in
>> a script that runs clang and then just calls llvm-profdata to do the
>> merging. Running llvm-profdata is cheap compared to doing locking in a
>> highly parallel situation like a build.
>>
>
> That would require synchronization for merging too.
>
> From Justin's email, it looks like there is a key point I have not made
> clear: the on-line profile merge is a very simple raw profile to raw
> profile merging which is super fast. The end result of the profile run is
> still in raw format. The raw to indexed merging is still needed -- but
> instead of merging thousands of raw profiles which can be very slow, with
> this model, only one raw profile input is needed.
>
> thanks,
>
> David
>
>
>>
>>
>> -- Sean Silva
>>
>> On Sat, Feb 27, 2016 at 6:02 PM, Xinliang David Li via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> One of the main missing features in Clang/LLVM profile runtime is
the
>>> lack of support for online/in-process profile merging support.
Profile data
>>> collected for different workloads for the same executable binary
need to be
>>> collected and merged later by the offline post-processing tool. 
This
>>> limitation makes it hard to handle cases where the instrumented
binary
>>> needs to be run with large number of small workloads, possibly in
>>> parallel.  For instance, to do PGO for clang, we may choose to 
build  a
>>> large project with the instrumented Clang binary. This is because
>>>  1) to avoid profile from different runs from overriding others, %p
>>> substitution needs to be specified in either the command line or an
>>> environment variable so that different process can dump profile
data into
>>> its own file named using pid. This will create huge requirement on
the disk
>>> storage. For instance, clang's raw profile size is typically
80M -- if the
>>> instrumented clang is used to build a medium to large size project
(such as
>>> clang itself), profile data can easily use up hundreds of Gig bytes
of
>>> local storage.
>>> 2) pid can also be recycled. This means that some of the profile
data
>>> may be overridden without being noticed.
>>>
>>> The way to solve this problem is to allow profile data to be merged
in
>>> process.  I have a prototype implementation and plan to send it out
for
>>> review soon after some clean ups. By default, the profiling merging
is off
>>> and it can be turned on with an user option or via an environment
variable.
>>> The following summarizes the issues involved in adding this
feature:
>>>  1. the target platform needs to have file locking support
>>>  2. there needs an efficient way to identify the profile data and
>>> associate it with the binary using binary/profdata signature;
>>>  3. Currently without merging, profile data from shared libraries
>>> (including dlopen/dlcose ones) are concatenated into the primary
profile
>>> file. This can complicate matters, as the merger also needs to find
the
>>> matching shared libs, and the merger also needs to avoid
unnecessary data
>>> movement/copy;
>>>  4. value profile data is variable in length even for the same
binary.
>>>
>>> All the above issues are resolved and clang self build with
instrumented
>>> binary passes (with both j1 and high parallelism).
>>>
>>> If you have any concerns, please let me know.
>>>
>>> thanks,
>>>
>>> David
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160301/8add4c05/attachment.html>

Sean Silva via llvm-dev

2016-Mar-01 23:34 UTC

head link

[llvm-dev] Add support for in-process profile merging in profile-runtime

Hi David,

This is wonderful data and demonstrates the viability of this feature. I
think this has alleviated the concerns regarding file locking.

As far as the implementation of the feature, I think we will probably want
the following incremental steps:
a) implement the core merging logic and add to buffer API a primitive for
merging two buffers
b) implement the file system glue to extend this to the filesystem API's
(write_file etc.)
c) implement a profile filename format string which generates a random
number mod a specified amount (strawman:
`LLVM_PROFILE_FILE=default.profraw.%7u` which generates a _u_nique number
mod 7. Of course, in general it is `%<N>u`)

 b) depends on a), but c) can be done in parallel with both.

Does this seem feasible?

-- Sean Silva

On Tue, Mar 1, 2016 at 2:55 PM, Xinliang David Li <davidxl at google.com>
wrote:
> I have implemented the profile pool idea from Mehdi, and collected
> performance data related to profile merging and file locking.  The
> following is the experiment setup:
>
> 1) the machine has 32 logical cores (Intel sandybridge machine/64G memory)
> 2) the workload is clang self build (~3.3K files to be built), and the
> instrumented binary is Clang.
> 3) ninja parallelism j32
>
> File systems tested (on linux)
> 1) a local file system on a SSD drive
> 2) tmpfs
> 3) a local file system on a hard disk
> 4) an internal distributed file system
>
> Configurations tested:
> 1) all processes dump to the same profile data file without locking (this
> configuration of course produces useless profile data in the end, but it
> serves as the performance baseline)
> 2) profile-merging enabled with pool sizes : 1, 2, 3, 4, 5, 10, and 32
> 3) using LLVM_PROFILE_FILE=..._%p to enable each process to dump its own
> copy of profile data (resulting in ~3.2K profile data files in the end).
> This configuration is only tested on some FS due to size/quota constraints.
>
> Here is a very high level summary of the experiment result. The longer
> writing latency it is, the more file locking contention is (which is not
> surprising). In some cases, file lock has close to zero overhead, while for
> FS with high write latencies, file locking can affect performance
> negatively. In such cases, using a small pool of profile files can
> completely recover the performance. The size of the required pool size is
> capped at a small value (which depends on many different factors: write
> latency, the rate at which the instrumented binary retires, io
> throughput/network bandwidth etc).
>
> 1) SSD
>
> The performance is almost identical across *ALL* the test configurations.
> The real time needed to complete the full self build is ~13m10s.  There is
> no visible file contention with file locking enabled even with pool size
=> 1.
>
> 2) tmpfs
>
> only tested with the following configs
> a) shared profile with no merge
> b) with merge (pool == 1), with merge (pool == 2)
>
> Not surprisingly, the result is similar to SSD case -- consistently
> finished building in a little more than 13m.
>
> 3) HDD
>
> With this configuration, file locking start to show some impact -- the
> write is slow enough to introduce contention.
>
> a) Shared profile without merging: ~13m10s
> b) with merging
>    b.1) pool size == 1:  ~18m20s
>    b.2) pool size == 2:  ~16m30s
>    b.3) pool size == 3:  ~15m55s
>    b.4) pool size == 4:  ~16m20s
>    b.5) pool size == 5:  ~16m42s
> c) >3000 profile file without merging (%p) : ~16m50s
>
> Increasing the size of merge pool increases dumping parallelism -- the
> performance improves initially but when it is above 4, it starts to degrade
> gradually. When the HDD IO throughput is saturated at that point and
> increasing parallelism does not help any more.
>
> In short, with profile merging, we just need to dump 3 profile files to
> achieve the same build performance that dumps >3000 files (the current
> default behavior).
>
> 4) An internal file system using network attached storage
>
> In such a file system, the file write has relatively long latency compared
> with local file systems. The backend storage server does dynamic load
> balancing so that it can achieve very high IO throughput with high
> parallelism (at both FE/client side and backend).
>
> a) Single profile without profile merging : ~60m
> b) Profile merging enabled:
>     b.1) pool size == 1:  ~80m
>     b.2) pool size == 2:  ~47m
>     b.3) pool size == 3:  ~43m
>     b.4) pool size == 4:  ~40m40s
>     b.5) pool size == 5:  ~38m50s
>     b.6) pool size == 10: ~36m48s
>     b.7) pool size == 32: ~36m24s
> c) >3000 profile file without profile merging (%p): ~35m24s
>
> b.6), b.7) and c) have the best performance among all.
>
> Unlike in HDD case, a) has poor performance here -- due to low parallelism
> in the storage backend.
>
> With file dumping parallelism, the performance flats out when the pool
> size >= 10. This is because the client (ninja+clang) system has reached
its
> peak and becomes the new performance bottleneck.
>
> Again, with profile merging, we only need 10 profile data file to achieve
> the same performance as the default behavior that requires >3000 files
to
> be dumped.
>
> thanks,
>
> David
>
>
>
>
> On Sun, Feb 28, 2016 at 12:13 AM, Xinliang David Li <davidxl at
google.com>
> wrote:
>
>>
>> On Sat, Feb 27, 2016 at 6:50 PM, Sean Silva <chisophugis at
gmail.com>
>> wrote:
>>
>>> I have thought about this issue too, in the context of games. We
may
>>> want to turn profiling only for certain frames (essentially, this
is many
>>> small profile runs).
>>>
>>> However, I have not seen it demonstrated that this kind of refined
data
>>> collection will actually improve PGO results in practice.
>>> The evidence I do have though is that IIRC Apple have found that
almost
>>> all of the benefits of PGO for the Clang binary can be gotten with
a
>>> handful of training runs of Clang. Are your findings different?
>>>
>>
>> We have a very wide customer base so we can not claim one use model is
>> sufficient for all users. For instance, we have users using fine
grained
>> profile dumping control (programatically) as you described above. There
are
>> also other possible use cases such as dump profiles for different
>> periodical phases into files associated with phases. Later different
>> phase's profile data can be merged with different weights.
>>
>>
>>>
>>> Also, in general, I am very wary of file locking. This can cause
huge
>>> amounts of slowdown for a build and has potential portability
problems.
>>>
>>
>> I don't see much slow down with a clang build using instrumented
clang as
>> the build compiler. With file locking and profile merging enabled, the
>> build time on my local machine looks like:
>>
>> real    18m22.737s
>> user    293m18.924s
>> sys     9m55.532s
>>
>> If profile merging/locking is disabled (i.e, let the profile dumper to
>> clobber/write over each other),  the real time is about 14m.
>>
>>
>>> I don't see it as a substantially better solution than wrapping
clang in
>>> a script that runs clang and then just calls llvm-profdata to do
the
>>> merging. Running llvm-profdata is cheap compared to doing locking
in a
>>> highly parallel situation like a build.
>>>
>>
>> That would require synchronization for merging too.
>>
>> From Justin's email, it looks like there is a key point I have not
made
>> clear: the on-line profile merge is a very simple raw profile to raw
>> profile merging which is super fast. The end result of the profile run
is
>> still in raw format. The raw to indexed merging is still needed -- but
>> instead of merging thousands of raw profiles which can be very slow,
with
>> this model, only one raw profile input is needed.
>>
>> thanks,
>>
>> David
>>
>>
>>>
>>>
>>> -- Sean Silva
>>>
>>> On Sat, Feb 27, 2016 at 6:02 PM, Xinliang David Li via llvm-dev
<
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> One of the main missing features in Clang/LLVM profile runtime
is the
>>>> lack of support for online/in-process profile merging support.
Profile data
>>>> collected for different workloads for the same executable
binary need to be
>>>> collected and merged later by the offline post-processing tool.
This
>>>> limitation makes it hard to handle cases where the instrumented
binary
>>>> needs to be run with large number of small workloads, possibly
in
>>>> parallel.  For instance, to do PGO for clang, we may choose to 
build  a
>>>> large project with the instrumented Clang binary. This is
because
>>>>  1) to avoid profile from different runs from overriding
others, %p
>>>> substitution needs to be specified in either the command line
or an
>>>> environment variable so that different process can dump profile
data into
>>>> its own file named using pid. This will create huge requirement
on the disk
>>>> storage. For instance, clang's raw profile size is
typically 80M -- if the
>>>> instrumented clang is used to build a medium to large size
project (such as
>>>> clang itself), profile data can easily use up hundreds of Gig
bytes of
>>>> local storage.
>>>> 2) pid can also be recycled. This means that some of the
profile data
>>>> may be overridden without being noticed.
>>>>
>>>> The way to solve this problem is to allow profile data to be
merged in
>>>> process.  I have a prototype implementation and plan to send it
out for
>>>> review soon after some clean ups. By default, the profiling
merging is off
>>>> and it can be turned on with an user option or via an
environment variable.
>>>> The following summarizes the issues involved in adding this
feature:
>>>>  1. the target platform needs to have file locking support
>>>>  2. there needs an efficient way to identify the profile data
and
>>>> associate it with the binary using binary/profdata signature;
>>>>  3. Currently without merging, profile data from shared
libraries
>>>> (including dlopen/dlcose ones) are concatenated into the
primary profile
>>>> file. This can complicate matters, as the merger also needs to
find the
>>>> matching shared libs, and the merger also needs to avoid
unnecessary data
>>>> movement/copy;
>>>>  4. value profile data is variable in length even for the same
binary.
>>>>
>>>> All the above issues are resolved and clang self build with
>>>> instrumented binary passes (with both j1 and high parallelism).
>>>>
>>>> If you have any concerns, please let me know.
>>>>
>>>> thanks,
>>>>
>>>> David
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160301/e083f520/attachment.html>

llvm dev - Mar 2016 - Add support for in-process profile merging in profile-runtime

[llvm-dev] Add support for in-process profile merging in profile-runtime

[llvm-dev] Add support for in-process profile merging in profile-runtime

[llvm-dev] Add support for in-process profile merging in profile-runtime

[llvm-dev] Add support for in-process profile merging in profile-runtime