thr3ads.net - llvm dev - [llvm-dev] Add support for in-process profile merging in profile-runtime [Mar 2016]

If this information is useful, please help other people find it:
Share via:

Sean Silva via llvm-dev

2016-Mar-01 23:34 UTC

[llvm-dev] Add support for in-process profile merging in profile-runtime

Hi David,

This is wonderful data and demonstrates the viability of this feature. I
think this has alleviated the concerns regarding file locking.

As far as the implementation of the feature, I think we will probably want
the following incremental steps:
a) implement the core merging logic and add to buffer API a primitive for
merging two buffers
b) implement the file system glue to extend this to the filesystem API's
(write_file etc.)
c) implement a profile filename format string which generates a random
number mod a specified amount (strawman:
`LLVM_PROFILE_FILE=default.profraw.%7u` which generates a _u_nique number
mod 7. Of course, in general it is `%<N>u`)

 b) depends on a), but c) can be done in parallel with both.

Does this seem feasible?

-- Sean Silva

On Tue, Mar 1, 2016 at 2:55 PM, Xinliang David Li <davidxl at google.com>
wrote:
> I have implemented the profile pool idea from Mehdi, and collected
> performance data related to profile merging and file locking.  The
> following is the experiment setup:
>
> 1) the machine has 32 logical cores (Intel sandybridge machine/64G memory)
> 2) the workload is clang self build (~3.3K files to be built), and the
> instrumented binary is Clang.
> 3) ninja parallelism j32
>
> File systems tested (on linux)
> 1) a local file system on a SSD drive
> 2) tmpfs
> 3) a local file system on a hard disk
> 4) an internal distributed file system
>
> Configurations tested:
> 1) all processes dump to the same profile data file without locking (this
> configuration of course produces useless profile data in the end, but it
> serves as the performance baseline)
> 2) profile-merging enabled with pool sizes : 1, 2, 3, 4, 5, 10, and 32
> 3) using LLVM_PROFILE_FILE=..._%p to enable each process to dump its own
> copy of profile data (resulting in ~3.2K profile data files in the end).
> This configuration is only tested on some FS due to size/quota constraints.
>
> Here is a very high level summary of the experiment result. The longer
> writing latency it is, the more file locking contention is (which is not
> surprising). In some cases, file lock has close to zero overhead, while for
> FS with high write latencies, file locking can affect performance
> negatively. In such cases, using a small pool of profile files can
> completely recover the performance. The size of the required pool size is
> capped at a small value (which depends on many different factors: write
> latency, the rate at which the instrumented binary retires, io
> throughput/network bandwidth etc).
>
> 1) SSD
>
> The performance is almost identical across *ALL* the test configurations.
> The real time needed to complete the full self build is ~13m10s.  There is
> no visible file contention with file locking enabled even with pool size
=> 1.
>
> 2) tmpfs
>
> only tested with the following configs
> a) shared profile with no merge
> b) with merge (pool == 1), with merge (pool == 2)
>
> Not surprisingly, the result is similar to SSD case -- consistently
> finished building in a little more than 13m.
>
> 3) HDD
>
> With this configuration, file locking start to show some impact -- the
> write is slow enough to introduce contention.
>
> a) Shared profile without merging: ~13m10s
> b) with merging
>    b.1) pool size == 1:  ~18m20s
>    b.2) pool size == 2:  ~16m30s
>    b.3) pool size == 3:  ~15m55s
>    b.4) pool size == 4:  ~16m20s
>    b.5) pool size == 5:  ~16m42s
> c) >3000 profile file without merging (%p) : ~16m50s
>
> Increasing the size of merge pool increases dumping parallelism -- the
> performance improves initially but when it is above 4, it starts to degrade
> gradually. When the HDD IO throughput is saturated at that point and
> increasing parallelism does not help any more.
>
> In short, with profile merging, we just need to dump 3 profile files to
> achieve the same build performance that dumps >3000 files (the current
> default behavior).
>
> 4) An internal file system using network attached storage
>
> In such a file system, the file write has relatively long latency compared
> with local file systems. The backend storage server does dynamic load
> balancing so that it can achieve very high IO throughput with high
> parallelism (at both FE/client side and backend).
>
> a) Single profile without profile merging : ~60m
> b) Profile merging enabled:
>     b.1) pool size == 1:  ~80m
>     b.2) pool size == 2:  ~47m
>     b.3) pool size == 3:  ~43m
>     b.4) pool size == 4:  ~40m40s
>     b.5) pool size == 5:  ~38m50s
>     b.6) pool size == 10: ~36m48s
>     b.7) pool size == 32: ~36m24s
> c) >3000 profile file without profile merging (%p): ~35m24s
>
> b.6), b.7) and c) have the best performance among all.
>
> Unlike in HDD case, a) has poor performance here -- due to low parallelism
> in the storage backend.
>
> With file dumping parallelism, the performance flats out when the pool
> size >= 10. This is because the client (ninja+clang) system has reached
its
> peak and becomes the new performance bottleneck.
>
> Again, with profile merging, we only need 10 profile data file to achieve
> the same performance as the default behavior that requires >3000 files
to
> be dumped.
>
> thanks,
>
> David
>
>
>
>
> On Sun, Feb 28, 2016 at 12:13 AM, Xinliang David Li <davidxl at
google.com>
> wrote:
>
>>
>> On Sat, Feb 27, 2016 at 6:50 PM, Sean Silva <chisophugis at
gmail.com>
>> wrote:
>>
>>> I have thought about this issue too, in the context of games. We
may
>>> want to turn profiling only for certain frames (essentially, this
is many
>>> small profile runs).
>>>
>>> However, I have not seen it demonstrated that this kind of refined
data
>>> collection will actually improve PGO results in practice.
>>> The evidence I do have though is that IIRC Apple have found that
almost
>>> all of the benefits of PGO for the Clang binary can be gotten with
a
>>> handful of training runs of Clang. Are your findings different?
>>>
>>
>> We have a very wide customer base so we can not claim one use model is
>> sufficient for all users. For instance, we have users using fine
grained
>> profile dumping control (programatically) as you described above. There
are
>> also other possible use cases such as dump profiles for different
>> periodical phases into files associated with phases. Later different
>> phase's profile data can be merged with different weights.
>>
>>
>>>
>>> Also, in general, I am very wary of file locking. This can cause
huge
>>> amounts of slowdown for a build and has potential portability
problems.
>>>
>>
>> I don't see much slow down with a clang build using instrumented
clang as
>> the build compiler. With file locking and profile merging enabled, the
>> build time on my local machine looks like:
>>
>> real    18m22.737s
>> user    293m18.924s
>> sys     9m55.532s
>>
>> If profile merging/locking is disabled (i.e, let the profile dumper to
>> clobber/write over each other),  the real time is about 14m.
>>
>>
>>> I don't see it as a substantially better solution than wrapping
clang in
>>> a script that runs clang and then just calls llvm-profdata to do
the
>>> merging. Running llvm-profdata is cheap compared to doing locking
in a
>>> highly parallel situation like a build.
>>>
>>
>> That would require synchronization for merging too.
>>
>> From Justin's email, it looks like there is a key point I have not
made
>> clear: the on-line profile merge is a very simple raw profile to raw
>> profile merging which is super fast. The end result of the profile run
is
>> still in raw format. The raw to indexed merging is still needed -- but
>> instead of merging thousands of raw profiles which can be very slow,
with
>> this model, only one raw profile input is needed.
>>
>> thanks,
>>
>> David
>>
>>
>>>
>>>
>>> -- Sean Silva
>>>
>>> On Sat, Feb 27, 2016 at 6:02 PM, Xinliang David Li via llvm-dev
<
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> One of the main missing features in Clang/LLVM profile runtime
is the
>>>> lack of support for online/in-process profile merging support.
Profile data
>>>> collected for different workloads for the same executable
binary need to be
>>>> collected and merged later by the offline post-processing tool.
This
>>>> limitation makes it hard to handle cases where the instrumented
binary
>>>> needs to be run with large number of small workloads, possibly
in
>>>> parallel.  For instance, to do PGO for clang, we may choose to 
build  a
>>>> large project with the instrumented Clang binary. This is
because
>>>>  1) to avoid profile from different runs from overriding
others, %p
>>>> substitution needs to be specified in either the command line
or an
>>>> environment variable so that different process can dump profile
data into
>>>> its own file named using pid. This will create huge requirement
on the disk
>>>> storage. For instance, clang's raw profile size is
typically 80M -- if the
>>>> instrumented clang is used to build a medium to large size
project (such as
>>>> clang itself), profile data can easily use up hundreds of Gig
bytes of
>>>> local storage.
>>>> 2) pid can also be recycled. This means that some of the
profile data
>>>> may be overridden without being noticed.
>>>>
>>>> The way to solve this problem is to allow profile data to be
merged in
>>>> process.  I have a prototype implementation and plan to send it
out for
>>>> review soon after some clean ups. By default, the profiling
merging is off
>>>> and it can be turned on with an user option or via an
environment variable.
>>>> The following summarizes the issues involved in adding this
feature:
>>>>  1. the target platform needs to have file locking support
>>>>  2. there needs an efficient way to identify the profile data
and
>>>> associate it with the binary using binary/profdata signature;
>>>>  3. Currently without merging, profile data from shared
libraries
>>>> (including dlopen/dlcose ones) are concatenated into the
primary profile
>>>> file. This can complicate matters, as the merger also needs to
find the
>>>> matching shared libs, and the merger also needs to avoid
unnecessary data
>>>> movement/copy;
>>>>  4. value profile data is variable in length even for the same
binary.
>>>>
>>>> All the above issues are resolved and clang self build with
>>>> instrumented binary passes (with both j1 and high parallelism).
>>>>
>>>> If you have any concerns, please let me know.
>>>>
>>>> thanks,
>>>>
>>>> David
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160301/e083f520/attachment.html>

Xinliang David Li via llvm-dev

2016-Mar-01 23:41 UTC

head link

[llvm-dev] Add support for in-process profile merging in profile-runtime

sounds reasonable. My design of c) is different in many ways (e.g, using
getpid()%PoolSize), but we can delay discussion of that in code review.

thanks,

David

On Tue, Mar 1, 2016 at 3:34 PM, Sean Silva via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hi David,
>
> This is wonderful data and demonstrates the viability of this feature. I
> think this has alleviated the concerns regarding file locking.
>
> As far as the implementation of the feature, I think we will probably want
> the following incremental steps:
> a) implement the core merging logic and add to buffer API a primitive for
> merging two buffers
> b) implement the file system glue to extend this to the filesystem
API's
> (write_file etc.)
> c) implement a profile filename format string which generates a random
> number mod a specified amount (strawman:
> `LLVM_PROFILE_FILE=default.profraw.%7u` which generates a _u_nique number
> mod 7. Of course, in general it is `%<N>u`)
>
>  b) depends on a), but c) can be done in parallel with both.
>
> Does this seem feasible?
>
> -- Sean Silva
>
> On Tue, Mar 1, 2016 at 2:55 PM, Xinliang David Li <davidxl at
google.com>
> wrote:
>
>> I have implemented the profile pool idea from Mehdi, and collected
>> performance data related to profile merging and file locking.  The
>> following is the experiment setup:
>>
>> 1) the machine has 32 logical cores (Intel sandybridge machine/64G
memory)
>> 2) the workload is clang self build (~3.3K files to be built), and the
>> instrumented binary is Clang.
>> 3) ninja parallelism j32
>>
>> File systems tested (on linux)
>> 1) a local file system on a SSD drive
>> 2) tmpfs
>> 3) a local file system on a hard disk
>> 4) an internal distributed file system
>>
>> Configurations tested:
>> 1) all processes dump to the same profile data file without locking
(this
>> configuration of course produces useless profile data in the end, but
it
>> serves as the performance baseline)
>> 2) profile-merging enabled with pool sizes : 1, 2, 3, 4, 5, 10, and 32
>> 3) using LLVM_PROFILE_FILE=..._%p to enable each process to dump its
own
>> copy of profile data (resulting in ~3.2K profile data files in the
end).
>> This configuration is only tested on some FS due to size/quota
constraints.
>>
>> Here is a very high level summary of the experiment result. The longer
>> writing latency it is, the more file locking contention is (which is
not
>> surprising). In some cases, file lock has close to zero overhead, while
for
>> FS with high write latencies, file locking can affect performance
>> negatively. In such cases, using a small pool of profile files can
>> completely recover the performance. The size of the required pool size
is
>> capped at a small value (which depends on many different factors: write
>> latency, the rate at which the instrumented binary retires, io
>> throughput/network bandwidth etc).
>>
>> 1) SSD
>>
>> The performance is almost identical across *ALL* the test
configurations.
>> The real time needed to complete the full self build is ~13m10s.  There
is
>> no visible file contention with file locking enabled even with pool
size =>> 1.
>>
>> 2) tmpfs
>>
>> only tested with the following configs
>> a) shared profile with no merge
>> b) with merge (pool == 1), with merge (pool == 2)
>>
>> Not surprisingly, the result is similar to SSD case -- consistently
>> finished building in a little more than 13m.
>>
>> 3) HDD
>>
>> With this configuration, file locking start to show some impact -- the
>> write is slow enough to introduce contention.
>>
>> a) Shared profile without merging: ~13m10s
>> b) with merging
>>    b.1) pool size == 1:  ~18m20s
>>    b.2) pool size == 2:  ~16m30s
>>    b.3) pool size == 3:  ~15m55s
>>    b.4) pool size == 4:  ~16m20s
>>    b.5) pool size == 5:  ~16m42s
>> c) >3000 profile file without merging (%p) : ~16m50s
>>
>> Increasing the size of merge pool increases dumping parallelism -- the
>> performance improves initially but when it is above 4, it starts to
degrade
>> gradually. When the HDD IO throughput is saturated at that point and
>> increasing parallelism does not help any more.
>>
>> In short, with profile merging, we just need to dump 3 profile files to
>> achieve the same build performance that dumps >3000 files (the
current
>> default behavior).
>>
>> 4) An internal file system using network attached storage
>>
>> In such a file system, the file write has relatively long latency
>> compared with local file systems. The backend storage server does
dynamic
>> load balancing so that it can achieve very high IO throughput with high
>> parallelism (at both FE/client side and backend).
>>
>> a) Single profile without profile merging : ~60m
>> b) Profile merging enabled:
>>     b.1) pool size == 1:  ~80m
>>     b.2) pool size == 2:  ~47m
>>     b.3) pool size == 3:  ~43m
>>     b.4) pool size == 4:  ~40m40s
>>     b.5) pool size == 5:  ~38m50s
>>     b.6) pool size == 10: ~36m48s
>>     b.7) pool size == 32: ~36m24s
>> c) >3000 profile file without profile merging (%p): ~35m24s
>>
>> b.6), b.7) and c) have the best performance among all.
>>
>> Unlike in HDD case, a) has poor performance here -- due to low
>> parallelism in the storage backend.
>>
>> With file dumping parallelism, the performance flats out when the pool
>> size >= 10. This is because the client (ninja+clang) system has
reached its
>> peak and becomes the new performance bottleneck.
>>
>> Again, with profile merging, we only need 10 profile data file to
achieve
>> the same performance as the default behavior that requires >3000
files to
>> be dumped.
>>
>> thanks,
>>
>> David
>>
>>
>>
>>
>> On Sun, Feb 28, 2016 at 12:13 AM, Xinliang David Li <davidxl at
google.com>
>> wrote:
>>
>>>
>>> On Sat, Feb 27, 2016 at 6:50 PM, Sean Silva <chisophugis at
gmail.com>
>>> wrote:
>>>
>>>> I have thought about this issue too, in the context of games.
We may
>>>> want to turn profiling only for certain frames (essentially,
this is many
>>>> small profile runs).
>>>>
>>>> However, I have not seen it demonstrated that this kind of
refined data
>>>> collection will actually improve PGO results in practice.
>>>> The evidence I do have though is that IIRC Apple have found
that almost
>>>> all of the benefits of PGO for the Clang binary can be gotten
with a
>>>> handful of training runs of Clang. Are your findings different?
>>>>
>>>
>>> We have a very wide customer base so we can not claim one use model
is
>>> sufficient for all users. For instance, we have users using fine
grained
>>> profile dumping control (programatically) as you described above.
There are
>>> also other possible use cases such as dump profiles for different
>>> periodical phases into files associated with phases. Later
different
>>> phase's profile data can be merged with different weights.
>>>
>>>
>>>>
>>>> Also, in general, I am very wary of file locking. This can
cause huge
>>>> amounts of slowdown for a build and has potential portability
problems.
>>>>
>>>
>>> I don't see much slow down with a clang build using
instrumented clang
>>> as the build compiler. With file locking and profile merging
enabled, the
>>> build time on my local machine looks like:
>>>
>>> real    18m22.737s
>>> user    293m18.924s
>>> sys     9m55.532s
>>>
>>> If profile merging/locking is disabled (i.e, let the profile dumper
to
>>> clobber/write over each other),  the real time is about 14m.
>>>
>>>
>>>> I don't see it as a substantially better solution than
wrapping clang
>>>> in a script that runs clang and then just calls llvm-profdata
to do the
>>>> merging. Running llvm-profdata is cheap compared to doing
locking in a
>>>> highly parallel situation like a build.
>>>>
>>>
>>> That would require synchronization for merging too.
>>>
>>> From Justin's email, it looks like there is a key point I have
not made
>>> clear: the on-line profile merge is a very simple raw profile to
raw
>>> profile merging which is super fast. The end result of the profile
run is
>>> still in raw format. The raw to indexed merging is still needed --
but
>>> instead of merging thousands of raw profiles which can be very
slow, with
>>> this model, only one raw profile input is needed.
>>>
>>> thanks,
>>>
>>> David
>>>
>>>
>>>>
>>>>
>>>> -- Sean Silva
>>>>
>>>> On Sat, Feb 27, 2016 at 6:02 PM, Xinliang David Li via llvm-dev
<
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>>> One of the main missing features in Clang/LLVM profile
runtime is the
>>>>> lack of support for online/in-process profile merging
support. Profile data
>>>>> collected for different workloads for the same executable
binary need to be
>>>>> collected and merged later by the offline post-processing
tool.  This
>>>>> limitation makes it hard to handle cases where the
instrumented binary
>>>>> needs to be run with large number of small workloads,
possibly in
>>>>> parallel.  For instance, to do PGO for clang, we may choose
to  build  a
>>>>> large project with the instrumented Clang binary. This is
because
>>>>>  1) to avoid profile from different runs from overriding
others, %p
>>>>> substitution needs to be specified in either the command
line or an
>>>>> environment variable so that different process can dump
profile data into
>>>>> its own file named using pid. This will create huge
requirement on the disk
>>>>> storage. For instance, clang's raw profile size is
typically 80M -- if the
>>>>> instrumented clang is used to build a medium to large size
project (such as
>>>>> clang itself), profile data can easily use up hundreds of
Gig bytes of
>>>>> local storage.
>>>>> 2) pid can also be recycled. This means that some of the
profile data
>>>>> may be overridden without being noticed.
>>>>>
>>>>> The way to solve this problem is to allow profile data to
be merged in
>>>>> process.  I have a prototype implementation and plan to
send it out for
>>>>> review soon after some clean ups. By default, the profiling
merging is off
>>>>> and it can be turned on with an user option or via an
environment variable.
>>>>> The following summarizes the issues involved in adding this
feature:
>>>>>  1. the target platform needs to have file locking support
>>>>>  2. there needs an efficient way to identify the profile
data and
>>>>> associate it with the binary using binary/profdata
signature;
>>>>>  3. Currently without merging, profile data from shared
libraries
>>>>> (including dlopen/dlcose ones) are concatenated into the
primary profile
>>>>> file. This can complicate matters, as the merger also needs
to find the
>>>>> matching shared libs, and the merger also needs to avoid
unnecessary data
>>>>> movement/copy;
>>>>>  4. value profile data is variable in length even for the
same binary.
>>>>>
>>>>> All the above issues are resolved and clang self build with
>>>>> instrumented binary passes (with both j1 and high
parallelism).
>>>>>
>>>>> If you have any concerns, please let me know.
>>>>>
>>>>> thanks,
>>>>>
>>>>> David
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>>
>>>>
>>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160301/275bbbda/attachment.html>

Sean Silva via llvm-dev

2016-Mar-01 23:54 UTC

head link

[llvm-dev] Add support for in-process profile merging in profile-runtime

On Tue, Mar 1, 2016 at 3:41 PM, Xinliang David Li <xinliangli at
gmail.com>
wrote:
> sounds reasonable. My design of c) is different in many ways (e.g, using
> getpid()%PoolSize), but we can delay discussion of that in code review.
>
I like that (e.g. support %7p in addition to %p).

-- Sean Silva

>
> thanks,
>
> David
>
> On Tue, Mar 1, 2016 at 3:34 PM, Sean Silva via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Hi David,
>>
>> This is wonderful data and demonstrates the viability of this feature.
I
>> think this has alleviated the concerns regarding file locking.
>>
>> As far as the implementation of the feature, I think we will probably
>> want the following incremental steps:
>> a) implement the core merging logic and add to buffer API a primitive
for
>> merging two buffers
>> b) implement the file system glue to extend this to the filesystem
API's
>> (write_file etc.)
>> c) implement a profile filename format string which generates a random
>> number mod a specified amount (strawman:
>> `LLVM_PROFILE_FILE=default.profraw.%7u` which generates a _u_nique
number
>> mod 7. Of course, in general it is `%<N>u`)
>>
>>  b) depends on a), but c) can be done in parallel with both.
>>
>> Does this seem feasible?
>>
>> -- Sean Silva
>>
>> On Tue, Mar 1, 2016 at 2:55 PM, Xinliang David Li <davidxl at
google.com>
>> wrote:
>>
>>> I have implemented the profile pool idea from Mehdi, and collected
>>> performance data related to profile merging and file locking.  The
>>> following is the experiment setup:
>>>
>>> 1) the machine has 32 logical cores (Intel sandybridge machine/64G
>>> memory)
>>> 2) the workload is clang self build (~3.3K files to be built), and
the
>>> instrumented binary is Clang.
>>> 3) ninja parallelism j32
>>>
>>> File systems tested (on linux)
>>> 1) a local file system on a SSD drive
>>> 2) tmpfs
>>> 3) a local file system on a hard disk
>>> 4) an internal distributed file system
>>>
>>> Configurations tested:
>>> 1) all processes dump to the same profile data file without locking
>>> (this configuration of course produces useless profile data in the
end, but
>>> it serves as the performance baseline)
>>> 2) profile-merging enabled with pool sizes : 1, 2, 3, 4, 5, 10, and
32
>>> 3) using LLVM_PROFILE_FILE=..._%p to enable each process to dump
its own
>>> copy of profile data (resulting in ~3.2K profile data files in the
end).
>>> This configuration is only tested on some FS due to size/quota
constraints.
>>>
>>> Here is a very high level summary of the experiment result. The
longer
>>> writing latency it is, the more file locking contention is (which
is not
>>> surprising). In some cases, file lock has close to zero overhead,
while for
>>> FS with high write latencies, file locking can affect performance
>>> negatively. In such cases, using a small pool of profile files can
>>> completely recover the performance. The size of the required pool
size is
>>> capped at a small value (which depends on many different factors:
write
>>> latency, the rate at which the instrumented binary retires, io
>>> throughput/network bandwidth etc).
>>>
>>> 1) SSD
>>>
>>> The performance is almost identical across *ALL* the test
>>> configurations. The real time needed to complete the full self
build is
>>> ~13m10s.  There is no visible file contention with file locking
enabled
>>> even with pool size == 1.
>>>
>>> 2) tmpfs
>>>
>>> only tested with the following configs
>>> a) shared profile with no merge
>>> b) with merge (pool == 1), with merge (pool == 2)
>>>
>>> Not surprisingly, the result is similar to SSD case -- consistently
>>> finished building in a little more than 13m.
>>>
>>> 3) HDD
>>>
>>> With this configuration, file locking start to show some impact --
the
>>> write is slow enough to introduce contention.
>>>
>>> a) Shared profile without merging: ~13m10s
>>> b) with merging
>>>    b.1) pool size == 1:  ~18m20s
>>>    b.2) pool size == 2:  ~16m30s
>>>    b.3) pool size == 3:  ~15m55s
>>>    b.4) pool size == 4:  ~16m20s
>>>    b.5) pool size == 5:  ~16m42s
>>> c) >3000 profile file without merging (%p) : ~16m50s
>>>
>>> Increasing the size of merge pool increases dumping parallelism --
the
>>> performance improves initially but when it is above 4, it starts to
degrade
>>> gradually. When the HDD IO throughput is saturated at that point
and
>>> increasing parallelism does not help any more.
>>>
>>> In short, with profile merging, we just need to dump 3 profile
files to
>>> achieve the same build performance that dumps >3000 files (the
current
>>> default behavior).
>>>
>>> 4) An internal file system using network attached storage
>>>
>>> In such a file system, the file write has relatively long latency
>>> compared with local file systems. The backend storage server does
dynamic
>>> load balancing so that it can achieve very high IO throughput with
high
>>> parallelism (at both FE/client side and backend).
>>>
>>> a) Single profile without profile merging : ~60m
>>> b) Profile merging enabled:
>>>     b.1) pool size == 1:  ~80m
>>>     b.2) pool size == 2:  ~47m
>>>     b.3) pool size == 3:  ~43m
>>>     b.4) pool size == 4:  ~40m40s
>>>     b.5) pool size == 5:  ~38m50s
>>>     b.6) pool size == 10: ~36m48s
>>>     b.7) pool size == 32: ~36m24s
>>> c) >3000 profile file without profile merging (%p): ~35m24s
>>>
>>> b.6), b.7) and c) have the best performance among all.
>>>
>>> Unlike in HDD case, a) has poor performance here -- due to low
>>> parallelism in the storage backend.
>>>
>>> With file dumping parallelism, the performance flats out when the
pool
>>> size >= 10. This is because the client (ninja+clang) system has
reached its
>>> peak and becomes the new performance bottleneck.
>>>
>>> Again, with profile merging, we only need 10 profile data file to
>>> achieve the same performance as the default behavior that requires
>3000
>>> files to be dumped.
>>>
>>> thanks,
>>>
>>> David
>>>
>>>
>>>
>>>
>>> On Sun, Feb 28, 2016 at 12:13 AM, Xinliang David Li <davidxl at
google.com>
>>> wrote:
>>>
>>>>
>>>> On Sat, Feb 27, 2016 at 6:50 PM, Sean Silva <chisophugis at
gmail.com>
>>>> wrote:
>>>>
>>>>> I have thought about this issue too, in the context of
games. We may
>>>>> want to turn profiling only for certain frames
(essentially, this is many
>>>>> small profile runs).
>>>>>
>>>>> However, I have not seen it demonstrated that this kind of
refined
>>>>> data collection will actually improve PGO results in
practice.
>>>>> The evidence I do have though is that IIRC Apple have found
that
>>>>> almost all of the benefits of PGO for the Clang binary can
be gotten with a
>>>>> handful of training runs of Clang. Are your findings
different?
>>>>>
>>>>
>>>> We have a very wide customer base so we can not claim one use
model is
>>>> sufficient for all users. For instance, we have users using
fine grained
>>>> profile dumping control (programatically) as you described
above. There are
>>>> also other possible use cases such as dump profiles for
different
>>>> periodical phases into files associated with phases. Later
different
>>>> phase's profile data can be merged with different weights.
>>>>
>>>>
>>>>>
>>>>> Also, in general, I am very wary of file locking. This can
cause huge
>>>>> amounts of slowdown for a build and has potential
portability problems.
>>>>>
>>>>
>>>> I don't see much slow down with a clang build using
instrumented clang
>>>> as the build compiler. With file locking and profile merging
enabled, the
>>>> build time on my local machine looks like:
>>>>
>>>> real    18m22.737s
>>>> user    293m18.924s
>>>> sys     9m55.532s
>>>>
>>>> If profile merging/locking is disabled (i.e, let the profile
dumper to
>>>> clobber/write over each other),  the real time is about 14m.
>>>>
>>>>
>>>>> I don't see it as a substantially better solution than
wrapping clang
>>>>> in a script that runs clang and then just calls
llvm-profdata to do the
>>>>> merging. Running llvm-profdata is cheap compared to doing
locking in a
>>>>> highly parallel situation like a build.
>>>>>
>>>>
>>>> That would require synchronization for merging too.
>>>>
>>>> From Justin's email, it looks like there is a key point I
have not made
>>>> clear: the on-line profile merge is a very simple raw profile
to raw
>>>> profile merging which is super fast. The end result of the
profile run is
>>>> still in raw format. The raw to indexed merging is still needed
-- but
>>>> instead of merging thousands of raw profiles which can be very
slow, with
>>>> this model, only one raw profile input is needed.
>>>>
>>>> thanks,
>>>>
>>>> David
>>>>
>>>>
>>>>>
>>>>>
>>>>> -- Sean Silva
>>>>>
>>>>> On Sat, Feb 27, 2016 at 6:02 PM, Xinliang David Li via
llvm-dev <
>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>
>>>>>> One of the main missing features in Clang/LLVM profile
runtime is the
>>>>>> lack of support for online/in-process profile merging
support. Profile data
>>>>>> collected for different workloads for the same
executable binary need to be
>>>>>> collected and merged later by the offline
post-processing tool.  This
>>>>>> limitation makes it hard to handle cases where the
instrumented binary
>>>>>> needs to be run with large number of small workloads,
possibly in
>>>>>> parallel.  For instance, to do PGO for clang, we may
choose to  build  a
>>>>>> large project with the instrumented Clang binary. This
is because
>>>>>>  1) to avoid profile from different runs from
overriding others, %p
>>>>>> substitution needs to be specified in either the
command line or an
>>>>>> environment variable so that different process can dump
profile data into
>>>>>> its own file named using pid. This will create huge
requirement on the disk
>>>>>> storage. For instance, clang's raw profile size is
typically 80M -- if the
>>>>>> instrumented clang is used to build a medium to large
size project (such as
>>>>>> clang itself), profile data can easily use up hundreds
of Gig bytes of
>>>>>> local storage.
>>>>>> 2) pid can also be recycled. This means that some of
the profile data
>>>>>> may be overridden without being noticed.
>>>>>>
>>>>>> The way to solve this problem is to allow profile data
to be merged
>>>>>> in process.  I have a prototype implementation and plan
to send it out for
>>>>>> review soon after some clean ups. By default, the
profiling merging is off
>>>>>> and it can be turned on with an user option or via an
environment variable.
>>>>>> The following summarizes the issues involved in adding
this feature:
>>>>>>  1. the target platform needs to have file locking
support
>>>>>>  2. there needs an efficient way to identify the
profile data and
>>>>>> associate it with the binary using binary/profdata
signature;
>>>>>>  3. Currently without merging, profile data from shared
libraries
>>>>>> (including dlopen/dlcose ones) are concatenated into
the primary profile
>>>>>> file. This can complicate matters, as the merger also
needs to find the
>>>>>> matching shared libs, and the merger also needs to
avoid unnecessary data
>>>>>> movement/copy;
>>>>>>  4. value profile data is variable in length even for
the same binary.
>>>>>>
>>>>>> All the above issues are resolved and clang self build
with
>>>>>> instrumented binary passes (with both j1 and high
parallelism).
>>>>>>
>>>>>> If you have any concerns, please let me know.
>>>>>>
>>>>>> thanks,
>>>>>>
>>>>>> David
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> llvm-dev at lists.llvm.org
>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160301/425acafd/attachment.html>

llvm dev - Mar 2016 - Add support for in-process profile merging in profile-runtime

[llvm-dev] Add support for in-process profile merging in profile-runtime

[llvm-dev] Add support for in-process profile merging in profile-runtime

[llvm-dev] Add support for in-process profile merging in profile-runtime