Sean Silva via llvm-dev
2016-Mar-01 23:34 UTC
[llvm-dev] Add support for in-process profile merging in profile-runtime
Hi David, This is wonderful data and demonstrates the viability of this feature. I think this has alleviated the concerns regarding file locking. As far as the implementation of the feature, I think we will probably want the following incremental steps: a) implement the core merging logic and add to buffer API a primitive for merging two buffers b) implement the file system glue to extend this to the filesystem API's (write_file etc.) c) implement a profile filename format string which generates a random number mod a specified amount (strawman: `LLVM_PROFILE_FILE=default.profraw.%7u` which generates a _u_nique number mod 7. Of course, in general it is `%<N>u`) b) depends on a), but c) can be done in parallel with both. Does this seem feasible? -- Sean Silva On Tue, Mar 1, 2016 at 2:55 PM, Xinliang David Li <davidxl at google.com> wrote:> I have implemented the profile pool idea from Mehdi, and collected > performance data related to profile merging and file locking. The > following is the experiment setup: > > 1) the machine has 32 logical cores (Intel sandybridge machine/64G memory) > 2) the workload is clang self build (~3.3K files to be built), and the > instrumented binary is Clang. > 3) ninja parallelism j32 > > File systems tested (on linux) > 1) a local file system on a SSD drive > 2) tmpfs > 3) a local file system on a hard disk > 4) an internal distributed file system > > Configurations tested: > 1) all processes dump to the same profile data file without locking (this > configuration of course produces useless profile data in the end, but it > serves as the performance baseline) > 2) profile-merging enabled with pool sizes : 1, 2, 3, 4, 5, 10, and 32 > 3) using LLVM_PROFILE_FILE=..._%p to enable each process to dump its own > copy of profile data (resulting in ~3.2K profile data files in the end). > This configuration is only tested on some FS due to size/quota constraints. > > Here is a very high level summary of the experiment result. The longer > writing latency it is, the more file locking contention is (which is not > surprising). In some cases, file lock has close to zero overhead, while for > FS with high write latencies, file locking can affect performance > negatively. In such cases, using a small pool of profile files can > completely recover the performance. The size of the required pool size is > capped at a small value (which depends on many different factors: write > latency, the rate at which the instrumented binary retires, io > throughput/network bandwidth etc). > > 1) SSD > > The performance is almost identical across *ALL* the test configurations. > The real time needed to complete the full self build is ~13m10s. There is > no visible file contention with file locking enabled even with pool size => 1. > > 2) tmpfs > > only tested with the following configs > a) shared profile with no merge > b) with merge (pool == 1), with merge (pool == 2) > > Not surprisingly, the result is similar to SSD case -- consistently > finished building in a little more than 13m. > > 3) HDD > > With this configuration, file locking start to show some impact -- the > write is slow enough to introduce contention. > > a) Shared profile without merging: ~13m10s > b) with merging > b.1) pool size == 1: ~18m20s > b.2) pool size == 2: ~16m30s > b.3) pool size == 3: ~15m55s > b.4) pool size == 4: ~16m20s > b.5) pool size == 5: ~16m42s > c) >3000 profile file without merging (%p) : ~16m50s > > Increasing the size of merge pool increases dumping parallelism -- the > performance improves initially but when it is above 4, it starts to degrade > gradually. When the HDD IO throughput is saturated at that point and > increasing parallelism does not help any more. > > In short, with profile merging, we just need to dump 3 profile files to > achieve the same build performance that dumps >3000 files (the current > default behavior). > > 4) An internal file system using network attached storage > > In such a file system, the file write has relatively long latency compared > with local file systems. The backend storage server does dynamic load > balancing so that it can achieve very high IO throughput with high > parallelism (at both FE/client side and backend). > > a) Single profile without profile merging : ~60m > b) Profile merging enabled: > b.1) pool size == 1: ~80m > b.2) pool size == 2: ~47m > b.3) pool size == 3: ~43m > b.4) pool size == 4: ~40m40s > b.5) pool size == 5: ~38m50s > b.6) pool size == 10: ~36m48s > b.7) pool size == 32: ~36m24s > c) >3000 profile file without profile merging (%p): ~35m24s > > b.6), b.7) and c) have the best performance among all. > > Unlike in HDD case, a) has poor performance here -- due to low parallelism > in the storage backend. > > With file dumping parallelism, the performance flats out when the pool > size >= 10. This is because the client (ninja+clang) system has reached its > peak and becomes the new performance bottleneck. > > Again, with profile merging, we only need 10 profile data file to achieve > the same performance as the default behavior that requires >3000 files to > be dumped. > > thanks, > > David > > > > > On Sun, Feb 28, 2016 at 12:13 AM, Xinliang David Li <davidxl at google.com> > wrote: > >> >> On Sat, Feb 27, 2016 at 6:50 PM, Sean Silva <chisophugis at gmail.com> >> wrote: >> >>> I have thought about this issue too, in the context of games. We may >>> want to turn profiling only for certain frames (essentially, this is many >>> small profile runs). >>> >>> However, I have not seen it demonstrated that this kind of refined data >>> collection will actually improve PGO results in practice. >>> The evidence I do have though is that IIRC Apple have found that almost >>> all of the benefits of PGO for the Clang binary can be gotten with a >>> handful of training runs of Clang. Are your findings different? >>> >> >> We have a very wide customer base so we can not claim one use model is >> sufficient for all users. For instance, we have users using fine grained >> profile dumping control (programatically) as you described above. There are >> also other possible use cases such as dump profiles for different >> periodical phases into files associated with phases. Later different >> phase's profile data can be merged with different weights. >> >> >>> >>> Also, in general, I am very wary of file locking. This can cause huge >>> amounts of slowdown for a build and has potential portability problems. >>> >> >> I don't see much slow down with a clang build using instrumented clang as >> the build compiler. With file locking and profile merging enabled, the >> build time on my local machine looks like: >> >> real 18m22.737s >> user 293m18.924s >> sys 9m55.532s >> >> If profile merging/locking is disabled (i.e, let the profile dumper to >> clobber/write over each other), the real time is about 14m. >> >> >>> I don't see it as a substantially better solution than wrapping clang in >>> a script that runs clang and then just calls llvm-profdata to do the >>> merging. Running llvm-profdata is cheap compared to doing locking in a >>> highly parallel situation like a build. >>> >> >> That would require synchronization for merging too. >> >> From Justin's email, it looks like there is a key point I have not made >> clear: the on-line profile merge is a very simple raw profile to raw >> profile merging which is super fast. The end result of the profile run is >> still in raw format. The raw to indexed merging is still needed -- but >> instead of merging thousands of raw profiles which can be very slow, with >> this model, only one raw profile input is needed. >> >> thanks, >> >> David >> >> >>> >>> >>> -- Sean Silva >>> >>> On Sat, Feb 27, 2016 at 6:02 PM, Xinliang David Li via llvm-dev < >>> llvm-dev at lists.llvm.org> wrote: >>> >>>> One of the main missing features in Clang/LLVM profile runtime is the >>>> lack of support for online/in-process profile merging support. Profile data >>>> collected for different workloads for the same executable binary need to be >>>> collected and merged later by the offline post-processing tool. This >>>> limitation makes it hard to handle cases where the instrumented binary >>>> needs to be run with large number of small workloads, possibly in >>>> parallel. For instance, to do PGO for clang, we may choose to build a >>>> large project with the instrumented Clang binary. This is because >>>> 1) to avoid profile from different runs from overriding others, %p >>>> substitution needs to be specified in either the command line or an >>>> environment variable so that different process can dump profile data into >>>> its own file named using pid. This will create huge requirement on the disk >>>> storage. For instance, clang's raw profile size is typically 80M -- if the >>>> instrumented clang is used to build a medium to large size project (such as >>>> clang itself), profile data can easily use up hundreds of Gig bytes of >>>> local storage. >>>> 2) pid can also be recycled. This means that some of the profile data >>>> may be overridden without being noticed. >>>> >>>> The way to solve this problem is to allow profile data to be merged in >>>> process. I have a prototype implementation and plan to send it out for >>>> review soon after some clean ups. By default, the profiling merging is off >>>> and it can be turned on with an user option or via an environment variable. >>>> The following summarizes the issues involved in adding this feature: >>>> 1. the target platform needs to have file locking support >>>> 2. there needs an efficient way to identify the profile data and >>>> associate it with the binary using binary/profdata signature; >>>> 3. Currently without merging, profile data from shared libraries >>>> (including dlopen/dlcose ones) are concatenated into the primary profile >>>> file. This can complicate matters, as the merger also needs to find the >>>> matching shared libs, and the merger also needs to avoid unnecessary data >>>> movement/copy; >>>> 4. value profile data is variable in length even for the same binary. >>>> >>>> All the above issues are resolved and clang self build with >>>> instrumented binary passes (with both j1 and high parallelism). >>>> >>>> If you have any concerns, please let me know. >>>> >>>> thanks, >>>> >>>> David >>>> >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> llvm-dev at lists.llvm.org >>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>> >>>> >>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160301/e083f520/attachment.html>
Xinliang David Li via llvm-dev
2016-Mar-01 23:41 UTC
[llvm-dev] Add support for in-process profile merging in profile-runtime
sounds reasonable. My design of c) is different in many ways (e.g, using getpid()%PoolSize), but we can delay discussion of that in code review. thanks, David On Tue, Mar 1, 2016 at 3:34 PM, Sean Silva via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Hi David, > > This is wonderful data and demonstrates the viability of this feature. I > think this has alleviated the concerns regarding file locking. > > As far as the implementation of the feature, I think we will probably want > the following incremental steps: > a) implement the core merging logic and add to buffer API a primitive for > merging two buffers > b) implement the file system glue to extend this to the filesystem API's > (write_file etc.) > c) implement a profile filename format string which generates a random > number mod a specified amount (strawman: > `LLVM_PROFILE_FILE=default.profraw.%7u` which generates a _u_nique number > mod 7. Of course, in general it is `%<N>u`) > > b) depends on a), but c) can be done in parallel with both. > > Does this seem feasible? > > -- Sean Silva > > On Tue, Mar 1, 2016 at 2:55 PM, Xinliang David Li <davidxl at google.com> > wrote: > >> I have implemented the profile pool idea from Mehdi, and collected >> performance data related to profile merging and file locking. The >> following is the experiment setup: >> >> 1) the machine has 32 logical cores (Intel sandybridge machine/64G memory) >> 2) the workload is clang self build (~3.3K files to be built), and the >> instrumented binary is Clang. >> 3) ninja parallelism j32 >> >> File systems tested (on linux) >> 1) a local file system on a SSD drive >> 2) tmpfs >> 3) a local file system on a hard disk >> 4) an internal distributed file system >> >> Configurations tested: >> 1) all processes dump to the same profile data file without locking (this >> configuration of course produces useless profile data in the end, but it >> serves as the performance baseline) >> 2) profile-merging enabled with pool sizes : 1, 2, 3, 4, 5, 10, and 32 >> 3) using LLVM_PROFILE_FILE=..._%p to enable each process to dump its own >> copy of profile data (resulting in ~3.2K profile data files in the end). >> This configuration is only tested on some FS due to size/quota constraints. >> >> Here is a very high level summary of the experiment result. The longer >> writing latency it is, the more file locking contention is (which is not >> surprising). In some cases, file lock has close to zero overhead, while for >> FS with high write latencies, file locking can affect performance >> negatively. In such cases, using a small pool of profile files can >> completely recover the performance. The size of the required pool size is >> capped at a small value (which depends on many different factors: write >> latency, the rate at which the instrumented binary retires, io >> throughput/network bandwidth etc). >> >> 1) SSD >> >> The performance is almost identical across *ALL* the test configurations. >> The real time needed to complete the full self build is ~13m10s. There is >> no visible file contention with file locking enabled even with pool size =>> 1. >> >> 2) tmpfs >> >> only tested with the following configs >> a) shared profile with no merge >> b) with merge (pool == 1), with merge (pool == 2) >> >> Not surprisingly, the result is similar to SSD case -- consistently >> finished building in a little more than 13m. >> >> 3) HDD >> >> With this configuration, file locking start to show some impact -- the >> write is slow enough to introduce contention. >> >> a) Shared profile without merging: ~13m10s >> b) with merging >> b.1) pool size == 1: ~18m20s >> b.2) pool size == 2: ~16m30s >> b.3) pool size == 3: ~15m55s >> b.4) pool size == 4: ~16m20s >> b.5) pool size == 5: ~16m42s >> c) >3000 profile file without merging (%p) : ~16m50s >> >> Increasing the size of merge pool increases dumping parallelism -- the >> performance improves initially but when it is above 4, it starts to degrade >> gradually. When the HDD IO throughput is saturated at that point and >> increasing parallelism does not help any more. >> >> In short, with profile merging, we just need to dump 3 profile files to >> achieve the same build performance that dumps >3000 files (the current >> default behavior). >> >> 4) An internal file system using network attached storage >> >> In such a file system, the file write has relatively long latency >> compared with local file systems. The backend storage server does dynamic >> load balancing so that it can achieve very high IO throughput with high >> parallelism (at both FE/client side and backend). >> >> a) Single profile without profile merging : ~60m >> b) Profile merging enabled: >> b.1) pool size == 1: ~80m >> b.2) pool size == 2: ~47m >> b.3) pool size == 3: ~43m >> b.4) pool size == 4: ~40m40s >> b.5) pool size == 5: ~38m50s >> b.6) pool size == 10: ~36m48s >> b.7) pool size == 32: ~36m24s >> c) >3000 profile file without profile merging (%p): ~35m24s >> >> b.6), b.7) and c) have the best performance among all. >> >> Unlike in HDD case, a) has poor performance here -- due to low >> parallelism in the storage backend. >> >> With file dumping parallelism, the performance flats out when the pool >> size >= 10. This is because the client (ninja+clang) system has reached its >> peak and becomes the new performance bottleneck. >> >> Again, with profile merging, we only need 10 profile data file to achieve >> the same performance as the default behavior that requires >3000 files to >> be dumped. >> >> thanks, >> >> David >> >> >> >> >> On Sun, Feb 28, 2016 at 12:13 AM, Xinliang David Li <davidxl at google.com> >> wrote: >> >>> >>> On Sat, Feb 27, 2016 at 6:50 PM, Sean Silva <chisophugis at gmail.com> >>> wrote: >>> >>>> I have thought about this issue too, in the context of games. We may >>>> want to turn profiling only for certain frames (essentially, this is many >>>> small profile runs). >>>> >>>> However, I have not seen it demonstrated that this kind of refined data >>>> collection will actually improve PGO results in practice. >>>> The evidence I do have though is that IIRC Apple have found that almost >>>> all of the benefits of PGO for the Clang binary can be gotten with a >>>> handful of training runs of Clang. Are your findings different? >>>> >>> >>> We have a very wide customer base so we can not claim one use model is >>> sufficient for all users. For instance, we have users using fine grained >>> profile dumping control (programatically) as you described above. There are >>> also other possible use cases such as dump profiles for different >>> periodical phases into files associated with phases. Later different >>> phase's profile data can be merged with different weights. >>> >>> >>>> >>>> Also, in general, I am very wary of file locking. This can cause huge >>>> amounts of slowdown for a build and has potential portability problems. >>>> >>> >>> I don't see much slow down with a clang build using instrumented clang >>> as the build compiler. With file locking and profile merging enabled, the >>> build time on my local machine looks like: >>> >>> real 18m22.737s >>> user 293m18.924s >>> sys 9m55.532s >>> >>> If profile merging/locking is disabled (i.e, let the profile dumper to >>> clobber/write over each other), the real time is about 14m. >>> >>> >>>> I don't see it as a substantially better solution than wrapping clang >>>> in a script that runs clang and then just calls llvm-profdata to do the >>>> merging. Running llvm-profdata is cheap compared to doing locking in a >>>> highly parallel situation like a build. >>>> >>> >>> That would require synchronization for merging too. >>> >>> From Justin's email, it looks like there is a key point I have not made >>> clear: the on-line profile merge is a very simple raw profile to raw >>> profile merging which is super fast. The end result of the profile run is >>> still in raw format. The raw to indexed merging is still needed -- but >>> instead of merging thousands of raw profiles which can be very slow, with >>> this model, only one raw profile input is needed. >>> >>> thanks, >>> >>> David >>> >>> >>>> >>>> >>>> -- Sean Silva >>>> >>>> On Sat, Feb 27, 2016 at 6:02 PM, Xinliang David Li via llvm-dev < >>>> llvm-dev at lists.llvm.org> wrote: >>>> >>>>> One of the main missing features in Clang/LLVM profile runtime is the >>>>> lack of support for online/in-process profile merging support. Profile data >>>>> collected for different workloads for the same executable binary need to be >>>>> collected and merged later by the offline post-processing tool. This >>>>> limitation makes it hard to handle cases where the instrumented binary >>>>> needs to be run with large number of small workloads, possibly in >>>>> parallel. For instance, to do PGO for clang, we may choose to build a >>>>> large project with the instrumented Clang binary. This is because >>>>> 1) to avoid profile from different runs from overriding others, %p >>>>> substitution needs to be specified in either the command line or an >>>>> environment variable so that different process can dump profile data into >>>>> its own file named using pid. This will create huge requirement on the disk >>>>> storage. For instance, clang's raw profile size is typically 80M -- if the >>>>> instrumented clang is used to build a medium to large size project (such as >>>>> clang itself), profile data can easily use up hundreds of Gig bytes of >>>>> local storage. >>>>> 2) pid can also be recycled. This means that some of the profile data >>>>> may be overridden without being noticed. >>>>> >>>>> The way to solve this problem is to allow profile data to be merged in >>>>> process. I have a prototype implementation and plan to send it out for >>>>> review soon after some clean ups. By default, the profiling merging is off >>>>> and it can be turned on with an user option or via an environment variable. >>>>> The following summarizes the issues involved in adding this feature: >>>>> 1. the target platform needs to have file locking support >>>>> 2. there needs an efficient way to identify the profile data and >>>>> associate it with the binary using binary/profdata signature; >>>>> 3. Currently without merging, profile data from shared libraries >>>>> (including dlopen/dlcose ones) are concatenated into the primary profile >>>>> file. This can complicate matters, as the merger also needs to find the >>>>> matching shared libs, and the merger also needs to avoid unnecessary data >>>>> movement/copy; >>>>> 4. value profile data is variable in length even for the same binary. >>>>> >>>>> All the above issues are resolved and clang self build with >>>>> instrumented binary passes (with both j1 and high parallelism). >>>>> >>>>> If you have any concerns, please let me know. >>>>> >>>>> thanks, >>>>> >>>>> David >>>>> >>>>> >>>>> _______________________________________________ >>>>> LLVM Developers mailing list >>>>> llvm-dev at lists.llvm.org >>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>> >>>>> >>>> >>> >> > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160301/275bbbda/attachment.html>
Sean Silva via llvm-dev
2016-Mar-01 23:54 UTC
[llvm-dev] Add support for in-process profile merging in profile-runtime
On Tue, Mar 1, 2016 at 3:41 PM, Xinliang David Li <xinliangli at gmail.com> wrote:> sounds reasonable. My design of c) is different in many ways (e.g, using > getpid()%PoolSize), but we can delay discussion of that in code review. >I like that (e.g. support %7p in addition to %p). -- Sean Silva> > thanks, > > David > > On Tue, Mar 1, 2016 at 3:34 PM, Sean Silva via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> Hi David, >> >> This is wonderful data and demonstrates the viability of this feature. I >> think this has alleviated the concerns regarding file locking. >> >> As far as the implementation of the feature, I think we will probably >> want the following incremental steps: >> a) implement the core merging logic and add to buffer API a primitive for >> merging two buffers >> b) implement the file system glue to extend this to the filesystem API's >> (write_file etc.) >> c) implement a profile filename format string which generates a random >> number mod a specified amount (strawman: >> `LLVM_PROFILE_FILE=default.profraw.%7u` which generates a _u_nique number >> mod 7. Of course, in general it is `%<N>u`) >> >> b) depends on a), but c) can be done in parallel with both. >> >> Does this seem feasible? >> >> -- Sean Silva >> >> On Tue, Mar 1, 2016 at 2:55 PM, Xinliang David Li <davidxl at google.com> >> wrote: >> >>> I have implemented the profile pool idea from Mehdi, and collected >>> performance data related to profile merging and file locking. The >>> following is the experiment setup: >>> >>> 1) the machine has 32 logical cores (Intel sandybridge machine/64G >>> memory) >>> 2) the workload is clang self build (~3.3K files to be built), and the >>> instrumented binary is Clang. >>> 3) ninja parallelism j32 >>> >>> File systems tested (on linux) >>> 1) a local file system on a SSD drive >>> 2) tmpfs >>> 3) a local file system on a hard disk >>> 4) an internal distributed file system >>> >>> Configurations tested: >>> 1) all processes dump to the same profile data file without locking >>> (this configuration of course produces useless profile data in the end, but >>> it serves as the performance baseline) >>> 2) profile-merging enabled with pool sizes : 1, 2, 3, 4, 5, 10, and 32 >>> 3) using LLVM_PROFILE_FILE=..._%p to enable each process to dump its own >>> copy of profile data (resulting in ~3.2K profile data files in the end). >>> This configuration is only tested on some FS due to size/quota constraints. >>> >>> Here is a very high level summary of the experiment result. The longer >>> writing latency it is, the more file locking contention is (which is not >>> surprising). In some cases, file lock has close to zero overhead, while for >>> FS with high write latencies, file locking can affect performance >>> negatively. In such cases, using a small pool of profile files can >>> completely recover the performance. The size of the required pool size is >>> capped at a small value (which depends on many different factors: write >>> latency, the rate at which the instrumented binary retires, io >>> throughput/network bandwidth etc). >>> >>> 1) SSD >>> >>> The performance is almost identical across *ALL* the test >>> configurations. The real time needed to complete the full self build is >>> ~13m10s. There is no visible file contention with file locking enabled >>> even with pool size == 1. >>> >>> 2) tmpfs >>> >>> only tested with the following configs >>> a) shared profile with no merge >>> b) with merge (pool == 1), with merge (pool == 2) >>> >>> Not surprisingly, the result is similar to SSD case -- consistently >>> finished building in a little more than 13m. >>> >>> 3) HDD >>> >>> With this configuration, file locking start to show some impact -- the >>> write is slow enough to introduce contention. >>> >>> a) Shared profile without merging: ~13m10s >>> b) with merging >>> b.1) pool size == 1: ~18m20s >>> b.2) pool size == 2: ~16m30s >>> b.3) pool size == 3: ~15m55s >>> b.4) pool size == 4: ~16m20s >>> b.5) pool size == 5: ~16m42s >>> c) >3000 profile file without merging (%p) : ~16m50s >>> >>> Increasing the size of merge pool increases dumping parallelism -- the >>> performance improves initially but when it is above 4, it starts to degrade >>> gradually. When the HDD IO throughput is saturated at that point and >>> increasing parallelism does not help any more. >>> >>> In short, with profile merging, we just need to dump 3 profile files to >>> achieve the same build performance that dumps >3000 files (the current >>> default behavior). >>> >>> 4) An internal file system using network attached storage >>> >>> In such a file system, the file write has relatively long latency >>> compared with local file systems. The backend storage server does dynamic >>> load balancing so that it can achieve very high IO throughput with high >>> parallelism (at both FE/client side and backend). >>> >>> a) Single profile without profile merging : ~60m >>> b) Profile merging enabled: >>> b.1) pool size == 1: ~80m >>> b.2) pool size == 2: ~47m >>> b.3) pool size == 3: ~43m >>> b.4) pool size == 4: ~40m40s >>> b.5) pool size == 5: ~38m50s >>> b.6) pool size == 10: ~36m48s >>> b.7) pool size == 32: ~36m24s >>> c) >3000 profile file without profile merging (%p): ~35m24s >>> >>> b.6), b.7) and c) have the best performance among all. >>> >>> Unlike in HDD case, a) has poor performance here -- due to low >>> parallelism in the storage backend. >>> >>> With file dumping parallelism, the performance flats out when the pool >>> size >= 10. This is because the client (ninja+clang) system has reached its >>> peak and becomes the new performance bottleneck. >>> >>> Again, with profile merging, we only need 10 profile data file to >>> achieve the same performance as the default behavior that requires >3000 >>> files to be dumped. >>> >>> thanks, >>> >>> David >>> >>> >>> >>> >>> On Sun, Feb 28, 2016 at 12:13 AM, Xinliang David Li <davidxl at google.com> >>> wrote: >>> >>>> >>>> On Sat, Feb 27, 2016 at 6:50 PM, Sean Silva <chisophugis at gmail.com> >>>> wrote: >>>> >>>>> I have thought about this issue too, in the context of games. We may >>>>> want to turn profiling only for certain frames (essentially, this is many >>>>> small profile runs). >>>>> >>>>> However, I have not seen it demonstrated that this kind of refined >>>>> data collection will actually improve PGO results in practice. >>>>> The evidence I do have though is that IIRC Apple have found that >>>>> almost all of the benefits of PGO for the Clang binary can be gotten with a >>>>> handful of training runs of Clang. Are your findings different? >>>>> >>>> >>>> We have a very wide customer base so we can not claim one use model is >>>> sufficient for all users. For instance, we have users using fine grained >>>> profile dumping control (programatically) as you described above. There are >>>> also other possible use cases such as dump profiles for different >>>> periodical phases into files associated with phases. Later different >>>> phase's profile data can be merged with different weights. >>>> >>>> >>>>> >>>>> Also, in general, I am very wary of file locking. This can cause huge >>>>> amounts of slowdown for a build and has potential portability problems. >>>>> >>>> >>>> I don't see much slow down with a clang build using instrumented clang >>>> as the build compiler. With file locking and profile merging enabled, the >>>> build time on my local machine looks like: >>>> >>>> real 18m22.737s >>>> user 293m18.924s >>>> sys 9m55.532s >>>> >>>> If profile merging/locking is disabled (i.e, let the profile dumper to >>>> clobber/write over each other), the real time is about 14m. >>>> >>>> >>>>> I don't see it as a substantially better solution than wrapping clang >>>>> in a script that runs clang and then just calls llvm-profdata to do the >>>>> merging. Running llvm-profdata is cheap compared to doing locking in a >>>>> highly parallel situation like a build. >>>>> >>>> >>>> That would require synchronization for merging too. >>>> >>>> From Justin's email, it looks like there is a key point I have not made >>>> clear: the on-line profile merge is a very simple raw profile to raw >>>> profile merging which is super fast. The end result of the profile run is >>>> still in raw format. The raw to indexed merging is still needed -- but >>>> instead of merging thousands of raw profiles which can be very slow, with >>>> this model, only one raw profile input is needed. >>>> >>>> thanks, >>>> >>>> David >>>> >>>> >>>>> >>>>> >>>>> -- Sean Silva >>>>> >>>>> On Sat, Feb 27, 2016 at 6:02 PM, Xinliang David Li via llvm-dev < >>>>> llvm-dev at lists.llvm.org> wrote: >>>>> >>>>>> One of the main missing features in Clang/LLVM profile runtime is the >>>>>> lack of support for online/in-process profile merging support. Profile data >>>>>> collected for different workloads for the same executable binary need to be >>>>>> collected and merged later by the offline post-processing tool. This >>>>>> limitation makes it hard to handle cases where the instrumented binary >>>>>> needs to be run with large number of small workloads, possibly in >>>>>> parallel. For instance, to do PGO for clang, we may choose to build a >>>>>> large project with the instrumented Clang binary. This is because >>>>>> 1) to avoid profile from different runs from overriding others, %p >>>>>> substitution needs to be specified in either the command line or an >>>>>> environment variable so that different process can dump profile data into >>>>>> its own file named using pid. This will create huge requirement on the disk >>>>>> storage. For instance, clang's raw profile size is typically 80M -- if the >>>>>> instrumented clang is used to build a medium to large size project (such as >>>>>> clang itself), profile data can easily use up hundreds of Gig bytes of >>>>>> local storage. >>>>>> 2) pid can also be recycled. This means that some of the profile data >>>>>> may be overridden without being noticed. >>>>>> >>>>>> The way to solve this problem is to allow profile data to be merged >>>>>> in process. I have a prototype implementation and plan to send it out for >>>>>> review soon after some clean ups. By default, the profiling merging is off >>>>>> and it can be turned on with an user option or via an environment variable. >>>>>> The following summarizes the issues involved in adding this feature: >>>>>> 1. the target platform needs to have file locking support >>>>>> 2. there needs an efficient way to identify the profile data and >>>>>> associate it with the binary using binary/profdata signature; >>>>>> 3. Currently without merging, profile data from shared libraries >>>>>> (including dlopen/dlcose ones) are concatenated into the primary profile >>>>>> file. This can complicate matters, as the merger also needs to find the >>>>>> matching shared libs, and the merger also needs to avoid unnecessary data >>>>>> movement/copy; >>>>>> 4. value profile data is variable in length even for the same binary. >>>>>> >>>>>> All the above issues are resolved and clang self build with >>>>>> instrumented binary passes (with both j1 and high parallelism). >>>>>> >>>>>> If you have any concerns, please let me know. >>>>>> >>>>>> thanks, >>>>>> >>>>>> David >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> LLVM Developers mailing list >>>>>> llvm-dev at lists.llvm.org >>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>>> >>>>>> >>>>> >>>> >>> >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160301/425acafd/attachment.html>