Xinliang David Li via llvm-dev
2016-Feb-28 02:02 UTC
[llvm-dev] Add support for in-process profile merging in profile-runtime
One of the main missing features in Clang/LLVM profile runtime is the lack of support for online/in-process profile merging support. Profile data collected for different workloads for the same executable binary need to be collected and merged later by the offline post-processing tool. This limitation makes it hard to handle cases where the instrumented binary needs to be run with large number of small workloads, possibly in parallel. For instance, to do PGO for clang, we may choose to build a large project with the instrumented Clang binary. This is because 1) to avoid profile from different runs from overriding others, %p substitution needs to be specified in either the command line or an environment variable so that different process can dump profile data into its own file named using pid. This will create huge requirement on the disk storage. For instance, clang's raw profile size is typically 80M -- if the instrumented clang is used to build a medium to large size project (such as clang itself), profile data can easily use up hundreds of Gig bytes of local storage. 2) pid can also be recycled. This means that some of the profile data may be overridden without being noticed. The way to solve this problem is to allow profile data to be merged in process. I have a prototype implementation and plan to send it out for review soon after some clean ups. By default, the profiling merging is off and it can be turned on with an user option or via an environment variable. The following summarizes the issues involved in adding this feature: 1. the target platform needs to have file locking support 2. there needs an efficient way to identify the profile data and associate it with the binary using binary/profdata signature; 3. Currently without merging, profile data from shared libraries (including dlopen/dlcose ones) are concatenated into the primary profile file. This can complicate matters, as the merger also needs to find the matching shared libs, and the merger also needs to avoid unnecessary data movement/copy; 4. value profile data is variable in length even for the same binary. All the above issues are resolved and clang self build with instrumented binary passes (with both j1 and high parallelism). If you have any concerns, please let me know. thanks, David -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160227/3212b633/attachment.html>
Sean Silva via llvm-dev
2016-Feb-28 02:50 UTC
[llvm-dev] Add support for in-process profile merging in profile-runtime
I have thought about this issue too, in the context of games. We may want to turn profiling only for certain frames (essentially, this is many small profile runs). However, I have not seen it demonstrated that this kind of refined data collection will actually improve PGO results in practice. The evidence I do have though is that IIRC Apple have found that almost all of the benefits of PGO for the Clang binary can be gotten with a handful of training runs of Clang. Are your findings different? Also, in general, I am very wary of file locking. This can cause huge amounts of slowdown for a build and has potential portability problems. I don't see it as a substantially better solution than wrapping clang in a script that runs clang and then just calls llvm-profdata to do the merging. Running llvm-profdata is cheap compared to doing locking in a highly parallel situation like a build. -- Sean Silva On Sat, Feb 27, 2016 at 6:02 PM, Xinliang David Li via llvm-dev < llvm-dev at lists.llvm.org> wrote:> One of the main missing features in Clang/LLVM profile runtime is the lack > of support for online/in-process profile merging support. Profile data > collected for different workloads for the same executable binary need to be > collected and merged later by the offline post-processing tool. This > limitation makes it hard to handle cases where the instrumented binary > needs to be run with large number of small workloads, possibly in > parallel. For instance, to do PGO for clang, we may choose to build a > large project with the instrumented Clang binary. This is because > 1) to avoid profile from different runs from overriding others, %p > substitution needs to be specified in either the command line or an > environment variable so that different process can dump profile data into > its own file named using pid. This will create huge requirement on the disk > storage. For instance, clang's raw profile size is typically 80M -- if the > instrumented clang is used to build a medium to large size project (such as > clang itself), profile data can easily use up hundreds of Gig bytes of > local storage. > 2) pid can also be recycled. This means that some of the profile data may > be overridden without being noticed. > > The way to solve this problem is to allow profile data to be merged in > process. I have a prototype implementation and plan to send it out for > review soon after some clean ups. By default, the profiling merging is off > and it can be turned on with an user option or via an environment variable. > The following summarizes the issues involved in adding this feature: > 1. the target platform needs to have file locking support > 2. there needs an efficient way to identify the profile data and > associate it with the binary using binary/profdata signature; > 3. Currently without merging, profile data from shared libraries > (including dlopen/dlcose ones) are concatenated into the primary profile > file. This can complicate matters, as the merger also needs to find the > matching shared libs, and the merger also needs to avoid unnecessary data > movement/copy; > 4. value profile data is variable in length even for the same binary. > > All the above issues are resolved and clang self build with instrumented > binary passes (with both j1 and high parallelism). > > If you have any concerns, please let me know. > > thanks, > > David > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160227/402da960/attachment-0001.html>
Hal Finkel via llvm-dev
2016-Feb-28 04:14 UTC
[llvm-dev] Add support for in-process profile merging in profile-runtime
----- Original Message -----> From: "Sean Silva via llvm-dev" <llvm-dev at lists.llvm.org> > To: "Xinliang David Li" <davidxl at google.com> > Cc: "llvm-dev" <llvm-dev at lists.llvm.org> > Sent: Saturday, February 27, 2016 8:50:05 PM > Subject: Re: [llvm-dev] Add support for in-process profile merging in profile-runtime > > > > I have thought about this issue too, in the context of games. We may > want to turn profiling only for certain frames (essentially, this is > many small profile runs). > > > However, I have not seen it demonstrated that this kind of refined > data collection will actually improve PGO results in practice. > The evidence I do have though is that IIRC Apple have found that > almost all of the benefits of PGO for the Clang binary can be gotten > with a handful of training runs of Clang. Are your findings > different? > > > Also, in general, I am very wary of file locking.As am I (especially since it often does not operate correctly, or is very slow, on distributed file systems). Why don't you just read in an existing file to pre-populate the counters section when it exists at startup? -Hal> This can cause huge > amounts of slowdown for a build and has potential portability > problems. I don't see it as a substantially better solution than > wrapping clang in a script that runs clang and then just calls > llvm-profdata to do the merging. Running llvm-profdata is cheap > compared to doing locking in a highly parallel situation like a > build. > > > > > > -- Sean Silva > > > On Sat, Feb 27, 2016 at 6:02 PM, Xinliang David Li via llvm-dev < > llvm-dev at lists.llvm.org > wrote: > > > > One of the main missing features in Clang/LLVM profile runtime is the > lack of support for online/in-process profile merging support. > Profile data collected for different workloads for the same > executable binary need to be collected and merged later by the > offline post-processing tool. This limitation makes it hard to > handle cases where the instrumented binary needs to be run with > large number of small workloads, possibly in parallel. For instance, > to do PGO for clang, we may choose to build a large project with the > instrumented Clang binary. This is because > 1) to avoid profile from different runs from overriding others, %p > substitution needs to be specified in either the command line or an > environment variable so that different process can dump profile data > into its own file named using pid. This will create huge requirement > on the disk storage. For instance, clang's raw profile size is > typically 80M -- if the instrumented clang is used to build a medium > to large size project (such as clang itself), profile data can > easily use up hundreds of Gig bytes of local storage. > 2) pid can also be recycled. This means that some of the profile data > may be overridden without being noticed. > > > The way to solve this problem is to allow profile data to be merged > in process. I have a prototype implementation and plan to send it > out for review soon after some clean ups. By default, the profiling > merging is off and it can be turned on with an user option or via an > environment variable. The following summarizes the issues involved > in adding this feature: > 1. the target platform needs to have file locking support > 2. there needs an efficient way to identify the profile data and > associate it with the binary using binary/profdata signature; > 3. Currently without merging, profile data from shared libraries > (including dlopen/dlcose ones) are concatenated into the primary > profile file. This can complicate matters, as the merger also needs > to find the matching shared libs, and the merger also needs to avoid > unnecessary data movement/copy; > 4. value profile data is variable in length even for the same binary. > > > All the above issues are resolved and clang self build with > instrumented binary passes (with both j1 and high parallelism). > > > If you have any concerns, please let me know. > > > thanks, > > > David > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory
Justin Bogner via llvm-dev
2016-Feb-28 07:44 UTC
[llvm-dev] Add support for in-process profile merging in profile-runtime
Xinliang David Li via llvm-dev <llvm-dev at lists.llvm.org> writes:> One of the main missing features in Clang/LLVM profile runtime is the lack of > support for online/in-process profile merging support. Profile data collected > for different workloads for the same executable binary need to be collected > and merged later by the offline post-processing tool. This limitation makes > it hard to handle cases where the instrumented binary needs to be run with > large number of small workloads, possibly in parallel. For instance, to do > PGO for clang, we may choose to build a large project with the instrumented > Clang binary. This is because > > 1) to avoid profile from different runs from overriding others, %p > substitution needs to be specified in either the command line or an > environment variable so that different process can dump profile data > into its own file named using pid.... or you can specify a more specific name that describes what's under test, instead of %p.> This will create huge requirement on the disk storage. For > instance, clang's raw profile size is typically 80M -- if the > instrumented clang is used to build a medium to large size project > (such as clang itself), profile data can easily use up hundreds of > Gig bytes of local storage.This argument is kind of confusing. It says that one profile is typicially 80M, then claims that this uses 100s of GB of data. From these statements that only makes sense I suppose that's true if you run 1000 profiling runs without merging the data in between. Is that what you're talking about, or did I miss something?> 2) pid can also be recycled. This means that some of the profile data may be > overridden without being noticed. > > The way to solve this problem is to allow profile data to be merged in > process.I'm not convinced. Can you provide some more concrete examples of where the out of process merging model fails? This was a *very deliberate* design decision in how clang's profiling works, and most of the subsequent decisions have been based on this initial one. Changing it has far reaching effects.> I have a prototype implementation and plan to send it out for review > soon after some clean ups. By default, the profiling merging is off and it can > be turned on with an user option or via an environment variable. The following > summarizes the issues involved in adding this feature: > 1. the target platform needs to have file locking support > 2. there needs an efficient way to identify the profile data and associate it > with the binary using binary/profdata signature; > 3. Currently without merging, profile data from shared libraries > (including dlopen/dlcose ones) are concatenated into the primary > profile file. This can complicate matters, as the merger also needs to > find the matching shared libs, and the merger also needs to avoid > unnecessary data movement/copy; > 4. value profile data is variable in length even for the same binary.If we actually want this, we should reconsider the design of having a raw vs processed profiling format. The raw profile format is specifically designed to be fast to write out and not to consider merging profiles at all. This feature would make it nearly as complicated as the processed format and lose all of the advantages of making them different.> All the above issues are resolved and clang self build with instrumented > binary passes (with both j1 and high parallelism). > > If you have any concerns, please let me know.
Xinliang David Li via llvm-dev
2016-Feb-28 08:13 UTC
[llvm-dev] Add support for in-process profile merging in profile-runtime
On Sat, Feb 27, 2016 at 6:50 PM, Sean Silva <chisophugis at gmail.com> wrote:> I have thought about this issue too, in the context of games. We may want > to turn profiling only for certain frames (essentially, this is many small > profile runs). > > However, I have not seen it demonstrated that this kind of refined data > collection will actually improve PGO results in practice. > The evidence I do have though is that IIRC Apple have found that almost > all of the benefits of PGO for the Clang binary can be gotten with a > handful of training runs of Clang. Are your findings different? >We have a very wide customer base so we can not claim one use model is sufficient for all users. For instance, we have users using fine grained profile dumping control (programatically) as you described above. There are also other possible use cases such as dump profiles for different periodical phases into files associated with phases. Later different phase's profile data can be merged with different weights.> > Also, in general, I am very wary of file locking. This can cause huge > amounts of slowdown for a build and has potential portability problems. >I don't see much slow down with a clang build using instrumented clang as the build compiler. With file locking and profile merging enabled, the build time on my local machine looks like: real 18m22.737s user 293m18.924s sys 9m55.532s If profile merging/locking is disabled (i.e, let the profile dumper to clobber/write over each other), the real time is about 14m.> I don't see it as a substantially better solution than wrapping clang in a > script that runs clang and then just calls llvm-profdata to do the merging. > Running llvm-profdata is cheap compared to doing locking in a highly > parallel situation like a build. >That would require synchronization for merging too.>From Justin's email, it looks like there is a key point I have not madeclear: the on-line profile merge is a very simple raw profile to raw profile merging which is super fast. The end result of the profile run is still in raw format. The raw to indexed merging is still needed -- but instead of merging thousands of raw profiles which can be very slow, with this model, only one raw profile input is needed. thanks, David> > > -- Sean Silva > > On Sat, Feb 27, 2016 at 6:02 PM, Xinliang David Li via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> One of the main missing features in Clang/LLVM profile runtime is the >> lack of support for online/in-process profile merging support. Profile data >> collected for different workloads for the same executable binary need to be >> collected and merged later by the offline post-processing tool. This >> limitation makes it hard to handle cases where the instrumented binary >> needs to be run with large number of small workloads, possibly in >> parallel. For instance, to do PGO for clang, we may choose to build a >> large project with the instrumented Clang binary. This is because >> 1) to avoid profile from different runs from overriding others, %p >> substitution needs to be specified in either the command line or an >> environment variable so that different process can dump profile data into >> its own file named using pid. This will create huge requirement on the disk >> storage. For instance, clang's raw profile size is typically 80M -- if the >> instrumented clang is used to build a medium to large size project (such as >> clang itself), profile data can easily use up hundreds of Gig bytes of >> local storage. >> 2) pid can also be recycled. This means that some of the profile data may >> be overridden without being noticed. >> >> The way to solve this problem is to allow profile data to be merged in >> process. I have a prototype implementation and plan to send it out for >> review soon after some clean ups. By default, the profiling merging is off >> and it can be turned on with an user option or via an environment variable. >> The following summarizes the issues involved in adding this feature: >> 1. the target platform needs to have file locking support >> 2. there needs an efficient way to identify the profile data and >> associate it with the binary using binary/profdata signature; >> 3. Currently without merging, profile data from shared libraries >> (including dlopen/dlcose ones) are concatenated into the primary profile >> file. This can complicate matters, as the merger also needs to find the >> matching shared libs, and the merger also needs to avoid unnecessary data >> movement/copy; >> 4. value profile data is variable in length even for the same binary. >> >> All the above issues are resolved and clang self build with instrumented >> binary passes (with both j1 and high parallelism). >> >> If you have any concerns, please let me know. >> >> thanks, >> >> David >> >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160228/9e26a356/attachment.html>
Xinliang David Li via llvm-dev
2016-Feb-28 08:46 UTC
[llvm-dev] Add support for in-process profile merging in profile-runtime
Justin, looks like there is some misunderstanding in my email. I want to clarify it here first: 1) I am not proposing changing the default profile dumping model as used today. The online merging is totally optional; 2) the on-line profile merging is not doing conversion from raw to index format. It does very simple raw-to-raw merging using existing runtime APIs. 3) the change to existing profile runtime code is just a few lines. All the new functionality is isolated in one new file. It will become clear when the patch is sent out later. My inline replies below: On Sat, Feb 27, 2016 at 11:44 PM, Justin Bogner <mail at justinbogner.com> wrote:> Xinliang David Li via llvm-dev <llvm-dev at lists.llvm.org> writes: > > One of the main missing features in Clang/LLVM profile runtime is the > lack of > > support for online/in-process profile merging support. Profile data > collected > > for different workloads for the same executable binary need to be > collected > > and merged later by the offline post-processing tool. This limitation > makes > > it hard to handle cases where the instrumented binary needs to be run > with > > large number of small workloads, possibly in parallel. For instance, to > do > > PGO for clang, we may choose to build a large project with the > instrumented > > Clang binary. This is because > > > > 1) to avoid profile from different runs from overriding others, %p > > substitution needs to be specified in either the command line or an > > environment variable so that different process can dump profile data > > into its own file named using pid. > > ... or you can specify a more specific name that describes what's under > test, instead of %p. >yes -- but the problem still exists -- each training process will need its own copy of raw profile.> > > This will create huge requirement on the disk storage. For > > instance, clang's raw profile size is typically 80M -- if the > > instrumented clang is used to build a medium to large size project > > (such as clang itself), profile data can easily use up hundreds of > > Gig bytes of local storage. > > This argument is kind of confusing. It says that one profile is > typicially 80M, then claims that this uses 100s of GB of data. From > these statements that only makes sense I suppose that's true if you run > 1000 profiling runs without merging the data in between. Is that what > you're talking about, or did I miss something? >Yes. For instance, first build a clang with -fprofile-instr-generate=prof.data.%p, and use this instrumented clang to build another large project such as clang itself. The second build will produce tons of profile data.> > > 2) pid can also be recycled. This means that some of the profile data > may be > > overridden without being noticed. > > > > The way to solve this problem is to allow profile data to be merged in > > process. > > I'm not convinced. Can you provide some more concrete examples of where > the out of process merging model fails? This was a *very deliberate* > design decision in how clang's profiling works, and most of the > subsequent decisions have been based on this initial one. Changing it > has far reaching effects.I am not proposing changing the out of process merging -- it is still needed. What I meant is that, in a scenario where the instrumented binaries are running multiple times (using their existing running harness), there is no good/automatic way of making sure different process's profile data won't have name conflict. Using clang's self build (using instrumented clang as build compiler for profile bootstrapping) as an example. Ideally this should all be done transparently -- i.e, set the instrumented compiler as the build compiler, run ninja or make and things will just work, but with the current default profile dumping mode, it can fail in many different ways: 1) Just run ninja/make -- all clang processes will dump profile into the same file concurrently -- the result is a corrupted profile -- FAIL 2) run ninja with LLVM_PROFILE_FILE=....%p 2.1) failure mode #1 --> really slow build due to large IO; or running out of diskspace 2.2) failure mode #2 --> pid recyling leading to profile file name conflict -- profile overwriting happens and we loss data Suppose 2) above finally succeeds, the user will have to merge thousands of raw profiles to indexed profile. With the proposed profile on-line merging, you just need to use the instrumented clang, and one merged raw profile data automagically produced in the end. The raw to indexed merge is also much faster. The online merge feature has a huge advantage when considering integrating the instrumented binary with existing make systems or loadtesting harness -- it is almost seamless.> > > I have a prototype implementation and plan to send it out for review > > soon after some clean ups. By default, the profiling merging is off and > it can > > be turned on with an user option or via an environment variable. The > following > > summarizes the issues involved in adding this feature: > > 1. the target platform needs to have file locking support > > 2. there needs an efficient way to identify the profile data and > associate it > > with the binary using binary/profdata signature; > > 3. Currently without merging, profile data from shared libraries > > (including dlopen/dlcose ones) are concatenated into the primary > > profile file. This can complicate matters, as the merger also needs to > > find the matching shared libs, and the merger also needs to avoid > > unnecessary data movement/copy; > > 4. value profile data is variable in length even for the same binary. > > If we actually want this, we should reconsider the design of having a > raw vs processed profiling format. The raw profile format is > specifically designed to be fast to write out and not to consider > merging profiles at all. This feature would make it nearly as > complicated as the processed format and lose all of the advantages of > making them different. >See above -- all the nice raw profile dumping mechanism is still kept -- there won't be a change of that. thanks, David> > > All the above issues are resolved and clang self build with instrumented > > binary passes (with both j1 and high parallelism). > > > > If you have any concerns, please let me know. >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160228/72787166/attachment-0001.html>
Seemingly Similar Threads
- Add support for in-process profile merging in profile-runtime
- Add support for in-process profile merging in profile-runtime
- Add support for in-process profile merging in profile-runtime
- Add support for in-process profile merging in profile-runtime
- Add support for in-process profile merging in profile-runtime