thr3ads.net - llvm dev - [llvm-dev] Add support for in-process profile merging in profile-runtime [Feb 2016]

If this information is useful, please help other people find it:
Share via:

Xinliang David Li via llvm-dev

2016-Feb-28 08:46 UTC

[llvm-dev] Add support for in-process profile merging in profile-runtime

Justin, looks like there is some misunderstanding in my email. I want to
clarify it here first:

1) I am not proposing changing the default profile dumping model as used
today. The online merging is totally optional;
2) the on-line profile merging is not doing conversion from raw to index
format. It does very simple raw-to-raw merging using existing runtime APIs.
3) the change to existing profile runtime code is just a few lines. All the
new functionality is isolated in one new file. It will become clear when
the patch is sent out later.

My inline replies below:


On Sat, Feb 27, 2016 at 11:44 PM, Justin Bogner <mail at justinbogner.com>
wrote:
> Xinliang David Li via llvm-dev <llvm-dev at lists.llvm.org> writes:
> > One of the main missing features in Clang/LLVM profile runtime is the
> lack of
> > support for online/in-process profile merging support. Profile data
> collected
> > for different workloads for the same executable binary need to be
> collected
> > and merged later by the offline post-processing tool.  This limitation
> makes
> > it hard to handle cases where the instrumented binary needs to be run
> with
> > large number of small workloads, possibly in parallel.  For instance,
to
> do
> > PGO for clang, we may choose to  build  a large project with the
> instrumented
> > Clang binary. This is because
> >
> > 1) to avoid profile from different runs from overriding others, %p
> >    substitution needs to be specified in either the command line or an
> >    environment variable so that different process can dump profile
data
> >    into its own file named using pid.
>
> ... or you can specify a more specific name that describes what's under
> test, instead of %p.
>
yes -- but the problem still exists -- each training process will need its
own copy of raw profile.

>
> >    This will create huge requirement on the disk storage. For
> >    instance, clang's raw profile size is typically 80M -- if the
> >    instrumented clang is used to build a medium to large size project
> >    (such as clang itself), profile data can easily use up hundreds of
> >    Gig bytes of local storage.
>
> This argument is kind of confusing. It says that one profile is
> typicially 80M, then claims that this uses 100s of GB of data. From
> these statements that only makes sense I suppose that's true if you run
> 1000 profiling runs without merging the data in between. Is that what
> you're talking about, or did I miss something?
>
Yes. For instance,  first build a  clang with
-fprofile-instr-generate=prof.data.%p, and use this instrumented clang to
build another large project such as clang itself. The second build will
produce tons of profile data.

>
> > 2) pid can also be recycled. This means that some of the profile data
> may be
> >    overridden without being noticed.
> >
> > The way to solve this problem is to allow profile data to be merged in
> > process.
>
> I'm not convinced. Can you provide some more concrete examples of where
> the out of process merging model fails? This was a *very deliberate*
> design decision in how clang's profiling works, and most of the
> subsequent decisions have been based on this initial one. Changing it
> has far reaching effects.

I am not proposing changing the out of process merging -- it is still
needed. What I meant is that, in a scenario where the instrumented binaries
are running multiple times (using their existing running harness), there is
no good/automatic way of making sure different process's profile data
won't
have name conflict.

Using clang's self build (using instrumented clang as build compiler for
profile bootstrapping) as an example.  Ideally this should all be done
transparently  -- i.e, set the instrumented compiler as the build compiler,
run ninja or make and things will just work, but with the current default
profile dumping mode, it can fail in many different ways:
1) Just run ninja/make -- all clang processes will dump profile into the
same file concurrently -- the result is a corrupted profile -- FAIL
2) run ninja with LLVM_PROFILE_FILE=....%p
   2.1) failure mode #1 --> really slow build due to large IO; or running
out of diskspace
   2.2) failure mode #2 --> pid recyling leading to profile file name
conflict -- profile overwriting happens and we loss data

Suppose 2) above finally succeeds, the user will have to merge thousands of
raw profiles to indexed profile.

With the proposed profile on-line merging, you just need to use the
instrumented clang, and one merged raw profile data automagically produced
in the end. The raw to indexed merge is also much faster.

The online merge feature has a huge advantage when considering integrating
the instrumented binary with existing make systems or loadtesting harness
-- it is almost seamless.


>
> > I have a prototype implementation and plan to send it out for review
> > soon after some clean ups. By default, the profiling merging is off
and
> it can
> > be turned on with an user option or via an environment variable. The
> following
> > summarizes the issues involved in adding this feature:
> > 1. the target platform needs to have file locking support
> > 2. there needs an efficient way to identify the profile data and
> associate it
> >    with the binary using binary/profdata signature;
> > 3. Currently without merging, profile data from shared libraries
> >    (including dlopen/dlcose ones) are concatenated into the primary
> >    profile file. This can complicate matters, as the merger also needs
to
> >    find the matching shared libs, and the merger also needs to avoid
> >    unnecessary data movement/copy;
> > 4. value profile data is variable in length even for the same binary.
>
> If we actually want this, we should reconsider the design of having a
> raw vs processed profiling format. The raw profile format is
> specifically designed to be fast to write out and not to consider
> merging profiles at all. This feature would make it nearly as
> complicated as the processed format and lose all of the advantages of
> making them different.
>
See above -- all the nice raw profile dumping mechanism is still kept --
there won't be a change of that.

thanks,

David


>
> > All the above issues are resolved and clang self build with
instrumented
> > binary passes (with both j1 and high parallelism).
> >
> > If you have any concerns, please let me know.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160228/72787166/attachment-0001.html>

Mehdi Amini via llvm-dev

2016-Feb-28 18:45 UTC

head link

[llvm-dev] Add support for in-process profile merging in profile-runtime

> On Feb 28, 2016, at 12:46 AM, Xinliang David Li via llvm-dev <llvm-dev
at lists.llvm.org> wrote:
> 
> Justin, looks like there is some misunderstanding in my email. I want to
clarify it here first:
> 
> 1) I am not proposing changing the default profile dumping model as used
today. The online merging is totally optional;
> 2) the on-line profile merging is not doing conversion from raw to index
format. It does very simple raw-to-raw merging using existing runtime APIs.
> 3) the change to existing profile runtime code is just a few lines. All the
new functionality is isolated in one new file. It will become clear when the
patch is sent out later.
> 
> My inline replies below:
> 
> 
> On Sat, Feb 27, 2016 at 11:44 PM, Justin Bogner <mail at
justinbogner.com <mailto:mail at justinbogner.com>> wrote:
> Xinliang David Li via llvm-dev <llvm-dev at lists.llvm.org
<mailto:llvm-dev at lists.llvm.org>> writes:
> > One of the main missing features in Clang/LLVM profile runtime is the
lack of
> > support for online/in-process profile merging support. Profile data
collected
> > for different workloads for the same executable binary need to be
collected
> > and merged later by the offline post-processing tool.  This limitation
makes
> > it hard to handle cases where the instrumented binary needs to be run
with
> > large number of small workloads, possibly in parallel.  For instance,
to do
> > PGO for clang, we may choose to  build  a large project with the
instrumented
> > Clang binary. This is because
> >
> > 1) to avoid profile from different runs from overriding others, %p
> >    substitution needs to be specified in either the command line or an
> >    environment variable so that different process can dump profile
data
> >    into its own file named using pid.
> 
> ... or you can specify a more specific name that describes what's under
> test, instead of %p.
> 
> yes -- but the problem still exists -- each training process will need its
own copy of raw profile.
>  
> 
> >    This will create huge requirement on the disk storage. For
> >    instance, clang's raw profile size is typically 80M -- if the
> >    instrumented clang is used to build a medium to large size project
> >    (such as clang itself), profile data can easily use up hundreds of
> >    Gig bytes of local storage.
> 
> This argument is kind of confusing. It says that one profile is
> typicially 80M, then claims that this uses 100s of GB of data. From
> these statements that only makes sense I suppose that's true if you run
> 1000 profiling runs without merging the data in between. Is that what
> you're talking about, or did I miss something?
> 
> Yes. For instance,  first build a  clang with
-fprofile-instr-generate=prof.data.%p, and use this instrumented clang to build
another large project such as clang itself. The second build will produce tons
of profile data.
>  
> 
> > 2) pid can also be recycled. This means that some of the profile data
may be
> >    overridden without being noticed.
> >
> > The way to solve this problem is to allow profile data to be merged in
> > process.
> 
> I'm not convinced. Can you provide some more concrete examples of where
> the out of process merging model fails? This was a *very deliberate*
> design decision in how clang's profiling works, and most of the
> subsequent decisions have been based on this initial one. Changing it
> has far reaching effects.
> 
> I am not proposing changing the out of process merging -- it is still
needed. What I meant is that, in a scenario where the instrumented binaries are
running multiple times (using their existing running harness), there is no
good/automatic way of making sure different process's profile data won't
have name conflict.
Could the profile file be named from a hash of the profile data themselves?

Even with that I still like the direction you're heading to. To avoid
contention on the "file locking", could there be a pool of output
files? I don't know how to map a new process to one output file without a
lock somewhere though.

-- 
Mehdi

> 
> Using clang's self build (using instrumented clang as build compiler
for profile bootstrapping) as an example.  Ideally this should all be done
transparently  -- i.e, set the instrumented compiler as the build compiler, run
ninja or make and things will just work, but with the current default profile
dumping mode, it can fail in many different ways:
> 1) Just run ninja/make -- all clang processes will dump profile into the
same file concurrently -- the result is a corrupted profile -- FAIL
> 2) run ninja with LLVM_PROFILE_FILE=....%p
>    2.1) failure mode #1 --> really slow build due to large IO; or
running out of diskspace
>    2.2) failure mode #2 --> pid recyling leading to profile file name
conflict -- profile overwriting happens and we loss data
> 
> Suppose 2) above finally succeeds, the user will have to merge thousands of
raw profiles to indexed profile.
>    
> With the proposed profile on-line merging, you just need to use the
instrumented clang, and one merged raw profile data automagically produced in
the end. The raw to indexed merge is also much faster.
> 
> The online merge feature has a huge advantage when considering integrating
the instrumented binary with existing make systems or loadtesting harness -- it
is almost seamless.
> 
>  
> 
> > I have a prototype implementation and plan to send it out for review
> > soon after some clean ups. By default, the profiling merging is off
and it can
> > be turned on with an user option or via an environment variable. The
following
> > summarizes the issues involved in adding this feature:
> > 1. the target platform needs to have file locking support
> > 2. there needs an efficient way to identify the profile data and
associate it
> >    with the binary using binary/profdata signature;
> > 3. Currently without merging, profile data from shared libraries
> >    (including dlopen/dlcose ones) are concatenated into the primary
> >    profile file. This can complicate matters, as the merger also needs
to
> >    find the matching shared libs, and the merger also needs to avoid
> >    unnecessary data movement/copy;
> > 4. value profile data is variable in length even for the same binary.
> 
> If we actually want this, we should reconsider the design of having a
> raw vs processed profiling format. The raw profile format is
> specifically designed to be fast to write out and not to consider
> merging profiles at all. This feature would make it nearly as
> complicated as the processed format and lose all of the advantages of
> making them different.
> 
> See above -- all the nice raw profile dumping mechanism is still kept --
there won't be a change of that.
> 
> thanks,
> 
> David
> 
>  
> 
> > All the above issues are resolved and clang self build with
instrumented
> > binary passes (with both j1 and high parallelism). 
> >
> > If you have any concerns, please let me know.
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160228/bed02bc4/attachment.html>

Xinliang David Li via llvm-dev

2016-Feb-28 18:59 UTC

head link

[llvm-dev] Add support for in-process profile merging in profile-runtime

On Sun, Feb 28, 2016 at 10:45 AM, Mehdi Amini <mehdi.amini at apple.com>
wrote:
>
> On Feb 28, 2016, at 12:46 AM, Xinliang David Li via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Justin, looks like there is some misunderstanding in my email. I want to
> clarify it here first:
>
> 1) I am not proposing changing the default profile dumping model as used
> today. The online merging is totally optional;
> 2) the on-line profile merging is not doing conversion from raw to index
> format. It does very simple raw-to-raw merging using existing runtime APIs.
> 3) the change to existing profile runtime code is just a few lines. All
> the new functionality is isolated in one new file. It will become clear
> when the patch is sent out later.
>
> My inline replies below:
>
>
> On Sat, Feb 27, 2016 at 11:44 PM, Justin Bogner <mail at
justinbogner.com>
> wrote:
>
>> Xinliang David Li via llvm-dev <llvm-dev at lists.llvm.org>
writes:
>> > One of the main missing features in Clang/LLVM profile runtime is
the
>> lack of
>> > support for online/in-process profile merging support. Profile
data
>> collected
>> > for different workloads for the same executable binary need to be
>> collected
>> > and merged later by the offline post-processing tool.  This
limitation
>> makes
>> > it hard to handle cases where the instrumented binary needs to be
run
>> with
>> > large number of small workloads, possibly in parallel.  For
instance,
>> to do
>> > PGO for clang, we may choose to  build  a large project with the
>> instrumented
>> > Clang binary. This is because
>> >
>> > 1) to avoid profile from different runs from overriding others, %p
>> >    substitution needs to be specified in either the command line
or an
>> >    environment variable so that different process can dump profile
data
>> >    into its own file named using pid.
>>
>> ... or you can specify a more specific name that describes what's
under
>> test, instead of %p.
>>
>
> yes -- but the problem still exists -- each training process will need its
> own copy of raw profile.
>
>
>>
>> >    This will create huge requirement on the disk storage. For
>> >    instance, clang's raw profile size is typically 80M -- if
the
>> >    instrumented clang is used to build a medium to large size
project
>> >    (such as clang itself), profile data can easily use up hundreds
of
>> >    Gig bytes of local storage.
>>
>> This argument is kind of confusing. It says that one profile is
>> typicially 80M, then claims that this uses 100s of GB of data. From
>> these statements that only makes sense I suppose that's true if you
run
>> 1000 profiling runs without merging the data in between. Is that what
>> you're talking about, or did I miss something?
>>
>
> Yes. For instance,  first build a  clang with
> -fprofile-instr-generate=prof.data.%p, and use this instrumented clang to
> build another large project such as clang itself. The second build will
> produce tons of profile data.
>
>
>>
>> > 2) pid can also be recycled. This means that some of the profile
data
>> may be
>> >    overridden without being noticed.
>> >
>> > The way to solve this problem is to allow profile data to be
merged in
>> > process.
>>
>> I'm not convinced. Can you provide some more concrete examples of
where
>> the out of process merging model fails? This was a *very deliberate*
>> design decision in how clang's profiling works, and most of the
>> subsequent decisions have been based on this initial one. Changing it
>> has far reaching effects.
>
>
> I am not proposing changing the out of process merging -- it is still
> needed. What I meant is that, in a scenario where the instrumented binaries
> are running multiple times (using their existing running harness), there is
> no good/automatic way of making sure different process's profile data
won't
> have name conflict.
>
>
> Could the profile file be named from a hash of the profile data themselves?
>
That will solve the name conflict problem -- but the problem with larger
resource requirement is still there :)
>
> Even with that I still like the direction you're heading to. To avoid
> contention on the "file locking", could there be a pool of output
files? I
> don't know how to map a new process to one output file without a lock
> somewhere though.
>
We can certainly introduce something like that when lock contention becomes
an issue (which is a great idea by the way, and quite straightforward to
implement). From my initial experiment, it does seem to be a problem.

thanks,

David
>
> --
> Mehdi
>
>
>
> Using clang's self build (using instrumented clang as build compiler
for
> profile bootstrapping) as an example.  Ideally this should all be done
> transparently  -- i.e, set the instrumented compiler as the build compiler,
> run ninja or make and things will just work, but with the current default
> profile dumping mode, it can fail in many different ways:
> 1) Just run ninja/make -- all clang processes will dump profile into the
> same file concurrently -- the result is a corrupted profile -- FAIL
> 2) run ninja with LLVM_PROFILE_FILE=....%p
>    2.1) failure mode #1 --> really slow build due to large IO; or
running
> out of diskspace
>    2.2) failure mode #2 --> pid recyling leading to profile file name
> conflict -- profile overwriting happens and we loss data
>
> Suppose 2) above finally succeeds, the user will have to merge thousands
> of raw profiles to indexed profile.
>
> With the proposed profile on-line merging, you just need to use the
> instrumented clang, and one merged raw profile data automagically produced
> in the end. The raw to indexed merge is also much faster.
>
> The online merge feature has a huge advantage when considering integrating
> the instrumented binary with existing make systems or loadtesting harness
> -- it is almost seamless.
>
>
>
>>
>> > I have a prototype implementation and plan to send it out for
review
>> > soon after some clean ups. By default, the profiling merging is
off and
>> it can
>> > be turned on with an user option or via an environment variable.
The
>> following
>> > summarizes the issues involved in adding this feature:
>> > 1. the target platform needs to have file locking support
>> > 2. there needs an efficient way to identify the profile data and
>> associate it
>> >    with the binary using binary/profdata signature;
>> > 3. Currently without merging, profile data from shared libraries
>> >    (including dlopen/dlcose ones) are concatenated into the
primary
>> >    profile file. This can complicate matters, as the merger also
needs
>> to
>> >    find the matching shared libs, and the merger also needs to
avoid
>> >    unnecessary data movement/copy;
>> > 4. value profile data is variable in length even for the same
binary.
>>
>> If we actually want this, we should reconsider the design of having a
>> raw vs processed profiling format. The raw profile format is
>> specifically designed to be fast to write out and not to consider
>> merging profiles at all. This feature would make it nearly as
>> complicated as the processed format and lose all of the advantages of
>> making them different.
>>
>
> See above -- all the nice raw profile dumping mechanism is still kept --
> there won't be a change of that.
>
> thanks,
>
> David
>
>
>
>>
>> > All the above issues are resolved and clang self build with
instrumented
>> > binary passes (with both j1 and high parallelism).
>> >
>> > If you have any concerns, please let me know.
>>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160228/bb49fa62/attachment.html>

Vedant Kumar via llvm-dev

2016-Feb-29 16:41 UTC

head link

[llvm-dev] Add support for in-process profile merging in profile-runtime

+ 1 to Sean's suggestion of using a wrapper script to call profdata merge.

David, does that work for your use case?

Some inline comments ---
> On Feb 28, 2016, at 10:45 AM, Mehdi Amini via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
>> 
>> On Feb 28, 2016, at 12:46 AM, Xinliang David Li via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>> 
>> Justin, looks like there is some misunderstanding in my email. I want
to clarify it here first:
>> 
>> 1) I am not proposing changing the default profile dumping model as
used today. The online merging is totally optional;
>> 2) the on-line profile merging is not doing conversion from raw to
index format. It does very simple raw-to-raw merging using existing runtime
APIs.
>> 3) the change to existing profile runtime code is just a few lines. All
the new functionality is isolated in one new file. It will become clear when the
patch is sent out later.
>> 
>> My inline replies below:
>> 
>> 
>> On Sat, Feb 27, 2016 at 11:44 PM, Justin Bogner <mail at
justinbogner.com> wrote:
>> Xinliang David Li via llvm-dev <llvm-dev at lists.llvm.org>
writes:
>> > One of the main missing features in Clang/LLVM profile runtime is
the lack of
>> > support for online/in-process profile merging support. Profile
data collected
>> > for different workloads for the same executable binary need to be
collected
>> > and merged later by the offline post-processing tool.  This
limitation makes
>> > it hard to handle cases where the instrumented binary needs to be
run with
>> > large number of small workloads, possibly in parallel.  For
instance, to do
>> > PGO for clang, we may choose to  build  a large project with the
instrumented
>> > Clang binary. This is because
>> >
>> > 1) to avoid profile from different runs from overriding others, %p
>> >    substitution needs to be specified in either the command line
or an
>> >    environment variable so that different process can dump profile
data
>> >    into its own file named using pid.
>> 
>> ... or you can specify a more specific name that describes what's
under
>> test, instead of %p.
>> 
>> yes -- but the problem still exists -- each training process will need
its own copy of raw profile.
>>  
>> 
>> >    This will create huge requirement on the disk storage. For
>> >    instance, clang's raw profile size is typically 80M -- if
the
>> >    instrumented clang is used to build a medium to large size
project
>> >    (such as clang itself), profile data can easily use up hundreds
of
>> >    Gig bytes of local storage.
>> 
>> This argument is kind of confusing. It says that one profile is
>> typicially 80M, then claims that this uses 100s of GB of data. From
>> these statements that only makes sense I suppose that's true if you
run
>> 1000 profiling runs without merging the data in between. Is that what
>> you're talking about, or did I miss something?
>> 
>> Yes. For instance,  first build a  clang with
-fprofile-instr-generate=prof.data.%p, and use this instrumented clang to build
another large project such as clang itself. The second build will produce tons
of profile data.
>>  
>> 
>> > 2) pid can also be recycled. This means that some of the profile
data may be
>> >    overridden without being noticed.
>> >
>> > The way to solve this problem is to allow profile data to be
merged in
>> > process.
>> 
>> I'm not convinced. Can you provide some more concrete examples of
where
>> the out of process merging model fails? This was a *very deliberate*
>> design decision in how clang's profiling works, and most of the
>> subsequent decisions have been based on this initial one. Changing it
>> has far reaching effects.
>> 
>> I am not proposing changing the out of process merging -- it is still
needed. What I meant is that, in a scenario where the instrumented binaries are
running multiple times (using their existing running harness), there is no
good/automatic way of making sure different process's profile data won't
have name conflict.
> 
> Could the profile file be named from a hash of the profile data themselves?
IIUC David's plan is to repeatedly use the same on-disk raw-profile.

> Even with that I still like the direction you're heading to. To avoid
contention on the "file locking", could there be a pool of output
files? I don't know how to map a new process to one output file without a
lock somewhere though.
> 
> -- 
> Mehdi
> 
> 
>> 
>> Using clang's self build (using instrumented clang as build
compiler for profile bootstrapping) as an example.  Ideally this should all be
done transparently  -- i.e, set the instrumented compiler as the build compiler,
run ninja or make and things will just work, but with the current default
profile dumping mode, it can fail in many different ways:
>> 1) Just run ninja/make -- all clang processes will dump profile into
the same file concurrently -- the result is a corrupted profile -- FAIL
>> 2) run ninja with LLVM_PROFILE_FILE=....%p
>>    2.1) failure mode #1 --> really slow build due to large IO; or
running out of diskspace
>>    2.2) failure mode #2 --> pid recyling leading to profile file
name conflict -- profile overwriting happens and we loss data
>> 
>> Suppose 2) above finally succeeds, the user will have to merge
thousands of raw profiles to indexed profile.
>>    
>> With the proposed profile on-line merging, you just need to use the
instrumented clang, and one merged raw profile data automagically produced in
the end. The raw to indexed merge is also much faster.
>> 
>> The online merge feature has a huge advantage when considering
integrating the instrumented binary with existing make systems or loadtesting
harness -- it is almost seamless.
>> 
>>  
>> 
>> > I have a prototype implementation and plan to send it out for
review
>> > soon after some clean ups. By default, the profiling merging is
off and it can
>> > be turned on with an user option or via an environment variable.
The following
>> > summarizes the issues involved in adding this feature:
>> > 1. the target platform needs to have file locking support
>> > 2. there needs an efficient way to identify the profile data and
associate it
>> >    with the binary using binary/profdata signature;
>> > 3. Currently without merging, profile data from shared libraries
>> >    (including dlopen/dlcose ones) are concatenated into the
primary
>> >    profile file. This can complicate matters, as the merger also
needs to
>> >    find the matching shared libs, and the merger also needs to
avoid
>> >    unnecessary data movement/copy;
>> > 4. value profile data is variable in length even for the same
binary.
>> 
>> If we actually want this, we should reconsider the design of having a
>> raw vs processed profiling format. The raw profile format is
>> specifically designed to be fast to write out and not to consider
>> merging profiles at all. This feature would make it nearly as
>> complicated as the processed format and lose all of the advantages of
>> making them different.
>> 
>> See above -- all the nice raw profile dumping mechanism is still kept
-- there won't be a change of that.
>> 
>> thanks,
>> 
>> David
>> 
>>  
>> 
>> > All the above issues are resolved and clang self build with
instrumented
>> > binary passes (with both j1 and high parallelism). 
>> >
>> > If you have any concerns, please let me know.
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
> However, I have not seen it demonstrated that this kind of refined data
collection will actually improve PGO results in practice.
That's an important question to ask. But, if users do not want profiles from
certain runs, shouldn't they either (1) pass LLVM_PROFILE_FILE=/dev/null, or
(2) not instrument the binary?

> Also, in general, I am very wary of file locking. This can cause huge
amounts of slowdown for a build and has potential portability problems.
Yeah, these things aren't great, but as long as they are opt-in I think
it'd be reasonable.

Also to be fair, writing out tons of raw profiles to separate files is a lot
slower than overwriting the same raw profile.

> I don't see it as a substantially better solution than wrapping clang
in a script that runs clang and then just calls llvm-profdata to do the merging.
Running llvm-profdata is cheap compared to doing locking in a highly parallel
situation like a build.
^ + 1.

vedant

Sean Silva via llvm-dev

2016-Feb-29 19:55 UTC

head link

[llvm-dev] Add support for in-process profile merging in profile-runtime

On Sun, Feb 28, 2016 at 10:45 AM, Mehdi Amini via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
> On Feb 28, 2016, at 12:46 AM, Xinliang David Li via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Justin, looks like there is some misunderstanding in my email. I want to
> clarify it here first:
>
> 1) I am not proposing changing the default profile dumping model as used
> today. The online merging is totally optional;
> 2) the on-line profile merging is not doing conversion from raw to index
> format. It does very simple raw-to-raw merging using existing runtime APIs.
> 3) the change to existing profile runtime code is just a few lines. All
> the new functionality is isolated in one new file. It will become clear
> when the patch is sent out later.
>
> My inline replies below:
>
>
> On Sat, Feb 27, 2016 at 11:44 PM, Justin Bogner <mail at
justinbogner.com>
> wrote:
>
>> Xinliang David Li via llvm-dev <llvm-dev at lists.llvm.org>
writes:
>> > One of the main missing features in Clang/LLVM profile runtime is
the
>> lack of
>> > support for online/in-process profile merging support. Profile
data
>> collected
>> > for different workloads for the same executable binary need to be
>> collected
>> > and merged later by the offline post-processing tool.  This
limitation
>> makes
>> > it hard to handle cases where the instrumented binary needs to be
run
>> with
>> > large number of small workloads, possibly in parallel.  For
instance,
>> to do
>> > PGO for clang, we may choose to  build  a large project with the
>> instrumented
>> > Clang binary. This is because
>> >
>> > 1) to avoid profile from different runs from overriding others, %p
>> >    substitution needs to be specified in either the command line
or an
>> >    environment variable so that different process can dump profile
data
>> >    into its own file named using pid.
>>
>> ... or you can specify a more specific name that describes what's
under
>> test, instead of %p.
>>
>
> yes -- but the problem still exists -- each training process will need its
> own copy of raw profile.
>
>
>>
>> >    This will create huge requirement on the disk storage. For
>> >    instance, clang's raw profile size is typically 80M -- if
the
>> >    instrumented clang is used to build a medium to large size
project
>> >    (such as clang itself), profile data can easily use up hundreds
of
>> >    Gig bytes of local storage.
>>
>> This argument is kind of confusing. It says that one profile is
>> typicially 80M, then claims that this uses 100s of GB of data. From
>> these statements that only makes sense I suppose that's true if you
run
>> 1000 profiling runs without merging the data in between. Is that what
>> you're talking about, or did I miss something?
>>
>
> Yes. For instance,  first build a  clang with
> -fprofile-instr-generate=prof.data.%p, and use this instrumented clang to
> build another large project such as clang itself. The second build will
> produce tons of profile data.
>
>
>>
>> > 2) pid can also be recycled. This means that some of the profile
data
>> may be
>> >    overridden without being noticed.
>> >
>> > The way to solve this problem is to allow profile data to be
merged in
>> > process.
>>
>> I'm not convinced. Can you provide some more concrete examples of
where
>> the out of process merging model fails? This was a *very deliberate*
>> design decision in how clang's profiling works, and most of the
>> subsequent decisions have been based on this initial one. Changing it
>> has far reaching effects.
>
>
> I am not proposing changing the out of process merging -- it is still
> needed. What I meant is that, in a scenario where the instrumented binaries
> are running multiple times (using their existing running harness), there is
> no good/automatic way of making sure different process's profile data
won't
> have name conflict.
>
>
> Could the profile file be named from a hash of the profile data themselves?
>
> Even with that I still like the direction you're heading to. To avoid
> contention on the "file locking", could there be a pool of output
files? I
> don't know how to map a new process to one output file without a lock
> somewhere though.
>
We could add something like %42H to the format string, which expands to a
hash of the file contents mod 42. That would allow controlling how many
output files there are.

Still, I think that having a %H (or whatever) which expands to a unique ID
is a good primitive to have in the format string and at least addresses the
issue of clobbering.

-- Sean Silva

>
> --
> Mehdi
>
>
>
> Using clang's self build (using instrumented clang as build compiler
for
> profile bootstrapping) as an example.  Ideally this should all be done
> transparently  -- i.e, set the instrumented compiler as the build compiler,
> run ninja or make and things will just work, but with the current default
> profile dumping mode, it can fail in many different ways:
> 1) Just run ninja/make -- all clang processes will dump profile into the
> same file concurrently -- the result is a corrupted profile -- FAIL
> 2) run ninja with LLVM_PROFILE_FILE=....%p
>    2.1) failure mode #1 --> really slow build due to large IO; or
running
> out of diskspace
>    2.2) failure mode #2 --> pid recyling leading to profile file name
> conflict -- profile overwriting happens and we loss data
>
> Suppose 2) above finally succeeds, the user will have to merge thousands
> of raw profiles to indexed profile.
>
> With the proposed profile on-line merging, you just need to use the
> instrumented clang, and one merged raw profile data automagically produced
> in the end. The raw to indexed merge is also much faster.
>
> The online merge feature has a huge advantage when considering integrating
> the instrumented binary with existing make systems or loadtesting harness
> -- it is almost seamless.
>
>
>
>>
>> > I have a prototype implementation and plan to send it out for
review
>> > soon after some clean ups. By default, the profiling merging is
off and
>> it can
>> > be turned on with an user option or via an environment variable.
The
>> following
>> > summarizes the issues involved in adding this feature:
>> > 1. the target platform needs to have file locking support
>> > 2. there needs an efficient way to identify the profile data and
>> associate it
>> >    with the binary using binary/profdata signature;
>> > 3. Currently without merging, profile data from shared libraries
>> >    (including dlopen/dlcose ones) are concatenated into the
primary
>> >    profile file. This can complicate matters, as the merger also
needs
>> to
>> >    find the matching shared libs, and the merger also needs to
avoid
>> >    unnecessary data movement/copy;
>> > 4. value profile data is variable in length even for the same
binary.
>>
>> If we actually want this, we should reconsider the design of having a
>> raw vs processed profiling format. The raw profile format is
>> specifically designed to be fast to write out and not to consider
>> merging profiles at all. This feature would make it nearly as
>> complicated as the processed format and lose all of the advantages of
>> making them different.
>>
>
> See above -- all the nice raw profile dumping mechanism is still kept --
> there won't be a change of that.
>
> thanks,
>
> David
>
>
>
>>
>> > All the above issues are resolved and clang self build with
instrumented
>> > binary passes (with both j1 and high parallelism).
>> >
>> > If you have any concerns, please let me know.
>>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160229/d1985799/attachment.html>

Justin Bogner via llvm-dev

2016-Feb-29 20:14 UTC

head link

[llvm-dev] Add support for in-process profile merging in profile-runtime

Xinliang David Li <davidxl at google.com> writes:> Justin, looks like there is some misunderstanding in my email. I want to
> clarify it here first:
>
> 1) I am not proposing changing the default profile dumping model as used
> today. The online merging is totally optional;
> 2) the on-line profile merging is not doing conversion from raw to index
> format. It does very simple raw-to-raw merging using existing runtime APIs.
> 3) the change to existing profile runtime code is just a few lines. All the
> new functionality is isolated in one new file. It will become clear when
> the patch is sent out later.
This doesn't sound quite as scary with this explanation. I'll wait for
the patch.

Possibly Parallel Threads

Search for more possibly parallel threads

llvm dev - Feb 2016 - Add support for in-process profile merging in profile-runtime

[llvm-dev] Add support for in-process profile merging in profile-runtime

[llvm-dev] Add support for in-process profile merging in profile-runtime

[llvm-dev] Add support for in-process profile merging in profile-runtime

[llvm-dev] Add support for in-process profile merging in profile-runtime

[llvm-dev] Add support for in-process profile merging in profile-runtime

[llvm-dev] Add support for in-process profile merging in profile-runtime

Possibly Parallel Threads