thr3ads.net - llvm dev - [llvm-dev] RFC: Pass to prune redundant profiling instrumentation [Mar 2016]

If this information is useful, please help other people find it:
Share via:

Vedant Kumar via llvm-dev

2016-Mar-11 03:21 UTC

[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

Hi,

I'd like to add a new pass to LLVM which removes redundant profile counter
updates. The goal is to speed up code coverage testing and profile generation
for PGO.

I'm sending this email out to describe my approach, share some early
results,
and gather feedback.
 

Problem Overview
===============
A profile counter is redundant if it's incremented in exactly the same basic
blocks as some other profile counter. Consider the following module:

    local void f1() {
      instrprof_increment(profc_f1);
    }

    void f2() {
      instrprof_increment(profc_f2);
      f1();
    }

Once the inliner runs and deletes f1, we're left with:

    void f2() {
      instrprof_increment(profc_f2);
      instrprof_increment(profc_f1);
    }

Now we can say profc_f1 is redundant (or, an alias for profc_f2).

I've noticed that frontend-based instrumentation can generate many redundant
profile counters. This degrades performance and increases code size.  We can
address the problem by getting rid of redundant counter updates. The trick is
to make sure we get back the same profiles.


Proposed Solution
================
I propose a pruning pass which takes the following steps:

  1. Delete functions with local linkage and only one use, if that use is in
     a profile data record.

     These functions are left around by the inliner (it doesn't know that
     they're safe to delete). Deleting them reduces code size and simplifies
     subsequent analysis of profile counters.

  2. Determine which profile counters are essential.

  3. Erase all updates to redundant profile counters.

  4. Emit the aliases into a new section in the binary.

     Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some changes
     in compiler-rt are required to walk the alias section and fill in the
     correct execution counts at program exit time.

This pass needs to be run after the inliner in order to be effective.

The complexity of this pass is O(N*M), where N is the number of profile
counters, and M is the average number of updates per counter. In practice it is
a bit faster, since we can skip the analysis of counters which are discovered to
be redundant early on in the process.


Early Results
============
The pruning pass results in 25% speed improvement in the example program above
(where f2 is called in a loop 10^8 times).

Here is a slightly less contrived example:

  #include <vector>
  #include <algorithm>
  #include <cstdlib>

  static void escape(void *p) {
    asm volatile("" : : "g"(p) : "memory");
  }

  int main(int argc, char **argv) { 
    std::vector<int> V(atoi(argv[1])); 
    escape(reinterpret_cast<void *>(V.data())); 
    std::sort(V.begin(), V.end()); 
    return V[0];
  } 

I get the following results on my desktop (10^8 elements, 5 runs each):

  O3:                           0.262s
  O3 + PGOInstr:                0.663s
  O3 + PGOInstr + Pruning:      0.606s (8.6% performance win, 672 aliases)
  O3 + CoverageInstr:           0.690s
  O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, 688 aliases)


Next Steps?
==========
Is the performance of instrumented code something we think we need to fix?
What's an acceptable compile-time overhead for running this pruning pass? Is
the general approach a non-starter for anybody?

I'd like to get some feedback and gauge interest before pursuing this
further.
Possible next steps include benchmarking instrumented versions of clang and
swift on the relevant test suites, running performance tests from lnt, running
compile-time tests, and measuring any code size differences.


thanks
vedant

Mehdi Amini via llvm-dev

2016-Mar-11 03:48 UTC

head link

[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

> On Mar 10, 2016, at 7:21 PM, Vedant Kumar via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Hi,
> 
> I'd like to add a new pass to LLVM which removes redundant profile
counter
> updates. The goal is to speed up code coverage testing and profile
generation
> for PGO.
> 
> I'm sending this email out to describe my approach, share some early
results,
> and gather feedback.
> 
> 
> Problem Overview
> ===============> 
> A profile counter is redundant if it's incremented in exactly the same
basic
> blocks as some other profile counter. Consider the following module:
> 
>    local void f1() {
>      instrprof_increment(profc_f1);
>    }
> 
>    void f2() {
>      instrprof_increment(profc_f2);
>      f1();
>    }
> 
> Once the inliner runs and deletes f1, we're left with:
> 
>    void f2() {
>      instrprof_increment(profc_f2);
>      instrprof_increment(profc_f1);
>    }
> 
> Now we can say profc_f1 is redundant (or, an alias for profc_f2).
> 
> I've noticed that frontend-based instrumentation can generate many
redundant
> profile counters. This degrades performance and increases code size.  We
can
> address the problem by getting rid of redundant counter updates. The trick
is
> to make sure we get back the same profiles.
> 
> 
> Proposed Solution
> ================> 
> I propose a pruning pass which takes the following steps:
> 
>  1. Delete functions with local linkage and only one use, if that use is in
>     a profile data record.
> 
>     These functions are left around by the inliner (it doesn't know
that
>     they're safe to delete). Deleting them reduces code size and
simplifies
>     subsequent analysis of profile counters.
I think that this is something global-DCE should be teached about (if it does
not know already).

> 
>  2. Determine which profile counters are essential.
What is an "essential counter"?
> 
>  3. Erase all updates to redundant profile counters.
When you are saying "erase", you need to actually replace the multiple
counter increment with *a new* counter, right? I.e. when replacing:

     instrprof_increment(profc_f2);
     instrprof_increment(profc_f1);

you need to emit:

     instrprof_increment(profc_f2f1_fused);

>  4. Emit the aliases into a new section in the binary.
> 
>     Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some
changes
>     in compiler-rt are required to walk the alias section and fill in the
>     correct execution counts at program exit time.
> 
> This pass needs to be run after the inliner in order to be effective.
> 
> The complexity of this pass is O(N*M), where N is the number of profile
> counters, and M is the average number of updates per counter.
What you're describing here is not clear to me? But I may have been totally
misunderstanding the whole concept :)


-- 
Mehdi

> In practice it is
> a bit faster, since we can skip the analysis of counters which are
discovered to
> be redundant early on in the process.
> 
> 
> Early Results
> ============> 
> The pruning pass results in 25% speed improvement in the example program
above
> (where f2 is called in a loop 10^8 times).
> 
> Here is a slightly less contrived example:
> 
>  #include <vector>
>  #include <algorithm>
>  #include <cstdlib>
> 
>  static void escape(void *p) {
>    asm volatile("" : : "g"(p) : "memory");
>  }
> 
>  int main(int argc, char **argv) { 
>    std::vector<int> V(atoi(argv[1])); 
>    escape(reinterpret_cast<void *>(V.data())); 
>    std::sort(V.begin(), V.end()); 
>    return V[0];
>  } 
> 
> I get the following results on my desktop (10^8 elements, 5 runs each):
> 
>  O3:                           0.262s
>  O3 + PGOInstr:                0.663s
>  O3 + PGOInstr + Pruning:      0.606s (8.6% performance win, 672 aliases)
>  O3 + CoverageInstr:           0.690s
>  O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, 688 aliases)
> 
> 
> Next Steps?
> ==========> 
> Is the performance of instrumented code something we think we need to fix?
> What's an acceptable compile-time overhead for running this pruning
pass? Is
> the general approach a non-starter for anybody?
> 
> I'd like to get some feedback and gauge interest before pursuing this
further.
> Possible next steps include benchmarking instrumented versions of clang and
> swift on the relevant test suites, running performance tests from lnt,
running
> compile-time tests, and measuring any code size differences.
> 
> 
> thanks
> vedant
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Sean Silva via llvm-dev

2016-Mar-11 04:33 UTC

head link

[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

On Thu, Mar 10, 2016 at 7:21 PM, Vedant Kumar via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hi,
>
> I'd like to add a new pass to LLVM which removes redundant profile
counter
> updates. The goal is to speed up code coverage testing and profile
> generation
> for PGO.
>
We may want to have a focused discussion about this goal, rather than a
particular suggestion. There are a lot of things we can do. Off the top of
my head, some are:

1. add some sort of alias annotation (such as an independent TBAA root
node) to all the counter increment memory instructions to tell the
optimizer they don't interfere with the usual loads and stores.

2. communicate to the optimizer that counters can be registerized. In a
loop like:
for (int i = 0; i < N; i++) {
  if (foo())
    bar();
  else
    baz();
}
we perform O(N) counter increments (i.e. load, increment, store) last I
checked. However, if the counters are in registers, then we only perform
O(1) memory operations. This can dramatically reduce the pressure on the
CPU's load/store units and also relieve cross-core cache line ping-pong
when two cores are executing the same code. Note that the latter benefit is
attained even if we ultimately end up spilling the counters due to
increased register pressure.

I actually don't know what is preventing the usual optimization pipeline
from getting 2 right.

>
> I'm sending this email out to describe my approach, share some early
> results,
> and gather feedback.
>
>
> Problem Overview
> ===============>
> A profile counter is redundant if it's incremented in exactly the same
> basic
> blocks as some other profile counter. Consider the following module:
>
>     local void f1() {
>       instrprof_increment(profc_f1);
>     }
>
>     void f2() {
>       instrprof_increment(profc_f2);
>       f1();
>     }
>
> Once the inliner runs and deletes f1, we're left with:
>
>     void f2() {
>       instrprof_increment(profc_f2);
>       instrprof_increment(profc_f1);
>     }
>
> Now we can say profc_f1 is redundant (or, an alias for profc_f2).
>
> I've noticed that frontend-based instrumentation can generate many
> redundant
> profile counters. This degrades performance and increases code size.  We
> can
> address the problem by getting rid of redundant counter updates. The trick
> is
> to make sure we get back the same profiles.
>
>
> Proposed Solution
> ================>
> I propose a pruning pass which takes the following steps:
>
>   1. Delete functions with local linkage and only one use, if that use is
> in
>      a profile data record.
>
>      These functions are left around by the inliner (it doesn't know
that
>      they're safe to delete). Deleting them reduces code size and
> simplifies
>      subsequent analysis of profile counters.
>
>   2. Determine which profile counters are essential.
>
>   3. Erase all updates to redundant profile counters.
>
>   4. Emit the aliases into a new section in the binary.
>
>      Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some
> changes
>      in compiler-rt are required to walk the alias section and fill in the
>      correct execution counts at program exit time.
>
> This pass needs to be run after the inliner in order to be effective.
>
> The complexity of this pass is O(N*M), where N is the number of profile
> counters, and M is the average number of updates per counter. In practice
> it is
> a bit faster, since we can skip the analysis of counters which are
> discovered to
> be redundant early on in the process.
>
I think a conceptually simpler design is something like:

for each CFG edge:
    record which FE counters have ended up associated with it
remove FE counters
run IR instrumentation pass
emit a side table mapping IR instr counters to FE counters (more generally:
how to reconstruct FE counters from the IR counters)

The problem is simply reduced to the IR instrumentation pass.

>
>
> Early Results
> ============>
> The pruning pass results in 25% speed improvement in the example program
> above
> (where f2 is called in a loop 10^8 times).
>
> Here is a slightly less contrived example:
>
>   #include <vector>
>   #include <algorithm>
>   #include <cstdlib>
>
>   static void escape(void *p) {
>     asm volatile("" : : "g"(p) : "memory");
>   }
>
>   int main(int argc, char **argv) {
>     std::vector<int> V(atoi(argv[1]));
>     escape(reinterpret_cast<void *>(V.data()));
>     std::sort(V.begin(), V.end());
    return V[0];>   }
>
> I get the following results on my desktop (10^8 elements, 5 runs each):
>
>   O3:                           0.262s
>   O3 + PGOInstr:                0.663s
>   O3 + PGOInstr + Pruning:      0.606s (8.6% performance win, 672 aliases)
>   O3 + CoverageInstr:           0.690s
>   O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, 688 aliases)
>
>
> Next Steps?
> ==========>
> Is the performance of instrumented code something we think we need to fix?
>
With frontend instrumentation it definitely is.

What's an acceptable compile-time overhead for running this pruning pass?


Instrumented builds are "special" anyway so a fair slowdown is
probably
acceptable. I.e. this doesn't affect regular developer compilation times.
As a baseline, maybe compare compile time of `-fprofile-instr-generate -O2`
vs. `-O2`. If `-fprofile-instr-generate -O2` is 10% slower than the
control, then that indicates that a 5% extra slowdown is probably
reasonable for a substantial reduction in profiling overhead (which can
result in a qualitative improvement to the actual usability of the
instrumented program).

-- Sean Silva


> Is
> the general approach a non-starter for anybody?
>
> I'd like to get some feedback and gauge interest before pursuing this
> further.
> Possible next steps include benchmarking instrumented versions of clang and
> swift on the relevant test suites, running performance tests from lnt,
> running
> compile-time tests, and measuring any code size differences.
>
>
> thanks
> vedant
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160310/ce9e351e/attachment.html>

Daniel Berlin via llvm-dev

2016-Mar-11 04:42 UTC

head link

[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

>
>
> >
> >     These functions are left around by the inliner (it doesn't
know that
> >     they're safe to delete). Deleting them reduces code size and
> simplifies
> >     subsequent analysis of profile counters.
>
> I think that this is something global-DCE should be teached about (if it
> does not know already).
>
> What attribute do you think we could use to do that?
> >  4. Emit the aliases into a new section in the binary.
> >
> >     Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some
> changes
> >     in compiler-rt are required to walk the alias section and fill in
the
> >     correct execution counts at program exit time.
> >
> > This pass needs to be run after the inliner in order to be effective.
> >
> > The complexity of this pass is O(N*M), where N is the number of
profile
> > counters, and M is the average number of updates per counter.
>
> What you're describing here is not clear to me? But I may have been
> totally misunderstanding the whole concept :)
>Also, describing the complexity like this is strange.
It should be just O(N) where N is the number of instructions.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160310/ef8b80fa/attachment.html>

Xinliang David Li via llvm-dev

2016-Mar-11 05:34 UTC

head link

[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

Vedant, Are you concerned about instrumentation run  performance for PGO or
for coverage testing? For coverage testing, is coverage information enough
or count information is also needed?  Depending on the answer to the
questions, the solution may be very different.

If the answer is for PGO, the a much better longer term solution is to
migrate to use IR based instrumentation. Not only because IR based
instrumentation places counter update 'optimally', the early
optimization
including pre-inline of tiny funcs is very effective in reducing instr
overhead plus the benefit of better profile quality due to context
sensitivity. For more details or performance numbers see Rong's RFC about
late instrumentation.

If the answer is PGO, but for various reasons, the FE based instrumentation
has to be used, then I think there is another more effective way previously
suggested by Sean. Basically you can skip single BB functions completely
during instrumentation. There are various reasons why profile data for
single bb function is not useful:
1) they are usually small and beneficial to be inlined regardless of
profile data
2) the BB of the inline instance can get profile data from the caller
context
3) the profile data for the out of line single BB func is also useless

Rong has some data showing the effectiveness of this method -- not as good
as the pre-optimization approach, but IIRC also very good.

If the answer is for coverage but the actual count value does not really
matter, then a more effective way of reducing overhead is to teach the
instrumentation lowering pass to lower the counter update into a simple
store:

  counter_1 = 1;

Such stores from the inline instances of the same func can be easily
eliminated. It will also help multithreaded program a lot when there is
heavy counter contention.

Another benefit is that the size of the counter can be effectively reduced
to 1  byte instead of 8 byte.

The tricky situation is you want coverage data but count per line is also
important -- but I wonder having special pass to just handle this scenario
is worth the effort.

Also a side note, I think longer term we should unify three instrumentation
mechanism into one: FE based, IR based, and old gcda instrumentation.  IR
based is the most efficient implementation -- when combined with gcda
profiling runtime, it can be used to replace current gcda profiling which
is not efficient. It is also possible to use IR based instr for coverage
mapping (with coverage map format mostly unchanged) but main challenge is
passing source info etc.

thanks,

David

On Thu, Mar 10, 2016 at 7:21 PM, Vedant Kumar via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hi,
>
> I'd like to add a new pass to LLVM which removes redundant profile
counter
> updates. The goal is to speed up code coverage testing and profile
> generation
> for PGO.
>
> I'm sending this email out to describe my approach, share some early
> results,
> and gather feedback.
>
>From your example, it seems only one scenario is handled -- local functionwith single callsite ? This seems to be quite narrow in scope. Before
pursuing further, it is better to measure the impact of this on larger
benchmarks.

> thanks
> vedant
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160310/59320d80/attachment.html>

Xinliang David Li via llvm-dev

2016-Mar-11 05:42 UTC

head link

[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

On Thu, Mar 10, 2016 at 8:33 PM, Sean Silva via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
>
> On Thu, Mar 10, 2016 at 7:21 PM, Vedant Kumar via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Hi,
>>
>> I'd like to add a new pass to LLVM which removes redundant profile
counter
>> updates. The goal is to speed up code coverage testing and profile
>> generation
>> for PGO.
>>
>
> We may want to have a focused discussion about this goal, rather than a
> particular suggestion. There are a lot of things we can do. Off the top of
> my head, some are:
>
> 1. add some sort of alias annotation (such as an independent TBAA root
> node) to all the counter increment memory instructions to tell the
> optimizer they don't interfere with the usual loads and stores.
>
> 2. communicate to the optimizer that counters can be registerized. In a
> loop like:
> for (int i = 0; i < N; i++) {
>   if (foo())
>     bar();
>   else
>     baz();
> }
> we perform O(N) counter increments (i.e. load, increment, store) last I
> checked. However, if the counters are in registers, then we only perform
> O(1) memory operations. This can dramatically reduce the pressure on the
> CPU's load/store units and also relieve cross-core cache line ping-pong
> when two cores are executing the same code. Note that the latter benefit is
> attained even if we ultimately end up spilling the counters due to
> increased register pressure.
>
> I actually don't know what is preventing the usual optimization
pipeline
> from getting 2 right.
>
Call Mod-ref.  We need to teach the optimizer that the counter owned by the
current function (if the function is proved to be non-recursive in some
way) can not be modified by any other calls.

David



>
>
>>
>> I'm sending this email out to describe my approach, share some
early
>> results,
>> and gather feedback.
>>
>>
>> Problem Overview
>> ===============>>
>> A profile counter is redundant if it's incremented in exactly the
same
>> basic
>> blocks as some other profile counter. Consider the following module:
>>
>>     local void f1() {
>>       instrprof_increment(profc_f1);
>>     }
>>
>>     void f2() {
>>       instrprof_increment(profc_f2);
>>       f1();
>>     }
>>
>> Once the inliner runs and deletes f1, we're left with:
>>
>>     void f2() {
>>       instrprof_increment(profc_f2);
>>       instrprof_increment(profc_f1);
>>     }
>>
>> Now we can say profc_f1 is redundant (or, an alias for profc_f2).
>>
>> I've noticed that frontend-based instrumentation can generate many
>> redundant
>> profile counters. This degrades performance and increases code size. 
We
>> can
>> address the problem by getting rid of redundant counter updates. The
>> trick is
>> to make sure we get back the same profiles.
>>
>>
>> Proposed Solution
>> ================>>
>> I propose a pruning pass which takes the following steps:
>>
>>   1. Delete functions with local linkage and only one use, if that use
is
>> in
>>      a profile data record.
>>
>>      These functions are left around by the inliner (it doesn't
know that
>>      they're safe to delete). Deleting them reduces code size and
>> simplifies
>>      subsequent analysis of profile counters.
>>
>>   2. Determine which profile counters are essential.
>>
>>   3. Erase all updates to redundant profile counters.
>>
>>   4. Emit the aliases into a new section in the binary.
>>
>>      Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some
>> changes
>>      in compiler-rt are required to walk the alias section and fill in
the
>>      correct execution counts at program exit time.
>>
>> This pass needs to be run after the inliner in order to be effective.
>>
>> The complexity of this pass is O(N*M), where N is the number of profile
>> counters, and M is the average number of updates per counter. In
practice
>> it is
>> a bit faster, since we can skip the analysis of counters which are
>> discovered to
>> be redundant early on in the process.
>>
>
> I think a conceptually simpler design is something like:
>
> for each CFG edge:
>     record which FE counters have ended up associated with it
> remove FE counters
> run IR instrumentation pass
> emit a side table mapping IR instr counters to FE counters (more
> generally: how to reconstruct FE counters from the IR counters)
>
> The problem is simply reduced to the IR instrumentation pass.
>
>
>>
>>
>> Early Results
>> ============>>
>> The pruning pass results in 25% speed improvement in the example
program
>> above
>> (where f2 is called in a loop 10^8 times).
>>
>> Here is a slightly less contrived example:
>>
>>   #include <vector>
>>   #include <algorithm>
>>   #include <cstdlib>
>>
>>   static void escape(void *p) {
>>     asm volatile("" : : "g"(p) :
"memory");
>>   }
>>
>>   int main(int argc, char **argv) {
>>     std::vector<int> V(atoi(argv[1]));
>>     escape(reinterpret_cast<void *>(V.data()));
>>     std::sort(V.begin(), V.end());
>
>     return V[0];
>>   }
>>
>> I get the following results on my desktop (10^8 elements, 5 runs each):
>>
>>   O3:                           0.262s
>>   O3 + PGOInstr:                0.663s
>>   O3 + PGOInstr + Pruning:      0.606s (8.6% performance win, 672
aliases)
>>   O3 + CoverageInstr:           0.690s
>>   O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, 688
>> aliases)
>>
>>
>> Next Steps?
>> ==========>>
>> Is the performance of instrumented code something we think we need to
fix?
>>
>
> With frontend instrumentation it definitely is.
>
> What's an acceptable compile-time overhead for running this pruning
pass?
>
>
> Instrumented builds are "special" anyway so a fair slowdown is
probably
> acceptable. I.e. this doesn't affect regular developer compilation
times.
> As a baseline, maybe compare compile time of `-fprofile-instr-generate -O2`
> vs. `-O2`. If `-fprofile-instr-generate -O2` is 10% slower than the
> control, then that indicates that a 5% extra slowdown is probably
> reasonable for a substantial reduction in profiling overhead (which can
> result in a qualitative improvement to the actual usability of the
> instrumented program).
>
> -- Sean Silva
>
>
>
>> Is
>> the general approach a non-starter for anybody?
>>
>> I'd like to get some feedback and gauge interest before pursuing
this
>> further.
>> Possible next steps include benchmarking instrumented versions of clang
>> and
>> swift on the relevant test suites, running performance tests from lnt,
>> running
>> compile-time tests, and measuring any code size differences.
>>
>>
>> thanks
>> vedant
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160310/27b05662/attachment.html>

Justin Bogner via llvm-dev

2016-Mar-11 19:17 UTC

head link

[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

Xinliang David Li via llvm-dev <llvm-dev at lists.llvm.org>
writes:> Vedant, Are you concerned about instrumentation run  performance for PGO or
> for coverage testing? For coverage testing, is coverage information enough
> or count information is also needed?  Depending on the answer to the
> questions, the solution may be very different.
>
> If the answer is for PGO, the a much better longer term solution is to
> migrate to use IR based instrumentation. Not only because IR based
> instrumentation places counter update 'optimally', the early
optimization
> including pre-inline of tiny funcs is very effective in reducing instr
> overhead plus the benefit of better profile quality due to context
> sensitivity. For more details or performance numbers see Rong's RFC
about
> late instrumentation.
>
> If the answer is PGO, but for various reasons, the FE based instrumentation
> has to be used, then I think there is another more effective way previously
> suggested by Sean. Basically you can skip single BB functions completely
> during instrumentation. There are various reasons why profile data for
> single bb function is not useful:
> 1) they are usually small and beneficial to be inlined regardless of
> profile data
> 2) the BB of the inline instance can get profile data from the caller
> context
> 3) the profile data for the out of line single BB func is also useless
>
> Rong has some data showing the effectiveness of this method -- not as good
> as the pre-optimization approach, but IIRC also very good.
>
> If the answer is for coverage but the actual count value does not really
> matter, then a more effective way of reducing overhead is to teach the
> instrumentation lowering pass to lower the counter update into a simple
> store:
>
>   counter_1 = 1;
This won't work. The FE based instrumentation has implicit counters for
various AST constructs that can only be worked out by doing math with
the explicit counters. If they don't have actual counts, this doesn't
work.
> Such stores from the inline instances of the same func can be easily
> eliminated. It will also help multithreaded program a lot when there is
> heavy counter contention.
>
> Another benefit is that the size of the counter can be effectively reduced
> to 1  byte instead of 8 byte.
>
> The tricky situation is you want coverage data but count per line is also
> important -- but I wonder having special pass to just handle this scenario
> is worth the effort.
Isn't "coverage with count-per-line" the whole point of a coverage
feature?

It's also convenient to gather data for PGO and coverage in the same
run, rather than having to build instrumented twice and run the same
tests/training runs twice. Performance of the FE based instrumentation
is pretty important for productivity reasons.
> Also a side note, I think longer term we should unify three instrumentation
> mechanism into one: FE based, IR based, and old gcda instrumentation.  IR
> based is the most efficient implementation -- when combined with gcda
> profiling runtime, it can be used to replace current gcda profiling which
> is not efficient. It is also possible to use IR based instr for coverage
> mapping (with coverage map format mostly unchanged) but main challenge is
> passing source info etc.
I don't disagree, but I think that getting the same quality of coverage
data from the IR based profiles as we do from the FE ones is a fairly
large undertaking.
> thanks,
>
> David
>
>
> On Thu, Mar 10, 2016 at 7:21 PM, Vedant Kumar via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Hi,
>>
>> I'd like to add a new pass to LLVM which removes redundant profile
counter
>> updates. The goal is to speed up code coverage testing and profile
>> generation
>> for PGO.
>>
>> I'm sending this email out to describe my approach, share some
early
>> results,
>> and gather feedback.
>>
>
> From your example, it seems only one scenario is handled -- local function
> with single callsite ? This seems to be quite narrow in scope. Before
> pursuing further, it is better to measure the impact of this on larger
> benchmarks.
>
>
>
>> thanks
>> vedant
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Justin Bogner via llvm-dev

2016-Mar-11 19:27 UTC

head link

[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

Sean Silva via llvm-dev <llvm-dev at lists.llvm.org>
writes:> On Thu, Mar 10, 2016 at 7:21 PM, Vedant Kumar via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Hi,
>>
>> I'd like to add a new pass to LLVM which removes redundant profile
counter
>> updates. The goal is to speed up code coverage testing and profile
>> generation
>> for PGO.
>>
>
> We may want to have a focused discussion about this goal, rather than a
> particular suggestion. There are a lot of things we can do. Off the top of
> my head, some are:
>
> 1. add some sort of alias annotation (such as an independent TBAA root
> node) to all the counter increment memory instructions to tell the
> optimizer they don't interfere with the usual loads and stores.
>
> 2. communicate to the optimizer that counters can be registerized. In a
> loop like:
> for (int i = 0; i < N; i++) {
>   if (foo())
>     bar();
>   else
>     baz();
> }
> we perform O(N) counter increments (i.e. load, increment, store) last I
> checked. However, if the counters are in registers, then we only perform
> O(1) memory operations. This can dramatically reduce the pressure on the
> CPU's load/store units and also relieve cross-core cache line ping-pong
> when two cores are executing the same code. Note that the latter benefit is
> attained even if we ultimately end up spilling the counters due to
> increased register pressure.
>
> I actually don't know what is preventing the usual optimization
pipeline
> from getting 2 right.
Yes, we definitely want to teach optimizations to be clever about
instrumentation counters. This kind of thing should help both the FE and
IR based instrumentation.

I think it's somewhat orthogonal to Vedant's approach here though - even
if we can hoist increments out of loops or do the work in registers some
of the time, reducing counts that are known to be identical to one
memory location should still bring a fair bit of value.

Vedant Kumar via llvm-dev

2016-Mar-11 20:47 UTC

head link

[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

There have been a lot of responses. I'll try to summarize the thread and
respond
to some of the questions/feedback.


Summary
======
1. We should teach GlobalDCE to strip out instrumented functions which the
   inliner cannot delete.

2. Sean suggests adding metadata to loads/stores of counters to convey that
   they do not alias normal program data.

   I'm not familiar with AA, but off-hand this seems like a good idea.

3. Sean also suggests improving the optimizer to registerize counters when
   possible.

   Ditto, seems like a good idea.

4. Sean has an alternate proposal for counter pruning which works by remapping
   FE-generated counters to counters generated by the IR instrumentation pass.
   David likes this proposal.

   I have some questions about this but am willing to try it. IMO, the pruning
   pass I've suggested could complement this nicely -- I don't view them
as
   mutually exclusive (though I could be wrong).

5. There seems to be consensus over the need to improve performance of
   FE-instrumented programs.

6. David is concerned that the proposed pruning pass is too narrow in scope,
   and would like to see results on a larger benchmark. I'll try to get some
   numbers.

7. David mentioned some optimizations that are possible if we make coverage
   information "yes or no" (per line). Justin pointed out that this
doesn't
   work for a few of Apple's use cases.


Clarifications
=============
1. In my original email, two of my numbers were incorrect. I listed the
   sizes of the __llvm_prf_alias sections, not the number of actual aliases.
   The corrected lines are:

O3 + PGOInstr + Pruning:      0.606s (8.6% performance win, _42_ aliases)
O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, _44_ aliases)

2. 
>>  Determine which profile counters are essential.
> 
> What is an "essential counter"?
The pass I proposed works like this:

  for (PC : ProfileCounters) {
    if (Update-Sites-Of(PC) == Update-Sites-Of(Other-PC)) {
      mark PC as "essential"
      mark Other-PC as "redundant" (i.e, alias of PC)
    }
  }

Where Update-Sites-Of(PC) is the set of basic blocks in which PC is updated.
If there are multiple counters updated in each update site for PC, they are
all marked as aliases in one pass.


3.
>> Erase all updates to redundant profile counters.
> 
> When you are saying "erase", you need to actually replace the
multiple counter increment with *a new* counter, right? I.e. when replacing:
> 
>     instrprof_increment(profc_f2);
>     instrprof_increment(profc_f1);
> 
> you need to emit:
> 
>     instrprof_increment(profc_f2f1_fused);

No. The pass only erases updates to counters which it can prove are
truly redundant. There is no need to create a new (or fused) counter
because profc_f1 == profc_f2 in all possible profiles.

4.
>> The complexity of this pass is O(N*M), where N is the number of profile
>> counters, and M is the average number of updates per counter.
> 
> What you're describing here is not clear to me? But I may have been
totally misunderstanding the whole concept :)
Hopefully the pseudo-code helps? Alternatively I can put a WIP patch up
on Phab.
> It should be just O(N) where N is the number of instructions.
The complexity of the pass is not upper-bounded by the number of
instructions in the program because certain loads and stores can be visited
more than once.


FE to IR Counter Remapping
=========================
I have a question about this plan:
> for each CFG edge:
>     record which FE counters have ended up associated with it
> remove FE counters
> run IR instrumentation pass
> emit a side table mapping IR instr counters to FE counters
Currently, -instrprof happens early in the pipeline. IIUC this is done to
allow the optimizer to work with load+add+stores, instead of profile update
intrinsics.

Say we introduce a counter remapping pass like the one Sean suggested. It
should be run before -instrprof so that we don't waste time lowering a bunch
of instrprof_increment intrinsics which we'll have to throw away later.

But that means that the CFGs that the counter remapping pass operates on
won't
reflect changes made by the inliner (or any other optimizations which alter the
CFG), right?

ISTM the pruning pass I've proposed is useful whether we're doing
FE-based
instrumentation _or_ late instrumentation. Since it operates on loads+stores
directly, it can clean up redundant counter increments at any point in the
pipeline (after -instrprof).


vedant

> On Mar 10, 2016, at 7:48 PM, Mehdi Amini <mehdi.amini at apple.com>
wrote:
> 
>> 
>> On Mar 10, 2016, at 7:21 PM, Vedant Kumar via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
>> 
>> Hi,
>> 
>> I'd like to add a new pass to LLVM which removes redundant profile
counter
>> updates. The goal is to speed up code coverage testing and profile
generation
>> for PGO.
>> 
>> I'm sending this email out to describe my approach, share some
early results,
>> and gather feedback.
>> 
>> 
>> Problem Overview
>> ===============>> 
>> A profile counter is redundant if it's incremented in exactly the
same basic
>> blocks as some other profile counter. Consider the following module:
>> 
>>   local void f1() {
>>     instrprof_increment(profc_f1);
>>   }
>> 
>>   void f2() {
>>     instrprof_increment(profc_f2);
>>     f1();
>>   }
>> 
>> Once the inliner runs and deletes f1, we're left with:
>> 
>>   void f2() {
>>     instrprof_increment(profc_f2);
>>     instrprof_increment(profc_f1);
>>   }
>> 
>> Now we can say profc_f1 is redundant (or, an alias for profc_f2).
>> 
>> I've noticed that frontend-based instrumentation can generate many
redundant
>> profile counters. This degrades performance and increases code size. 
We can
>> address the problem by getting rid of redundant counter updates. The
trick is
>> to make sure we get back the same profiles.
>> 
>> 
>> Proposed Solution
>> ================>> 
>> I propose a pruning pass which takes the following steps:
>> 
>> 1. Delete functions with local linkage and only one use, if that use is
in
>>    a profile data record.
>> 
>>    These functions are left around by the inliner (it doesn't know
that
>>    they're safe to delete). Deleting them reduces code size and
simplifies
>>    subsequent analysis of profile counters.
> 
> I think that this is something global-DCE should be teached about (if it
does not know already).
> 
> 
>> 
>> 2. Determine which profile counters are essential.
> 
> What is an "essential counter"?
> 
>> 
>> 3. Erase all updates to redundant profile counters.
> 
> When you are saying "erase", you need to actually replace the
multiple counter increment with *a new* counter, right? I.e. when replacing:
> 
>     instrprof_increment(profc_f2);
>     instrprof_increment(profc_f1);
> 
> you need to emit:
> 
>     instrprof_increment(profc_f2f1_fused);
> 
> 
>> 4. Emit the aliases into a new section in the binary.
>> 
>>    Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some
changes
>>    in compiler-rt are required to walk the alias section and fill in
the
>>    correct execution counts at program exit time.
>> 
>> This pass needs to be run after the inliner in order to be effective.
>> 
>> The complexity of this pass is O(N*M), where N is the number of profile
>> counters, and M is the average number of updates per counter.
> 
> What you're describing here is not clear to me? But I may have been
totally misunderstanding the whole concept :)
> 
> 
> -- 
> Mehdi
> 
> 
>> In practice it is
>> a bit faster, since we can skip the analysis of counters which are
discovered to
>> be redundant early on in the process.
>> 
>> 
>> Early Results
>> ============>> 
>> The pruning pass results in 25% speed improvement in the example
program above
>> (where f2 is called in a loop 10^8 times).
>> 
>> Here is a slightly less contrived example:
>> 
>> #include <vector>
>> #include <algorithm>
>> #include <cstdlib>
>> 
>> static void escape(void *p) {
>>   asm volatile("" : : "g"(p) : "memory");
>> }
>> 
>> int main(int argc, char **argv) { 
>>   std::vector<int> V(atoi(argv[1])); 
>>   escape(reinterpret_cast<void *>(V.data())); 
>>   std::sort(V.begin(), V.end()); 
>>   return V[0];
>> } 
>> 
>> I get the following results on my desktop (10^8 elements, 5 runs each):
>> 
>> O3:                           0.262s
>> O3 + PGOInstr:                0.663s
>> O3 + PGOInstr + Pruning:      0.606s (8.6% performance win, 672
aliases)
>> O3 + CoverageInstr:           0.690s
>> O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, 688
aliases)
>> 
>> 
>> Next Steps?
>> ==========>> 
>> Is the performance of instrumented code something we think we need to
fix?
>> What's an acceptable compile-time overhead for running this pruning
pass? Is
>> the general approach a non-starter for anybody?
>> 
>> I'd like to get some feedback and gauge interest before pursuing
this further.
>> Possible next steps include benchmarking instrumented versions of clang
and
>> swift on the relevant test suites, running performance tests from lnt,
running
>> compile-time tests, and measuring any code size differences.
>> 
>> 
>> thanks
>> vedant
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Apparently Analagous Threads

Search for more possibly parallel threads

llvm dev - Mar 2016 - RFC: Pass to prune redundant profiling instrumentation

[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

Apparently Analagous Threads