thr3ads.net - llvm dev - [LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters) [Apr 2014]

If this information is useful, please help other people find it:
Share via:

Xinliang David Li

2014-Apr-18 19:45 UTC

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

On Fri, Apr 18, 2014 at 12:13 AM, Dmitry Vyukov <dvyukov at google.com>
wrote:
> Hi,
>
> This is long thread, so I will combine several comments into single email.
>
>
> >> - 8-bit per-thread counters, dumping into central counters on
overflow.
> >The overflow will happen very quickly with 8bit counter.
>
> Yes, but it reduces contention by 256x (a thread must execute at least 256
> loop iterations between increments). In practice, if you reduce contention
> below some threshold, it does not represent a problem anymore.
>
>
>
> >> - per-thread counters. Solves the problem at huge cost in RAM
per-thread
> >It is not practical. Especially for TLS counters -- it creates huge
> pressure on stack memory.
>
> Do we have any numbers about number of counters? If there are 100K 1-byte
> counters, I would consider it as practical.
>
>A medium sized app I looked at has about 10M counters (arcs only).  It is
also not uncommon to see such apps running with hundreds of threads.


>
>
>
>
> > In Google GCC, we implemented another technique which proves to be
very
> effective -- it is called FDO sampling.
> > Basically counters will be updated every N samples.
>
> How does it work?
>
Similar to how occurrences based PMU sampling work. Setting sampling period
to 100 can reduce the instrumentation overhead by close to 100x without
introducing much precision loss.

>
>
>
> >> It seems to me like we’re going to have a hard time getting good
> multithreaded performance without significant impact on the single-threaded
> behavior.
> > I don't really agree.
>
>
> >We are talking about developers here. Nobody would know the exact
thread
> counts, but developers know the ballpark number
>
> I strongly believe that we must relief developers from this choice during
> build time, and do our best to auto-tune (if the final scheme requires
> tuning).
>

That really depends.   If the tuning space is small, it won't be a problem
for the developer/builder.


> First, such questions puts unnecessary burden on developers. We don't
ask
> what register allocation algorithm to use for each function, right?
>
Crazy developers can actually do that via internal options, but this is
totally different case.  People just needs one flag to turn on/off
sharding. When sharding is on, compiler can pick the best 'N' according
to
some heuristics at compile time.


> Second, there are significant chances we will get a wrong answer, because
> e.g. developer's view of how threaded is the app can differ from
reality or
> from our classification.
> Third, the app can be build by a build engineer; or the procedure can be
> applied to a base with 10'000 apps; or threaded-ness can change; or the
app
> can work in different modes; or the app can have different phases.
>
>We have forgotten to mention the benefit of implementation simplicity.  If
the static/simple solution solves the problem for most of the use cases,
designing fancy dynamic solution sounds like over-engineering to me.  It
(overhead reduction technique) may also get in the way of easier functional
enhancement in the future.

 David
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/6861153d/attachment.html>

Kostya Serebryany

2014-Apr-23 14:31 UTC

head link

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

I've run one proprietary benchmark that reflects a large portion of the
google's server side code.
-fprofile-instr-generate leads to 14x slowdown due to counter contention.
That's serious.
Admittedly, there is a single hot function that accounts for half of that
slowdown,
but even if I rebuild that function w/o -fprofile-instr-generate, the
slowdown remains above 5x.
This is not a toy code that I've written to prove my point -- this is real
code one may want to profile with -fprofile-instr-generate.
We need another approach for threaded code.

There is another ungood feature of the current instrumentation. Consider
this function:
std::vector<int> v(1000);
void foo() { v[0] = 42; }

Here we have a single basic block and a call, but since the coverage is
emitted by the
FE before inlining (and is also emitted for std::vector methods) we get
this assembler at -O2:
0000000000400b90 <_Z3foov>:
  400b90:       48 ff 05 11 25 20 00    incq   0x202511(%rip)        #
6030a8 <__llvm_profile_counters__Z3foov>
  400b97:       48 ff 05 42 25 20 00    incq   0x202542(%rip)        #
6030e0 <__llvm_profile_counters__ZNSt6vectorIiSaIiEEixEm>
  400b9e:       48 8b 05 4b 26 20 00    mov    0x20264b(%rip),%rax        #
6031f0 <v>
  400ba5:       c7 00 2a 00 00 00       movl   $0x2a,(%rax)
  400bab:       c3                      retq

Suddenly, an innocent function that uses std::vector becomes a terrible
point of contention.
Full test case below, -fprofile-instr-generate leads to 10x slowdown.

========================
Now, here is a more detailed proposal of logarithmic self-cooling counter
mentioned before. Please comment.
The counter is a number of the form (2^k-1).
It starts with 0.
After the first update it is 1.
After *approximately* 1 more update it becomes 3
After *approximately* 2 more updates it becomes 7
After *approximately* 4 more updates it becomes 15
...
After *approximately* 2^k more updates it becomes 2^(k+2)-1

The code would look like this:
  if ((fast_thread_local_rand() & counter) == 0)
    counter = 2 * counter + 1;

Possible implementation for fast_thread_local_rand:
long fast_thread_local_rand() {
  static __thread long r;
  return r++;
}
Although I would try to find something cheaper that this. (Ideas?)


The counter is not precise (well, the current racy counters are not precise
either).
But statistically it should be no more than 2x away from the real counter.
Will this accuracy be enough for the major use cases?

Moreover, this approach allows to implement the counter increment using a
callback:
  if ((fast_thread_local_rand() & counter) == 0)
__cov_increment(&counter);
which in turn will let us use the same hack as in AsanCoverage: use the PC
to map the counter to the source code.
(= no need to create separate section in the objects).

Thoughts?

--kcc


% clang++ -O2 -lpthread coverage_mt_vec.cc && time ./a.out
TIME: real: 0.219; user: 0.430; system: 0.000
% clang++ -O2 -lpthread -fprofile-instr-generate coverage_mt_vec.cc &&
time
./a.out
TIME: real: 3.743; user: 7.280; system: 0.000

% cat coverage_mt_vec.cc
#include <pthread.h>
#include <vector>

std::vector<int> v(1000);

__attribute__((noinline)) void foo() { v[0] = 42; }

void *Thread1(void *) {
  for (int i = 0; i < 100000000; i++)
    foo();
  return 0;
}

__attribute__((noinline)) void bar() { v[999] = 66; }

void *Thread2(void *) {
  for (int i = 0; i < 100000000; i++)
    bar();
  return 0;
}

int main() {
  static const int kNumThreads = 16;
  pthread_t t[kNumThreads];
  pthread_create(&t[0], 0, Thread1, 0);
  pthread_create(&t[1], 0, Thread2, 0);
  pthread_join(t[0], 0);
  pthread_join(t[1], 0);
  return 0;
}




On Fri, Apr 18, 2014 at 11:45 PM, Xinliang David Li <xinliangli at
gmail.com>wrote:
>
>
>
> On Fri, Apr 18, 2014 at 12:13 AM, Dmitry Vyukov <dvyukov at
google.com>wrote:
>
>> Hi,
>>
>> This is long thread, so I will combine several comments into single
email.
>>
>>
>> >> - 8-bit per-thread counters, dumping into central counters on
overflow.
>> >The overflow will happen very quickly with 8bit counter.
>>
>> Yes, but it reduces contention by 256x (a thread must execute at least
>> 256 loop iterations between increments). In practice, if you reduce
>> contention below some threshold, it does not represent a problem
anymore.
>>
>>
>>
>> >> - per-thread counters. Solves the problem at huge cost in RAM
>> per-thread
>> >It is not practical. Especially for TLS counters -- it creates huge
>> pressure on stack memory.
>>
>> Do we have any numbers about number of counters? If there are 100K
1-byte
>> counters, I would consider it as practical.
>>
>>
> A medium sized app I looked at has about 10M counters (arcs only).  It is
> also not uncommon to see such apps running with hundreds of threads.
>
>
>
>>
>>
>>
>>
>> > In Google GCC, we implemented another technique which proves to be
very
>> effective -- it is called FDO sampling.
>> > Basically counters will be updated every N samples.
>>
>> How does it work?
>>
>
> Similar to how occurrences based PMU sampling work. Setting sampling
> period to 100 can reduce the instrumentation overhead by close to 100x
> without introducing much precision loss.
>
>
>>
>>
>>
>> >> It seems to me like we’re going to have a hard time getting
good
>> multithreaded performance without significant impact on the
single-threaded
>> behavior.
>> > I don't really agree.
>>
>>
>> >We are talking about developers here. Nobody would know the exact
thread
>> counts, but developers know the ballpark number
>>
>> I strongly believe that we must relief developers from this choice
during
>> build time, and do our best to auto-tune (if the final scheme requires
>> tuning).
>>
>
>
> That really depends.   If the tuning space is small, it won't be a
problem
> for the developer/builder.
>
>
>
>>  First, such questions puts unnecessary burden on developers. We
don't
>> ask what register allocation algorithm to use for each function, right?
>>
>
> Crazy developers can actually do that via internal options, but this is
> totally different case.  People just needs one flag to turn on/off
> sharding. When sharding is on, compiler can pick the best 'N'
according to
> some heuristics at compile time.
>
>
>
>> Second, there are significant chances we will get a wrong answer,
because
>> e.g. developer's view of how threaded is the app can differ from
reality or
>> from our classification.
>> Third, the app can be build by a build engineer; or the procedure can
be
>> applied to a base with 10'000 apps; or threaded-ness can change; or
the app
>> can work in different modes; or the app can have different phases.
>>
>>
> We have forgotten to mention the benefit of implementation simplicity.  If
> the static/simple solution solves the problem for most of the use cases,
> designing fancy dynamic solution sounds like over-engineering to me.  It
> (overhead reduction technique) may also get in the way of easier functional
> enhancement in the future.
>
>  David
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140423/2ded3fc2/attachment.html>

Xinliang David Li

2014-Apr-23 16:13 UTC

head link

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

This is very neat. The concern is that for long running process, the
precision loss can be huge.  Run to run counter variance can also be very
large (without atomic update).
>
> Moreover, this approach allows to implement the counter increment using a
> callback:
>   if ((fast_thread_local_rand() & counter) == 0)
> __cov_increment(&counter);
> which in turn will let us use the same hack as in AsanCoverage: use the PC
> to map the counter to the source code.
> (= no need to create separate section in the objects).
>
Similar callback mechanism (GCC based) is used in Google for testing
coverage.

David

>
> Thoughts?
>
> --kcc
>
>
> % clang++ -O2 -lpthread coverage_mt_vec.cc && time ./a.out
> TIME: real: 0.219; user: 0.430; system: 0.000
> % clang++ -O2 -lpthread -fprofile-instr-generate coverage_mt_vec.cc
&&
> time ./a.out
> TIME: real: 3.743; user: 7.280; system: 0.000
>
> % cat coverage_mt_vec.cc
> #include <pthread.h>
> #include <vector>
>
> std::vector<int> v(1000);
>
> __attribute__((noinline)) void foo() { v[0] = 42; }
>
> void *Thread1(void *) {
>   for (int i = 0; i < 100000000; i++)
>     foo();
>   return 0;
> }
>
> __attribute__((noinline)) void bar() { v[999] = 66; }
>
> void *Thread2(void *) {
>   for (int i = 0; i < 100000000; i++)
>     bar();
>   return 0;
> }
>
> int main() {
>   static const int kNumThreads = 16;
>   pthread_t t[kNumThreads];
>   pthread_create(&t[0], 0, Thread1, 0);
>   pthread_create(&t[1], 0, Thread2, 0);
>   pthread_join(t[0], 0);
>   pthread_join(t[1], 0);
>   return 0;
> }
>
>
>
>
> On Fri, Apr 18, 2014 at 11:45 PM, Xinliang David Li <xinliangli at
gmail.com>wrote:
>
>>
>>
>>
>> On Fri, Apr 18, 2014 at 12:13 AM, Dmitry Vyukov <dvyukov at
google.com>wrote:
>>
>>> Hi,
>>>
>>> This is long thread, so I will combine several comments into single
>>> email.
>>>
>>>
>>> >> - 8-bit per-thread counters, dumping into central counters
on
>>> overflow.
>>> >The overflow will happen very quickly with 8bit counter.
>>>
>>> Yes, but it reduces contention by 256x (a thread must execute at
least
>>> 256 loop iterations between increments). In practice, if you reduce
>>> contention below some threshold, it does not represent a problem
anymore.
>>>
>>>
>>>
>>> >> - per-thread counters. Solves the problem at huge cost in
RAM
>>> per-thread
>>> >It is not practical. Especially for TLS counters -- it creates
huge
>>> pressure on stack memory.
>>>
>>> Do we have any numbers about number of counters? If there are 100K
>>> 1-byte counters, I would consider it as practical.
>>>
>>>
>> A medium sized app I looked at has about 10M counters (arcs only).  It
is
>> also not uncommon to see such apps running with hundreds of threads.
>>
>>
>>
>>>
>>>
>>>
>>>
>>> > In Google GCC, we implemented another technique which proves
to be
>>> very effective -- it is called FDO sampling.
>>> > Basically counters will be updated every N samples.
>>>
>>> How does it work?
>>>
>>
>> Similar to how occurrences based PMU sampling work. Setting sampling
>> period to 100 can reduce the instrumentation overhead by close to 100x
>> without introducing much precision loss.
>>
>>
>>>
>>>
>>>
>>> >> It seems to me like we’re going to have a hard time
getting good
>>> multithreaded performance without significant impact on the
single-threaded
>>> behavior.
>>> > I don't really agree.
>>>
>>>
>>> >We are talking about developers here. Nobody would know the
exact
>>> thread counts, but developers know the ballpark number
>>>
>>> I strongly believe that we must relief developers from this choice
>>> during build time, and do our best to auto-tune (if the final
scheme
>>> requires tuning).
>>>
>>
>>
>> That really depends.   If the tuning space is small, it won't be a
>> problem for the developer/builder.
>>
>>
>>
>>>  First, such questions puts unnecessary burden on developers. We
don't
>>> ask what register allocation algorithm to use for each function,
right?
>>>
>>
>> Crazy developers can actually do that via internal options, but this is
>> totally different case.  People just needs one flag to turn on/off
>> sharding. When sharding is on, compiler can pick the best 'N'
according to
>> some heuristics at compile time.
>>
>>
>>
>>> Second, there are significant chances we will get a wrong answer,
>>> because e.g. developer's view of how threaded is the app can
differ from
>>> reality or from our classification.
>>> Third, the app can be build by a build engineer; or the procedure
can be
>>> applied to a base with 10'000 apps; or threaded-ness can
change; or the app
>>> can work in different modes; or the app can have different phases.
>>>
>>>
>> We have forgotten to mention the benefit of implementation simplicity.
>>  If the static/simple solution solves the problem for most of the use
>> cases, designing fancy dynamic solution sounds like over-engineering to
me.
>>  It (overhead reduction technique) may also get in the way of easier
>> functional enhancement in the future.
>>
>>  David
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140423/5f2d7493/attachment.html>

Bob Wilson

2014-Apr-23 18:48 UTC

head link

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

On Apr 23, 2014, at 7:31 AM, Kostya Serebryany <kcc at google.com> wrote:
> I've run one proprietary benchmark that reflects a large portion of the
google's server side code.
> -fprofile-instr-generate leads to 14x slowdown due to counter contention.
That's serious.
> Admittedly, there is a single hot function that accounts for half of that
slowdown,
> but even if I rebuild that function w/o -fprofile-instr-generate, the
slowdown remains above 5x.
> This is not a toy code that I've written to prove my point -- this is
real code one may want to profile with -fprofile-instr-generate.
> We need another approach for threaded code. 
> 
> There is another ungood feature of the current instrumentation. Consider
this function:
> std::vector<int> v(1000);
> void foo() { v[0] = 42; }
> 
> Here we have a single basic block and a call, but since the coverage is
emitted by the
> FE before inlining (and is also emitted for std::vector methods) we get
this assembler at -O2:
> 0000000000400b90 <_Z3foov>:
>   400b90:       48 ff 05 11 25 20 00    incq   0x202511(%rip)        #
6030a8 <__llvm_profile_counters__Z3foov>
>   400b97:       48 ff 05 42 25 20 00    incq   0x202542(%rip)        #
6030e0 <__llvm_profile_counters__ZNSt6vectorIiSaIiEEixEm>
>   400b9e:       48 8b 05 4b 26 20 00    mov    0x20264b(%rip),%rax        #
6031f0 <v>
>   400ba5:       c7 00 2a 00 00 00       movl   $0x2a,(%rax)
>   400bab:       c3                      retq   
> 
> Suddenly, an innocent function that uses std::vector becomes a terrible
point of contention.
> Full test case below, -fprofile-instr-generate leads to 10x slowdown. 
> 
> ========================> 
> Now, here is a more detailed proposal of logarithmic self-cooling counter
mentioned before. Please comment.
> The counter is a number of the form (2^k-1). 
> It starts with 0.
> After the first update it is 1.
> After *approximately* 1 more update it becomes 3
> After *approximately* 2 more updates it becomes 7
> After *approximately* 4 more updates it becomes 15
> ...
> After *approximately* 2^k more updates it becomes 2^(k+2)-1
> 
> The code would look like this:
>   if ((fast_thread_local_rand() & counter) == 0)
>     counter = 2 * counter + 1;
> 
> Possible implementation for fast_thread_local_rand: 
> long fast_thread_local_rand() {
>   static __thread long r;
>   return r++;
> }
> Although I would try to find something cheaper that this. (Ideas?)
> 
> 
> The counter is not precise (well, the current racy counters are not precise
either).
> But statistically it should be no more than 2x away from the real counter. 
> Will this accuracy be enough for the major use cases? 
> 
> Moreover, this approach allows to implement the counter increment using a
callback:
>   if ((fast_thread_local_rand() & counter) == 0)  
__cov_increment(&counter);
> which in turn will let us use the same hack as in AsanCoverage: use the PC
to map the counter to the source code.
> (= no need to create separate section in the objects).
> 
> Thoughts? 
> 
> —kcc
I can see that the behavior of our current instrumentation is going to be a
problem for the kinds of applications that you’re looking at. If you can find a
way to get the overhead down without losing accuracy and without hurting the
performance for applications without significant contention, then we can just
adopt that. If you can’t do it without tradeoffs, then we should have a separate
option to let those who care switch between different kinds of instrumentation.

Using the PC to map to the source code is simply not going to work with
-fprofile-instr-generate. The mapping from counter values to the user-visible
execution counts is complex and relies on detailed knowledge of the clang ASTs.
> 
> 
> % clang++ -O2 -lpthread coverage_mt_vec.cc && time ./a.out 
> TIME: real: 0.219; user: 0.430; system: 0.000
> % clang++ -O2 -lpthread -fprofile-instr-generate coverage_mt_vec.cc
&& time ./a.out
> TIME: real: 3.743; user: 7.280; system: 0.000
> 
> % cat coverage_mt_vec.cc
> #include <pthread.h>
> #include <vector>
> 
> std::vector<int> v(1000);
> 
> __attribute__((noinline)) void foo() { v[0] = 42; }
>   
> void *Thread1(void *) {
>   for (int i = 0; i < 100000000; i++)
>     foo(); 
>   return 0;
> }   
> 
> __attribute__((noinline)) void bar() { v[999] = 66; }
> 
> void *Thread2(void *) {
>   for (int i = 0; i < 100000000; i++)
>     bar();
>   return 0;
> } 
>   
> int main() {
>   static const int kNumThreads = 16;
>   pthread_t t[kNumThreads];
>   pthread_create(&t[0], 0, Thread1, 0);
>   pthread_create(&t[1], 0, Thread2, 0);
>   pthread_join(t[0], 0);
>   pthread_join(t[1], 0);
>   return 0;
> }
> 
> 
> 
> 
> On Fri, Apr 18, 2014 at 11:45 PM, Xinliang David Li <xinliangli at
gmail.com> wrote:
> 
> 
> 
> On Fri, Apr 18, 2014 at 12:13 AM, Dmitry Vyukov <dvyukov at
google.com> wrote:
> Hi,
> 
> This is long thread, so I will combine several comments into single email.
> 
> 
> >> - 8-bit per-thread counters, dumping into central counters on
overflow.
> >The overflow will happen very quickly with 8bit counter.
> 
> Yes, but it reduces contention by 256x (a thread must execute at least 256
loop iterations between increments). In practice, if you reduce contention below
some threshold, it does not represent a problem anymore.
> 
> 
> 
> >> - per-thread counters. Solves the problem at huge cost in RAM
per-thread
> >It is not practical. Especially for TLS counters -- it creates huge
pressure on stack memory.
> 
> Do we have any numbers about number of counters? If there are 100K 1-byte
counters, I would consider it as practical.
> 
> 
> A medium sized app I looked at has about 10M counters (arcs only).  It is
also not uncommon to see such apps running with hundreds of threads.
> 
>  
> 
> 
> 
> 
> > In Google GCC, we implemented another technique which proves to be
very effective -- it is called FDO sampling.
> > Basically counters will be updated every N samples.
> 
> How does it work?
> 
> Similar to how occurrences based PMU sampling work. Setting sampling period
to 100 can reduce the instrumentation overhead by close to 100x without
introducing much precision loss.
>  
> 
> 
> 
> >> It seems to me like we’re going to have a hard time getting good
multithreaded performance without significant impact on the single-threaded
behavior.
> > I don't really agree.
> 
> 
> >We are talking about developers here. Nobody would know the exact
thread counts, but developers know the ballpark number
> 
> I strongly believe that we must relief developers from this choice during
build time, and do our best to auto-tune (if the final scheme requires tuning).
> 
> 
> That really depends.   If the tuning space is small, it won't be a
problem for the developer/builder.
> 
>  
> First, such questions puts unnecessary burden on developers. We don't
ask what register allocation algorithm to use for each function, right?
> 
> Crazy developers can actually do that via internal options, but this is
totally different case.  People just needs one flag to turn on/off sharding.
When sharding is on, compiler can pick the best 'N' according to some
heuristics at compile time.
> 
>  
> Second, there are significant chances we will get a wrong answer, because
e.g. developer's view of how threaded is the app can differ from reality or
from our classification.
> Third, the app can be build by a build engineer; or the procedure can be
applied to a base with 10'000 apps; or threaded-ness can change; or the app
can work in different modes; or the app can have different phases.
> 
> 
> We have forgotten to mention the benefit of implementation simplicity.  If
the static/simple solution solves the problem for most of the use cases,
designing fancy dynamic solution sounds like over-engineering to me.  It
(overhead reduction technique) may also get in the way of easier functional
enhancement in the future.
> 
>  David
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140423/77ed1cf7/attachment.html>

Duncan P. N. Exon Smith

2014-Apr-25 17:11 UTC

head link

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

On 2014-Apr-23, at 7:31, Kostya Serebryany <kcc at google.com> wrote:
> I've run one proprietary benchmark that reflects a large portion of the
google's server side code.
> -fprofile-instr-generate leads to 14x slowdown due to counter contention.
That's serious.
> Admittedly, there is a single hot function that accounts for half of that
slowdown,
> but even if I rebuild that function w/o -fprofile-instr-generate, the
slowdown remains above 5x.
> This is not a toy code that I've written to prove my point -- this is
real code one may want to profile with -fprofile-instr-generate.
> We need another approach for threaded code. 
> 
> There is another ungood feature of the current instrumentation. Consider
this function:
> std::vector<int> v(1000);
> void foo() { v[0] = 42; }
> 
> Here we have a single basic block and a call, but since the coverage is
emitted by the
> FE before inlining (and is also emitted for std::vector methods) we get
this assembler at -O2:
> 0000000000400b90 <_Z3foov>:
>   400b90:       48 ff 05 11 25 20 00    incq   0x202511(%rip)        #
6030a8 <__llvm_profile_counters__Z3foov>
>   400b97:       48 ff 05 42 25 20 00    incq   0x202542(%rip)        #
6030e0 <__llvm_profile_counters__ZNSt6vectorIiSaIiEEixEm>
>   400b9e:       48 8b 05 4b 26 20 00    mov    0x20264b(%rip),%rax        #
6031f0 <v>
>   400ba5:       c7 00 2a 00 00 00       movl   $0x2a,(%rax)
>   400bab:       c3                      retq   
> 
> Suddenly, an innocent function that uses std::vector becomes a terrible
point of contention.
I know you're just using std::vector<> as an example, but I think
there
should be an option to avoid instrumenting the STL.  STL
implementations have typically been hand-tuned already.  I think the
benefit of PGO is extremely small there (although I have no numbers to
back this up.)
> Full test case below, -fprofile-instr-generate leads to 10x slowdown. 
> 
> ========================> 
> Now, here is a more detailed proposal of logarithmic self-cooling counter
mentioned before. Please comment.
> The counter is a number of the form (2^k-1). 
> It starts with 0.
> After the first update it is 1.
> After *approximately* 1 more update it becomes 3
> After *approximately* 2 more updates it becomes 7
> After *approximately* 4 more updates it becomes 15
> ...
> After *approximately* 2^k more updates it becomes 2^(k+2)-1
> 
> The code would look like this:
>   if ((fast_thread_local_rand() & counter) == 0)
>     counter = 2 * counter + 1;
> 
> Possible implementation for fast_thread_local_rand: 
> long fast_thread_local_rand() {
>   static __thread long r;
>   return r++;
> }
> Although I would try to find something cheaper that this. (Ideas?)
Very cool.
> The counter is not precise (well, the current racy counters are not precise
either).
> But statistically it should be no more than 2x away from the real counter. 
> Will this accuracy be enough for the major use cases? 
I really like this idea.  I think the loss of accuracy should be
opt-in (-fprofile-instr-generate=fuzzy), but I think this should be
available.

I think this could be cleanly integrated into -fprofile-instr-generate
without changing -fprofile-instr-use, compiler-rt, or llvm-profdata.
Having this option looks like a clear win to me.
> Moreover, this approach allows to implement the counter increment using a
callback:
>   if ((fast_thread_local_rand() & counter) == 0)  
__cov_increment(&counter);
> which in turn will let us use the same hack as in AsanCoverage: use the PC
to map the counter to the source code.
> (= no need to create separate section in the objects).
As Bob said already, the PC hack doesn't work for the frontend-based
instrumentation.  I'm skeptical even about relying on debug info for
retrieving function names.  But optimizing the binary size is a
separate topic anyway.

Possibly Parallel Threads

Search for more reasonably related threads

llvm dev - Apr 2014 - [LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

Possibly Parallel Threads