thr3ads.net - llvm dev - [LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters) [Apr 2014]

If this information is useful, please help other people find it:
Share via:

Bob Wilson

2014-Apr-25 16:44 UTC

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

On Apr 24, 2014, at 1:33 AM, Dmitry Vyukov <dvyukov at google.com>
wrote:>> 
>> I can see that the behavior of our current instrumentation is going to
be a
>> problem for the kinds of applications that you’re looking at. If you
can
>> find a way to get the overhead down without losing accuracy
> 
> What are your requirements for accuracy?
> Current implementation does not provide 100% accuracy, so it's
> something less than 100%. What is it?
> What use cases for numeric counters (as opposed to bool flag) do we
> need to support? Is it only feedback-driven optimizations?
That’s a fair point. The current implementation is potentially inaccurate
because the counter increments are not thread-safe. In a low-contention
situation, that won’t matter much, but the counts could become quite inaccurate
if there are multiple threads running the same code at the same time.

I don’t have a specific goal in mind right now for accuracy. We plan to use this
instrumentation for both PGO and code coverage. Some coverage users only care
about a boolean check but others want to see the actual execution counts. If we
display the count values and they are significantly different from the real
execution counts, that will lead to much confusion.
> 
> 
>> and without
>> hurting the performance for applications without significant
contention,
> 
> What is the acceptable threshold? 0.01% would be fine, right? What is
> the maximum value that you are ready to agree with?
Like I said, I don’t have a specific value in mind. My sense is that most of the
applications we care about are quite different from the massively parallel code
that Google cares about. We may have many threads but they’re all doing
different things and contention is much less likely to be a problem.

I really think we need to see specifics before we can decide anything here.
> 
> 
>> then we can just adopt that. If you can’t do it without tradeoffs, then
we
>> should have a separate option to let those who care switch between
different
>> kinds of instrumentation.
>> 
>> Using the PC to map to the source code is simply not going to work with
>> -fprofile-instr-generate. The mapping from counter values to the
>> user-visible execution counts is complex and relies on detailed
knowledge of
>> the clang ASTs.
>> 
>> 
>> 
>> % clang++ -O2 -lpthread coverage_mt_vec.cc && time ./a.out
>> TIME: real: 0.219; user: 0.430; system: 0.000
>> % clang++ -O2 -lpthread -fprofile-instr-generate coverage_mt_vec.cc
&& time
>> ./a.out
>> TIME: real: 3.743; user: 7.280; system: 0.000
>> 
>> % cat coverage_mt_vec.cc
>> #include <pthread.h>
>> #include <vector>
>> 
>> std::vector<int> v(1000);
>> 
>> __attribute__((noinline)) void foo() { v[0] = 42; }
>> 
>> void *Thread1(void *) {
>>  for (int i = 0; i < 100000000; i++)
>>    foo();
>>  return 0;
>> }
>> 
>> __attribute__((noinline)) void bar() { v[999] = 66; }
>> 
>> void *Thread2(void *) {
>>  for (int i = 0; i < 100000000; i++)
>>    bar();
>>  return 0;
>> }
>> 
>> int main() {
>>  static const int kNumThreads = 16;
>>  pthread_t t[kNumThreads];
>>  pthread_create(&t[0], 0, Thread1, 0);
>>  pthread_create(&t[1], 0, Thread2, 0);
>>  pthread_join(t[0], 0);
>>  pthread_join(t[1], 0);
>>  return 0;
>> }
>> 
>> 
>> 
>> 
>> On Fri, Apr 18, 2014 at 11:45 PM, Xinliang David Li <xinliangli at
gmail.com>
>> wrote:
>>> 
>>> 
>>> 
>>> 
>>> On Fri, Apr 18, 2014 at 12:13 AM, Dmitry Vyukov <dvyukov at
google.com>
>>> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> This is long thread, so I will combine several comments into
single
>>>> email.
>>>> 
>>>> 
>>>>>> - 8-bit per-thread counters, dumping into central
counters on
>>>>>> overflow.
>>>>> The overflow will happen very quickly with 8bit counter.
>>>> 
>>>> Yes, but it reduces contention by 256x (a thread must execute
at least
>>>> 256 loop iterations between increments). In practice, if you
reduce
>>>> contention below some threshold, it does not represent a
problem anymore.
>>>> 
>>>> 
>>>> 
>>>>>> - per-thread counters. Solves the problem at huge cost
in RAM
>>>>>> per-thread
>>>>> It is not practical. Especially for TLS counters -- it
creates huge
>>>>> pressure on stack memory.
>>>> 
>>>> Do we have any numbers about number of counters? If there are
100K 1-byte
>>>> counters, I would consider it as practical.
>>>> 
>>> 
>>> A medium sized app I looked at has about 10M counters (arcs only). 
It is
>>> also not uncommon to see such apps running with hundreds of
threads.
>>> 
>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> In Google GCC, we implemented another technique which
proves to be very
>>>>> effective -- it is called FDO sampling.
>>>>> Basically counters will be updated every N samples.
>>>> 
>>>> How does it work?
>>> 
>>> 
>>> Similar to how occurrences based PMU sampling work. Setting
sampling
>>> period to 100 can reduce the instrumentation overhead by close to
100x
>>> without introducing much precision loss.
>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>>> It seems to me like we’re going to have a hard time
getting good
>>>>>> multithreaded performance without significant impact on
the single-threaded
>>>>>> behavior.
>>>>> I don't really agree.
>>>> 
>>>> 
>>>>> We are talking about developers here. Nobody would know the
exact thread
>>>>> counts, but developers know the ballpark number
>>>> 
>>>> I strongly believe that we must relief developers from this
choice during
>>>> build time, and do our best to auto-tune (if the final scheme
requires
>>>> tuning).
>>> 
>>> 
>>> 
>>> That really depends.   If the tuning space is small, it won't
be a problem
>>> for the developer/builder.
>>> 
>>> 
>>>> 
>>>> First, such questions puts unnecessary burden on developers. We
don't ask
>>>> what register allocation algorithm to use for each function,
right?
>>> 
>>> 
>>> Crazy developers can actually do that via internal options, but
this is
>>> totally different case.  People just needs one flag to turn on/off
sharding.
>>> When sharding is on, compiler can pick the best 'N'
according to some
>>> heuristics at compile time.
>>> 
>>> 
>>>> 
>>>> Second, there are significant chances we will get a wrong
answer, because
>>>> e.g. developer's view of how threaded is the app can differ
from reality or
>>>> from our classification.
>>>> Third, the app can be build by a build engineer; or the
procedure can be
>>>> applied to a base with 10'000 apps; or threaded-ness can
change; or the app
>>>> can work in different modes; or the app can have different
phases.
>>>> 
>>> 
>>> We have forgotten to mention the benefit of implementation
simplicity.  If
>>> the static/simple solution solves the problem for most of the use
cases,
>>> designing fancy dynamic solution sounds like over-engineering to
me.  It
>>> (overhead reduction technique) may also get in the way of easier
functional
>>> enhancement in the future.
>>> 
>>> David
>>> 
>>> 
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140425/b8dbdefc/attachment.html>

Dmitry Vyukov

2014-Apr-25 17:22 UTC

head link

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

On Fri, Apr 25, 2014 at 8:44 PM, Bob Wilson <bob.wilson at apple.com>
wrote:>
> On Apr 24, 2014, at 1:33 AM, Dmitry Vyukov <dvyukov at google.com>
wrote:
>
>
> I can see that the behavior of our current instrumentation is going to be a
> problem for the kinds of applications that you’re looking at. If you can
> find a way to get the overhead down without losing accuracy
>
>
> What are your requirements for accuracy?
> Current implementation does not provide 100% accuracy, so it's
> something less than 100%. What is it?
> What use cases for numeric counters (as opposed to bool flag) do we
> need to support? Is it only feedback-driven optimizations?
>
>
> That’s a fair point. The current implementation is potentially inaccurate
> because the counter increments are not thread-safe. In a low-contention
> situation, that won’t matter much, but the counts could become quite
> inaccurate if there are multiple threads running the same code at the same
> time.
>
> I don’t have a specific goal in mind right now for accuracy. We plan to use
> this instrumentation for both PGO and code coverage. Some coverage users
> only care about a boolean check but others want to see the actual execution
> counts. If we display the count values and they are significantly different
> from the real execution counts, that will lead to much confusion.
If I would look at coverage report with exact numbers and observe 2
statements that must have the same hit count, but in the report they
have different hit counts; I would be highly confused.
Go coverage profiler has 3 modes of operation: set, count and atomic:
http://golang.org/cmd/go/#hdr-Description_of_testing_flags
As you can guess, the last one (atomic) is intended for exactly such
case -- when user looks at exact numbers.


> and without
> hurting the performance for applications without significant contention,
>
>
> What is the acceptable threshold? 0.01% would be fine, right? What is
> the maximum value that you are ready to agree with?
>
>
> Like I said, I don’t have a specific value in mind. My sense is that most
of
> the applications we care about are quite different from the massively
> parallel code that Google cares about. We may have many threads but they’re
> all doing different things and contention is much less likely to be a
> problem.

Note that contention can arise even if threads are doing different
things but use common components. E.g. if you include a custom malloc
implementation into your program, then everything that allocates
becomes a point of contention.


> I really think we need to see specifics before we can decide anything here.
>
>
>
> then we can just adopt that. If you can’t do it without tradeoffs, then we
> should have a separate option to let those who care switch between
different
> kinds of instrumentation.
>
> Using the PC to map to the source code is simply not going to work with
> -fprofile-instr-generate. The mapping from counter values to the
> user-visible execution counts is complex and relies on detailed knowledge
of
> the clang ASTs.
>
>
>
> % clang++ -O2 -lpthread coverage_mt_vec.cc && time ./a.out
> TIME: real: 0.219; user: 0.430; system: 0.000
> % clang++ -O2 -lpthread -fprofile-instr-generate coverage_mt_vec.cc
&& time
> ./a.out
> TIME: real: 3.743; user: 7.280; system: 0.000
>
> % cat coverage_mt_vec.cc
> #include <pthread.h>
> #include <vector>
>
> std::vector<int> v(1000);
>
> __attribute__((noinline)) void foo() { v[0] = 42; }
>
> void *Thread1(void *) {
>  for (int i = 0; i < 100000000; i++)
>    foo();
>  return 0;
> }
>
> __attribute__((noinline)) void bar() { v[999] = 66; }
>
> void *Thread2(void *) {
>  for (int i = 0; i < 100000000; i++)
>    bar();
>  return 0;
> }
>
> int main() {
>  static const int kNumThreads = 16;
>  pthread_t t[kNumThreads];
>  pthread_create(&t[0], 0, Thread1, 0);
>  pthread_create(&t[1], 0, Thread2, 0);
>  pthread_join(t[0], 0);
>  pthread_join(t[1], 0);
>  return 0;
> }
>
>
>
>
> On Fri, Apr 18, 2014 at 11:45 PM, Xinliang David Li <xinliangli at
gmail.com>
> wrote:
>
>
>
>
>
> On Fri, Apr 18, 2014 at 12:13 AM, Dmitry Vyukov <dvyukov at
google.com>
> wrote:
>
>
> Hi,
>
> This is long thread, so I will combine several comments into single
> email.
>
>
> - 8-bit per-thread counters, dumping into central counters on
> overflow.
>
> The overflow will happen very quickly with 8bit counter.
>
>
> Yes, but it reduces contention by 256x (a thread must execute at least
> 256 loop iterations between increments). In practice, if you reduce
> contention below some threshold, it does not represent a problem anymore.
>
>
>
> - per-thread counters. Solves the problem at huge cost in RAM
> per-thread
>
> It is not practical. Especially for TLS counters -- it creates huge
> pressure on stack memory.
>
>
> Do we have any numbers about number of counters? If there are 100K 1-byte
> counters, I would consider it as practical.
>
>
> A medium sized app I looked at has about 10M counters (arcs only).  It is
> also not uncommon to see such apps running with hundreds of threads.
>
>
>
>
>
>
>
> In Google GCC, we implemented another technique which proves to be very
> effective -- it is called FDO sampling.
> Basically counters will be updated every N samples.
>
>
> How does it work?
>
>
>
> Similar to how occurrences based PMU sampling work. Setting sampling
> period to 100 can reduce the instrumentation overhead by close to 100x
> without introducing much precision loss.
>
>
>
>
>
> It seems to me like we’re going to have a hard time getting good
> multithreaded performance without significant impact on the single-threaded
> behavior.
>
> I don't really agree.
>
>
>
> We are talking about developers here. Nobody would know the exact thread
> counts, but developers know the ballpark number
>
>
> I strongly believe that we must relief developers from this choice during
> build time, and do our best to auto-tune (if the final scheme requires
> tuning).
>
>
>
>
> That really depends.   If the tuning space is small, it won't be a
problem
> for the developer/builder.
>
>
>
> First, such questions puts unnecessary burden on developers. We don't
ask
> what register allocation algorithm to use for each function, right?
>
>
>
> Crazy developers can actually do that via internal options, but this is
> totally different case.  People just needs one flag to turn on/off
sharding.
> When sharding is on, compiler can pick the best 'N' according to
some
> heuristics at compile time.
>
>
>
> Second, there are significant chances we will get a wrong answer, because
> e.g. developer's view of how threaded is the app can differ from
reality or
> from our classification.
> Third, the app can be build by a build engineer; or the procedure can be
> applied to a base with 10'000 apps; or threaded-ness can change; or the
app
> can work in different modes; or the app can have different phases.
>
>
> We have forgotten to mention the benefit of implementation simplicity.  If
> the static/simple solution solves the problem for most of the use cases,
> designing fancy dynamic solution sounds like over-engineering to me.  It
> (overhead reduction technique) may also get in the way of easier functional
> enhancement in the future.

Bob Wilson

2014-Apr-25 17:28 UTC

head link

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

On Apr 25, 2014, at 10:22 AM, Dmitry Vyukov <dvyukov at google.com> wrote:
> On Fri, Apr 25, 2014 at 8:44 PM, Bob Wilson <bob.wilson at apple.com>
wrote:
>> 
>> On Apr 24, 2014, at 1:33 AM, Dmitry Vyukov <dvyukov at
google.com> wrote:
>> 
>> 
>> I can see that the behavior of our current instrumentation is going to
be a
>> problem for the kinds of applications that you’re looking at. If you
can
>> find a way to get the overhead down without losing accuracy
>> 
>> 
>> What are your requirements for accuracy?
>> Current implementation does not provide 100% accuracy, so it's
>> something less than 100%. What is it?
>> What use cases for numeric counters (as opposed to bool flag) do we
>> need to support? Is it only feedback-driven optimizations?
>> 
>> 
>> That’s a fair point. The current implementation is potentially
inaccurate
>> because the counter increments are not thread-safe. In a low-contention
>> situation, that won’t matter much, but the counts could become quite
>> inaccurate if there are multiple threads running the same code at the
same
>> time.
>> 
>> I don’t have a specific goal in mind right now for accuracy. We plan to
use
>> this instrumentation for both PGO and code coverage. Some coverage
users
>> only care about a boolean check but others want to see the actual
execution
>> counts. If we display the count values and they are significantly
different
>> from the real execution counts, that will lead to much confusion.
> 
> If I would look at coverage report with exact numbers and observe 2
> statements that must have the same hit count, but in the report they
> have different hit counts; I would be highly confused.
> Go coverage profiler has 3 modes of operation: set, count and atomic:
> http://golang.org/cmd/go/#hdr-Description_of_testing_flags
> As you can guess, the last one (atomic) is intended for exactly such
> case -- when user looks at exact numbers.
I haven’t worked with Go, but that is exactly the kind of distinction that I had
in mind when I suggested that we may want to support different kinds of
instrumentation. It sounds like the “set” mode is pretty appealing for your
target applications, and I won’t be at all surprised if we end up wanting
something like “atomic” for some code coverage users. I’m guessing that the
current -fprofile-instr-generate approach is similar to the “count” mode.
> 
> 
> 
>> and without
>> hurting the performance for applications without significant
contention,
>> 
>> 
>> What is the acceptable threshold? 0.01% would be fine, right? What is
>> the maximum value that you are ready to agree with?
>> 
>> 
>> Like I said, I don’t have a specific value in mind. My sense is that
most of
>> the applications we care about are quite different from the massively
>> parallel code that Google cares about. We may have many threads but
they’re
>> all doing different things and contention is much less likely to be a
>> problem.
> 
> 
> Note that contention can arise even if threads are doing different
> things but use common components. E.g. if you include a custom malloc
> implementation into your program, then everything that allocates
> becomes a point of contention.
Yes, of course.
> 
> 
> 
>> I really think we need to see specifics before we can decide anything
here.
>> 
>> 
>> 
>> then we can just adopt that. If you can’t do it without tradeoffs, then
we
>> should have a separate option to let those who care switch between
different
>> kinds of instrumentation.
>> 
>> Using the PC to map to the source code is simply not going to work with
>> -fprofile-instr-generate. The mapping from counter values to the
>> user-visible execution counts is complex and relies on detailed
knowledge of
>> the clang ASTs.
>> 
>> 
>> 
>> % clang++ -O2 -lpthread coverage_mt_vec.cc && time ./a.out
>> TIME: real: 0.219; user: 0.430; system: 0.000
>> % clang++ -O2 -lpthread -fprofile-instr-generate coverage_mt_vec.cc
&& time
>> ./a.out
>> TIME: real: 3.743; user: 7.280; system: 0.000
>> 
>> % cat coverage_mt_vec.cc
>> #include <pthread.h>
>> #include <vector>
>> 
>> std::vector<int> v(1000);
>> 
>> __attribute__((noinline)) void foo() { v[0] = 42; }
>> 
>> void *Thread1(void *) {
>> for (int i = 0; i < 100000000; i++)
>>   foo();
>> return 0;
>> }
>> 
>> __attribute__((noinline)) void bar() { v[999] = 66; }
>> 
>> void *Thread2(void *) {
>> for (int i = 0; i < 100000000; i++)
>>   bar();
>> return 0;
>> }
>> 
>> int main() {
>> static const int kNumThreads = 16;
>> pthread_t t[kNumThreads];
>> pthread_create(&t[0], 0, Thread1, 0);
>> pthread_create(&t[1], 0, Thread2, 0);
>> pthread_join(t[0], 0);
>> pthread_join(t[1], 0);
>> return 0;
>> }
>> 
>> 
>> 
>> 
>> On Fri, Apr 18, 2014 at 11:45 PM, Xinliang David Li <xinliangli at
gmail.com>
>> wrote:
>> 
>> 
>> 
>> 
>> 
>> On Fri, Apr 18, 2014 at 12:13 AM, Dmitry Vyukov <dvyukov at
google.com>
>> wrote:
>> 
>> 
>> Hi,
>> 
>> This is long thread, so I will combine several comments into single
>> email.
>> 
>> 
>> - 8-bit per-thread counters, dumping into central counters on
>> overflow.
>> 
>> The overflow will happen very quickly with 8bit counter.
>> 
>> 
>> Yes, but it reduces contention by 256x (a thread must execute at least
>> 256 loop iterations between increments). In practice, if you reduce
>> contention below some threshold, it does not represent a problem
anymore.
>> 
>> 
>> 
>> - per-thread counters. Solves the problem at huge cost in RAM
>> per-thread
>> 
>> It is not practical. Especially for TLS counters -- it creates huge
>> pressure on stack memory.
>> 
>> 
>> Do we have any numbers about number of counters? If there are 100K
1-byte
>> counters, I would consider it as practical.
>> 
>> 
>> A medium sized app I looked at has about 10M counters (arcs only).  It
is
>> also not uncommon to see such apps running with hundreds of threads.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> In Google GCC, we implemented another technique which proves to be very
>> effective -- it is called FDO sampling.
>> Basically counters will be updated every N samples.
>> 
>> 
>> How does it work?
>> 
>> 
>> 
>> Similar to how occurrences based PMU sampling work. Setting sampling
>> period to 100 can reduce the instrumentation overhead by close to 100x
>> without introducing much precision loss.
>> 
>> 
>> 
>> 
>> 
>> It seems to me like we’re going to have a hard time getting good
>> multithreaded performance without significant impact on the
single-threaded
>> behavior.
>> 
>> I don't really agree.
>> 
>> 
>> 
>> We are talking about developers here. Nobody would know the exact
thread
>> counts, but developers know the ballpark number
>> 
>> 
>> I strongly believe that we must relief developers from this choice
during
>> build time, and do our best to auto-tune (if the final scheme requires
>> tuning).
>> 
>> 
>> 
>> 
>> That really depends.   If the tuning space is small, it won't be a
problem
>> for the developer/builder.
>> 
>> 
>> 
>> First, such questions puts unnecessary burden on developers. We
don't ask
>> what register allocation algorithm to use for each function, right?
>> 
>> 
>> 
>> Crazy developers can actually do that via internal options, but this is
>> totally different case.  People just needs one flag to turn on/off
sharding.
>> When sharding is on, compiler can pick the best 'N' according
to some
>> heuristics at compile time.
>> 
>> 
>> 
>> Second, there are significant chances we will get a wrong answer,
because
>> e.g. developer's view of how threaded is the app can differ from
reality or
>> from our classification.
>> Third, the app can be build by a build engineer; or the procedure can
be
>> applied to a base with 10'000 apps; or threaded-ness can change; or
the app
>> can work in different modes; or the app can have different phases.
>> 
>> 
>> We have forgotten to mention the benefit of implementation simplicity. 
If
>> the static/simple solution solves the problem for most of the use
cases,
>> designing fancy dynamic solution sounds like over-engineering to me. 
It
>> (overhead reduction technique) may also get in the way of easier
functional
>> enhancement in the future.

llvm dev - Apr 2014 - [LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)