thr3ads.net - llvm dev - [LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters) [Apr 2014]

If this information is useful, please help other people find it:
Share via:

Justin Bogner

2014-Apr-17 20:27 UTC

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

Chandler Carruth <chandlerc at google.com> writes:> Having thought a bit about the best strategy to solve this, I think we
should
> use a tradeoff of memory to reduce contention. I don't really like any
of the
> other options as much, if we can get that one to work. Here is my specific
> suggestion:
>
> On Thu, Apr 17, 2014 at 5:21 AM, Kostya Serebryany <kcc at
google.com> wrote:
>
>     - per-cpu counters (not portable, requires very modern kernel with lots
of
>     patches)
>     - sharded counters: each counter represented as N counters sitting in
>     different cache lines. Every thread accesses the counter with index
TID%N.
>     Solves the problem partially, better with larger values of N, but then
>     again it costs RAM.
>
> I think we should combine these somewhat.
>
> At an abstract level, I think we should design the profiling to support up
to
> N shards of counters.
>
> I think we should have a dynamic number of shards if possible. The goal
here
> is that if you never need multiple shards (single threaded) you pay
> essentially zero cost. I would have a global number of shards that changes
> rarely, and re-compute it on entry to each function with something along
the
> lines of:
>
> if (thread-ID != main's thread-ID && shard_count == 1) {
>   shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS,
NUMBER_OF_CORES));
>   // if shard_count changed with this, we can also call a library routine
here
> that does the work of allocating the actual extra shards.
> }
Is it possible to hook on something more clever than function entry?
Choosing to do this based on thread creation or something like that
could make this extra check very cheap.
> MAX is a fixed cap so even on systems with 100s of cores we don't do
something
> silly. NUBER_OF_THREADS, if supported on the OS, can limit the shards when
we
> only have a small number of threads in the program. NUMBER_OF_CORES, if
> supported on the OS, can limit the shards. If we don't have the number
of
> threads, we just use the number of cores. If we don't have the number
of
> cores, we can just guess 8 (or something).
I like the general idea of this approach. The obvious TLS approach
worried me in that it explodes memory usage by much more than you'll
generally need.
> Then, we can gracefully fall back on the following strategies to pick an
index
> into the shards:
>
> - Per-core non-atomic counter updates (if we support them) using
restartable
> sequences
> - Use a core-ID as the index into the shards to statistically minimize the
> contention, and do the increments atomically so it remains correct even if
the
> core-ID load happens before a core migration and the counter increment
occurs
> afterward
> - Use (thread-ID % number of cores) if we don't have support for
getting a
> core-ID from the target OS. This will still have a reasonable distribution
I
> suspect, even if not perfect.
>
> Finally, I wouldn't merge on shutdown if possible. I would just write
larger
> raw profile data for multithreaded runs, and let the profdata tool merge
them.
Agreed. Doing the merging offline is a clear win. The space overhead is
short lived enough that it shouldn't be a big issue.
> If this is still too much memory, then I would suggest doing the above, but
> doing it independently for each function so that only those functions
actually
> called via multithreaded code end up sharding their counters.
I'd worry that doing this per-function might be too fine of granularity,
but there are certainly use cases where large chunks of the code are
only exercised by one (of many) threads, so something along these lines
could be worthwhile.
> I think this would be reasonably straightforward to implement, not
> significantly grow the cost of single-threaded instrumentation, and largely
> mitigate the contention on the counters. It can benefit from advanced hooks
> into the OS when those are available, but seems pretty easy to implement on
> roughly any OS with reasonable tradeoffs. The only really hard requirement
is
> the ability to get a thread-id, but I think that is fairly reasonable (C++
> even makes this essentially mandatory).
>
> Thoughts?

Chandler Carruth

2014-Apr-17 21:04 UTC

head link

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

On Thu, Apr 17, 2014 at 1:27 PM, Justin Bogner <mail at
justinbogner.com>wrote:
> Chandler Carruth <chandlerc at google.com> writes:
>
> if (thread-ID != main's thread-ID && shard_count <
std::min(MAX,
> NUMBER_OF_CORES)) {
> >   shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS,
> NUMBER_OF_CORES));
> >   // if shard_count changed with this, we can also call a library
> routine here
> > that does the work of allocating the actual extra shards.
> > }
>
Note, I had a bug here. The first time I had this only grow the shard count
once, and later thought it might be nice to grow this incrementally so that
applications that only need 2 or 3 threads, don't pay for more. I fixed the
guard here, but naturally this is still hand-wavy. =D

>
> Is it possible to hook on something more clever than function entry?
> Choosing to do this based on thread creation or something like that
> could make this extra check very cheap.
>
Yea, *if* you can do it on thread creation, that would be awesome. But we
don't really have a good way to model that right now. on the flip side.

>
> > MAX is a fixed cap so even on systems with 100s of cores we don't
do
> something
> > silly. NUBER_OF_THREADS, if supported on the OS, can limit the shards
> when we
> > only have a small number of threads in the program. NUMBER_OF_CORES,
if
> > supported on the OS, can limit the shards. If we don't have the
number of
> > threads, we just use the number of cores. If we don't have the
number of
> > cores, we can just guess 8 (or something).
>
> I like the general idea of this approach. The obvious TLS approach
> worried me in that it explodes memory usage by much more than you'll
> generally need.
>
> > Then, we can gracefully fall back on the following strategies to pick
an
> index
> > into the shards:
> >
> > - Per-core non-atomic counter updates (if we support them) using
> restartable
> > sequences
> > - Use a core-ID as the index into the shards to statistically minimize
> the
> > contention, and do the increments atomically so it remains correct
even
> if the
> > core-ID load happens before a core migration and the counter increment
> occurs
> > afterward
> > - Use (thread-ID % number of cores) if we don't have support for
getting
> a
> > core-ID from the target OS. This will still have a reasonable
> distribution I
> > suspect, even if not perfect.
> >
> > Finally, I wouldn't merge on shutdown if possible. I would just
write
> larger
> > raw profile data for multithreaded runs, and let the profdata tool
merge
> them.
>
> Agreed. Doing the merging offline is a clear win. The space overhead is
> short lived enough that it shouldn't be a big issue.
>
> > If this is still too much memory, then I would suggest doing the
above,
> but
> > doing it independently for each function so that only those functions
> actually
> > called via multithreaded code end up sharding their counters.
>
> I'd worry that doing this per-function might be too fine of
granularity,
> but there are certainly use cases where large chunks of the code are
> only exercised by one (of many) threads, so something along these lines
> could be worthwhile.
>
Yea. If we need to do it in a fine grained way (IE, not whole-program
sharding), I think my favorite granularity would be
per-post-inliner-function. The other thing to do is to set up
function-local variables at that phase that are used to locate the counters
cheaply within the function body.

One way that might work would be to use essentially synthetic intrinsics to
introduce the *markers* for instrumentation (and the function names,
frontend hashes etc, all the stuff that needs language-level semantics)
into the code, but have LLVM transform these into the actual
instrumentation code after inlining and optimization. Then it can also
inject the necessary code into the more coarse grained boundaries.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140417/a2b3cc56/attachment.html>

Bob Wilson

2014-Apr-18 00:19 UTC

head link

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

On Apr 17, 2014, at 2:04 PM, Chandler Carruth <chandlerc at google.com>
wrote:
> On Thu, Apr 17, 2014 at 1:27 PM, Justin Bogner <mail at
justinbogner.com> wrote:
> Chandler Carruth <chandlerc at google.com> writes: 
> > if (thread-ID != main's thread-ID && shard_count <
std::min(MAX, NUMBER_OF_CORES)) {
> >   shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS,
NUMBER_OF_CORES));
> >   // if shard_count changed with this, we can also call a library
routine here
> > that does the work of allocating the actual extra shards.
> > }
> 
> Note, I had a bug here. The first time I had this only grow the shard count
once, and later thought it might be nice to grow this incrementally so that
applications that only need 2 or 3 threads, don't pay for more. I fixed the
guard here, but naturally this is still hand-wavy. =D
>  
> 
> Is it possible to hook on something more clever than function entry?
> Choosing to do this based on thread creation or something like that
> could make this extra check very cheap.
> 
> Yea, *if* you can do it on thread creation, that would be awesome. But we
don't really have a good way to model that right now. on the flip side.
>  
> 
> > MAX is a fixed cap so even on systems with 100s of cores we don't
do something
> > silly. NUBER_OF_THREADS, if supported on the OS, can limit the shards
when we
> > only have a small number of threads in the program. NUMBER_OF_CORES,
if
> > supported on the OS, can limit the shards. If we don't have the
number of
> > threads, we just use the number of cores. If we don't have the
number of
> > cores, we can just guess 8 (or something).
> 
> I like the general idea of this approach. The obvious TLS approach
> worried me in that it explodes memory usage by much more than you'll
> generally need.
> 
> > Then, we can gracefully fall back on the following strategies to pick
an index
> > into the shards:
> >
> > - Per-core non-atomic counter updates (if we support them) using
restartable
> > sequences
> > - Use a core-ID as the index into the shards to statistically minimize
the
> > contention, and do the increments atomically so it remains correct
even if the
> > core-ID load happens before a core migration and the counter increment
occurs
> > afterward
> > - Use (thread-ID % number of cores) if we don't have support for
getting a
> > core-ID from the target OS. This will still have a reasonable
distribution I
> > suspect, even if not perfect.
> >
> > Finally, I wouldn't merge on shutdown if possible. I would just
write larger
> > raw profile data for multithreaded runs, and let the profdata tool
merge them.
> 
> Agreed. Doing the merging offline is a clear win. The space overhead is
> short lived enough that it shouldn't be a big issue.
> 
> > If this is still too much memory, then I would suggest doing the
above, but
> > doing it independently for each function so that only those functions
actually
> > called via multithreaded code end up sharding their counters.
> 
> I'd worry that doing this per-function might be too fine of
granularity,
> but there are certainly use cases where large chunks of the code are
> only exercised by one (of many) threads, so something along these lines
> could be worthwhile.
> 
> Yea. If we need to do it in a fine grained way (IE, not whole-program
sharding), I think my favorite granularity would be per-post-inliner-function.
The other thing to do is to set up function-local variables at that phase that
are used to locate the counters cheaply within the function body.
> 
> One way that might work would be to use essentially synthetic intrinsics to
introduce the *markers* for instrumentation (and the function names, frontend
hashes etc, all the stuff that needs language-level semantics) into the code,
but have LLVM transform these into the actual instrumentation code after
inlining and optimization. Then it can also inject the necessary code into the
more coarse grained boundaries.
> 
I really don’t think there is a *right* answer here. If you can make this work
without a significant performance regression for cases where there is little
contention for the counters, then great, but I’m pretty skeptical of that. I
would love to be surprised.

If someone wants to implement a scheme like this, then we can get real
performance numbers instead of just speculation and we’ll see.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140417/7336ceb4/attachment.html>

llvm dev - Apr 2014 - [LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)