thr3ads.net - llvm dev - [LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters) [Apr 2014]

If this information is useful, please help other people find it:
Share via:

Xinliang David Li

2014-Apr-17 20:51 UTC

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

Good thinking, but why do you think runtime  selection of shard count is
better than compile time selection? For single threaded apps, shard count
is always 1, so why paying the penalty to check thread id each time
function is entered?

For multi-threaded apps, I would expect MAX to be smaller than NUM_OF_CORES
to avoid excessive memory consumption, then you always end up with N =MAX. If
MAX is larger than NUM_OF_CORES,  for large MT apps, the # of
 threads tends to be larger than NUM_OF_CORES, so it also ends up with N =MAX. 
For rare cases, the shard count may switch between MAX and
NUM_OF_CORES, but you also pay the penalty to reallocate/memcpy counter
arrays each time it changes.

Making N non compile time constant also makes the indexing more expensive.
Of course we can ignore thread migration and do CSE on it.

David

On Thu, Apr 17, 2014 at 1:06 PM, Chandler Carruth <chandlerc at
google.com>wrote:
> Having thought a bit about the best strategy to solve this, I think we
> should use a tradeoff of memory to reduce contention. I don't really
like
> any of the other options as much, if we can get that one to work. Here is
> my specific suggestion:
>
> On Thu, Apr 17, 2014 at 5:21 AM, Kostya Serebryany <kcc at
google.com> wrote:
>
>> - per-cpu counters (not portable, requires very modern kernel with lots
>> of patches)
>> - sharded counters: each counter represented as N counters sitting in
>> different cache lines. Every thread accesses the counter with index
TID%N.
>> Solves the problem partially, better with larger values of N, but then
>> again it costs RAM.
>>
>
> I think we should combine these somewhat.
>
> At an abstract level, I think we should design the profiling to support up
> to N shards of counters.
>
> I think we should have a dynamic number of shards if possible. The goal
> here is that if you never need multiple shards (single threaded) you pay
> essentially zero cost. I would have a global number of shards that changes
> rarely, and re-compute it on entry to each function with something along
> the lines of:
>
> if (thread-ID != main's thread-ID && shard_count == 1) {
>   shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS,
> NUMBER_OF_CORES));
>   // if shard_count changed with this, we can also call a library routine
> here that does the work of allocating the actual extra shards.
> }
>
> MAX is a fixed cap so even on systems with 100s of cores we don't do
> something silly. NUBER_OF_THREADS, if supported on the OS, can limit the
> shards when we only have a small number of threads in the program.
> NUMBER_OF_CORES, if supported on the OS, can limit the shards. If we
don't
> have the number of threads, we just use the number of cores. If we
don't
> have the number of cores, we can just guess 8 (or something).
>

>
> Then, we can gracefully fall back on the following strategies to pick an
> index into the shards:
>
> - Per-core non-atomic counter updates (if we support them) using
> restartable sequences
> - Use a core-ID as the index into the shards to statistically minimize the
> contention, and do the increments atomically so it remains correct even if
> the core-ID load happens before a core migration and the counter increment
> occurs afterward
> - Use (thread-ID % number of cores) if we don't have support for
getting a
> core-ID from the target OS. This will still have a reasonable distribution
> I suspect, even if not perfect.
>
>
> Finally, I wouldn't merge on shutdown if possible. I would just write
> larger raw profile data for multithreaded runs, and let the profdata tool
> merge them.
>
>
> If this is still too much memory, then I would suggest doing the above,
> but doing it independently for each function so that only those functions
> actually called via multithreaded code end up sharding their counters.
>
>
> I think this would be reasonably straightforward to implement, not
> significantly grow the cost of single-threaded instrumentation, and largely
> mitigate the contention on the counters. It can benefit from advanced hooks
> into the OS when those are available, but seems pretty easy to implement on
> roughly any OS with reasonable tradeoffs. The only really hard requirement
> is the ability to get a thread-id, but I think that is fairly reasonable
> (C++ even makes this essentially mandatory).
>
> Thoughts?
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140417/c614ecb2/attachment.html>

Chandler Carruth

2014-Apr-17 21:09 UTC

head link

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

On Thu, Apr 17, 2014 at 1:51 PM, Xinliang David Li <xinliangli at
gmail.com>wrote:
> Good thinking, but why do you think runtime  selection of shard count is
> better than compile time selection? For single threaded apps, shard count
> is always 1, so why paying the penalty to check thread id each time
> function is entered?
>
Because extremely few applications statically decide how many threads to
use in the real world (in my experience). This is even more relevant if you
consider each <unit of code, maybe post-inlined function> independently,
where you might have many threads but near 0 overlapping functions on those
threads. The number of cores also changes from machine to machine, and can
even change based on the particular OS mode in which your application runs.

> For multi-threaded apps, I would expect MAX to be smaller than
> NUM_OF_CORES to avoid excessive memory consumption, then you always end up
> with N == MAX. If MAX is larger than NUM_OF_CORES,  for large MT apps, the
> # of  threads tends to be larger than NUM_OF_CORES, so it also ends up with
> N == MAX.  For rare cases, the shard count may switch between MAX and
> NUM_OF_CORES, but you also pay the penalty to reallocate/memcpy counter
> arrays each time it changes.
>
Sorry, this was just pseudo code, and very rough at that.

The goal was to allow programs with >1 thread but significantly fewer
threads than cores to not pay (in memory) for all of the shards. There are
common patterns here such as applications that are essentially single
threaded, but with one or two background threads. Also, the hard
compile-time max is a compile time constant, but the number of cores isn't
(see above) so at least once per execution of the program, we'll need to
dynamically take the min of the two.

> Making N non compile time constant also makes the indexing more expensive.
> Of course we can ignore thread migration and do CSE on it.
>
Yes, and a certain amount of this is actually fine because the whole point
was to minimize contention rather than perfectly eliminate it.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140417/eb134860/attachment.html>

Xinliang David Li

2014-Apr-17 21:33 UTC

head link

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

On Thu, Apr 17, 2014 at 2:09 PM, Chandler Carruth <chandlerc at
google.com>wrote:
>
> On Thu, Apr 17, 2014 at 1:51 PM, Xinliang David Li <xinliangli at
gmail.com>wrote:
>
>> Good thinking, but why do you think runtime  selection of shard count
is
>> better than compile time selection? For single threaded apps, shard
count
>> is always 1, so why paying the penalty to check thread id each time
>> function is entered?
>>
>
> Because extremely few applications statically decide how many threads to
> use in the real world (in my experience). This is even more relevant if you
> consider each <unit of code, maybe post-inlined function>
independently,
> where you might have many threads but near 0 overlapping functions on those
> threads. The number of cores also changes from machine to machine, and can
> even change based on the particular OS mode in which your application runs.
>

We are talking about developers here. Nobody would know the exact thread
counts, but developers know the ballpark number, which should be enough.
E.g. 1) my program is single threaded; 2) my program is mostly single
threaded with some lightweight helper threads; 3) my program is heavily
threaded without a single hotspot; 4) my program is heavily threaded with
hotspot contention ,etc.  Only 4) is of concern here.  Besides, user can
always find out if instrumentation build is too slow and decide which
strategy to use. For apps with distinct phases (e.g. ST->MT->ST), the
proposed approach may be useful, but it won't be the majority.


>
>> For multi-threaded apps, I would expect MAX to be smaller than
>> NUM_OF_CORES to avoid excessive memory consumption, then you always end
up
>> with N == MAX. If MAX is larger than NUM_OF_CORES,  for large MT apps,
the
>> # of  threads tends to be larger than NUM_OF_CORES, so it also ends up
with
>> N == MAX.  For rare cases, the shard count may switch between MAX and
>> NUM_OF_CORES, but you also pay the penalty to reallocate/memcpy counter
>> arrays each time it changes.
>>
>
> Sorry, this was just pseudo code, and very rough at that.
>
> The goal was to allow programs with >1 thread but significantly fewer
> threads than cores to not pay (in memory) for all of the shards. There are
> common patterns here such as applications that are essentially single
> threaded, but with one or two background threads. Also, the hard
> compile-time max is a compile time constant, but the number of cores
isn't
> (see above) so at least once per execution of the program, we'll need
to
> dynamically take the min of the two.
>
> See above -- for each cases (scenario 2), user normally has priorknowledge.

>
>> Making N non compile time constant also makes the indexing more
>> expensive. Of course we can ignore thread migration and do CSE on it.
>>
>
> Yes, and a certain amount of this is actually fine because the whole point
> was to minimize contention rather than perfectly eliminate it.
>
>Another danger involved with dynamically resizing the counter is that it
requires a global or per function lock to access the counters. The cost of
this can be really high.

David
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140417/126f9a65/attachment.html>

llvm dev - Apr 2014 - [LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)