thr3ads.net - llvm dev - [LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters) [Apr 2014]

If this information is useful, please help other people find it:
Share via:

Bob Wilson

2014-Apr-18 00:19 UTC

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

On Apr 17, 2014, at 2:04 PM, Chandler Carruth <chandlerc at google.com>
wrote:
> On Thu, Apr 17, 2014 at 1:27 PM, Justin Bogner <mail at
justinbogner.com> wrote:
> Chandler Carruth <chandlerc at google.com> writes: 
> > if (thread-ID != main's thread-ID && shard_count <
std::min(MAX, NUMBER_OF_CORES)) {
> >   shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS,
NUMBER_OF_CORES));
> >   // if shard_count changed with this, we can also call a library
routine here
> > that does the work of allocating the actual extra shards.
> > }
> 
> Note, I had a bug here. The first time I had this only grow the shard count
once, and later thought it might be nice to grow this incrementally so that
applications that only need 2 or 3 threads, don't pay for more. I fixed the
guard here, but naturally this is still hand-wavy. =D
>  
> 
> Is it possible to hook on something more clever than function entry?
> Choosing to do this based on thread creation or something like that
> could make this extra check very cheap.
> 
> Yea, *if* you can do it on thread creation, that would be awesome. But we
don't really have a good way to model that right now. on the flip side.
>  
> 
> > MAX is a fixed cap so even on systems with 100s of cores we don't
do something
> > silly. NUBER_OF_THREADS, if supported on the OS, can limit the shards
when we
> > only have a small number of threads in the program. NUMBER_OF_CORES,
if
> > supported on the OS, can limit the shards. If we don't have the
number of
> > threads, we just use the number of cores. If we don't have the
number of
> > cores, we can just guess 8 (or something).
> 
> I like the general idea of this approach. The obvious TLS approach
> worried me in that it explodes memory usage by much more than you'll
> generally need.
> 
> > Then, we can gracefully fall back on the following strategies to pick
an index
> > into the shards:
> >
> > - Per-core non-atomic counter updates (if we support them) using
restartable
> > sequences
> > - Use a core-ID as the index into the shards to statistically minimize
the
> > contention, and do the increments atomically so it remains correct
even if the
> > core-ID load happens before a core migration and the counter increment
occurs
> > afterward
> > - Use (thread-ID % number of cores) if we don't have support for
getting a
> > core-ID from the target OS. This will still have a reasonable
distribution I
> > suspect, even if not perfect.
> >
> > Finally, I wouldn't merge on shutdown if possible. I would just
write larger
> > raw profile data for multithreaded runs, and let the profdata tool
merge them.
> 
> Agreed. Doing the merging offline is a clear win. The space overhead is
> short lived enough that it shouldn't be a big issue.
> 
> > If this is still too much memory, then I would suggest doing the
above, but
> > doing it independently for each function so that only those functions
actually
> > called via multithreaded code end up sharding their counters.
> 
> I'd worry that doing this per-function might be too fine of
granularity,
> but there are certainly use cases where large chunks of the code are
> only exercised by one (of many) threads, so something along these lines
> could be worthwhile.
> 
> Yea. If we need to do it in a fine grained way (IE, not whole-program
sharding), I think my favorite granularity would be per-post-inliner-function.
The other thing to do is to set up function-local variables at that phase that
are used to locate the counters cheaply within the function body.
> 
> One way that might work would be to use essentially synthetic intrinsics to
introduce the *markers* for instrumentation (and the function names, frontend
hashes etc, all the stuff that needs language-level semantics) into the code,
but have LLVM transform these into the actual instrumentation code after
inlining and optimization. Then it can also inject the necessary code into the
more coarse grained boundaries.
> 
I really don’t think there is a *right* answer here. If you can make this work
without a significant performance regression for cases where there is little
contention for the counters, then great, but I’m pretty skeptical of that. I
would love to be surprised.

If someone wants to implement a scheme like this, then we can get real
performance numbers instead of just speculation and we’ll see.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140417/7336ceb4/attachment.html>

Dmitry Vyukov

2014-Apr-18 07:13 UTC

head link

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

Hi,

This is long thread, so I will combine several comments into single email.
>> - 8-bit per-thread counters, dumping into central counters on overflow.
>The overflow will happen very quickly with 8bit counter.
Yes, but it reduces contention by 256x (a thread must execute at least 256
loop iterations between increments). In practice, if you reduce contention
below some threshold, it does not represent a problem anymore.

>> - per-thread counters. Solves the problem at huge cost in RAM
per-thread
>It is not practical. Especially for TLS counters -- it creates hugepressure on stack memory.

Do we have any numbers about number of counters? If there are 100K 1-byte
counters, I would consider it as practical.

> - self-cooling logarithmic counters: if ((fast_random() % (1 <<
counter))== 0) counter++;

This option did not receive any attention. While it has O(1) memory
consumption and reduces contention.
Note that there are several tunables here: first -- how implement
fast_rand: it can be a per-thread LCG, or per-thread counter or set of
counter, or something else; second -- frequency of increment, e.g. it
actually can be additive and/or bounded (because if we reduce frequency of
increments by 1000x, this must be enough).

> In Google GCC, we implemented another technique which proves to be very
effective -- it is called FDO sampling.> Basically counters will be updated every N samples.
How does it work?

>> It seems to me like we’re going to have a hard time getting goodmultithreaded performance without significant impact on the single-threaded
behavior.> I don't really agree.
Disagreement seconded.
For single-threaded case we are talking about constant overhead, while for
multithreaded case -- about superlinear overhead.
Having said that, if there is a *simple* way to modify the final scheme to
significantly reduce single-threaded overheads, then or course it's worth
doing. But it's not the cornerstone.

> MAX is a fixed cap so even on systems with 100s of cores we don't dosomething silly.

Makes not sense to me.
Why do you want only systems with up to 100 cores to be fast and scalable?
100+ core system *especially* need good scalability (contention tends to be
superlinear).
What you are proposing is basically: If I have 10 engineers in my company,
I probably want to give them 10 working desks as well. But let's not go
insane. If I have 1000 engineers, 100 desks must be enough for them. This
must reduce costs.
The baseline memory consumption for systems (and amount of RAM!) is
O(NCORES), not O(1). In some read-mostly cases it's possible to achieve
O(1) memory consumption, and that's great. But if it's not the case
here,
let it be so.

> shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS, NUMBER_OF_CORES))
Threads do not produce contention, it's cores that produce contention.
The formula must be:  shard_count = k*NCORES
And if you want less memory in single-threaded case, then: shard_count
min(k*NCORES, c*NTHREADS)

>We are talking about developers here. Nobody would know the exact threadcounts, but developers know the ballpark number

I strongly believe that we must relief developers from this choice during
build time, and do our best to auto-tune (if the final scheme requires
tuning).
First, such questions puts unnecessary burden on developers. We don't ask
what register allocation algorithm to use for each function, right?
Second, there are significant chances we will get a wrong answer, because
e.g. developer's view of how threaded is the app can differ from reality or
from our classification.
Third, the app can be build by a build engineer; or the procedure can be
applied to a base with 10'000 apps; or threaded-ness can change; or the app
can work in different modes; or the app can have different phases.

>Another danger involved with dynamically resizing the counter is that itrequires a global or per function lock to access the counters. The cost of
this can be really high.

It's possible to do it w/o a mutex, fast-path overhead is only a virtually
zero overhead (if implemented properly in the compiler) atomic consume load:

const int maxshard = 4096;
uint64* shard[maxshard];
atomic<int> shardmask;

void inline inccounter(int idx)
{
int shardidx = gettid() & atomic_load(&shardmask, memory_order_consume);
 shard[shardidx][idx]++;
}

int pthread_create(...)
{
if (updateshardcount()) {
 shardlock();
if (updateshardcount()) {
int newcount = computeshardcount();
 for (int i = oldcount; i < newcount; i++)
shard[i] = malloc(ncounter*sizeof(uint64));
atomic_store(&shardmask, newcount-1, memory_order_release);
 }
shardunlock();
}
...
}
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/af5fb0a1/attachment.html>

Dmitry Vyukov

2014-Apr-18 07:32 UTC

head link

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

On Fri, Apr 18, 2014 at 11:13 AM, Dmitry Vyukov <dvyukov at google.com>
wrote:
> Hi,
>
> This is long thread, so I will combine several comments into single email.
>
>
> >> - 8-bit per-thread counters, dumping into central counters on
overflow.
> >The overflow will happen very quickly with 8bit counter.
>
> Yes, but it reduces contention by 256x (a thread must execute at least 256
> loop iterations between increments). In practice, if you reduce contention
> below some threshold, it does not represent a problem anymore.
>
>
>
> >> - per-thread counters. Solves the problem at huge cost in RAM
per-thread
> >It is not practical. Especially for TLS counters -- it creates huge
> pressure on stack memory.
>
> Do we have any numbers about number of counters? If there are 100K 1-byte
> counters, I would consider it as practical.
>
>
>
> > - self-cooling logarithmic counters: if ((fast_random() % (1 <<
> counter)) == 0) counter++;
>
> This option did not receive any attention. While it has O(1) memory
> consumption and reduces contention.
> Note that there are several tunables here: first -- how implement
> fast_rand: it can be a per-thread LCG, or per-thread counter or set of
> counter, or something else; second -- frequency of increment, e.g. it
> actually can be additive and/or bounded (because if we reduce frequency of
> increments by 1000x, this must be enough).
>
>
>
> > In Google GCC, we implemented another technique which proves to be
very
> effective -- it is called FDO sampling.
> > Basically counters will be updated every N samples.
>
> How does it work?
>
>
>
> >> It seems to me like we’re going to have a hard time getting good
> multithreaded performance without significant impact on the single-threaded
> behavior.
> > I don't really agree.
>
> Disagreement seconded.
> For single-threaded case we are talking about constant overhead, while for
> multithreaded case -- about superlinear overhead.
> Having said that, if there is a *simple* way to modify the final scheme to
> significantly reduce single-threaded overheads, then or course it's
worth
> doing. But it's not the cornerstone.
>
>
>
> > MAX is a fixed cap so even on systems with 100s of cores we don't
do
> something silly.
>
> Makes not sense to me.
> Why do you want only systems with up to 100 cores to be fast and scalable?
> 100+ core system *especially* need good scalability (contention tends to be
> superlinear).
> What you are proposing is basically: If I have 10 engineers in my company,
> I probably want to give them 10 working desks as well. But let's not go
> insane. If I have 1000 engineers, 100 desks must be enough for them. This
> must reduce costs.
> The baseline memory consumption for systems (and amount of RAM!) is
> O(NCORES), not O(1). In some read-mostly cases it's possible to achieve
> O(1) memory consumption, and that's great. But if it's not the case
here,
> let it be so.
>
>
>
> > shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS,
NUMBER_OF_CORES))
>
> Threads do not produce contention, it's cores that produce contention.
> The formula must be:  shard_count = k*NCORES
> And if you want less memory in single-threaded case, then: shard_count >
min(k*NCORES, c*NTHREADS)
>
>
>
> >We are talking about developers here. Nobody would know the exact
thread
> counts, but developers know the ballpark number
>
> I strongly believe that we must relief developers from this choice during
> build time, and do our best to auto-tune (if the final scheme requires
> tuning).
> First, such questions puts unnecessary burden on developers. We don't
ask
> what register allocation algorithm to use for each function, right?
> Second, there are significant chances we will get a wrong answer, because
> e.g. developer's view of how threaded is the app can differ from
reality or
> from our classification.
> Third, the app can be build by a build engineer; or the procedure can be
> applied to a base with 10'000 apps; or threaded-ness can change; or the
app
> can work in different modes; or the app can have different phases.
>
>
>
> >Another danger involved with dynamically resizing the counter is that
it
> requires a global or per function lock to access the counters. The cost of
> this can be really high.
>
> It's possible to do it w/o a mutex, fast-path overhead is only a
virtually
> zero overhead (if implemented properly in the compiler) atomic consume
load:
>
> const int maxshard = 4096;
> uint64* shard[maxshard];
> atomic<int> shardmask;
>
> void inline inccounter(int idx)
> {
> int shardidx = gettid() & atomic_load(&shardmask,
memory_order_consume);
>  shard[shardidx][idx]++;
> }
>
> int pthread_create(...)
> {
> if (updateshardcount()) {
>  shardlock();
> if (updateshardcount()) {
> int newcount = computeshardcount();
>  for (int i = oldcount; i < newcount; i++)
> shard[i] = malloc(ncounter*sizeof(uint64));
> atomic_store(&shardmask, newcount-1, memory_order_release);
>  }
> shardunlock();
> }
> ...
> }
>
>
>If we go with this scheme, for tid I would do:

__thread threadid;

int pthread_create(...)
{
    threadid = goohhash(gettid());
    ...
}

void userfunc()
{
    int tid = threadid+1;
    threadid = tid;
    // use tid in profiling
    ...
}

This avoids the problem of potentially expensive gettid(), and avoids a
possible persistent interference between tids.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/3ed5cdad/attachment.html>

Chandler Carruth

2014-Apr-18 09:04 UTC

head link

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

On Fri, Apr 18, 2014 at 12:13 AM, Dmitry Vyukov <dvyukov at google.com>
wrote:
> > MAX is a fixed cap so even on systems with 100s of cores we don't
do
> something silly.
>
> Makes not sense to me.
> Why do you want only systems with up to 100 cores to be fast and scalable?
> 100+ core system *especially* need good scalability (contention tends to be
> superlinear).
>
Please don't argue motives, and instead argue the technical merits. I do
want systems with 100s of cores to be fast and scalable.

What you are proposing is basically: If I have 10 engineers in my
company,> I probably want to give them 10 working desks as well. But let's not go
> insane. If I have 1000 engineers, 100 desks must be enough for them. This
> must reduce costs.
> The baseline memory consumption for systems (and amount of RAM!) is
> O(NCORES), not O(1). In some read-mostly cases it's possible to achieve
> O(1) memory consumption, and that's great. But if it's not the case
here,
> let it be so.

I think you are drastically overstating what I am suggesting. The bad
analogy isn't helping.

The only way we get contention is if we have the same function accessing
the same counters within the function on multiple cores at the same time.
It is entirely conceivable that programs which manage to do this for
*every* core in a system with many hundreds of cores are rare. As a
consequence, it might be a practical compromise to reduce the number of
shards below the number of cores if the memory overhead is not providing
commensurate performance. Clearly, measurements and such are needed here,
but it is at least a tunable knob that we should know about and consider in
our measurements.

>
>
>
> > shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS,
NUMBER_OF_CORES))
>
> Threads do not produce contention, it's cores that produce contention.
>
It is the *combination* of threads on cores which produce contention. The
above 'max' should have been a 'min', sorry for that confusion.
The point
was to reduce the shards if *either* the number of cores is small or the
number of threads is small.

The formula must be:  shard_count = k*NCORES> And if you want less memory in single-threaded case, then: shard_count >
min(k*NCORES, c*NTHREADS)

Which is what I intended to write.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/eb16047c/attachment.html>

Xinliang David Li

2014-Apr-18 19:45 UTC

head link

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

On Fri, Apr 18, 2014 at 12:13 AM, Dmitry Vyukov <dvyukov at google.com>
wrote:
> Hi,
>
> This is long thread, so I will combine several comments into single email.
>
>
> >> - 8-bit per-thread counters, dumping into central counters on
overflow.
> >The overflow will happen very quickly with 8bit counter.
>
> Yes, but it reduces contention by 256x (a thread must execute at least 256
> loop iterations between increments). In practice, if you reduce contention
> below some threshold, it does not represent a problem anymore.
>
>
>
> >> - per-thread counters. Solves the problem at huge cost in RAM
per-thread
> >It is not practical. Especially for TLS counters -- it creates huge
> pressure on stack memory.
>
> Do we have any numbers about number of counters? If there are 100K 1-byte
> counters, I would consider it as practical.
>
>A medium sized app I looked at has about 10M counters (arcs only).  It is
also not uncommon to see such apps running with hundreds of threads.


>
>
>
>
> > In Google GCC, we implemented another technique which proves to be
very
> effective -- it is called FDO sampling.
> > Basically counters will be updated every N samples.
>
> How does it work?
>
Similar to how occurrences based PMU sampling work. Setting sampling period
to 100 can reduce the instrumentation overhead by close to 100x without
introducing much precision loss.

>
>
>
> >> It seems to me like we’re going to have a hard time getting good
> multithreaded performance without significant impact on the single-threaded
> behavior.
> > I don't really agree.
>
>
> >We are talking about developers here. Nobody would know the exact
thread
> counts, but developers know the ballpark number
>
> I strongly believe that we must relief developers from this choice during
> build time, and do our best to auto-tune (if the final scheme requires
> tuning).
>

That really depends.   If the tuning space is small, it won't be a problem
for the developer/builder.


> First, such questions puts unnecessary burden on developers. We don't
ask
> what register allocation algorithm to use for each function, right?
>
Crazy developers can actually do that via internal options, but this is
totally different case.  People just needs one flag to turn on/off
sharding. When sharding is on, compiler can pick the best 'N' according
to
some heuristics at compile time.


> Second, there are significant chances we will get a wrong answer, because
> e.g. developer's view of how threaded is the app can differ from
reality or
> from our classification.
> Third, the app can be build by a build engineer; or the procedure can be
> applied to a base with 10'000 apps; or threaded-ness can change; or the
app
> can work in different modes; or the app can have different phases.
>
>We have forgotten to mention the benefit of implementation simplicity.  If
the static/simple solution solves the problem for most of the use cases,
designing fancy dynamic solution sounds like over-engineering to me.  It
(overhead reduction technique) may also get in the way of easier functional
enhancement in the future.

 David
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/6861153d/attachment.html>

Maybe Matching Threads

Search for more maybe matching threads

llvm dev - Apr 2014 - [LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

Maybe Matching Threads