Dmitry Vyukov
2014-Apr-18 07:32 UTC
[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
On Fri, Apr 18, 2014 at 11:13 AM, Dmitry Vyukov <dvyukov at google.com> wrote:> Hi, > > This is long thread, so I will combine several comments into single email. > > > >> - 8-bit per-thread counters, dumping into central counters on overflow. > >The overflow will happen very quickly with 8bit counter. > > Yes, but it reduces contention by 256x (a thread must execute at least 256 > loop iterations between increments). In practice, if you reduce contention > below some threshold, it does not represent a problem anymore. > > > > >> - per-thread counters. Solves the problem at huge cost in RAM per-thread > >It is not practical. Especially for TLS counters -- it creates huge > pressure on stack memory. > > Do we have any numbers about number of counters? If there are 100K 1-byte > counters, I would consider it as practical. > > > > > - self-cooling logarithmic counters: if ((fast_random() % (1 << > counter)) == 0) counter++; > > This option did not receive any attention. While it has O(1) memory > consumption and reduces contention. > Note that there are several tunables here: first -- how implement > fast_rand: it can be a per-thread LCG, or per-thread counter or set of > counter, or something else; second -- frequency of increment, e.g. it > actually can be additive and/or bounded (because if we reduce frequency of > increments by 1000x, this must be enough). > > > > > In Google GCC, we implemented another technique which proves to be very > effective -- it is called FDO sampling. > > Basically counters will be updated every N samples. > > How does it work? > > > > >> It seems to me like we’re going to have a hard time getting good > multithreaded performance without significant impact on the single-threaded > behavior. > > I don't really agree. > > Disagreement seconded. > For single-threaded case we are talking about constant overhead, while for > multithreaded case -- about superlinear overhead. > Having said that, if there is a *simple* way to modify the final scheme to > significantly reduce single-threaded overheads, then or course it's worth > doing. But it's not the cornerstone. > > > > > MAX is a fixed cap so even on systems with 100s of cores we don't do > something silly. > > Makes not sense to me. > Why do you want only systems with up to 100 cores to be fast and scalable? > 100+ core system *especially* need good scalability (contention tends to be > superlinear). > What you are proposing is basically: If I have 10 engineers in my company, > I probably want to give them 10 working desks as well. But let's not go > insane. If I have 1000 engineers, 100 desks must be enough for them. This > must reduce costs. > The baseline memory consumption for systems (and amount of RAM!) is > O(NCORES), not O(1). In some read-mostly cases it's possible to achieve > O(1) memory consumption, and that's great. But if it's not the case here, > let it be so. > > > > > shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS, NUMBER_OF_CORES)) > > Threads do not produce contention, it's cores that produce contention. > The formula must be: shard_count = k*NCORES > And if you want less memory in single-threaded case, then: shard_count > min(k*NCORES, c*NTHREADS) > > > > >We are talking about developers here. Nobody would know the exact thread > counts, but developers know the ballpark number > > I strongly believe that we must relief developers from this choice during > build time, and do our best to auto-tune (if the final scheme requires > tuning). > First, such questions puts unnecessary burden on developers. We don't ask > what register allocation algorithm to use for each function, right? > Second, there are significant chances we will get a wrong answer, because > e.g. developer's view of how threaded is the app can differ from reality or > from our classification. > Third, the app can be build by a build engineer; or the procedure can be > applied to a base with 10'000 apps; or threaded-ness can change; or the app > can work in different modes; or the app can have different phases. > > > > >Another danger involved with dynamically resizing the counter is that it > requires a global or per function lock to access the counters. The cost of > this can be really high. > > It's possible to do it w/o a mutex, fast-path overhead is only a virtually > zero overhead (if implemented properly in the compiler) atomic consume load: > > const int maxshard = 4096; > uint64* shard[maxshard]; > atomic<int> shardmask; > > void inline inccounter(int idx) > { > int shardidx = gettid() & atomic_load(&shardmask, memory_order_consume); > shard[shardidx][idx]++; > } > > int pthread_create(...) > { > if (updateshardcount()) { > shardlock(); > if (updateshardcount()) { > int newcount = computeshardcount(); > for (int i = oldcount; i < newcount; i++) > shard[i] = malloc(ncounter*sizeof(uint64)); > atomic_store(&shardmask, newcount-1, memory_order_release); > } > shardunlock(); > } > ... > } > > >If we go with this scheme, for tid I would do: __thread threadid; int pthread_create(...) { threadid = goohhash(gettid()); ... } void userfunc() { int tid = threadid+1; threadid = tid; // use tid in profiling ... } This avoids the problem of potentially expensive gettid(), and avoids a possible persistent interference between tids. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/3ed5cdad/attachment.html>
Dmitry Vyukov
2014-Apr-18 07:44 UTC
[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
On Fri, Apr 18, 2014 at 11:32 AM, Dmitry Vyukov <dvyukov at google.com> wrote:> On Fri, Apr 18, 2014 at 11:13 AM, Dmitry Vyukov <dvyukov at google.com>wrote: > >> Hi, >> >> This is long thread, so I will combine several comments into single email. >> >> >> >> - 8-bit per-thread counters, dumping into central counters on overflow. >> >The overflow will happen very quickly with 8bit counter. >> >> Yes, but it reduces contention by 256x (a thread must execute at least >> 256 loop iterations between increments). In practice, if you reduce >> contention below some threshold, it does not represent a problem anymore. >> >> >> >> >> - per-thread counters. Solves the problem at huge cost in RAM >> per-thread >> >It is not practical. Especially for TLS counters -- it creates huge >> pressure on stack memory. >> >> Do we have any numbers about number of counters? If there are 100K 1-byte >> counters, I would consider it as practical. >> >> >> >> > - self-cooling logarithmic counters: if ((fast_random() % (1 << >> counter)) == 0) counter++; >> >> This option did not receive any attention. While it has O(1) memory >> consumption and reduces contention. >> Note that there are several tunables here: first -- how implement >> fast_rand: it can be a per-thread LCG, or per-thread counter or set of >> counter, or something else; second -- frequency of increment, e.g. it >> actually can be additive and/or bounded (because if we reduce frequency of >> increments by 1000x, this must be enough). >> >> >> >> > In Google GCC, we implemented another technique which proves to be very >> effective -- it is called FDO sampling. >> > Basically counters will be updated every N samples. >> >> How does it work? >> >> >> >> >> It seems to me like we’re going to have a hard time getting good >> multithreaded performance without significant impact on the single-threaded >> behavior. >> > I don't really agree. >> >> Disagreement seconded. >> For single-threaded case we are talking about constant overhead, while >> for multithreaded case -- about superlinear overhead. >> Having said that, if there is a *simple* way to modify the final scheme >> to significantly reduce single-threaded overheads, then or course it's >> worth doing. But it's not the cornerstone. >> >> >> >> > MAX is a fixed cap so even on systems with 100s of cores we don't do >> something silly. >> >> Makes not sense to me. >> Why do you want only systems with up to 100 cores to be fast and >> scalable? 100+ core system *especially* need good scalability (contention >> tends to be superlinear). >> What you are proposing is basically: If I have 10 engineers in my >> company, I probably want to give them 10 working desks as well. But let's >> not go insane. If I have 1000 engineers, 100 desks must be enough for them. >> This must reduce costs. >> The baseline memory consumption for systems (and amount of RAM!) is >> O(NCORES), not O(1). In some read-mostly cases it's possible to achieve >> O(1) memory consumption, and that's great. But if it's not the case here, >> let it be so. >> >> >> >> > shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS, >> NUMBER_OF_CORES)) >> >> Threads do not produce contention, it's cores that produce contention. >> The formula must be: shard_count = k*NCORES >> And if you want less memory in single-threaded case, then: shard_count >> min(k*NCORES, c*NTHREADS) >> >> >> >> >We are talking about developers here. Nobody would know the exact thread >> counts, but developers know the ballpark number >> >> I strongly believe that we must relief developers from this choice during >> build time, and do our best to auto-tune (if the final scheme requires >> tuning). >> First, such questions puts unnecessary burden on developers. We don't ask >> what register allocation algorithm to use for each function, right? >> Second, there are significant chances we will get a wrong answer, because >> e.g. developer's view of how threaded is the app can differ from reality or >> from our classification. >> Third, the app can be build by a build engineer; or the procedure can be >> applied to a base with 10'000 apps; or threaded-ness can change; or the app >> can work in different modes; or the app can have different phases. >> >> >> >> >Another danger involved with dynamically resizing the counter is that it >> requires a global or per function lock to access the counters. The cost of >> this can be really high. >> >> It's possible to do it w/o a mutex, fast-path overhead is only a >> virtually zero overhead (if implemented properly in the compiler) atomic >> consume load: >> >> const int maxshard = 4096; >> uint64* shard[maxshard]; >> atomic<int> shardmask; >> >> void inline inccounter(int idx) >> { >> int shardidx = gettid() & atomic_load(&shardmask, memory_order_consume); >> shard[shardidx][idx]++; >> } >> >> int pthread_create(...) >> { >> if (updateshardcount()) { >> shardlock(); >> if (updateshardcount()) { >> int newcount = computeshardcount(); >> for (int i = oldcount; i < newcount; i++) >> shard[i] = malloc(ncounter*sizeof(uint64)); >> atomic_store(&shardmask, newcount-1, memory_order_release); >> } >> shardunlock(); >> } >> ... >> } >> >> >> > If we go with this scheme, for tid I would do: > > __thread threadid; > > int pthread_create(...) > { > threadid = goohhash(gettid()); > ... > } > > void userfunc() > { > int tid = threadid+1; > threadid = tid; >Kostya noted that this increment is a bad idea. Because each thread will access all possible cache lines for each counter. I agree, this is a bad idea. What bothers me is that if we do tid%something, there is non-zero chance that significant amount of active threads will be mapped onto the same shard. And then get back to where we start -- we have significant contention. Ideally it would be something like: on_thread_rescheduling() { tid = getcpuid(); } but common OSes do not provide necessary interfaces.> // use tid in profiling > ... > } > > This avoids the problem of potentially expensive gettid(), and avoids a > possible persistent interference between tids. > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/28f5b3e2/attachment.html>
Kostya Serebryany
2014-Apr-18 07:50 UTC
[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
On Fri, Apr 18, 2014 at 11:44 AM, Dmitry Vyukov <dvyukov at google.com> wrote:> On Fri, Apr 18, 2014 at 11:32 AM, Dmitry Vyukov <dvyukov at google.com>wrote: > >> On Fri, Apr 18, 2014 at 11:13 AM, Dmitry Vyukov <dvyukov at google.com>wrote: >> >>> Hi, >>> >>> This is long thread, so I will combine several comments into single >>> email. >>> >>> >>> >> - 8-bit per-thread counters, dumping into central counters on >>> overflow. >>> >The overflow will happen very quickly with 8bit counter. >>> >>> Yes, but it reduces contention by 256x (a thread must execute at least >>> 256 loop iterations between increments). In practice, if you reduce >>> contention below some threshold, it does not represent a problem anymore. >>> >>> >>> >>> >> - per-thread counters. Solves the problem at huge cost in RAM >>> per-thread >>> >It is not practical. Especially for TLS counters -- it creates huge >>> pressure on stack memory. >>> >>> Do we have any numbers about number of counters? If there are 100K >>> 1-byte counters, I would consider it as practical. >>> >>> >>> >>> > - self-cooling logarithmic counters: if ((fast_random() % (1 << >>> counter)) == 0) counter++; >>> >>> This option did not receive any attention. While it has O(1) memory >>> consumption and reduces contention. >>> Note that there are several tunables here: first -- how implement >>> fast_rand: it can be a per-thread LCG, or per-thread counter or set of >>> counter, or something else; second -- frequency of increment, e.g. it >>> actually can be additive and/or bounded (because if we reduce frequency of >>> increments by 1000x, this must be enough). >>> >>> >>> >>> > In Google GCC, we implemented another technique which proves to be >>> very effective -- it is called FDO sampling. >>> > Basically counters will be updated every N samples. >>> >>> How does it work? >>> >>> >>> >>> >> It seems to me like we’re going to have a hard time getting good >>> multithreaded performance without significant impact on the single-threaded >>> behavior. >>> > I don't really agree. >>> >>> Disagreement seconded. >>> For single-threaded case we are talking about constant overhead, while >>> for multithreaded case -- about superlinear overhead. >>> Having said that, if there is a *simple* way to modify the final scheme >>> to significantly reduce single-threaded overheads, then or course it's >>> worth doing. But it's not the cornerstone. >>> >>> >>> >>> > MAX is a fixed cap so even on systems with 100s of cores we don't do >>> something silly. >>> >>> Makes not sense to me. >>> Why do you want only systems with up to 100 cores to be fast and >>> scalable? 100+ core system *especially* need good scalability (contention >>> tends to be superlinear). >>> What you are proposing is basically: If I have 10 engineers in my >>> company, I probably want to give them 10 working desks as well. But let's >>> not go insane. If I have 1000 engineers, 100 desks must be enough for them. >>> This must reduce costs. >>> The baseline memory consumption for systems (and amount of RAM!) is >>> O(NCORES), not O(1). In some read-mostly cases it's possible to achieve >>> O(1) memory consumption, and that's great. But if it's not the case here, >>> let it be so. >>> >>> >>> >>> > shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS, >>> NUMBER_OF_CORES)) >>> >>> Threads do not produce contention, it's cores that produce contention. >>> The formula must be: shard_count = k*NCORES >>> And if you want less memory in single-threaded case, then: shard_count >>> min(k*NCORES, c*NTHREADS) >>> >>> >>> >>> >We are talking about developers here. Nobody would know the exact >>> thread counts, but developers know the ballpark number >>> >>> I strongly believe that we must relief developers from this choice >>> during build time, and do our best to auto-tune (if the final scheme >>> requires tuning). >>> First, such questions puts unnecessary burden on developers. We don't >>> ask what register allocation algorithm to use for each function, right? >>> Second, there are significant chances we will get a wrong answer, >>> because e.g. developer's view of how threaded is the app can differ from >>> reality or from our classification. >>> Third, the app can be build by a build engineer; or the procedure can be >>> applied to a base with 10'000 apps; or threaded-ness can change; or the app >>> can work in different modes; or the app can have different phases. >>> >>> >>> >>> >Another danger involved with dynamically resizing the counter is that >>> it requires a global or per function lock to access the counters. The cost >>> of this can be really high. >>> >>> It's possible to do it w/o a mutex, fast-path overhead is only a >>> virtually zero overhead (if implemented properly in the compiler) atomic >>> consume load: >>> >>> const int maxshard = 4096; >>> uint64* shard[maxshard]; >>> atomic<int> shardmask; >>> >>> void inline inccounter(int idx) >>> { >>> int shardidx = gettid() & atomic_load(&shardmask, memory_order_consume); >>> shard[shardidx][idx]++; >>> } >>> >>> int pthread_create(...) >>> { >>> if (updateshardcount()) { >>> shardlock(); >>> if (updateshardcount()) { >>> int newcount = computeshardcount(); >>> for (int i = oldcount; i < newcount; i++) >>> shard[i] = malloc(ncounter*sizeof(uint64)); >>> atomic_store(&shardmask, newcount-1, memory_order_release); >>> } >>> shardunlock(); >>> } >>> ... >>> } >>> >>> >>> >> If we go with this scheme, for tid I would do: >> >> __thread threadid; >> >> int pthread_create(...) >> { >> threadid = goohhash(gettid()); >> ... >> } >> >> void userfunc() >> { >> int tid = threadid+1; >> threadid = tid; >> > > Kostya noted that this increment is a bad idea. Because each thread will > access all possible cache lines for each counter. I agree, this is a bad > idea. > > What bothers me is that if we do tid%something, there is non-zero chance > that significant amount of active threads will be mapped onto the same > shard. And then get back to where we start -- we have significant > contention. > > Ideally it would be something like: > > on_thread_rescheduling() > { > tid = getcpuid(); > } > > but common OSes do not provide necessary interfaces. >Note also: if we store tid in a thread-local variable, we'll either have to load it on every counter increment or keep it in a register, which by itself will bring additional ~10% slowdown on x86_64. This is why taking bits from %sp or %fs sounds attractive (it has problems too)> > > > >> // use tid in profiling >> ... >> } >> >> This avoids the problem of potentially expensive gettid(), and avoids a >> possible persistent interference between tids. >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/04115e2b/attachment.html>