Kostya Serebryany
2014-Apr-18 07:50 UTC
[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
On Fri, Apr 18, 2014 at 11:44 AM, Dmitry Vyukov <dvyukov at google.com> wrote:> On Fri, Apr 18, 2014 at 11:32 AM, Dmitry Vyukov <dvyukov at google.com>wrote: > >> On Fri, Apr 18, 2014 at 11:13 AM, Dmitry Vyukov <dvyukov at google.com>wrote: >> >>> Hi, >>> >>> This is long thread, so I will combine several comments into single >>> email. >>> >>> >>> >> - 8-bit per-thread counters, dumping into central counters on >>> overflow. >>> >The overflow will happen very quickly with 8bit counter. >>> >>> Yes, but it reduces contention by 256x (a thread must execute at least >>> 256 loop iterations between increments). In practice, if you reduce >>> contention below some threshold, it does not represent a problem anymore. >>> >>> >>> >>> >> - per-thread counters. Solves the problem at huge cost in RAM >>> per-thread >>> >It is not practical. Especially for TLS counters -- it creates huge >>> pressure on stack memory. >>> >>> Do we have any numbers about number of counters? If there are 100K >>> 1-byte counters, I would consider it as practical. >>> >>> >>> >>> > - self-cooling logarithmic counters: if ((fast_random() % (1 << >>> counter)) == 0) counter++; >>> >>> This option did not receive any attention. While it has O(1) memory >>> consumption and reduces contention. >>> Note that there are several tunables here: first -- how implement >>> fast_rand: it can be a per-thread LCG, or per-thread counter or set of >>> counter, or something else; second -- frequency of increment, e.g. it >>> actually can be additive and/or bounded (because if we reduce frequency of >>> increments by 1000x, this must be enough). >>> >>> >>> >>> > In Google GCC, we implemented another technique which proves to be >>> very effective -- it is called FDO sampling. >>> > Basically counters will be updated every N samples. >>> >>> How does it work? >>> >>> >>> >>> >> It seems to me like we’re going to have a hard time getting good >>> multithreaded performance without significant impact on the single-threaded >>> behavior. >>> > I don't really agree. >>> >>> Disagreement seconded. >>> For single-threaded case we are talking about constant overhead, while >>> for multithreaded case -- about superlinear overhead. >>> Having said that, if there is a *simple* way to modify the final scheme >>> to significantly reduce single-threaded overheads, then or course it's >>> worth doing. But it's not the cornerstone. >>> >>> >>> >>> > MAX is a fixed cap so even on systems with 100s of cores we don't do >>> something silly. >>> >>> Makes not sense to me. >>> Why do you want only systems with up to 100 cores to be fast and >>> scalable? 100+ core system *especially* need good scalability (contention >>> tends to be superlinear). >>> What you are proposing is basically: If I have 10 engineers in my >>> company, I probably want to give them 10 working desks as well. But let's >>> not go insane. If I have 1000 engineers, 100 desks must be enough for them. >>> This must reduce costs. >>> The baseline memory consumption for systems (and amount of RAM!) is >>> O(NCORES), not O(1). In some read-mostly cases it's possible to achieve >>> O(1) memory consumption, and that's great. But if it's not the case here, >>> let it be so. >>> >>> >>> >>> > shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS, >>> NUMBER_OF_CORES)) >>> >>> Threads do not produce contention, it's cores that produce contention. >>> The formula must be: shard_count = k*NCORES >>> And if you want less memory in single-threaded case, then: shard_count >>> min(k*NCORES, c*NTHREADS) >>> >>> >>> >>> >We are talking about developers here. Nobody would know the exact >>> thread counts, but developers know the ballpark number >>> >>> I strongly believe that we must relief developers from this choice >>> during build time, and do our best to auto-tune (if the final scheme >>> requires tuning). >>> First, such questions puts unnecessary burden on developers. We don't >>> ask what register allocation algorithm to use for each function, right? >>> Second, there are significant chances we will get a wrong answer, >>> because e.g. developer's view of how threaded is the app can differ from >>> reality or from our classification. >>> Third, the app can be build by a build engineer; or the procedure can be >>> applied to a base with 10'000 apps; or threaded-ness can change; or the app >>> can work in different modes; or the app can have different phases. >>> >>> >>> >>> >Another danger involved with dynamically resizing the counter is that >>> it requires a global or per function lock to access the counters. The cost >>> of this can be really high. >>> >>> It's possible to do it w/o a mutex, fast-path overhead is only a >>> virtually zero overhead (if implemented properly in the compiler) atomic >>> consume load: >>> >>> const int maxshard = 4096; >>> uint64* shard[maxshard]; >>> atomic<int> shardmask; >>> >>> void inline inccounter(int idx) >>> { >>> int shardidx = gettid() & atomic_load(&shardmask, memory_order_consume); >>> shard[shardidx][idx]++; >>> } >>> >>> int pthread_create(...) >>> { >>> if (updateshardcount()) { >>> shardlock(); >>> if (updateshardcount()) { >>> int newcount = computeshardcount(); >>> for (int i = oldcount; i < newcount; i++) >>> shard[i] = malloc(ncounter*sizeof(uint64)); >>> atomic_store(&shardmask, newcount-1, memory_order_release); >>> } >>> shardunlock(); >>> } >>> ... >>> } >>> >>> >>> >> If we go with this scheme, for tid I would do: >> >> __thread threadid; >> >> int pthread_create(...) >> { >> threadid = goohhash(gettid()); >> ... >> } >> >> void userfunc() >> { >> int tid = threadid+1; >> threadid = tid; >> > > Kostya noted that this increment is a bad idea. Because each thread will > access all possible cache lines for each counter. I agree, this is a bad > idea. > > What bothers me is that if we do tid%something, there is non-zero chance > that significant amount of active threads will be mapped onto the same > shard. And then get back to where we start -- we have significant > contention. > > Ideally it would be something like: > > on_thread_rescheduling() > { > tid = getcpuid(); > } > > but common OSes do not provide necessary interfaces. >Note also: if we store tid in a thread-local variable, we'll either have to load it on every counter increment or keep it in a register, which by itself will bring additional ~10% slowdown on x86_64. This is why taking bits from %sp or %fs sounds attractive (it has problems too)> > > > >> // use tid in profiling >> ... >> } >> >> This avoids the problem of potentially expensive gettid(), and avoids a >> possible persistent interference between tids. >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/04115e2b/attachment.html>
Kostya Serebryany
2014-Apr-18 09:10 UTC
[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
One more proposal: simple per-thread counters allocated with mmap(MAP_NORESERVE), the same trick that works so well for asan/tsan/msan. Chrome has ~3M basic blocks instrumented for coverage, so even largest applications will hardly have more than, say, 10M basic blocks (number can be configurable at application start time). This gives us 80Mb for the array of 64-bit counters. That's a lot if multiplied by the number of threads, but the MAP_NORESERVE trick solves the problem -- each thread will only touch the pages where it actually increment the counters. On thread exit the whole 80Mb counter array are will be merged into a central array of counters and then discarded, but we can also postpone this until another new thread is created -- then we just reuse the counter array. This brings two challenges. #1. The basic blocks should be numbered sequentially. I see only one way to accomplish this: with the help of linker (and dynamic linker for DSOs). The compiler would emit code using offsets that will later be transformed into constants by the linker. Not sure if any existing linker support this kind of thing. Anyone? #2. How to access the per-thread counter array. If we simply store the pointer to the array in TLS, the instrumentation will be more expensive just because of need to load and keep this pointer. If the counter array is part of TLS itself, we'll have to intrude into the pthread library (or wrap it) so that this part of TLS is mapped with MAP_NORESERVE. This is all far from trivial, but it - has the same performance in single-threaded case compared to the current design - has no contention in multi-threaded case. - should use moderate amount of RAM due to the MAP_NORESERVE trick. ?? --kcc On Fri, Apr 18, 2014 at 11:50 AM, Kostya Serebryany <kcc at google.com> wrote:> > > > On Fri, Apr 18, 2014 at 11:44 AM, Dmitry Vyukov <dvyukov at google.com>wrote: > >> On Fri, Apr 18, 2014 at 11:32 AM, Dmitry Vyukov <dvyukov at google.com>wrote: >> >>> On Fri, Apr 18, 2014 at 11:13 AM, Dmitry Vyukov <dvyukov at google.com>wrote: >>> >>>> Hi, >>>> >>>> This is long thread, so I will combine several comments into single >>>> email. >>>> >>>> >>>> >> - 8-bit per-thread counters, dumping into central counters on >>>> overflow. >>>> >The overflow will happen very quickly with 8bit counter. >>>> >>>> Yes, but it reduces contention by 256x (a thread must execute at least >>>> 256 loop iterations between increments). In practice, if you reduce >>>> contention below some threshold, it does not represent a problem anymore. >>>> >>>> >>>> >>>> >> - per-thread counters. Solves the problem at huge cost in RAM >>>> per-thread >>>> >It is not practical. Especially for TLS counters -- it creates huge >>>> pressure on stack memory. >>>> >>>> Do we have any numbers about number of counters? If there are 100K >>>> 1-byte counters, I would consider it as practical. >>>> >>>> >>>> >>>> > - self-cooling logarithmic counters: if ((fast_random() % (1 << >>>> counter)) == 0) counter++; >>>> >>>> This option did not receive any attention. While it has O(1) memory >>>> consumption and reduces contention. >>>> Note that there are several tunables here: first -- how implement >>>> fast_rand: it can be a per-thread LCG, or per-thread counter or set of >>>> counter, or something else; second -- frequency of increment, e.g. it >>>> actually can be additive and/or bounded (because if we reduce frequency of >>>> increments by 1000x, this must be enough). >>>> >>>> >>>> >>>> > In Google GCC, we implemented another technique which proves to be >>>> very effective -- it is called FDO sampling. >>>> > Basically counters will be updated every N samples. >>>> >>>> How does it work? >>>> >>>> >>>> >>>> >> It seems to me like we’re going to have a hard time getting good >>>> multithreaded performance without significant impact on the single-threaded >>>> behavior. >>>> > I don't really agree. >>>> >>>> Disagreement seconded. >>>> For single-threaded case we are talking about constant overhead, while >>>> for multithreaded case -- about superlinear overhead. >>>> Having said that, if there is a *simple* way to modify the final scheme >>>> to significantly reduce single-threaded overheads, then or course it's >>>> worth doing. But it's not the cornerstone. >>>> >>>> >>>> >>>> > MAX is a fixed cap so even on systems with 100s of cores we don't do >>>> something silly. >>>> >>>> Makes not sense to me. >>>> Why do you want only systems with up to 100 cores to be fast and >>>> scalable? 100+ core system *especially* need good scalability (contention >>>> tends to be superlinear). >>>> What you are proposing is basically: If I have 10 engineers in my >>>> company, I probably want to give them 10 working desks as well. But let's >>>> not go insane. If I have 1000 engineers, 100 desks must be enough for them. >>>> This must reduce costs. >>>> The baseline memory consumption for systems (and amount of RAM!) is >>>> O(NCORES), not O(1). In some read-mostly cases it's possible to achieve >>>> O(1) memory consumption, and that's great. But if it's not the case here, >>>> let it be so. >>>> >>>> >>>> >>>> > shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS, >>>> NUMBER_OF_CORES)) >>>> >>>> Threads do not produce contention, it's cores that produce contention. >>>> The formula must be: shard_count = k*NCORES >>>> And if you want less memory in single-threaded case, then: shard_count >>>> = min(k*NCORES, c*NTHREADS) >>>> >>>> >>>> >>>> >We are talking about developers here. Nobody would know the exact >>>> thread counts, but developers know the ballpark number >>>> >>>> I strongly believe that we must relief developers from this choice >>>> during build time, and do our best to auto-tune (if the final scheme >>>> requires tuning). >>>> First, such questions puts unnecessary burden on developers. We don't >>>> ask what register allocation algorithm to use for each function, right? >>>> Second, there are significant chances we will get a wrong answer, >>>> because e.g. developer's view of how threaded is the app can differ from >>>> reality or from our classification. >>>> Third, the app can be build by a build engineer; or the procedure can >>>> be applied to a base with 10'000 apps; or threaded-ness can change; or the >>>> app can work in different modes; or the app can have different phases. >>>> >>>> >>>> >>>> >Another danger involved with dynamically resizing the counter is that >>>> it requires a global or per function lock to access the counters. The cost >>>> of this can be really high. >>>> >>>> It's possible to do it w/o a mutex, fast-path overhead is only a >>>> virtually zero overhead (if implemented properly in the compiler) atomic >>>> consume load: >>>> >>>> const int maxshard = 4096; >>>> uint64* shard[maxshard]; >>>> atomic<int> shardmask; >>>> >>>> void inline inccounter(int idx) >>>> { >>>> int shardidx = gettid() & atomic_load(&shardmask, memory_order_consume); >>>> shard[shardidx][idx]++; >>>> } >>>> >>>> int pthread_create(...) >>>> { >>>> if (updateshardcount()) { >>>> shardlock(); >>>> if (updateshardcount()) { >>>> int newcount = computeshardcount(); >>>> for (int i = oldcount; i < newcount; i++) >>>> shard[i] = malloc(ncounter*sizeof(uint64)); >>>> atomic_store(&shardmask, newcount-1, memory_order_release); >>>> } >>>> shardunlock(); >>>> } >>>> ... >>>> } >>>> >>>> >>>> >>> If we go with this scheme, for tid I would do: >>> >>> __thread threadid; >>> >>> int pthread_create(...) >>> { >>> threadid = goohhash(gettid()); >>> ... >>> } >>> >>> void userfunc() >>> { >>> int tid = threadid+1; >>> threadid = tid; >>> >> >> Kostya noted that this increment is a bad idea. Because each thread will >> access all possible cache lines for each counter. I agree, this is a bad >> idea. >> >> What bothers me is that if we do tid%something, there is non-zero chance >> that significant amount of active threads will be mapped onto the same >> shard. And then get back to where we start -- we have significant >> contention. >> >> Ideally it would be something like: >> >> on_thread_rescheduling() >> { >> tid = getcpuid(); >> } >> >> but common OSes do not provide necessary interfaces. >> > > Note also: if we store tid in a thread-local variable, we'll either have > to load it on every counter increment or keep it in a register, > which by itself will bring additional ~10% slowdown on x86_64. This is why > taking bits from %sp or %fs sounds attractive (it has problems too) > > >> >> >> >> >>> // use tid in profiling >>> ... >>> } >>> >>> This avoids the problem of potentially expensive gettid(), and avoids a >>> possible persistent interference between tids. >>> >>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/10d67bab/attachment.html>
Chandler Carruth
2014-Apr-18 09:21 UTC
[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
On Fri, Apr 18, 2014 at 2:10 AM, Kostya Serebryany <kcc at google.com> wrote:> One more proposal: simple per-thread counters allocated with > mmap(MAP_NORESERVE), the same trick that works so well for asan/tsan/msan. > > Chrome has ~3M basic blocks instrumented for coverage, > so even largest applications will hardly have more than, say, 10M basic > blocks >I think this is a *gross* underestimation. I work with applications more than one order of magnitude larger than Chrome.> (number can be configurable at application start time). This gives us 80Mb > for the array of 64-bit counters. > That's a lot if multiplied by the number of threads, but the MAP_NORESERVE > trick solves the problem -- > each thread will only touch the pages where it actually increment the > counters. > On thread exit the whole 80Mb counter array are will be merged into a > central array of counters and then discarded, > but we can also postpone this until another new thread is created -- then > we just reuse the counter array. > > This brings two challenges. > > #1. The basic blocks should be numbered sequentially. I see only one way > to accomplish this: with the help of linker (and dynamic linker for DSOs). > The compiler would emit code using offsets that will later be transformed > into constants by the linker. > Not sure if any existing linker support this kind of thing. Anyone? > > #2. How to access the per-thread counter array. If we simply store the > pointer to the array in TLS, the instrumentation will be more expensive > just because of need to load and keep this pointer. > If the counter array is part of TLS itself, we'll have to intrude into the > pthread library (or wrap it) so that this part of TLS is mapped with > MAP_NORESERVE. >#3. It essentially *requires* a complex merge on shutdown rather than a simple flush. I'm not even sure how to do the merge without dirtying still more pages of the no-reserve memory. It's not at all clear to me that this scales up (either in memory usage, memory reservation, or shutdown time) to larger applications. Chrome isn't a useful upper bound here. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/cb55e142/attachment.html>