Chandler Carruth
2014-Apr-18 09:21 UTC
[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
On Fri, Apr 18, 2014 at 2:10 AM, Kostya Serebryany <kcc at google.com> wrote:> One more proposal: simple per-thread counters allocated with > mmap(MAP_NORESERVE), the same trick that works so well for asan/tsan/msan. > > Chrome has ~3M basic blocks instrumented for coverage, > so even largest applications will hardly have more than, say, 10M basic > blocks >I think this is a *gross* underestimation. I work with applications more than one order of magnitude larger than Chrome.> (number can be configurable at application start time). This gives us 80Mb > for the array of 64-bit counters. > That's a lot if multiplied by the number of threads, but the MAP_NORESERVE > trick solves the problem -- > each thread will only touch the pages where it actually increment the > counters. > On thread exit the whole 80Mb counter array are will be merged into a > central array of counters and then discarded, > but we can also postpone this until another new thread is created -- then > we just reuse the counter array. > > This brings two challenges. > > #1. The basic blocks should be numbered sequentially. I see only one way > to accomplish this: with the help of linker (and dynamic linker for DSOs). > The compiler would emit code using offsets that will later be transformed > into constants by the linker. > Not sure if any existing linker support this kind of thing. Anyone? > > #2. How to access the per-thread counter array. If we simply store the > pointer to the array in TLS, the instrumentation will be more expensive > just because of need to load and keep this pointer. > If the counter array is part of TLS itself, we'll have to intrude into the > pthread library (or wrap it) so that this part of TLS is mapped with > MAP_NORESERVE. >#3. It essentially *requires* a complex merge on shutdown rather than a simple flush. I'm not even sure how to do the merge without dirtying still more pages of the no-reserve memory. It's not at all clear to me that this scales up (either in memory usage, memory reservation, or shutdown time) to larger applications. Chrome isn't a useful upper bound here. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/cb55e142/attachment.html>
Kostya Serebryany
2014-Apr-18 09:29 UTC
[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
On Fri, Apr 18, 2014 at 1:21 PM, Chandler Carruth <chandlerc at google.com>wrote:> On Fri, Apr 18, 2014 at 2:10 AM, Kostya Serebryany <kcc at google.com> wrote: > >> One more proposal: simple per-thread counters allocated with >> mmap(MAP_NORESERVE), the same trick that works so well for asan/tsan/msan. >> >> Chrome has ~3M basic blocks instrumented for coverage, >> so even largest applications will hardly have more than, say, 10M basic >> blocks >> > > I think this is a *gross* underestimation. I work with applications more > than one order of magnitude larger than Chrome. >Agree, Chrome is comparatively small. But the thing does not change even if we have 100M basic blocks. The hypothesis (which need to be checked) is that every thread will touch a small portion of BBs => a small portion of pages in the counter array.> > >> (number can be configurable at application start time). This gives us >> 80Mb for the array of 64-bit counters. >> That's a lot if multiplied by the number of threads, but the >> MAP_NORESERVE trick solves the problem -- >> each thread will only touch the pages where it actually increment the >> counters. >> On thread exit the whole 80Mb counter array are will be merged into a >> central array of counters and then discarded, >> but we can also postpone this until another new thread is created -- then >> we just reuse the counter array. >> >> This brings two challenges. >> >> #1. The basic blocks should be numbered sequentially. I see only one way >> to accomplish this: with the help of linker (and dynamic linker for DSOs). >> The compiler would emit code using offsets that will later be transformed >> into constants by the linker. >> Not sure if any existing linker support this kind of thing. Anyone? >> >> #2. How to access the per-thread counter array. If we simply store the >> pointer to the array in TLS, the instrumentation will be more expensive >> just because of need to load and keep this pointer. >> If the counter array is part of TLS itself, we'll have to intrude into >> the pthread library (or wrap it) so that this part of TLS is mapped with >> MAP_NORESERVE. >> > > #3. It essentially *requires* a complex merge on shutdown rather than a > simple flush. >yep> I'm not even sure how to do the merge without dirtying still more pages of > the no-reserve memory. >and yep again. I don't know a way to check if a mmaped page is unused. --kcc> > > It's not at all clear to me that this scales up (either in memory usage, > memory reservation, or shutdown time) to larger applications. Chrome isn't > a useful upper bound here. >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/7b066da2/attachment.html>
Dmitry Vyukov
2014-Apr-18 09:30 UTC
[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
On Fri, Apr 18, 2014 at 1:21 PM, Chandler Carruth <chandlerc at google.com>wrote:> On Fri, Apr 18, 2014 at 2:10 AM, Kostya Serebryany <kcc at google.com> wrote: > >> One more proposal: simple per-thread counters allocated with >> mmap(MAP_NORESERVE), the same trick that works so well for asan/tsan/msan. >> >> Chrome has ~3M basic blocks instrumented for coverage, >> so even largest applications will hardly have more than, say, 10M basic >> blocks >> > > I think this is a *gross* underestimation. I work with applications more > than one order of magnitude larger than Chrome. > > >> (number can be configurable at application start time). This gives us >> 80Mb for the array of 64-bit counters. >> That's a lot if multiplied by the number of threads, but the >> MAP_NORESERVE trick solves the problem -- >> each thread will only touch the pages where it actually increment the >> counters. >> On thread exit the whole 80Mb counter array are will be merged into a >> central array of counters and then discarded, >> but we can also postpone this until another new thread is created -- then >> we just reuse the counter array. >> >> This brings two challenges. >> >> #1. The basic blocks should be numbered sequentially. I see only one way >> to accomplish this: with the help of linker (and dynamic linker for DSOs). >> The compiler would emit code using offsets that will later be transformed >> into constants by the linker. >> Not sure if any existing linker support this kind of thing. Anyone? >> >> #2. How to access the per-thread counter array. If we simply store the >> pointer to the array in TLS, the instrumentation will be more expensive >> just because of need to load and keep this pointer. >> If the counter array is part of TLS itself, we'll have to intrude into >> the pthread library (or wrap it) so that this part of TLS is mapped with >> MAP_NORESERVE. >> > > #3. It essentially *requires* a complex merge on shutdown rather than a > simple flush. I'm not even sure how to do the merge without dirtying still > more pages of the no-reserve memory. > > > It's not at all clear to me that this scales up (either in memory usage, > memory reservation, or shutdown time) to larger applications. Chrome isn't > a useful upper bound here. >Array processing is fast. Contention is slow. I would expect this to be a net win. For the additional memory consumption during final merge, we can process one per-thread array, unmap it, process second array, unmap it, and so on. This will not require bringing all the pages into memory. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/a6228a73/attachment.html>
Chandler Carruth
2014-Apr-18 09:41 UTC
[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
On Fri, Apr 18, 2014 at 2:30 AM, Dmitry Vyukov <dvyukov at google.com> wrote:> It's not at all clear to me that this scales up (either in memory usage, >> memory reservation, or shutdown time) to larger applications. Chrome isn't >> a useful upper bound here. >> > > Array processing is fast. Contention is slow. I would expect this to be a > net win. > For the additional memory consumption during final merge, we can process > one per-thread array, unmap it, process second array, unmap it, and so on. > This will not require bringing all the pages into memory. >Array processing is fast, but paging in a large % of the pages in your address space is not at all fast. This will murder the kernel's page table, and do other very slow things I suspect. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/49bf0665/attachment.html>