Kostya Serebryany
2014-Apr-17 14:13 UTC
[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
On Thu, Apr 17, 2014 at 6:10 PM, Yaron Keren <yaron.keren at gmail.com> wrote:> If accuracy is not critical, incrementing the counters without any guards > might be good enough. >No. Contention on the counters leads to 5x-10x slowdown. This is never good enough. --kcc Hot areas will still be hot and cold areas will not be affected.> > Yaron > > > > 2014-04-17 15:21 GMT+03:00 Kostya Serebryany <kcc at google.com>: > >> Hi, >> >> The current design of -fprofile-instr-generate has the same fundamental >> flaw >> as the old gcc's gcov instrumentation: it has contention on counters. >> A trivial synthetic test case was described here: >> http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-October/066116.html >> >> For the problem to appear we need to have a hot function that is >> simultaneously executed >> by multiple threads -- then we will have high contention on the racy >> profile counters. >> >> Such situation is not necessary very frequent, but when it happens >> -fprofile-instr-generate becomes barely usable due to huge slowdown >> (5x-10x) >> >> An example is the multi-threaded vp9 video encoder. >> >> git clone https://chromium.googlesource.com/webm/libvpx >> cd libvpx/ >> F="-no-integrated-as -fprofile-instr-generate"; CC="clang $F" >> CXX="clang++ $F" LD="clang++ $F" ./configure >> make -j32 >> # get sample video from from >> https://media.xiph.org/video/derf/y4m/akiyo_cif.y4m >> time ./vpxenc -o /dev/null -j 8 akiyo_cif.y4m >> >> When running single-threaded, -fprofile-instr-generate adds reasonable >> ~15% overhead >> (8.5 vs 10 seconds) >> When running with 8 threads, it has 7x overhead (3.5 seconds vs 26 >> seconds). >> >> I am not saying that this flaw is a showstopper, but with the continued >> move >> towards multithreading it will be hurting more and more users of coverage >> and PGO. >> AFAICT, most of our PGO users simply can not run their software in >> single-threaded mode, >> and some of them surely have hot functions running in all threads at >> once. >> >> At the very least we should document this problem, but better try fixing >> it. >> >> Some ideas: >> >> - per-thread counters. Solves the problem at huge cost in RAM per-thread >> - 8-bit per-thread counters, dumping into central counters on overflow. >> - per-cpu counters (not portable, requires very modern kernel with lots >> of patches) >> - sharded counters: each counter represented as N counters sitting in >> different cache lines. Every thread accesses the counter with index TID%N. >> Solves the problem partially, better with larger values of N, but then >> again it costs RAM. >> - reduce contention on hot counters by not incrementing them if they are >> big enough: >> {if (counter < 65536) counter++}; This reduces the accuracy though. Is >> that bad for PGO? >> - self-cooling logarithmic counters: if ((fast_random() % (1 << counter)) >> == 0) counter++; >> >> Other thoughts? >> >> --kcc >> >> >> >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140417/d112c0ba/attachment.html>
Jonathan Roelofs
2014-Apr-17 16:37 UTC
[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
How about per-thread if the counter is hot enough? Jon On 4/17/14, 7:13 AM, Kostya Serebryany wrote:> > > > On Thu, Apr 17, 2014 at 6:10 PM, Yaron Keren <yaron.keren at gmail.com > <mailto:yaron.keren at gmail.com>> wrote: > > If accuracy is not critical, incrementing the counters without any guards > might be good enough. > > > No. Contention on the counters leads to 5x-10x slowdown. This is never good > enough. > > --kcc > > Hot areas will still be hot and cold areas will not be affected. > > Yaron > > > > 2014-04-17 15:21 GMT+03:00 Kostya Serebryany <kcc at google.com > <mailto:kcc at google.com>>: > > Hi, > > The current design of -fprofile-instr-generate has the same fundamental > flaw > as the old gcc's gcov instrumentation: it has contention on counters. > A trivial synthetic test case was described here: > http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-October/066116.html > > For the problem to appear we need to have a hot function that is > simultaneously executed > by multiple threads -- then we will have high contention on the racy > profile counters. > > Such situation is not necessary very frequent, but when it happens > -fprofile-instr-generate becomes barely usable due to huge slowdown (5x-10x) > > An example is the multi-threaded vp9 video encoder. > > git clone https://chromium.googlesource.com/webm/libvpx > cd libvpx/ > F="-no-integrated-as -fprofile-instr-generate"; CC="clang $F" > CXX="clang++ $F" LD="clang++ $F" ./configure > make -j32 > # get sample video from from > https://media.xiph.org/video/derf/y4m/akiyo_cif.y4m > time ./vpxenc -o /dev/null -j 8 akiyo_cif.y4m > > When running single-threaded, -fprofile-instr-generate adds reasonable > ~15% overhead > (8.5 vs 10 seconds) > When running with 8 threads, it has 7x overhead (3.5 seconds vs 26 seconds). > > I am not saying that this flaw is a showstopper, but with the continued move > towards multithreading it will be hurting more and more users of > coverage and PGO. > AFAICT, most of our PGO users simply can not run their software in > single-threaded mode, > and some of them surely have hot functions running in all threads at once. > > At the very least we should document this problem, but better try fixing > it. > > Some ideas: > > - per-thread counters. Solves the problem at huge cost in RAM per-thread > - 8-bit per-thread counters, dumping into central counters on overflow. > - per-cpu counters (not portable, requires very modern kernel with lots > of patches) > - sharded counters: each counter represented as N counters sitting in > different cache lines. Every thread accesses the counter with index > TID%N. Solves the problem partially, better with larger values of N, but > then again it costs RAM. > - reduce contention on hot counters by not incrementing them if they are > big enough: > {if (counter < 65536) counter++}; This reduces the accuracy though. > Is that bad for PGO? > - self-cooling logarithmic counters: if ((fast_random() % (1 << > counter)) == 0) counter++; > > Other thoughts? > > --kcc > > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-- Jon Roelofs jonathan at codesourcery.com CodeSourcery / Mentor Embedded
Kostya Serebryany
2014-Apr-17 16:39 UTC
[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
On Thu, Apr 17, 2014 at 8:37 PM, Jonathan Roelofs <jonathan at codesourcery.com> wrote:> How about per-thread if the counter is hot enough? >Err. How do you know if the counter is hot w/o first profiling the app? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140417/f340951e/attachment.html>