thr3ads.net - llvm dev - [LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters) [Apr 2014]

If this information is useful, please help other people find it:
Share via:

Kostya Serebryany

2014-Apr-17 14:13 UTC

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

On Thu, Apr 17, 2014 at 6:10 PM, Yaron Keren <yaron.keren at gmail.com>
wrote:
> If accuracy is not critical, incrementing the counters without any guards
> might be good enough.
>
No.  Contention on the counters leads to 5x-10x slowdown. This is never
good enough.

--kcc

Hot areas will still be hot and cold areas will not be
affected.>
> Yaron
>
>
>
> 2014-04-17 15:21 GMT+03:00 Kostya Serebryany <kcc at google.com>:
>
>> Hi,
>>
>> The current design of -fprofile-instr-generate has the same fundamental
>> flaw
>> as the old gcc's gcov instrumentation: it has contention on
counters.
>> A trivial synthetic test case was described here:
>> http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-October/066116.html
>>
>> For the problem to appear we need to have a hot function that is
>> simultaneously executed
>> by multiple threads -- then we will have high contention on the racy
>> profile counters.
>>
>> Such situation is not necessary very frequent, but when it happens
>> -fprofile-instr-generate becomes barely usable due to huge slowdown
>> (5x-10x)
>>
>> An example is the multi-threaded vp9 video encoder.
>>
>> git clone https://chromium.googlesource.com/webm/libvpx
>> cd libvpx/
>> F="-no-integrated-as -fprofile-instr-generate";
CC="clang $F"
>> CXX="clang++ $F" LD="clang++ $F" ./configure
>> make -j32
>> # get sample video from from
>> https://media.xiph.org/video/derf/y4m/akiyo_cif.y4m
>> time ./vpxenc -o /dev/null -j 8 akiyo_cif.y4m
>>
>> When running single-threaded, -fprofile-instr-generate adds reasonable
>> ~15% overhead
>> (8.5 vs 10 seconds)
>> When running with 8 threads, it has 7x overhead (3.5 seconds vs 26
>> seconds).
>>
>> I am not saying that this flaw is a showstopper, but with the continued
>> move
>> towards multithreading it will be hurting more and more users of
coverage
>> and PGO.
>> AFAICT, most of our PGO users simply can not run their software in
>> single-threaded mode,
>> and some of them surely have hot functions running in all threads at
>> once.
>>
>> At the very least we should document this problem, but better try
fixing
>> it.
>>
>> Some ideas:
>>
>> - per-thread counters. Solves the problem at huge cost in RAM
per-thread
>> - 8-bit per-thread counters, dumping into central counters on overflow.
>> - per-cpu counters (not portable, requires very modern kernel with lots
>> of patches)
>> - sharded counters: each counter represented as N counters sitting in
>> different cache lines. Every thread accesses the counter with index
TID%N.
>> Solves the problem partially, better with larger values of N, but then
>> again it costs RAM.
>> - reduce contention on hot counters by not incrementing them if they
are
>> big enough:
>>    {if (counter < 65536) counter++}; This reduces the accuracy
though. Is
>> that bad for PGO?
>> - self-cooling logarithmic counters: if ((fast_random() % (1 <<
counter))
>> == 0) counter++;
>>
>> Other thoughts?
>>
>> --kcc
>>
>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140417/d112c0ba/attachment.html>

Jonathan Roelofs

2014-Apr-17 16:37 UTC

head link

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

How about per-thread if the counter is hot enough?

Jon

On 4/17/14, 7:13 AM, Kostya Serebryany wrote:>
>
>
> On Thu, Apr 17, 2014 at 6:10 PM, Yaron Keren <yaron.keren at gmail.com
> <mailto:yaron.keren at gmail.com>> wrote:
>
>     If accuracy is not critical, incrementing the counters without any
guards
>     might be good enough.
>
>
> No.  Contention on the counters leads to 5x-10x slowdown. This is never
good
> enough.
>
> --kcc
>
>     Hot areas will still be hot and cold areas will not be affected.
>
>     Yaron
>
>
>
>     2014-04-17 15:21 GMT+03:00 Kostya Serebryany <kcc at google.com
>     <mailto:kcc at google.com>>:
>
>         Hi,
>
>         The current design of -fprofile-instr-generate has the same
fundamental
>         flaw
>         as the old gcc's gcov instrumentation: it has contention on
counters.
>         A trivial synthetic test case was described here:
>         http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-October/066116.html
>
>         For the problem to appear we need to have a hot function that is
>         simultaneously executed
>         by multiple threads -- then we will have high contention on the
racy
>         profile counters.
>
>         Such situation is not necessary very frequent, but when it happens
>         -fprofile-instr-generate becomes barely usable due to huge slowdown
(5x-10x)
>
>         An example is the multi-threaded vp9 video encoder.
>
>         git clone https://chromium.googlesource.com/webm/libvpx
>         cd libvpx/
>         F="-no-integrated-as -fprofile-instr-generate";
CC="clang $F"
>         CXX="clang++ $F" LD="clang++ $F" ./configure
>         make -j32
>         # get sample video from from
>         https://media.xiph.org/video/derf/y4m/akiyo_cif.y4m
>         time ./vpxenc -o /dev/null -j 8 akiyo_cif.y4m
>
>         When running single-threaded, -fprofile-instr-generate adds
reasonable
>         ~15% overhead
>         (8.5 vs 10 seconds)
>         When running with 8 threads, it has 7x overhead (3.5 seconds vs 26
seconds).
>
>         I am not saying that this flaw is a showstopper, but with the
continued move
>         towards multithreading it will be hurting more and more users of
>         coverage and PGO.
>         AFAICT, most of our PGO users simply can not run their software in
>         single-threaded mode,
>         and some of them surely have hot functions running in all threads
at once.
>
>         At the very least we should document this problem, but better try
fixing
>         it.
>
>         Some ideas:
>
>         - per-thread counters. Solves the problem at huge cost in RAM
per-thread
>         - 8-bit per-thread counters, dumping into central counters on
overflow.
>         - per-cpu counters (not portable, requires very modern kernel with
lots
>         of patches)
>         - sharded counters: each counter represented as N counters sitting
in
>         different cache lines. Every thread accesses the counter with index
>         TID%N. Solves the problem partially, better with larger values of
N, but
>         then again it costs RAM.
>         - reduce contention on hot counters by not incrementing them if
they are
>         big enough:
>             {if (counter < 65536) counter++}; This reduces the accuracy
though.
>         Is that bad for PGO?
>         - self-cooling logarithmic counters: if ((fast_random() % (1
<<
>         counter)) == 0) counter++;
>
>         Other thoughts?
>
>         --kcc
>
>
>
>
>         _______________________________________________
>         LLVM Developers mailing list
>         LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>
http://llvm.cs.uiuc.edu
>         http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
-- 
Jon Roelofs
jonathan at codesourcery.com
CodeSourcery / Mentor Embedded

Kostya Serebryany

2014-Apr-17 16:39 UTC

head link

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

On Thu, Apr 17, 2014 at 8:37 PM, Jonathan Roelofs <jonathan at
codesourcery.com> wrote:
> How about per-thread if the counter is hot enough?
>
Err. How do you know if the counter is hot w/o first profiling the app?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140417/f340951e/attachment.html>

llvm dev - Apr 2014 - [LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)