Bob Wilson
2014-Apr-17 18:22 UTC
[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
On Apr 17, 2014, at 11:09 AM, Xinliang David Li <xinliangli at gmail.com> wrote:> > On Thu, Apr 17, 2014 at 10:58 AM, Duncan P. N. Exon Smith <dexonsmith at apple.com> wrote: > > On 2014-Apr-17, at 10:38, Xinliang David Li <xinliangli at gmail.com> wrote: > > > > > Another idea is to use stack local counters per function -- synced up with global counters on entry and exit. the problem with it is for deeply recursive calls, stack pressure can be too high. > > I think they'd need to be synced with global counters before function > calls as well, since any function call can call "exit()". > > right -- but it might be better to handle this in other ways. For instance a stack of counters for each frames is maintained. At exit, they are flushed in a batch. Or simply ignore it in case of program exit .It seems to me like we’re going to have a hard time getting good multithreaded performance without significant impact on the single-threaded behavior. We might need to add an option to choose between those. There’s a lot of room for improvement in the performance with the current instrumentation, so maybe we can find a way to make things incrementally better in a way that helps both, but avoiding the multithreaded cache conflicts seems like it’s going to be expensive in other ways. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140417/12cde0a2/attachment.html>
Xinliang David Li
2014-Apr-17 18:33 UTC
[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
Yes -- option. It would be unwise to penalize single threaded app unconditionally. David On Thu, Apr 17, 2014 at 11:22 AM, Bob Wilson <bob.wilson at apple.com> wrote:> > On Apr 17, 2014, at 11:09 AM, Xinliang David Li <xinliangli at gmail.com> > wrote: > > > On Thu, Apr 17, 2014 at 10:58 AM, Duncan P. N. Exon Smith < > dexonsmith at apple.com> wrote: > >> >> On 2014-Apr-17, at 10:38, Xinliang David Li <xinliangli at gmail.com> wrote: >> >> > >> > Another idea is to use stack local counters per function -- synced up >> with global counters on entry and exit. the problem with it is for deeply >> recursive calls, stack pressure can be too high. >> >> I think they'd need to be synced with global counters before function >> calls as well, since any function call can call "exit()". >> > > right -- but it might be better to handle this in other ways. For instance > a stack of counters for each frames is maintained. At exit, they are > flushed in a batch. Or simply ignore it in case of program exit . > > > It seems to me like we’re going to have a hard time getting good > multithreaded performance without significant impact on the single-threaded > behavior. We might need to add an option to choose between those. There’s a > lot of room for improvement in the performance with the current > instrumentation, so maybe we can find a way to make things incrementally > better in a way that helps both, but avoiding the multithreaded cache > conflicts seems like it’s going to be expensive in other ways. >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140417/a61d9108/attachment.html>
Chandler Carruth
2014-Apr-17 18:41 UTC
[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
On Thu, Apr 17, 2014 at 11:22 AM, Bob Wilson <bob.wilson at apple.com> wrote:> On Apr 17, 2014, at 11:09 AM, Xinliang David Li <xinliangli at gmail.com> > wrote: > > > On Thu, Apr 17, 2014 at 10:58 AM, Duncan P. N. Exon Smith < > dexonsmith at apple.com> wrote: > >> >> On 2014-Apr-17, at 10:38, Xinliang David Li <xinliangli at gmail.com> wrote: >> >> > >> > Another idea is to use stack local counters per function -- synced up >> with global counters on entry and exit. the problem with it is for deeply >> recursive calls, stack pressure can be too high. >> >> I think they'd need to be synced with global counters before function >> calls as well, since any function call can call "exit()". >> > > right -- but it might be better to handle this in other ways. For instance > a stack of counters for each frames is maintained. At exit, they are > flushed in a batch. Or simply ignore it in case of program exit . > > > It seems to me like we’re going to have a hard time getting good > multithreaded performance without significant impact on the single-threaded > behavior. We might need to add an option to choose between those. There’s a > lot of room for improvement in the performance with the current > instrumentation, so maybe we can find a way to make things incrementally > better in a way that helps both, but avoiding the multithreaded cache > conflicts seems like it’s going to be expensive in other ways. >I don't really agree. First, multithreaded applications are going to be the majority soon, even if they aren't already. We should design for them and support them well by default. If, once we have that, we find single threaded performance dramatically suffers, then maybe we should add a flag. But it doesn't make sense to do this before we even have data. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140417/056852cb/attachment.html>
Bob Wilson
2014-Apr-17 18:47 UTC
[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
On Apr 17, 2014, at 11:41 AM, Chandler Carruth <chandlerc at google.com> wrote:> > On Thu, Apr 17, 2014 at 11:22 AM, Bob Wilson <bob.wilson at apple.com> wrote: > On Apr 17, 2014, at 11:09 AM, Xinliang David Li <xinliangli at gmail.com> wrote: >> >> On Thu, Apr 17, 2014 at 10:58 AM, Duncan P. N. Exon Smith <dexonsmith at apple.com> wrote: >> >> On 2014-Apr-17, at 10:38, Xinliang David Li <xinliangli at gmail.com> wrote: >> >> > >> > Another idea is to use stack local counters per function -- synced up with global counters on entry and exit. the problem with it is for deeply recursive calls, stack pressure can be too high. >> >> I think they'd need to be synced with global counters before function >> calls as well, since any function call can call "exit()". >> >> right -- but it might be better to handle this in other ways. For instance a stack of counters for each frames is maintained. At exit, they are flushed in a batch. Or simply ignore it in case of program exit . > > It seems to me like we’re going to have a hard time getting good multithreaded performance without significant impact on the single-threaded behavior. We might need to add an option to choose between those. There’s a lot of room for improvement in the performance with the current instrumentation, so maybe we can find a way to make things incrementally better in a way that helps both, but avoiding the multithreaded cache conflicts seems like it’s going to be expensive in other ways. > > I don't really agree. > > First, multithreaded applications are going to be the majority soon, even if they aren't already. We should design for them and support them well by default. If, once we have that, we find single threaded performance dramatically suffers, then maybe we should add a flag. But it doesn't make sense to do this before we even have data.If someone wants to revise the instrumentation in a way that works better for multithreaded code, that’s great. Before the change is committed, we should have performance data comparing it to the current code. If there is no regression, then fine. If it significantly hurts single-threaded performance, then we will need a flag. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140417/bdd5c6f5/attachment.html>