> Since the scope is “opaque” and target specific, can you elaborate what kind of generic optimization can be performed?Some optimizations that are related to a single thread could be done without needing to know the actual memory scope. For example, an atomic acquire can restrict reordering memory operations after it, but allow reordering of memory operations (except another atomic acquire) before it, regardless of the memory scope. Thanks, -Tony -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160823/be78b51c/attachment-0001.html>
Hi, [Sorry for chiming in so late.] I understand why a straightforward metadata scheme won't work here, but have you considered an alternate scheme that works in the following way: - We add a MD node called !nosynch that lists a set of "domains" a certain memory operation does *not* synchronize with. - Memory operations with !nosynch synchronize with memory operations without any !nosynch metadata (so dropping !nosynch is safe). This will only work if your frontend knows, ahead of time, what the possible set of synch-domains are, but it presumably already knows that (otherwise how do you map domain names to integers)? The other disadvantage with the scheme above is that memory operations on the "normal CPU heap" (pardon my GPU n00b-ness here :) ) will synch with the memory operations with !nosynch metadata. However, we can solve that by modeling the "normal CPU heap" as "!nosynch !{!special_domain_a, !special_domain_b, ... all domains except !cpu_heap_domain}". Thanks, -- Sanjoy
Hi, Sanjoy Das wrote: > I understand why a straightforward metadata scheme won't work here, > but have you considered an alternate scheme that works in the > following way: > > - We add a MD node called !nosynch that lists a set of "domains" a > certain memory operation does *not* synchronize with. > > - Memory operations with !nosynch synchronize with memory operations > without any !nosynch metadata (so dropping !nosynch is safe). I missed a spot here ^, !nosynch metadata will also have to have a sub-node for the kind of synch-domain *it* is in. The synchs-with relation is then: bool SynchsWith(MemOp A, MemOp B) { MD_A = A.getMD(MD_nosynch); MD_B = B.getMD(MD_nosynch); if (!MD_A || !MD_B) return true; return MD_B.nosync_list.contains(MD_A.id) || MD_A.nosync_list.contains(MD_B.id); } I'm still not a 100% convinced that the above works, but I think there are advantages to expressing synch scopes as metadata. For instance, the optimizer already "knows" what to do with the metadata on loads it speculates. -- Sanjoy
> On Aug 30, 2016, at 5:53 PM, Sanjoy Das <sanjoy at playingwithpointers.com> wrote: > > Hi, > > [Sorry for chiming in so late.] > > I understand why a straightforward metadata scheme won't work here, > but have you considered an alternate scheme that works in the > following way: > > - We add a MD node called !nosynch that lists a set of "domains" a > certain memory operation does *not* synchronize with. > > - Memory operations with !nosynch synchronize with memory operations > without any !nosynch metadata (so dropping !nosynch is safe).I’m not sure, but isn’t the synchscope id (or domains as you seem to call it) intended to change which instruction would be actually codegen? In which case I’m not sure dropping it is ever a good idea, even when it does not affect correctness it would dramatically affect performance. — Mehdi> > This will only work if your frontend knows, ahead of time, what the > possible set of synch-domains are, but it presumably already knows > that (otherwise how do you map domain names to integers)? > > The other disadvantage with the scheme above is that memory operations > on the "normal CPU heap" (pardon my GPU n00b-ness here :) ) will synch > with the memory operations with !nosynch metadata. However, we can > solve that by modeling the "normal CPU heap" as "!nosynch > !{!special_domain_a, !special_domain_b, ... all domains except > !cpu_heap_domain}". > > Thanks, > -- Sanjoy
> Some optimizations that are related to a single thread could be done without needing to know the actual memory scope.Right, it's clear to me that there exist optimizations that you cannot do if we model these ops as target-specific intrinsics. But what I think Mehdi and I were trying to get at is: How much of a problem is this in practice? Are there real-world programs that suffer because we miss these optimizations? If so, how much? The reason I'm asking this question is, there's a real cost to adding complexity in LLVM. Everyone in the project is going to pay that cost, forever (or at least, until we remove the feature :). So I want to try to evaluate whether paying that cost is actually worth while, as compared to the simple alternative (i.e., intrinsics). Given the tepid response to this proposal, I'm sort of thinking that now may not be the time to start paying this cost. (We can always revisit this in the future.) But I remain open to being convinced. As a point of comparison, we have a rule of thumb that we'll add an optimization that increases compilation time by x% if we have a benchmark that is sped up by at least x%. Similarly here, I'd want to weigh the added complexity against the improvements to user code. -Justin On Tue, Aug 23, 2016 at 2:28 PM, Tye, Tony via llvm-dev <llvm-dev at lists.llvm.org> wrote:>> Since the scope is “opaque” and target specific, can you elaborate what >> kind of generic optimization can be performed? > > > > Some optimizations that are related to a single thread could be done without > needing to know the actual memory scope. For example, an atomic acquire can > restrict reordering memory operations after it, but allow reordering of > memory operations (except another atomic acquire) before it, regardless of > the memory scope. > > > > Thanks, > > -Tony > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >
On Wed, Aug 31, 2016 at 12:23:34PM -0700, Justin Lebar via llvm-dev wrote:> > Some optimizations that are related to a single thread could be done without needing to know the actual memory scope. > > Right, it's clear to me that there exist optimizations that you cannot > do if we model these ops as target-specific intrinsics. > > But what I think Mehdi and I were trying to get at is: How much of a > problem is this in practice? Are there real-world programs that > suffer because we miss these optimizations? If so, how much? > > The reason I'm asking this question is, there's a real cost to adding > complexity in LLVM. Everyone in the project is going to pay that > cost, forever (or at least, until we remove the feature :). So I want > to try to evaluate whether paying that cost is actually worth while, > as compared to the simple alternative (i.e., intrinsics). Given the > tepid response to this proposal, I'm sort of thinking that now may not > be the time to start paying this cost. (We can always revisit this in > the future.) But I remain open to being convinced. >I think the cost of adding this information to the IR is really low. There is already a sychronization scope field present for LLVM atomic instructions, and it is already being encoded as 32-bits, so it is possible to represent the additional scopes using the existing bitcode format. Optimization passes are already aware of this synchronization scope field, so they know how to preserve it when transforming the IR. The primary goal here is to pass synchronization scope information from the fronted to the backend. We already have a mechanism for doing this, so why not use it? That seems like the lowest cost option to me. -Tom> As a point of comparison, we have a rule of thumb that we'll add an > optimization that increases compilation time by x% if we have a > benchmark that is sped up by at least x%. Similarly here, I'd want to > weigh the added complexity against the improvements to user code. > > -Justin > > On Tue, Aug 23, 2016 at 2:28 PM, Tye, Tony via llvm-dev > <llvm-dev at lists.llvm.org> wrote: > >> Since the scope is “opaque” and target specific, can you elaborate what > >> kind of generic optimization can be performed? > > > > > > > > Some optimizations that are related to a single thread could be done without > > needing to know the actual memory scope. For example, an atomic acquire can > > restrict reordering memory operations after it, but allow reordering of > > memory operations (except another atomic acquire) before it, regardless of > > the memory scope. > > > > > > > > Thanks, > > > > -Tony > > > > > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev