On Thu, Jan 14, 2016 at 01:29:13PM -0800, Paul E. McKenney wrote:> So smp_mb() provides transitivity, as do pairs of smp_store_release() > and smp_read_acquire(),But they provide different grades of transitivity, which is where all the confusion lays. smp_mb() is strongly/globally transitive, all CPUs will agree on the order. Whereas the RCpc release+acquire is weakly so, only the two cpus involved in the handover will agree on the order.
On Fri, Jan 15, 2016 at 09:55:54AM +0100, Peter Zijlstra wrote:> On Thu, Jan 14, 2016 at 01:29:13PM -0800, Paul E. McKenney wrote: > > So smp_mb() provides transitivity, as do pairs of smp_store_release() > > and smp_read_acquire(), > > But they provide different grades of transitivity, which is where all > the confusion lays. > > smp_mb() is strongly/globally transitive, all CPUs will agree on the order. > > Whereas the RCpc release+acquire is weakly so, only the two cpus > involved in the handover will agree on the order.And the stuff we're confused about is how best to express the difference and guarantees of these two forms of transitivity and how exactly they interact. And smp_load_acquire()/smp_store_release() are RCpc because TSO archs and PPC. the atomic*_{acquire,release}() are RCpc because PPC and LOCK,UNLOCK are similarly RCpc because of PPC. Now we'd like PPC to stick a SYNC in either LOCK or UNLOCK so at least the locks are RCsc again, but they resist for performance reasons but waver because they don't want to be the ones finding all the nasty bugs because they're the only one. Now the thing I worry about, and still have not had an answer to is if weakly ordered MIPS will end up being RCsc or RCpc for their locks if they get implemented with SYNC_ACQUIRE and SYNC_RELEASE instead of the current SYNC.
On Fri, Jan 15, 2016 at 09:55:54AM +0100, Peter Zijlstra wrote:> On Thu, Jan 14, 2016 at 01:29:13PM -0800, Paul E. McKenney wrote: > > So smp_mb() provides transitivity, as do pairs of smp_store_release() > > and smp_read_acquire(), > > But they provide different grades of transitivity, which is where all > the confusion lays. > > smp_mb() is strongly/globally transitive, all CPUs will agree on the order. > > Whereas the RCpc release+acquire is weakly so, only the two cpus > involved in the handover will agree on the order.Good point! Using grace periods in place of smp_mb() also provides strong/global transitivity, but also insanely high latencies. ;-) The patch below updates Documentation/memory-barriers.txt to define local vs. global transitivity. The corresponding ppcmem litmus test is included below as well. Should we start putting litmus tests for the various examples somewhere, perhaps in a litmus-tests directory within each participating architecture? I have a pile of powerpc-related litmus tests on my laptop, but they probably aren't doing all that much good there. Thanx, Paul ------------------------------------------------------------------------ PPC local-transitive "" { 0:r1=1; 0:r2=u; 0:r3=v; 0:r4=x; 0:r5=y; 0:r6=z; 1:r1=1; 1:r2=u; 1:r3=v; 1:r4=x; 1:r5=y; 1:r6=z; 2:r1=1; 2:r2=u; 2:r3=v; 2:r4=x; 2:r5=y; 2:r6=z; 3:r1=1; 3:r2=u; 3:r3=v; 3:r4=x; 3:r5=y; 3:r6=z; } P0 | P1 | P2 | P3 ; lwz r9,0(r4) | lwz r9,0(r5) | lwz r9,0(r6) | stw r1,0(r3) ; lwsync | lwsync | lwsync | sync ; stw r1,0(r2) | lwz r8,0(r3) | stw r1,0(r7) | lwz r9,0(r2) ; lwsync | lwz r7,0(r2) | | ; stw r1,0(r5) | lwsync | | ; | stw r1,0(r6) | | ; exists (* (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r8=0 /\ 3:r9=0) *) (* (0:r9=1 /\ 1:r9=1 /\ 2:r9=1) *) (* (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r7=0) *) (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r7=0) ------------------------------------------------------------------------ commit 2cb4e83a1b5c89c8e39b8a64bd89269d05913e41 Author: Paul E. McKenney <paulmck at linux.vnet.ibm.com> Date: Fri Jan 15 09:30:42 2016 -0800 documentation: Distinguish between local and global transitivity The introduction of smp_load_acquire() and smp_store_release() had the side effect of introducing a weaker notion of transitivity: The transitivity of full smp_mb() barriers is global, but that of smp_store_release()/smp_load_acquire() chains is local. This commit therefore introduces the notion of local transitivity and gives an example. Reported-by: Peter Zijlstra <peterz at infradead.org> Reported-by: Will Deacon <will.deacon at arm.com> Signed-off-by: Paul E. McKenney <paulmck at linux.vnet.ibm.com> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index c66ba46d8079..d8109ed99342 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -1318,8 +1318,82 @@ or a level of cache, CPU 2 might have early access to CPU 1's writes. General barriers are therefore required to ensure that all CPUs agree on the combined order of CPU 1's and CPU 2's accesses. -To reiterate, if your code requires transitivity, use general barriers -throughout. +General barriers provide "global transitivity", so that all CPUs will +agree on the order of operations. In contrast, a chain of release-acquire +pairs provides only "local transitivity", so that only those CPUs on +the chain are guaranteed to agree on the combined order of the accesses. +For example, switching to C code in deference to Herman Hollerith: + + int u, v, x, y, z; + + void cpu0(void) + { + r0 = smp_load_acquire(&x); + WRITE_ONCE(u, 1); + smp_store_release(&y, 1); + } + + void cpu1(void) + { + r1 = smp_load_acquire(&y); + r4 = READ_ONCE(v); + r5 = READ_ONCE(u); + smp_store_release(&z, 1); + } + + void cpu2(void) + { + r2 = smp_load_acquire(&z); + smp_store_release(&x, 1); + } + + void cpu3(void) + { + WRITE_ONCE(v, 1); + smp_mb(); + r3 = READ_ONCE(u); + } + +Because cpu0(), cpu1(), and cpu2() participate in a local transitive +chain of smp_store_release()/smp_load_acquire() pairs, the following +outcome is prohibited: + + r0 == 1 && r1 == 1 && r2 == 1 + +Furthermore, because of the release-acquire relationship between cpu0() +and cpu1(), cpu1() must see cpu0()'s writes, so that the following +outcome is prohibited: + + r1 == 1 && r5 == 0 + +However, the transitivity of release-acquire is local to the participating +CPUs and does not apply to cpu3(). Therefore, the following outcome +is possible: + + r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0 + +Although cpu0(), cpu1(), and cpu2() will see their respective reads and +writes in order, CPUs not involved in the release-acquire chain might +well disagree on the order. This disagreement stems from the fact that +the weak memory-barrier instructions used to implement smp_load_acquire() +and smp_store_release() are not required to order prior stores against +subsequent loads in all cases. This means that cpu3() can see cpu0()'s +store to u as happening -after- cpu1()'s load from v, even though +both cpu0() and cpu1() agree that these two operations occurred in the +intended order. + +However, please keep in mind that smp_load_acquire() is not magic. +In particular, it simply reads from its argument with ordering. It does +-not- ensure that any particular value will be read. Therefore, the +following outcome is possible: + + r0 == 0 && r1 == 0 && r2 == 0 && r5 == 0 + +Note that this outcome can happen even on a mythical sequentially +consistent system where nothing is ever reordered. + +To reiterate, if your code requires global transitivity, use general +barriers throughout. ========================
On Fri, Jan 15, 2016 at 10:13:48AM +0100, Peter Zijlstra wrote:> On Fri, Jan 15, 2016 at 09:55:54AM +0100, Peter Zijlstra wrote: > > On Thu, Jan 14, 2016 at 01:29:13PM -0800, Paul E. McKenney wrote: > > > So smp_mb() provides transitivity, as do pairs of smp_store_release() > > > and smp_read_acquire(), > > > > But they provide different grades of transitivity, which is where all > > the confusion lays. > > > > smp_mb() is strongly/globally transitive, all CPUs will agree on the order. > > > > Whereas the RCpc release+acquire is weakly so, only the two cpus > > involved in the handover will agree on the order. > > And the stuff we're confused about is how best to express the difference > and guarantees of these two forms of transitivity and how exactly they > interact.Hoping my memory-barrier.txt patch helps here...> And smp_load_acquire()/smp_store_release() are RCpc because TSO archs > and PPC. the atomic*_{acquire,release}() are RCpc because PPC and > LOCK,UNLOCK are similarly RCpc because of PPC. > > Now we'd like PPC to stick a SYNC in either LOCK or UNLOCK so at least > the locks are RCsc again, but they resist for performance reasons but > waver because they don't want to be the ones finding all the nasty bugs > because they're the only one.I believe that the relevant proverb said something about starving to death between two bales of hay... ;-)> Now the thing I worry about, and still have not had an answer to is if > weakly ordered MIPS will end up being RCsc or RCpc for their locks if > they get implemented with SYNC_ACQUIRE and SYNC_RELEASE instead of the > current SYNC.It would be good to have better clarity on this, no two ways about it. Thanx, Paul
On Fri, Jan 15, 2016 at 09:39:12AM -0800, Paul E. McKenney wrote:> Should we start putting litmus tests for the various examples > somewhere, perhaps in a litmus-tests directory within each participating > architecture? I have a pile of powerpc-related litmus tests on my laptop, > but they probably aren't doing all that much good there.Yeah, or a version of them in C that we can 'compile'?> > commit 2cb4e83a1b5c89c8e39b8a64bd89269d05913e41 > Author: Paul E. McKenney <paulmck at linux.vnet.ibm.com> > Date: Fri Jan 15 09:30:42 2016 -0800 > > documentation: Distinguish between local and global transitivity > > The introduction of smp_load_acquire() and smp_store_release() had > the side effect of introducing a weaker notion of transitivity: > The transitivity of full smp_mb() barriers is global, but that > of smp_store_release()/smp_load_acquire() chains is local. This > commit therefore introduces the notion of local transitivity and > gives an example. > > Reported-by: Peter Zijlstra <peterz at infradead.org> > Reported-by: Will Deacon <will.deacon at arm.com> > Signed-off-by: Paul E. McKenney <paulmck at linux.vnet.ibm.com>I think it fails to mention smp_mb__after_release_acquire(), although I suspect we didn't actually introduce the primitive yet, which raises the point, do we want to?
Hi Paul, On Fri, Jan 15, 2016 at 09:39:12AM -0800, Paul E. McKenney wrote:> On Fri, Jan 15, 2016 at 09:55:54AM +0100, Peter Zijlstra wrote: > > On Thu, Jan 14, 2016 at 01:29:13PM -0800, Paul E. McKenney wrote: > > > So smp_mb() provides transitivity, as do pairs of smp_store_release() > > > and smp_read_acquire(), > > > > But they provide different grades of transitivity, which is where all > > the confusion lays. > > > > smp_mb() is strongly/globally transitive, all CPUs will agree on the order. > > > > Whereas the RCpc release+acquire is weakly so, only the two cpus > > involved in the handover will agree on the order. > > Good point! > > Using grace periods in place of smp_mb() also provides strong/global > transitivity, but also insanely high latencies. ;-) > > The patch below updates Documentation/memory-barriers.txt to define > local vs. global transitivity. The corresponding ppcmem litmus test > is included below as well. > > Should we start putting litmus tests for the various examples > somewhere, perhaps in a litmus-tests directory within each participating > architecture? I have a pile of powerpc-related litmus tests on my laptop, > but they probably aren't doing all that much good there.I too would like to have the litmus tests in the kernel so that we can refer to them from memory-barriers.txt. Ideally they wouldn't be targetted to a particular arch, however.> PPC local-transitive > "" > { > 0:r1=1; 0:r2=u; 0:r3=v; 0:r4=x; 0:r5=y; 0:r6=z; > 1:r1=1; 1:r2=u; 1:r3=v; 1:r4=x; 1:r5=y; 1:r6=z; > 2:r1=1; 2:r2=u; 2:r3=v; 2:r4=x; 2:r5=y; 2:r6=z; > 3:r1=1; 3:r2=u; 3:r3=v; 3:r4=x; 3:r5=y; 3:r6=z; > } > P0 | P1 | P2 | P3 ; > lwz r9,0(r4) | lwz r9,0(r5) | lwz r9,0(r6) | stw r1,0(r3) ; > lwsync | lwsync | lwsync | sync ; > stw r1,0(r2) | lwz r8,0(r3) | stw r1,0(r7) | lwz r9,0(r2) ; > lwsync | lwz r7,0(r2) | | ; > stw r1,0(r5) | lwsync | | ; > | stw r1,0(r6) | | ; > exists > (* (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r8=0 /\ 3:r9=0) *) > (* (0:r9=1 /\ 1:r9=1 /\ 2:r9=1) *) > (* (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r7=0) *) > (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r7=0)i.e. we should rewrite this using READ_ONCE/WRITE_ONCE and smp_mb() etc.> ------------------------------------------------------------------------ > > commit 2cb4e83a1b5c89c8e39b8a64bd89269d05913e41 > Author: Paul E. McKenney <paulmck at linux.vnet.ibm.com> > Date: Fri Jan 15 09:30:42 2016 -0800 > > documentation: Distinguish between local and global transitivity > > The introduction of smp_load_acquire() and smp_store_release() had > the side effect of introducing a weaker notion of transitivity: > The transitivity of full smp_mb() barriers is global, but that > of smp_store_release()/smp_load_acquire() chains is local. This > commit therefore introduces the notion of local transitivity and > gives an example. > > Reported-by: Peter Zijlstra <peterz at infradead.org> > Reported-by: Will Deacon <will.deacon at arm.com> > Signed-off-by: Paul E. McKenney <paulmck at linux.vnet.ibm.com> > > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt > index c66ba46d8079..d8109ed99342 100644 > --- a/Documentation/memory-barriers.txt > +++ b/Documentation/memory-barriers.txt > @@ -1318,8 +1318,82 @@ or a level of cache, CPU 2 might have early access to CPU 1's writes. > General barriers are therefore required to ensure that all CPUs agree > on the combined order of CPU 1's and CPU 2's accesses. > > -To reiterate, if your code requires transitivity, use general barriers > -throughout. > +General barriers provide "global transitivity", so that all CPUs will > +agree on the order of operations. In contrast, a chain of release-acquire > +pairs provides only "local transitivity", so that only those CPUs on > +the chain are guaranteed to agree on the combined order of the accesses.Thanks for having a go at this. I tried defining something axiomatically, but got stuck pretty quickly. In my scheme, I used "data-directed transitivity" instead of "local transitivity", since the latter seems to be a bit of a misnomer.> +For example, switching to C code in deference to Herman Hollerith: > + > + int u, v, x, y, z; > + > + void cpu0(void) > + { > + r0 = smp_load_acquire(&x); > + WRITE_ONCE(u, 1); > + smp_store_release(&y, 1); > + } > + > + void cpu1(void) > + { > + r1 = smp_load_acquire(&y); > + r4 = READ_ONCE(v); > + r5 = READ_ONCE(u); > + smp_store_release(&z, 1); > + } > + > + void cpu2(void) > + { > + r2 = smp_load_acquire(&z); > + smp_store_release(&x, 1); > + } > + > + void cpu3(void) > + { > + WRITE_ONCE(v, 1); > + smp_mb(); > + r3 = READ_ONCE(u); > + } > + > +Because cpu0(), cpu1(), and cpu2() participate in a local transitive > +chain of smp_store_release()/smp_load_acquire() pairs, the following > +outcome is prohibited: > + > + r0 == 1 && r1 == 1 && r2 == 1 > + > +Furthermore, because of the release-acquire relationship between cpu0() > +and cpu1(), cpu1() must see cpu0()'s writes, so that the following > +outcome is prohibited: > + > + r1 == 1 && r5 == 0 > + > +However, the transitivity of release-acquire is local to the participating > +CPUs and does not apply to cpu3(). Therefore, the following outcome > +is possible: > + > + r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0I think you should be completely explicit and include r5 == 1 here, too. Also -- where would you add the smp_mb__after_release_acquire to fix (i.e. forbid) this? Immediately after cpu1()'s read of y?> +Although cpu0(), cpu1(), and cpu2() will see their respective reads and > +writes in order, CPUs not involved in the release-acquire chain might > +well disagree on the order. This disagreement stems from the fact that > +the weak memory-barrier instructions used to implement smp_load_acquire() > +and smp_store_release() are not required to order prior stores against > +subsequent loads in all cases. This means that cpu3() can see cpu0()'s > +store to u as happening -after- cpu1()'s load from v, even though > +both cpu0() and cpu1() agree that these two operations occurred in the > +intended order. > + > +However, please keep in mind that smp_load_acquire() is not magic. > +In particular, it simply reads from its argument with ordering. It does > +-not- ensure that any particular value will be read. Therefore, the > +following outcome is possible: > + > + r0 == 0 && r1 == 0 && r2 == 0 && r5 == 0 > + > +Note that this outcome can happen even on a mythical sequentially > +consistent system where nothing is ever reordered.I'm not sure this last bit is strictly necessary. If somebody thinks that acquire/release involve some sort of implicit synchronisation, I think they may have bigger problems with memory-barriers.txt. Will