thr3ads.net - Virtualization - [v3,11/41] mips: reuse asm-generic/barrier.h [Jan 2016]

If this information is useful, please help other people find it:
Share via:

Peter Zijlstra

2016-Jan-15 08:55 UTC

[v3,11/41] mips: reuse asm-generic/barrier.h

On Thu, Jan 14, 2016 at 01:29:13PM -0800, Paul E. McKenney
wrote:> So smp_mb() provides transitivity, as do pairs of smp_store_release()
> and smp_read_acquire(), 
But they provide different grades of transitivity, which is where all
the confusion lays.

smp_mb() is strongly/globally transitive, all CPUs will agree on the order.

Whereas the RCpc release+acquire is weakly so, only the two cpus
involved in the handover will agree on the order.

Peter Zijlstra

2016-Jan-15 09:13 UTC

head link

[v3,11/41] mips: reuse asm-generic/barrier.h

On Fri, Jan 15, 2016 at 09:55:54AM +0100, Peter Zijlstra
wrote:> On Thu, Jan 14, 2016 at 01:29:13PM -0800, Paul E. McKenney wrote:
> > So smp_mb() provides transitivity, as do pairs of smp_store_release()
> > and smp_read_acquire(), 
> 
> But they provide different grades of transitivity, which is where all
> the confusion lays.
> 
> smp_mb() is strongly/globally transitive, all CPUs will agree on the order.
> 
> Whereas the RCpc release+acquire is weakly so, only the two cpus
> involved in the handover will agree on the order.
And the stuff we're confused about is how best to express the difference
and guarantees of these two forms of transitivity and how exactly they
interact.

And smp_load_acquire()/smp_store_release() are RCpc because TSO archs
and PPC. the atomic*_{acquire,release}() are RCpc because PPC and
LOCK,UNLOCK are similarly RCpc because of PPC.

Now we'd like PPC to stick a SYNC in either LOCK or UNLOCK so at least
the locks are RCsc again, but they resist for performance reasons but
waver because they don't want to be the ones finding all the nasty bugs
because they're the only one.

Now the thing I worry about, and still have not had an answer to is if
weakly ordered MIPS will end up being RCsc or RCpc for their locks if
they get implemented with SYNC_ACQUIRE and SYNC_RELEASE instead of the
current SYNC.

Paul E. McKenney

2016-Jan-15 17:39 UTC

head link

[v3,11/41] mips: reuse asm-generic/barrier.h

On Fri, Jan 15, 2016 at 09:55:54AM +0100, Peter Zijlstra
wrote:> On Thu, Jan 14, 2016 at 01:29:13PM -0800, Paul E. McKenney wrote:
> > So smp_mb() provides transitivity, as do pairs of smp_store_release()
> > and smp_read_acquire(), 
> 
> But they provide different grades of transitivity, which is where all
> the confusion lays.
> 
> smp_mb() is strongly/globally transitive, all CPUs will agree on the order.
> 
> Whereas the RCpc release+acquire is weakly so, only the two cpus
> involved in the handover will agree on the order.
Good point!

Using grace periods in place of smp_mb() also provides strong/global
transitivity, but also insanely high latencies.  ;-)

The patch below updates Documentation/memory-barriers.txt to define
local vs. global transitivity.  The corresponding ppcmem litmus test
is included below as well.

Should we start putting litmus tests for the various examples
somewhere, perhaps in a litmus-tests directory within each participating
architecture?  I have a pile of powerpc-related litmus tests on my laptop,
but they probably aren't doing all that much good there.

							Thanx, Paul

------------------------------------------------------------------------

PPC local-transitive
""
{
0:r1=1; 0:r2=u; 0:r3=v; 0:r4=x; 0:r5=y; 0:r6=z;
1:r1=1; 1:r2=u; 1:r3=v; 1:r4=x; 1:r5=y; 1:r6=z;
2:r1=1; 2:r2=u; 2:r3=v; 2:r4=x; 2:r5=y; 2:r6=z;
3:r1=1; 3:r2=u; 3:r3=v; 3:r4=x; 3:r5=y; 3:r6=z;
}
 P0           | P1           | P2           | P3           ;
 lwz r9,0(r4) | lwz r9,0(r5) | lwz r9,0(r6) | stw r1,0(r3) ;
 lwsync       | lwsync       | lwsync       | sync         ;
 stw r1,0(r2) | lwz r8,0(r3) | stw r1,0(r7) | lwz r9,0(r2) ;
 lwsync       | lwz r7,0(r2) |              |              ;
 stw r1,0(r5) | lwsync       |              |              ;
              | stw r1,0(r6) |              |              ;
exists
(* (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r8=0 /\ 3:r9=0) *)
(* (0:r9=1 /\ 1:r9=1 /\ 2:r9=1) *)
(* (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r7=0) *)
(0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r7=0)

------------------------------------------------------------------------

commit 2cb4e83a1b5c89c8e39b8a64bd89269d05913e41
Author: Paul E. McKenney <paulmck at linux.vnet.ibm.com>
Date:   Fri Jan 15 09:30:42 2016 -0800

    documentation: Distinguish between local and global transitivity
    
    The introduction of smp_load_acquire() and smp_store_release() had
    the side effect of introducing a weaker notion of transitivity:
    The transitivity of full smp_mb() barriers is global, but that
    of smp_store_release()/smp_load_acquire() chains is local.  This
    commit therefore introduces the notion of local transitivity and
    gives an example.
    
    Reported-by: Peter Zijlstra <peterz at infradead.org>
    Reported-by: Will Deacon <will.deacon at arm.com>
    Signed-off-by: Paul E. McKenney <paulmck at linux.vnet.ibm.com>

diff --git a/Documentation/memory-barriers.txt
b/Documentation/memory-barriers.txt
index c66ba46d8079..d8109ed99342 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1318,8 +1318,82 @@ or a level of cache, CPU 2 might have early access to CPU
1's writes.
 General barriers are therefore required to ensure that all CPUs agree
 on the combined order of CPU 1's and CPU 2's accesses.
 
-To reiterate, if your code requires transitivity, use general barriers
-throughout.
+General barriers provide "global transitivity", so that all CPUs will
+agree on the order of operations.  In contrast, a chain of release-acquire
+pairs provides only "local transitivity", so that only those CPUs on
+the chain are guaranteed to agree on the combined order of the accesses.
+For example, switching to C code in deference to Herman Hollerith:
+
+	int u, v, x, y, z;
+
+	void cpu0(void)
+	{
+		r0 = smp_load_acquire(&x);
+		WRITE_ONCE(u, 1);
+		smp_store_release(&y, 1);
+	}
+
+	void cpu1(void)
+	{
+		r1 = smp_load_acquire(&y);
+		r4 = READ_ONCE(v);
+		r5 = READ_ONCE(u);
+		smp_store_release(&z, 1);
+	}
+
+	void cpu2(void)
+	{
+		r2 = smp_load_acquire(&z);
+		smp_store_release(&x, 1);
+	}
+
+	void cpu3(void)
+	{
+		WRITE_ONCE(v, 1);
+		smp_mb();
+		r3 = READ_ONCE(u);
+	}
+
+Because cpu0(), cpu1(), and cpu2() participate in a local transitive
+chain of smp_store_release()/smp_load_acquire() pairs, the following
+outcome is prohibited:
+
+	r0 == 1 && r1 == 1 && r2 == 1
+
+Furthermore, because of the release-acquire relationship between cpu0()
+and cpu1(), cpu1() must see cpu0()'s writes, so that the following
+outcome is prohibited:
+
+	r1 == 1 && r5 == 0
+
+However, the transitivity of release-acquire is local to the participating
+CPUs and does not apply to cpu3().  Therefore, the following outcome
+is possible:
+
+	r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4
== 0
+
+Although cpu0(), cpu1(), and cpu2() will see their respective reads and
+writes in order, CPUs not involved in the release-acquire chain might
+well disagree on the order.  This disagreement stems from the fact that
+the weak memory-barrier instructions used to implement smp_load_acquire()
+and smp_store_release() are not required to order prior stores against
+subsequent loads in all cases.  This means that cpu3() can see cpu0()'s
+store to u as happening -after- cpu1()'s load from v, even though
+both cpu0() and cpu1() agree that these two operations occurred in the
+intended order.
+
+However, please keep in mind that smp_load_acquire() is not magic.
+In particular, it simply reads from its argument with ordering.  It does
+-not- ensure that any particular value will be read.  Therefore, the
+following outcome is possible:
+
+	r0 == 0 && r1 == 0 && r2 == 0 && r5 == 0
+
+Note that this outcome can happen even on a mythical sequentially
+consistent system where nothing is ever reordered.
+
+To reiterate, if your code requires global transitivity, use general
+barriers throughout.
 
 
 ========================

Paul E. McKenney

2016-Jan-15 17:46 UTC

head link

[v3,11/41] mips: reuse asm-generic/barrier.h

On Fri, Jan 15, 2016 at 10:13:48AM +0100, Peter Zijlstra
wrote:> On Fri, Jan 15, 2016 at 09:55:54AM +0100, Peter Zijlstra wrote:
> > On Thu, Jan 14, 2016 at 01:29:13PM -0800, Paul E. McKenney wrote:
> > > So smp_mb() provides transitivity, as do pairs of
smp_store_release()
> > > and smp_read_acquire(), 
> > 
> > But they provide different grades of transitivity, which is where all
> > the confusion lays.
> > 
> > smp_mb() is strongly/globally transitive, all CPUs will agree on the
order.
> > 
> > Whereas the RCpc release+acquire is weakly so, only the two cpus
> > involved in the handover will agree on the order.
> 
> And the stuff we're confused about is how best to express the
difference
> and guarantees of these two forms of transitivity and how exactly they
> interact.
Hoping my memory-barrier.txt patch helps here...
> And smp_load_acquire()/smp_store_release() are RCpc because TSO archs
> and PPC. the atomic*_{acquire,release}() are RCpc because PPC and
> LOCK,UNLOCK are similarly RCpc because of PPC.
> 
> Now we'd like PPC to stick a SYNC in either LOCK or UNLOCK so at least
> the locks are RCsc again, but they resist for performance reasons but
> waver because they don't want to be the ones finding all the nasty bugs
> because they're the only one.
I believe that the relevant proverb said something about starving to
death between two bales of hay...  ;-)
> Now the thing I worry about, and still have not had an answer to is if
> weakly ordered MIPS will end up being RCsc or RCpc for their locks if
> they get implemented with SYNC_ACQUIRE and SYNC_RELEASE instead of the
> current SYNC.
It would be good to have better clarity on this, no two ways about it.

							Thanx, Paul

Peter Zijlstra

2016-Jan-15 21:29 UTC

head link

[v3,11/41] mips: reuse asm-generic/barrier.h

On Fri, Jan 15, 2016 at 09:39:12AM -0800, Paul E. McKenney
wrote:> Should we start putting litmus tests for the various examples
> somewhere, perhaps in a litmus-tests directory within each participating
> architecture?  I have a pile of powerpc-related litmus tests on my laptop,
> but they probably aren't doing all that much good there.
Yeah, or a version of them in C that we can
'compile'?> 
> commit 2cb4e83a1b5c89c8e39b8a64bd89269d05913e41
> Author: Paul E. McKenney <paulmck at linux.vnet.ibm.com>
> Date:   Fri Jan 15 09:30:42 2016 -0800
> 
>     documentation: Distinguish between local and global transitivity
>     
>     The introduction of smp_load_acquire() and smp_store_release() had
>     the side effect of introducing a weaker notion of transitivity:
>     The transitivity of full smp_mb() barriers is global, but that
>     of smp_store_release()/smp_load_acquire() chains is local.  This
>     commit therefore introduces the notion of local transitivity and
>     gives an example.
>     
>     Reported-by: Peter Zijlstra <peterz at infradead.org>
>     Reported-by: Will Deacon <will.deacon at arm.com>
>     Signed-off-by: Paul E. McKenney <paulmck at linux.vnet.ibm.com>
I think it fails to mention smp_mb__after_release_acquire(), although I
suspect we didn't actually introduce the primitive yet, which raises the
point, do we want to?

Will Deacon

2016-Jan-25 18:02 UTC

head link

[v3,11/41] mips: reuse asm-generic/barrier.h

Hi Paul,

On Fri, Jan 15, 2016 at 09:39:12AM -0800, Paul E. McKenney
wrote:> On Fri, Jan 15, 2016 at 09:55:54AM +0100, Peter Zijlstra wrote:
> > On Thu, Jan 14, 2016 at 01:29:13PM -0800, Paul E. McKenney wrote:
> > > So smp_mb() provides transitivity, as do pairs of
smp_store_release()
> > > and smp_read_acquire(), 
> > 
> > But they provide different grades of transitivity, which is where all
> > the confusion lays.
> > 
> > smp_mb() is strongly/globally transitive, all CPUs will agree on the
order.
> > 
> > Whereas the RCpc release+acquire is weakly so, only the two cpus
> > involved in the handover will agree on the order.
> 
> Good point!
> 
> Using grace periods in place of smp_mb() also provides strong/global
> transitivity, but also insanely high latencies.  ;-)
> 
> The patch below updates Documentation/memory-barriers.txt to define
> local vs. global transitivity.  The corresponding ppcmem litmus test
> is included below as well.
> 
> Should we start putting litmus tests for the various examples
> somewhere, perhaps in a litmus-tests directory within each participating
> architecture?  I have a pile of powerpc-related litmus tests on my laptop,
> but they probably aren't doing all that much good there.
I too would like to have the litmus tests in the kernel so that we can
refer to them from memory-barriers.txt. Ideally they wouldn't be targetted
to a particular arch, however.
> PPC local-transitive
> ""
> {
> 0:r1=1; 0:r2=u; 0:r3=v; 0:r4=x; 0:r5=y; 0:r6=z;
> 1:r1=1; 1:r2=u; 1:r3=v; 1:r4=x; 1:r5=y; 1:r6=z;
> 2:r1=1; 2:r2=u; 2:r3=v; 2:r4=x; 2:r5=y; 2:r6=z;
> 3:r1=1; 3:r2=u; 3:r3=v; 3:r4=x; 3:r5=y; 3:r6=z;
> }
>  P0           | P1           | P2           | P3           ;
>  lwz r9,0(r4) | lwz r9,0(r5) | lwz r9,0(r6) | stw r1,0(r3) ;
>  lwsync       | lwsync       | lwsync       | sync         ;
>  stw r1,0(r2) | lwz r8,0(r3) | stw r1,0(r7) | lwz r9,0(r2) ;
>  lwsync       | lwz r7,0(r2) |              |              ;
>  stw r1,0(r5) | lwsync       |              |              ;
>               | stw r1,0(r6) |              |              ;
> exists
> (* (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r8=0 /\ 3:r9=0) *)
> (* (0:r9=1 /\ 1:r9=1 /\ 2:r9=1) *)
> (* (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r7=0) *)
> (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r7=0)
i.e. we should rewrite this using READ_ONCE/WRITE_ONCE and smp_mb() etc.
> ------------------------------------------------------------------------
> 
> commit 2cb4e83a1b5c89c8e39b8a64bd89269d05913e41
> Author: Paul E. McKenney <paulmck at linux.vnet.ibm.com>
> Date:   Fri Jan 15 09:30:42 2016 -0800
> 
>     documentation: Distinguish between local and global transitivity
>     
>     The introduction of smp_load_acquire() and smp_store_release() had
>     the side effect of introducing a weaker notion of transitivity:
>     The transitivity of full smp_mb() barriers is global, but that
>     of smp_store_release()/smp_load_acquire() chains is local.  This
>     commit therefore introduces the notion of local transitivity and
>     gives an example.
>     
>     Reported-by: Peter Zijlstra <peterz at infradead.org>
>     Reported-by: Will Deacon <will.deacon at arm.com>
>     Signed-off-by: Paul E. McKenney <paulmck at linux.vnet.ibm.com>
> 
> diff --git a/Documentation/memory-barriers.txt
b/Documentation/memory-barriers.txt
> index c66ba46d8079..d8109ed99342 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1318,8 +1318,82 @@ or a level of cache, CPU 2 might have early access
to CPU 1's writes.
>  General barriers are therefore required to ensure that all CPUs agree
>  on the combined order of CPU 1's and CPU 2's accesses.
>  
> -To reiterate, if your code requires transitivity, use general barriers
> -throughout.
> +General barriers provide "global transitivity", so that all CPUs
will
> +agree on the order of operations.  In contrast, a chain of release-acquire
> +pairs provides only "local transitivity", so that only those
CPUs on
> +the chain are guaranteed to agree on the combined order of the accesses.
Thanks for having a go at this. I tried defining something axiomatically,
but got stuck pretty quickly. In my scheme, I used "data-directed
transitivity" instead of "local transitivity", since the latter
seems to
be a bit of a misnomer.
> +For example, switching to C code in deference to Herman Hollerith:
> +
> +	int u, v, x, y, z;
> +
> +	void cpu0(void)
> +	{
> +		r0 = smp_load_acquire(&x);
> +		WRITE_ONCE(u, 1);
> +		smp_store_release(&y, 1);
> +	}
> +
> +	void cpu1(void)
> +	{
> +		r1 = smp_load_acquire(&y);
> +		r4 = READ_ONCE(v);
> +		r5 = READ_ONCE(u);
> +		smp_store_release(&z, 1);
> +	}
> +
> +	void cpu2(void)
> +	{
> +		r2 = smp_load_acquire(&z);
> +		smp_store_release(&x, 1);
> +	}
> +
> +	void cpu3(void)
> +	{
> +		WRITE_ONCE(v, 1);
> +		smp_mb();
> +		r3 = READ_ONCE(u);
> +	}
> +
> +Because cpu0(), cpu1(), and cpu2() participate in a local transitive
> +chain of smp_store_release()/smp_load_acquire() pairs, the following
> +outcome is prohibited:
> +
> +	r0 == 1 && r1 == 1 && r2 == 1
> +
> +Furthermore, because of the release-acquire relationship between cpu0()
> +and cpu1(), cpu1() must see cpu0()'s writes, so that the following
> +outcome is prohibited:
> +
> +	r1 == 1 && r5 == 0
> +
> +However, the transitivity of release-acquire is local to the participating
> +CPUs and does not apply to cpu3().  Therefore, the following outcome
> +is possible:
> +
> +	r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0
&& r4 == 0
I think you should be completely explicit and include r5 == 1 here, too.

Also -- where would you add the smp_mb__after_release_acquire to fix
(i.e. forbid) this? Immediately after cpu1()'s read of y?
> +Although cpu0(), cpu1(), and cpu2() will see their respective reads and
> +writes in order, CPUs not involved in the release-acquire chain might
> +well disagree on the order.  This disagreement stems from the fact that
> +the weak memory-barrier instructions used to implement smp_load_acquire()
> +and smp_store_release() are not required to order prior stores against
> +subsequent loads in all cases.  This means that cpu3() can see
cpu0()'s
> +store to u as happening -after- cpu1()'s load from v, even though
> +both cpu0() and cpu1() agree that these two operations occurred in the
> +intended order.
> +
> +However, please keep in mind that smp_load_acquire() is not magic.
> +In particular, it simply reads from its argument with ordering.  It does
> +-not- ensure that any particular value will be read.  Therefore, the
> +following outcome is possible:
> +
> +	r0 == 0 && r1 == 0 && r2 == 0 && r5 == 0
> +
> +Note that this outcome can happen even on a mythical sequentially
> +consistent system where nothing is ever reordered.
I'm not sure this last bit is strictly necessary. If somebody thinks that
acquire/release involve some sort of implicit synchronisation, I think
they may have bigger problems with memory-barriers.txt.

Will

Reasonably Related Threads

Search for more maybe matching threads

Virtualization - Jan 2016 - [v3,11/41] mips: reuse asm-generic/barrier.h

[v3,11/41] mips: reuse asm-generic/barrier.h

[v3,11/41] mips: reuse asm-generic/barrier.h

[v3,11/41] mips: reuse asm-generic/barrier.h

[v3,11/41] mips: reuse asm-generic/barrier.h

[v3,11/41] mips: reuse asm-generic/barrier.h

[v3,11/41] mips: reuse asm-generic/barrier.h

Reasonably Related Threads