thr3ads.net - Linux Virtualization - [v3,11/41] mips: reuse asm-generic/barrier.h [Jan 2016]

If this information is useful, please help other people find it:
Share via:

Leonid Yegoshin

2016-Jan-12 20:45 UTC

[v3,11/41] mips: reuse asm-generic/barrier.h

(I try to answer on multiple mails in one)

First of all, it seems like some generic notes should be given here:

1. Generic MIPS "SYNC" (aka "SYNC 0") instruction is a very
heavy in
some CPUs. On that CPUs it basically kills pipelines in each CPU, can do 
a special memory/IO bus transaction (similar to "fence") and hold a 
system until all R/W is completed. It is like Big Kernel Lock but worse. 
So, the move to SMP_* kind of barriers is needed to improve performance, 
especially on newest CPUs with long pipelines.

2. MIPS Arch document may be misleading because words "ordering" and 
"completion" means different from Linux, the SYNC instruction 
description is written for HW engineers. I wrote that in a separate 
patch of the same patchset - 
http://patchwork.linux-mips.org/patch/10505/ "MIPS: R6: Use lightweight 
SYNC instruction in smp_* memory barriers":
> This instructions were specifically designed to work for smp_*() sort of
> memory barriers in MIPS R2/R3/R5 and R6.
>
> Unfortunately, it's description is very cryptic and is done in HW
engineering
> style which prevents use of it by SW.
3. I bother MIPS Arch team long time until I completely understood that 
MIPS SYNC_WMB, SYNC_MB, SYNC_RMB, SYNC_RELEASE and SYNC_ACQUIRE do an 
exactly that is required in Documentation/memory-barriers.txt


In Peter Zijlstra mail:
> 1) you do not make such things selectable; either the hardware needs
> them or it doesn't. If it does you_must_  use them, however unlikely.It is selectable only for MIPS R2 but not MIPS R6. The reason is - most 
of MIPS R2 CPUs have short pipeline and that SYNC is just waste of CPU 
resource, especially taking into account that "lightweight syncs" are 
converted to a heavy "SYNC 0" in many of that CPUs. However the latest
MIPS/Imagination CPU have a pipeline long enough to hit a problem - 
absence of SYNC at LL/SC inside atomics, barriers etc.
> And reading the MIPS64 v6.04 instruction set manual, I think 0x11/0x12
> are_NOT_  transitive and therefore cannot be used to implement the
> smp_mb__{before,after} stuff.
>
> That is, in MIPS speak, those SYNC types are Ordering Barriers, not
> Completion Barriers.
Please see above, point 2.
> That is, currently all architectures -- with exception of PPC -- have
> RCsc locks, but using these non-transitive things will get you RCpc
> locks.
>
> So yes, MIPS can go RCpc for its locks and share the burden of pain with
> PPC, but that needs to be a very concious decision.
I don't understand that - I tried hard but I can't find any word like 
"RCsc", "RCpc" in Documents/ directory. Web search goes
nowhere, of course.


In Will Deacon mail:
> The issue I have with the SYNC description in the text above is that it
> describes the single CPU (program order) and the dual-CPU (confusingly
> named global order) cases, but then doesn't generalise any further.
That
> means we can't sensibly reason about transitivity properties when a
third
> agent is involved. For example, the WRC+sync+addr test:
>
>
> P0:
> Wx = 1
>
> P1:
> Rx == 1
> SYNC
> Wy = 1
>
> P2:
> Ry == 1
> <address dep>
> Rx = 0
>
>
> I can't find anything to forbid that, given the text. The main problem
> is having the SYNC on P1 affect the write by P0.
As I understand that test, the visibility of P0: W[x] = 1 is identical 
to P1 and P2 here. If P1 got X before SYNC and write to Y after SYNC 
then instruction source register dependency tracking in P2 prevents a 
speculative load of X before P2 obtains Y from the same place as P0/P1 
and calculate address of X. If some load of X in P2 happens before 
address dependency calculation it's result is discarded.

Yes, you can't find that in MIPS SYNC instruction description, it is 
more likely in CM (Coherence Manager) area. I just pointed our arch team 
member responsible for documents and he will think how to explain that.

- Leonid.

Leonid Yegoshin

2016-Jan-13 00:21 UTC

head link

[v3,11/41] mips: reuse asm-generic/barrier.h

On 01/12/2016 01:40 PM, Peter Zijlstra wrote:>
>> It is selectable only for MIPS R2 but not MIPS R6. The reason is - most
of
>> MIPS R2 CPUs have short pipeline and that SYNC is just waste of CPU
>> resource, especially taking into account that "lightweight
syncs" are
>> converted to a heavy "SYNC 0" in many of that CPUs. However
the latest
>> MIPS/Imagination CPU have a pipeline long enough to hit a problem -
absence
>> of SYNC at LL/SC inside atomics, barriers etc.
> What ?! Are you saying that because R2 has short pipelines its unlikely
> to hit the reordering issues and we can omit barriers?
It was my guess to explain - why barriers was not included originally. 
You can check with Ralf, he knows more about that time MIPS Linux code.

I bother with this more than 2 years and I just try to solve that issue 
- in recent CPUs the load after LL/SC synchronization instruction loop 
can get ahead of SC for sure, it was tested.
>
>>> And reading the MIPS64 v6.04 instruction set manual, I think
0x11/0x12
>>> are_NOT_  transitive and therefore cannot be used to implement the
>>> smp_mb__{before,after} stuff.
>>>
>>> That is, in MIPS speak, those SYNC types are Ordering Barriers, not
>>> Completion Barriers.
>> Please see above, point 2.
> That did not in fact enlighten things. Are they transitive/multi-copy
> atomic or not?
Peter Zijlstra recently wrote: "In particular we're very much all 
'confused' about the various notions of transitivity". I am
actually
confused too and need some examples here.
>
> (and here Will will go into great detail on the differences between the
> two and make our collective brains explode :-)
>
>>> That is, currently all architectures -- with exception of PPC --
have
>>> RCsc locks, but using these non-transitive things will get you RCpc
>>> locks.
>>>
>>> So yes, MIPS can go RCpc for its locks and share the burden of pain
with
>>> PPC, but that needs to be a very concious decision.
>> I don't understand that - I tried hard but I can't find any
word like
>> "RCsc", "RCpc" in Documents/ directory. Web search
goes nowhere, of course.
> From: lkml.kernel.org/r/20150828153921.GF19282 at
twins.programming.kicks-ass.net
>
> Yes, the difference between RCpc and RCsc is in the meaning of RELEASE +
> ACQUIRE. With RCsc that implies a full memory barrier, with RCpc it does
> not.
MIPS Arch starting from R2 requires that. If some CPU can't, it should 
execute a full "SYNC 0" instead, which is a full memory barrier.
>
> Currently PowerPC is the only arch that (can, and) does RCpc and gives a
> weaker RELEASE + ACQUIRE. Only the CPU who did the ACQUIRE is guaranteed
> to see the stores of the CPU which did the RELEASE in order.
Yes, it was a goal for SYNC_ACQUIRE and SYNC_RELEASE.

Caveats:

     - "Full memory barrier" on MIPS means - full barrier for any
device
in coherent domain. In MIPS Tech/Imagination Tech MIPS-based CPU it is 
"for any device connected to CM or IOCU + directly connected memory".

     - It is not applied to instruction fetch. However, I-Cache flushes 
and SYNCI are consistent with that. There is also hazard barrier 
instructions to clear CPU pipeline to some extent - to help with this 
limitation.

I don't think that these caveats prevent a correct Acquire/Release semantic.

- Leonid.

Leonid Yegoshin

2016-Jan-13 19:02 UTC

head link

[v3,11/41] mips: reuse asm-generic/barrier.h

On 01/13/2016 02:45 AM, Will Deacon wrote:> On Tue, Jan 12, 2016 at 12:45:14PM -0800, Leonid Yegoshin wrote:
>>
> I don't think the address dependency is enough on its own. By that
> reasoning, the following variant (WRC+addr+addr) would work too:
>
>
> P0:
> Wx = 1
>
> P1:
> Rx == 1
> <address dep>
> Wy = 1
>
> P2:
> Ry == 1
> <address dep>
> Rx = 0
>
>
> So are you saying that this is also forbidden?
> Imagine that P0 and P1 are two threads that share a store buffer. What
> then?
>
I ask HW team about it but I have a question - has it any relationship 
with replacing MIPS SYNC with lightweight SYNCs (SYNC_WMB etc)? You use 
any barrier or do not use it and I just voice an intention to use a more 
efficient instruction instead of bold hummer (SYNC instruction). If you 
don't use any barrier here then it is a different issue.

May be it has sense to return back to original issue?

- Leonid

Leonid Yegoshin

2016-Jan-13 22:26 UTC

head link

[v3,11/41] mips: reuse asm-generic/barrier.h

On 01/13/2016 02:45 AM, Will Deacon wrote:>>
> I don't think the address dependency is enough on its own. By that
> reasoning, the following variant (WRC+addr+addr) would work too:
>
>
> P0:
> Wx = 1
>
> P1:
> Rx == 1
> <address dep>
> Wy = 1
>
> P2:
> Ry == 1
> <address dep>
> Rx = 0
>
>
> So are you saying that this is also forbidden?
> Imagine that P0 and P1 are two threads that share a store buffer. What
> then?
OK, I collected answers and it is:

     In MIPS R6 this test passes OK, I mean - P2: Rx = 1 if Ry is read 
as 1. By design.

     However, it is unclear that happens in MIPS R2 1004K.

     Moreover, there are voices against guarantee that it will be in 
future and that voices point me to Documentation/memory-barriers.txt 
section "DATA DEPENDENCY BARRIERS" examples which require SYNC_RMB 
between loading address/index and using that for loading data based on 
that address or index for shared data (look on CPU2
pseudo-code):> To deal with this, a data dependency barrier or better must be inserted
> between the address load and the data load:
>
>         CPU 1                 CPU 2
>         ===============       ==============>         { A == 1, B == 2,
C = 3, P == &A, Q == &C }
>         B = 4;
>         <write barrier>
>         WRITE_ONCE(P, &B);
>                               Q = READ_ONCE(P);
>                               <data dependency barrier>
<-----------
> SYNC_RMB is here
>                               D = *Q;
...> Another example of where data dependency barriers might be required is 
> where a
> number is read from memory and then used to calculate the index for an 
> array
> access:
>
>         CPU 1                 CPU 2
>         ===============       ==============>         { M[0] == 1, M[1]
== 2, M[3] = 3, P == 0, Q == 3 }
>         M[1] = 4;
>         <write barrier>
>         WRITE_ONCE(P, 1);
>                               Q = READ_ONCE(P);
>                               <data dependency barrier>
<------------
> SYNC_RMB is here
>                               D = M[Q];
That voices say that there is a legitimate reason to relax HW here for 
performance if SYNC_RMB is needed anyway to work with this sequence of 
shared data.


And all that is out-of-topic here in my mind. I just want to be sure 
that this patchset still provides a use of a specific lightweight SYNCs 
on MIPS vs bold and heavy generalized "SYNC 0" in any case.

- Leonid.

Maybe Matching Threads

Search for more possibly parallel threads

Linux Virtualization - Jan 2016 - [v3,11/41] mips: reuse asm-generic/barrier.h

[v3,11/41] mips: reuse asm-generic/barrier.h

[v3,11/41] mips: reuse asm-generic/barrier.h

[v3,11/41] mips: reuse asm-generic/barrier.h

[v3,11/41] mips: reuse asm-generic/barrier.h

Maybe Matching Threads