On Tue, Jan 12, 2016 at 11:40:12AM +0100, Peter Zijlstra wrote:> On Tue, Jan 12, 2016 at 11:25:55AM +0100, Peter Zijlstra wrote: > > On Tue, Jan 12, 2016 at 10:27:11AM +0100, Peter Zijlstra wrote: > > > 2) the changelog _completely_ fails to explain the sync 0x11 and sync > > > 0x12 semantics nor does it provide a publicly accessible link to > > > documentation that does. > > > > Ralf pointed me at: https://imgtec.com/mips/architectures/mips64/ > > > > > 3) it really should have explained what you did with > > > smp_llsc_mb/smp_mb__before_llsc() in _detail_. > > > > And reading the MIPS64 v6.04 instruction set manual, I think 0x11/0x12 > > are _NOT_ transitive and therefore cannot be used to implement the > > smp_mb__{before,after} stuff. > > > > That is, in MIPS speak, those SYNC types are Ordering Barriers, not > > Completion Barriers. They need not be globally performed. > > Which if true; and I know Will has some questions here; would also mean > that you 'cannot' use the ACQUIRE/RELEASE barriers for your locks as was > recently suggested by David Daney.The issue I have with the SYNC description in the text above is that it describes the single CPU (program order) and the dual-CPU (confusingly named global order) cases, but then doesn't generalise any further. That means we can't sensibly reason about transitivity properties when a third agent is involved. For example, the WRC+sync+addr test: P0: Wx = 1 P1: Rx == 1 SYNC Wy = 1 P2: Ry == 1 <address dep> Rx = 0 I can't find anything to forbid that, given the text. The main problem is having the SYNC on P1 affect the write by P0.> That is, currently all architectures -- with exception of PPC -- have > RCsc locks, but using these non-transitive things will get you RCpc > locks. > > So yes, MIPS can go RCpc for its locks and share the burden of pain with > PPC, but that needs to be a very concious decision.I think it's much worse than RCpc, given my interpretation of the wording. Will
(I try to answer on multiple mails in one) First of all, it seems like some generic notes should be given here: 1. Generic MIPS "SYNC" (aka "SYNC 0") instruction is a very heavy in some CPUs. On that CPUs it basically kills pipelines in each CPU, can do a special memory/IO bus transaction (similar to "fence") and hold a system until all R/W is completed. It is like Big Kernel Lock but worse. So, the move to SMP_* kind of barriers is needed to improve performance, especially on newest CPUs with long pipelines. 2. MIPS Arch document may be misleading because words "ordering" and "completion" means different from Linux, the SYNC instruction description is written for HW engineers. I wrote that in a separate patch of the same patchset - http://patchwork.linux-mips.org/patch/10505/ "MIPS: R6: Use lightweight SYNC instruction in smp_* memory barriers":> This instructions were specifically designed to work for smp_*() sort of > memory barriers in MIPS R2/R3/R5 and R6. > > Unfortunately, it's description is very cryptic and is done in HW engineering > style which prevents use of it by SW.3. I bother MIPS Arch team long time until I completely understood that MIPS SYNC_WMB, SYNC_MB, SYNC_RMB, SYNC_RELEASE and SYNC_ACQUIRE do an exactly that is required in Documentation/memory-barriers.txt In Peter Zijlstra mail:> 1) you do not make such things selectable; either the hardware needs > them or it doesn't. If it does you_must_ use them, however unlikely.It is selectable only for MIPS R2 but not MIPS R6. The reason is - most of MIPS R2 CPUs have short pipeline and that SYNC is just waste of CPU resource, especially taking into account that "lightweight syncs" are converted to a heavy "SYNC 0" in many of that CPUs. However the latest MIPS/Imagination CPU have a pipeline long enough to hit a problem - absence of SYNC at LL/SC inside atomics, barriers etc.> And reading the MIPS64 v6.04 instruction set manual, I think 0x11/0x12 > are_NOT_ transitive and therefore cannot be used to implement the > smp_mb__{before,after} stuff. > > That is, in MIPS speak, those SYNC types are Ordering Barriers, not > Completion Barriers.Please see above, point 2.> That is, currently all architectures -- with exception of PPC -- have > RCsc locks, but using these non-transitive things will get you RCpc > locks. > > So yes, MIPS can go RCpc for its locks and share the burden of pain with > PPC, but that needs to be a very concious decision.I don't understand that - I tried hard but I can't find any word like "RCsc", "RCpc" in Documents/ directory. Web search goes nowhere, of course. In Will Deacon mail:> The issue I have with the SYNC description in the text above is that it > describes the single CPU (program order) and the dual-CPU (confusingly > named global order) cases, but then doesn't generalise any further. That > means we can't sensibly reason about transitivity properties when a third > agent is involved. For example, the WRC+sync+addr test: > > > P0: > Wx = 1 > > P1: > Rx == 1 > SYNC > Wy = 1 > > P2: > Ry == 1 > <address dep> > Rx = 0 > > > I can't find anything to forbid that, given the text. The main problem > is having the SYNC on P1 affect the write by P0.As I understand that test, the visibility of P0: W[x] = 1 is identical to P1 and P2 here. If P1 got X before SYNC and write to Y after SYNC then instruction source register dependency tracking in P2 prevents a speculative load of X before P2 obtains Y from the same place as P0/P1 and calculate address of X. If some load of X in P2 happens before address dependency calculation it's result is discarded. Yes, you can't find that in MIPS SYNC instruction description, it is more likely in CM (Coherence Manager) area. I just pointed our arch team member responsible for documents and he will think how to explain that. - Leonid.
On Tue, Jan 12, 2016 at 12:45:14PM -0800, Leonid Yegoshin wrote:> (I try to answer on multiple mails in one) > > First of all, it seems like some generic notes should be given here: > > 1. Generic MIPS "SYNC" (aka "SYNC 0") instruction is a very heavy in some > CPUs. On that CPUs it basically kills pipelines in each CPU, can do a > special memory/IO bus transaction (similar to "fence") and hold a system > until all R/W is completed. It is like Big Kernel Lock but worse. So, the > move to SMP_* kind of barriers is needed to improve performance, especially > on newest CPUs with long pipelines.The MIPS SYNC isn't any worse than the PPC SYNC, x86 MFENCE or arm DSB SY, yes they're heavy, so what.> 2. MIPS Arch document may be misleading because words "ordering" and > "completion" means different from Linux, the SYNC instruction description is > written for HW engineers. I wrote that in a separate patch of the same > patchset - http://patchwork.linux-mips.org/patch/10505/ "MIPS: R6: Use > lightweight SYNC instruction in smp_* memory barriers":Did you actually say anything here?> >This instructions were specifically designed to work for smp_*() sort of > >memory barriers in MIPS R2/R3/R5 and R6. > > > >Unfortunately, it's description is very cryptic and is done in HW engineering > >style which prevents use of it by SW. > > 3. I bother MIPS Arch team long time until I completely understood that MIPS > SYNC_WMB, SYNC_MB, SYNC_RMB, SYNC_RELEASE and SYNC_ACQUIRE do an exactly > that is required in Documentation/memory-barriers.txtHa! and you think that document covers all the really fun details? In particular we're very much all 'confused' about the various notions of transitivity and what barriers imply how much of it.> In Peter Zijlstra mail: > > >1) you do not make such things selectable; either the hardware needs > >them or it doesn't. If it does you_must_ use them, however unlikely.> It is selectable only for MIPS R2 but not MIPS R6. The reason is - most of > MIPS R2 CPUs have short pipeline and that SYNC is just waste of CPU > resource, especially taking into account that "lightweight syncs" are > converted to a heavy "SYNC 0" in many of that CPUs. However the latest > MIPS/Imagination CPU have a pipeline long enough to hit a problem - absence > of SYNC at LL/SC inside atomics, barriers etc.What ?! Are you saying that because R2 has short pipelines its unlikely to hit the reordering issues and we can omit barriers?> >And reading the MIPS64 v6.04 instruction set manual, I think 0x11/0x12 > >are_NOT_ transitive and therefore cannot be used to implement the > >smp_mb__{before,after} stuff. > > > >That is, in MIPS speak, those SYNC types are Ordering Barriers, not > >Completion Barriers. > > Please see above, point 2.That did not in fact enlighten things. Are they transitive/multi-copy atomic or not? (and here Will will go into great detail on the differences between the two and make our collective brains explode :-)> >That is, currently all architectures -- with exception of PPC -- have > >RCsc locks, but using these non-transitive things will get you RCpc > >locks. > > > >So yes, MIPS can go RCpc for its locks and share the burden of pain with > >PPC, but that needs to be a very concious decision. > > I don't understand that - I tried hard but I can't find any word like > "RCsc", "RCpc" in Documents/ directory. Web search goes nowhere, of course.From: lkml.kernel.org/r/20150828153921.GF19282 at twins.programming.kicks-ass.net Yes, the difference between RCpc and RCsc is in the meaning of RELEASE + ACQUIRE. With RCsc that implies a full memory barrier, with RCpc it does not. Currently PowerPC is the only arch that (can, and) does RCpc and gives a weaker RELEASE + ACQUIRE. Only the CPU who did the ACQUIRE is guaranteed to see the stores of the CPU which did the RELEASE in order. As it stands, RCU is the only _known_ codebase where this matters, but we did in fact write code for a fair number of years 'assuming' RELEASE + ACQUIRE was a full barrier, so who knows what else is out there. RCsc - release consistency sequential consistency RCpc - release consistency processor consistency https://en.wikipedia.org/wiki/Processor_consistency
On Tue, Jan 12, 2016 at 12:45:14PM -0800, Leonid Yegoshin wrote:> >The issue I have with the SYNC description in the text above is that it > >describes the single CPU (program order) and the dual-CPU (confusingly > >named global order) cases, but then doesn't generalise any further. That > >means we can't sensibly reason about transitivity properties when a third > >agent is involved. For example, the WRC+sync+addr test: > > > > > >P0: > >Wx = 1 > > > >P1: > >Rx == 1 > >SYNC > >Wy = 1 > > > >P2: > >Ry == 1 > ><address dep> > >Rx = 0 > > > > > >I can't find anything to forbid that, given the text. The main problem > >is having the SYNC on P1 affect the write by P0. > > As I understand that test, the visibility of P0: W[x] = 1 is identical to P1 > and P2 here. If P1 got X before SYNC and write to Y after SYNC then > instruction source register dependency tracking in P2 prevents a speculative > load of X before P2 obtains Y from the same place as P0/P1 and calculate > address of X. If some load of X in P2 happens before address dependency > calculation it's result is discarded.I don't think the address dependency is enough on its own. By that reasoning, the following variant (WRC+addr+addr) would work too: P0: Wx = 1 P1: Rx == 1 <address dep> Wy = 1 P2: Ry == 1 <address dep> Rx = 0 So are you saying that this is also forbidden? Imagine that P0 and P1 are two threads that share a store buffer. What then?> Yes, you can't find that in MIPS SYNC instruction description, it is more > likely in CM (Coherence Manager) area. I just pointed our arch team member > responsible for documents and he will think how to explain that.I tried grepping the linked documents for "coherence manager" but couldn't find anything. Is the description you refer to available anywhere? Will