thr3ads.net - Linux Virtualization - [PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks [Jan 2016]

If this information is useful, please help other people find it:
Share via:

Michael S. Tsirkin

2016-Jan-12 22:10 UTC

[PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks

mb() typically uses mfence on modern x86, but a micro-benchmark shows that
it's
2 to 3 times slower than lock; addl $0,(%%e/rsp) that we use on older CPUs.

So let's use the locked variant everywhere - helps keep the code simple as
well.

While I was at it, I found some inconsistencies in comments in
arch/x86/include/asm/barrier.h

I hope I'm not splitting this up too much - the reason is I wanted to
isolate
the code changes (that people might want to test for performance) from comment
changes approved by Linus, from (so far unreviewed) comment change I came up
with myself.

Lightly tested on my system.

Michael S. Tsirkin (3):
  x86: drop mfence in favor of lock+addl
  x86: drop a comment left over from X86_OOSTORE
  x86: tweak the comment about use of wmb for IO

 arch/x86/include/asm/barrier.h | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

-- 
MST

Michael S. Tsirkin

2016-Jan-12 22:10 UTC

head link

[PATCH v2 1/3] x86: drop mfence in favor of lock+addl

mfence appears to be way slower than a locked instruction - let's use
lock+add unconditionally, same as we always did on old 32-bit.

Signed-off-by: Michael S. Tsirkin <mst at redhat.com>
---
 arch/x86/include/asm/barrier.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index a584e1c..7f99726 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -15,11 +15,12 @@
  * Some non-Intel clones support out of order store. wmb() ceases to be a
  * nop for these.
  */
-#define mb() alternative("lock; addl $0,0(%%esp)",
"mfence", X86_FEATURE_XMM2)
+
+#define mb() asm volatile("lock; addl $0,0(%%esp)" :::
"memory")
 #define rmb() alternative("lock; addl $0,0(%%esp)",
"lfence", X86_FEATURE_XMM2)
 #define wmb() alternative("lock; addl $0,0(%%esp)",
"sfence", X86_FEATURE_XMM)
 #else
-#define mb() 	asm volatile("mfence":::"memory")
+#define mb() asm volatile("lock; addl $0,0(%%rsp)" :::
"memory")
 #define rmb()	asm volatile("lfence":::"memory")
 #define wmb()	asm volatile("sfence" ::: "memory")
 #endif
-- 
MST

Michael S. Tsirkin

2016-Jan-12 22:10 UTC

head link

[PATCH v2 2/3] x86: drop a comment left over from X86_OOSTORE

The comment about wmb being non-nop is a left over from before commit
09df7c4c8097 ("x86: Remove CONFIG_X86_OOSTORE").

It makes no sense now: if you have an SMP system with out of order
stores, making wmb not a nop will not help.

Additionally, wmb is not a nop even for regular intel CPUs because of
weird use-cases e.g. dealing with WC memory.

Drop this comment.

Signed-off-by: Michael S. Tsirkin <mst at redhat.com>
---
 arch/x86/include/asm/barrier.h | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index 7f99726..eb220b8 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -11,11 +11,6 @@
  */
 
 #ifdef CONFIG_X86_32
-/*
- * Some non-Intel clones support out of order store. wmb() ceases to be a
- * nop for these.
- */
-
 #define mb() asm volatile("lock; addl $0,0(%%esp)" :::
"memory")
 #define rmb() alternative("lock; addl $0,0(%%esp)",
"lfence", X86_FEATURE_XMM2)
 #define wmb() alternative("lock; addl $0,0(%%esp)",
"sfence", X86_FEATURE_XMM)
-- 
MST

Michael S. Tsirkin

2016-Jan-12 22:10 UTC

head link

[PATCH v2 3/3] x86: tweak the comment about use of wmb for IO

On x86, we *do* still use the non-nop rmb/wmb for IO barriers, but even
that is generally questionable.

Leave them around as historial unless somebody can point to a case where
they care about the performance, but tweak the comment so people
don't think they are strictly required in all cases.

Signed-off-by: Michael S. Tsirkin <mst at redhat.com>
---
 arch/x86/include/asm/barrier.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index eb220b8..924cd44 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -6,7 +6,7 @@
 
 /*
  * Force strict CPU ordering.
- * And yes, this is required on UP too when we're talking
+ * And yes, this might be required on UP too when we're talking
  * to devices.
  */
 
-- 
MST

H. Peter Anvin

2016-Jan-12 22:25 UTC

head link

[PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks

On 01/12/16 14:10, Michael S. Tsirkin wrote:> mb() typically uses mfence on modern x86, but a micro-benchmark shows that
it's
> 2 to 3 times slower than lock; addl $0,(%%e/rsp) that we use on older CPUs.
> 
> So let's use the locked variant everywhere - helps keep the code simple
as
> well.
> 
> While I was at it, I found some inconsistencies in comments in
> arch/x86/include/asm/barrier.h
> 
> I hope I'm not splitting this up too much - the reason is I wanted to
isolate
> the code changes (that people might want to test for performance) from
comment
> changes approved by Linus, from (so far unreviewed) comment change I came
up
> with myself.
> 
> Lightly tested on my system.
> 
> Michael S. Tsirkin (3):
>   x86: drop mfence in favor of lock+addl
>   x86: drop a comment left over from X86_OOSTORE
>   x86: tweak the comment about use of wmb for IO
> 
I would like to get feedback from the hardware team about the
implications of this change, first.

	-hpa

One Thousand Gnomes

2016-Jan-12 22:25 UTC

head link

[PATCH v2 2/3] x86: drop a comment left over from X86_OOSTORE

On Wed, 13 Jan 2016 00:10:19 +0200
"Michael S. Tsirkin" <mst at redhat.com> wrote:
> The comment about wmb being non-nop is a left over from before commit
> 09df7c4c8097 ("x86: Remove CONFIG_X86_OOSTORE").
> 
> It makes no sense now: if you have an SMP system with out of order
> stores, making wmb not a nop will not help.
There were never any IDT Winchip systems with SMP support, and they were
the one system that could enable OOSTORE (and it was worth up to 30% on
some workloads). The fencing it had was just for DMA devices.

Alan

Michael S. Tsirkin

2016-Jan-26 08:20 UTC

head link

[PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks

On Tue, Jan 12, 2016 at 02:25:24PM -0800, H. Peter Anvin
wrote:> On 01/12/16 14:10, Michael S. Tsirkin wrote:
> > mb() typically uses mfence on modern x86, but a micro-benchmark shows
that it's
> > 2 to 3 times slower than lock; addl $0,(%%e/rsp) that we use on older
CPUs.
> > 
> > So let's use the locked variant everywhere - helps keep the code
simple as
> > well.
> > 
> > While I was at it, I found some inconsistencies in comments in
> > arch/x86/include/asm/barrier.h
> > 
> > I hope I'm not splitting this up too much - the reason is I wanted
to isolate
> > the code changes (that people might want to test for performance) from
comment
> > changes approved by Linus, from (so far unreviewed) comment change I
came up
> > with myself.
> > 
> > Lightly tested on my system.
> > 
> > Michael S. Tsirkin (3):
> >   x86: drop mfence in favor of lock+addl
> >   x86: drop a comment left over from X86_OOSTORE
> >   x86: tweak the comment about use of wmb for IO
> > 
> 
> I would like to get feedback from the hardware team about the
> implications of this change, first.
> 
> 	-hpa
> 
Hi hpa,
Any luck getting some feedback on this one?

Thanks,

-- 
MST

Michael S. Tsirkin

2018-Oct-11 18:11 UTC

head link

[PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks

On Thu, Oct 11, 2018 at 10:37:07AM -0700, Andres Freund
wrote:> Hi,
> 
> On 2016-01-26 10:20:14 +0200, Michael S. Tsirkin wrote:
> > On Tue, Jan 12, 2016 at 02:25:24PM -0800, H. Peter Anvin wrote:
> > > On 01/12/16 14:10, Michael S. Tsirkin wrote:
> > > > mb() typically uses mfence on modern x86, but a
micro-benchmark shows that it's
> > > > 2 to 3 times slower than lock; addl $0,(%%e/rsp) that we use
on older CPUs.
> > > > 
> > > > So let's use the locked variant everywhere - helps keep
the code simple as
> > > > well.
> > > > 
> > > > While I was at it, I found some inconsistencies in comments
in
> > > > arch/x86/include/asm/barrier.h
> > > > 
> > > > I hope I'm not splitting this up too much - the reason
is I wanted to isolate
> > > > the code changes (that people might want to test for
performance) from comment
> > > > changes approved by Linus, from (so far unreviewed) comment
change I came up
> > > > with myself.
> > > > 
> > > > Lightly tested on my system.
> > > > 
> > > > Michael S. Tsirkin (3):
> > > >   x86: drop mfence in favor of lock+addl
> > > >   x86: drop a comment left over from X86_OOSTORE
> > > >   x86: tweak the comment about use of wmb for IO
> > > > 
> > > 
> > > I would like to get feedback from the hardware team about the
> > > implications of this change, first.
> 
> > Any luck getting some feedback on this one?
> 
> Ping?  I just saw a bunch of kernel fences in a benchmark, making me
> wonder why linux uses mfence rather than lock addl. Leading me to this
> thread.
> 
> Greetings,
> 
> Andres Freund
It doesn't do it for smp_mb any longer:

commit 450cbdd0125cfa5d7bbf9e2a6b6961cc48d29730
Author: Michael S. Tsirkin <mst at redhat.com>
Date:   Fri Oct 27 19:14:31 2017 +0300

    locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE


I didn't bother with mb() since I didn't think it's performance
critical, and one needs to worry about drivers possibly doing
non-temporals etc which do need mfence.

Do you see mb() in a benchmark then?

-- 
MST

Maybe Matching Threads

Search for more reasonably related threads

Linux Virtualization - Jan 2016 - [PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks

[PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks

[PATCH v2 1/3] x86: drop mfence in favor of lock+addl

[PATCH v2 2/3] x86: drop a comment left over from X86_OOSTORE

[PATCH v2 3/3] x86: tweak the comment about use of wmb for IO

[PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks

[PATCH v2 2/3] x86: drop a comment left over from X86_OOSTORE

[PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks

[PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks

Maybe Matching Threads