Michael S. Tsirkin
2016-Jan-12 22:10 UTC
[PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks
mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's 2 to 3 times slower than lock; addl $0,(%%e/rsp) that we use on older CPUs. So let's use the locked variant everywhere - helps keep the code simple as well. While I was at it, I found some inconsistencies in comments in arch/x86/include/asm/barrier.h I hope I'm not splitting this up too much - the reason is I wanted to isolate the code changes (that people might want to test for performance) from comment changes approved by Linus, from (so far unreviewed) comment change I came up with myself. Lightly tested on my system. Michael S. Tsirkin (3): x86: drop mfence in favor of lock+addl x86: drop a comment left over from X86_OOSTORE x86: tweak the comment about use of wmb for IO arch/x86/include/asm/barrier.h | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) -- MST
Michael S. Tsirkin
2016-Jan-12 22:10 UTC
[PATCH v2 1/3] x86: drop mfence in favor of lock+addl
mfence appears to be way slower than a locked instruction - let's use lock+add unconditionally, same as we always did on old 32-bit. Signed-off-by: Michael S. Tsirkin <mst at redhat.com> --- arch/x86/include/asm/barrier.h | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h index a584e1c..7f99726 100644 --- a/arch/x86/include/asm/barrier.h +++ b/arch/x86/include/asm/barrier.h @@ -15,11 +15,12 @@ * Some non-Intel clones support out of order store. wmb() ceases to be a * nop for these. */ -#define mb() alternative("lock; addl $0,0(%%esp)", "mfence", X86_FEATURE_XMM2) + +#define mb() asm volatile("lock; addl $0,0(%%esp)" ::: "memory") #define rmb() alternative("lock; addl $0,0(%%esp)", "lfence", X86_FEATURE_XMM2) #define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM) #else -#define mb() asm volatile("mfence":::"memory") +#define mb() asm volatile("lock; addl $0,0(%%rsp)" ::: "memory") #define rmb() asm volatile("lfence":::"memory") #define wmb() asm volatile("sfence" ::: "memory") #endif -- MST
Michael S. Tsirkin
2016-Jan-12 22:10 UTC
[PATCH v2 2/3] x86: drop a comment left over from X86_OOSTORE
The comment about wmb being non-nop is a left over from before commit 09df7c4c8097 ("x86: Remove CONFIG_X86_OOSTORE"). It makes no sense now: if you have an SMP system with out of order stores, making wmb not a nop will not help. Additionally, wmb is not a nop even for regular intel CPUs because of weird use-cases e.g. dealing with WC memory. Drop this comment. Signed-off-by: Michael S. Tsirkin <mst at redhat.com> --- arch/x86/include/asm/barrier.h | 5 ----- 1 file changed, 5 deletions(-) diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h index 7f99726..eb220b8 100644 --- a/arch/x86/include/asm/barrier.h +++ b/arch/x86/include/asm/barrier.h @@ -11,11 +11,6 @@ */ #ifdef CONFIG_X86_32 -/* - * Some non-Intel clones support out of order store. wmb() ceases to be a - * nop for these. - */ - #define mb() asm volatile("lock; addl $0,0(%%esp)" ::: "memory") #define rmb() alternative("lock; addl $0,0(%%esp)", "lfence", X86_FEATURE_XMM2) #define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM) -- MST
Michael S. Tsirkin
2016-Jan-12 22:10 UTC
[PATCH v2 3/3] x86: tweak the comment about use of wmb for IO
On x86, we *do* still use the non-nop rmb/wmb for IO barriers, but even that is generally questionable. Leave them around as historial unless somebody can point to a case where they care about the performance, but tweak the comment so people don't think they are strictly required in all cases. Signed-off-by: Michael S. Tsirkin <mst at redhat.com> --- arch/x86/include/asm/barrier.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h index eb220b8..924cd44 100644 --- a/arch/x86/include/asm/barrier.h +++ b/arch/x86/include/asm/barrier.h @@ -6,7 +6,7 @@ /* * Force strict CPU ordering. - * And yes, this is required on UP too when we're talking + * And yes, this might be required on UP too when we're talking * to devices. */ -- MST
H. Peter Anvin
2016-Jan-12 22:25 UTC
[PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks
On 01/12/16 14:10, Michael S. Tsirkin wrote:> mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's > 2 to 3 times slower than lock; addl $0,(%%e/rsp) that we use on older CPUs. > > So let's use the locked variant everywhere - helps keep the code simple as > well. > > While I was at it, I found some inconsistencies in comments in > arch/x86/include/asm/barrier.h > > I hope I'm not splitting this up too much - the reason is I wanted to isolate > the code changes (that people might want to test for performance) from comment > changes approved by Linus, from (so far unreviewed) comment change I came up > with myself. > > Lightly tested on my system. > > Michael S. Tsirkin (3): > x86: drop mfence in favor of lock+addl > x86: drop a comment left over from X86_OOSTORE > x86: tweak the comment about use of wmb for IO >I would like to get feedback from the hardware team about the implications of this change, first. -hpa
One Thousand Gnomes
2016-Jan-12 22:25 UTC
[PATCH v2 2/3] x86: drop a comment left over from X86_OOSTORE
On Wed, 13 Jan 2016 00:10:19 +0200 "Michael S. Tsirkin" <mst at redhat.com> wrote:> The comment about wmb being non-nop is a left over from before commit > 09df7c4c8097 ("x86: Remove CONFIG_X86_OOSTORE"). > > It makes no sense now: if you have an SMP system with out of order > stores, making wmb not a nop will not help.There were never any IDT Winchip systems with SMP support, and they were the one system that could enable OOSTORE (and it was worth up to 30% on some workloads). The fencing it had was just for DMA devices. Alan
Michael S. Tsirkin
2016-Jan-26 08:20 UTC
[PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks
On Tue, Jan 12, 2016 at 02:25:24PM -0800, H. Peter Anvin wrote:> On 01/12/16 14:10, Michael S. Tsirkin wrote: > > mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's > > 2 to 3 times slower than lock; addl $0,(%%e/rsp) that we use on older CPUs. > > > > So let's use the locked variant everywhere - helps keep the code simple as > > well. > > > > While I was at it, I found some inconsistencies in comments in > > arch/x86/include/asm/barrier.h > > > > I hope I'm not splitting this up too much - the reason is I wanted to isolate > > the code changes (that people might want to test for performance) from comment > > changes approved by Linus, from (so far unreviewed) comment change I came up > > with myself. > > > > Lightly tested on my system. > > > > Michael S. Tsirkin (3): > > x86: drop mfence in favor of lock+addl > > x86: drop a comment left over from X86_OOSTORE > > x86: tweak the comment about use of wmb for IO > > > > I would like to get feedback from the hardware team about the > implications of this change, first. > > -hpa >Hi hpa, Any luck getting some feedback on this one? Thanks, -- MST
Michael S. Tsirkin
2018-Oct-11 18:11 UTC
[PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks
On Thu, Oct 11, 2018 at 10:37:07AM -0700, Andres Freund wrote:> Hi, > > On 2016-01-26 10:20:14 +0200, Michael S. Tsirkin wrote: > > On Tue, Jan 12, 2016 at 02:25:24PM -0800, H. Peter Anvin wrote: > > > On 01/12/16 14:10, Michael S. Tsirkin wrote: > > > > mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's > > > > 2 to 3 times slower than lock; addl $0,(%%e/rsp) that we use on older CPUs. > > > > > > > > So let's use the locked variant everywhere - helps keep the code simple as > > > > well. > > > > > > > > While I was at it, I found some inconsistencies in comments in > > > > arch/x86/include/asm/barrier.h > > > > > > > > I hope I'm not splitting this up too much - the reason is I wanted to isolate > > > > the code changes (that people might want to test for performance) from comment > > > > changes approved by Linus, from (so far unreviewed) comment change I came up > > > > with myself. > > > > > > > > Lightly tested on my system. > > > > > > > > Michael S. Tsirkin (3): > > > > x86: drop mfence in favor of lock+addl > > > > x86: drop a comment left over from X86_OOSTORE > > > > x86: tweak the comment about use of wmb for IO > > > > > > > > > > I would like to get feedback from the hardware team about the > > > implications of this change, first. > > > Any luck getting some feedback on this one? > > Ping? I just saw a bunch of kernel fences in a benchmark, making me > wonder why linux uses mfence rather than lock addl. Leading me to this > thread. > > Greetings, > > Andres FreundIt doesn't do it for smp_mb any longer: commit 450cbdd0125cfa5d7bbf9e2a6b6961cc48d29730 Author: Michael S. Tsirkin <mst at redhat.com> Date: Fri Oct 27 19:14:31 2017 +0300 locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE I didn't bother with mb() since I didn't think it's performance critical, and one needs to worry about drivers possibly doing non-temporals etc which do need mfence. Do you see mb() in a benchmark then? -- MST
Possibly Parallel Threads
- [PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks
- [PATCH v3 0/4] x86: faster mb()+documentation tweaks
- [PATCH v3 0/4] x86: faster mb()+documentation tweaks
- [PATCH v5 0/5] x86: faster smp_mb()+documentation tweaks
- [PATCH v5 0/5] x86: faster smp_mb()+documentation tweaks