thr3ads.net - similar to: "[PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks"

[PATCH v3 0/4] x86: faster mb()+documentation tweaks

2016 Jan 13

6

[PATCH v3 0/4] x86: faster mb()+documentation tweaks

mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's 2 to 3 times slower than lock; addl that we use on older CPUs. So let's use the locked variant everywhere. While I was at it, I found some inconsistencies in comments in arch/x86/include/asm/barrier.h The documentation fixes are included first - I verified that they do not change the generated code at all.

[PATCH v3 0/4] x86: faster mb()+documentation tweaks

2016 Jan 13

6

[PATCH v3 0/4] x86: faster mb()+documentation tweaks

mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's 2 to 3 times slower than lock; addl that we use on older CPUs. So let's use the locked variant everywhere. While I was at it, I found some inconsistencies in comments in arch/x86/include/asm/barrier.h The documentation fixes are included first - I verified that they do not change the generated code at all.

[PATCH v5 0/5] x86: faster smp_mb()+documentation tweaks

2016 Jan 28

10

[PATCH v5 0/5] x86: faster smp_mb()+documentation tweaks

mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's 2 to 3 times slower than lock; addl that we use on older CPUs. So we really should use the locked variant everywhere, except that intel manual says that clflush is only ordered by mfence, so we can't. Note: some callers of clflush seems to assume sfence will order it, so there could be existing bugs around

[PATCH v5 0/5] x86: faster smp_mb()+documentation tweaks

2016 Jan 28

10

[PATCH v5 0/5] x86: faster smp_mb()+documentation tweaks

mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's 2 to 3 times slower than lock; addl that we use on older CPUs. So we really should use the locked variant everywhere, except that intel manual says that clflush is only ordered by mfence, so we can't. Note: some callers of clflush seems to assume sfence will order it, so there could be existing bugs around

[PATCH v4 0/5] x86: faster smp_mb()+documentation tweaks

2016 Jan 27

6

[PATCH v4 0/5] x86: faster smp_mb()+documentation tweaks

mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's 2 to 3 times slower than lock; addl that we use on older CPUs. So we really should use the locked variant everywhere, except that intel manual says that clflush is only ordered by mfence, so we can't. Note: some callers of clflush seems to assume sfence will order it, so there could be existing bugs around

[PATCH v4 0/5] x86: faster smp_mb()+documentation tweaks

2016 Jan 27

6

[PATCH v4 0/5] x86: faster smp_mb()+documentation tweaks

mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's 2 to 3 times slower than lock; addl that we use on older CPUs. So we really should use the locked variant everywhere, except that intel manual says that clflush is only ordered by mfence, so we can't. Note: some callers of clflush seems to assume sfence will order it, so there could be existing bugs around

[PATCH v6] x86: use lock+addl for smp_mb()

2017 Oct 27

1

[PATCH v6] x86: use lock+addl for smp_mb()

mfence appears to be way slower than a locked instruction - let's use lock+add unconditionally, as we always did on old 32-bit. Results: perf stat -r 10 -- ./virtio_ring_0_9 --sleep --host-affinity 0 --guest-affinity 0 Before: 0.922565990 seconds time elapsed ( +- 1.15% ) After: 0.578667024 seconds time elapsed

[PATCH v6] x86: use lock+addl for smp_mb()

2017 Oct 27

1

[PATCH v6] x86: use lock+addl for smp_mb()

mfence appears to be way slower than a locked instruction - let's use lock+add unconditionally, as we always did on old 32-bit. Results: perf stat -r 10 -- ./virtio_ring_0_9 --sleep --host-affinity 0 --guest-affinity 0 Before: 0.922565990 seconds time elapsed ( +- 1.15% ) After: 0.578667024 seconds time elapsed

[PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks

2016 Jan 26

2

[PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks

On Tue, Jan 12, 2016 at 02:25:24PM -0800, H. Peter Anvin wrote: > On 01/12/16 14:10, Michael S. Tsirkin wrote: > > mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's > > 2 to 3 times slower than lock; addl $0,(%%e/rsp) that we use on older CPUs. > > > > So let's use the locked variant everywhere - helps keep the code simple as >

[PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks

2016 Jan 26

2

[PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks

On Tue, Jan 12, 2016 at 02:25:24PM -0800, H. Peter Anvin wrote: > On 01/12/16 14:10, Michael S. Tsirkin wrote: > > mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's > > 2 to 3 times slower than lock; addl $0,(%%e/rsp) that we use on older CPUs. > > > > So let's use the locked variant everywhere - helps keep the code simple as >

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

2016 Jan 12

3

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

On Mon, Nov 02, 2015 at 04:06:46PM -0800, Linus Torvalds wrote: > On Mon, Nov 2, 2015 at 12:15 PM, Davidlohr Bueso <dave at stgolabs.net> wrote: > > > > So I ran some experiments on an IvyBridge (2.8GHz) and the cost of XCHG is > > constantly cheaper (by at least half the latency) than MFENCE. While there > > was a decent amount of variation, this difference

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

2016 Jan 12

3

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

On Mon, Nov 02, 2015 at 04:06:46PM -0800, Linus Torvalds wrote: > On Mon, Nov 2, 2015 at 12:15 PM, Davidlohr Bueso <dave at stgolabs.net> wrote: > > > > So I ran some experiments on an IvyBridge (2.8GHz) and the cost of XCHG is > > constantly cheaper (by at least half the latency) than MFENCE. While there > > was a decent amount of variation, this difference

[PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks

2018 Oct 11

0

[PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks

On Thu, Oct 11, 2018 at 10:37:07AM -0700, Andres Freund wrote: > Hi, > > On 2016-01-26 10:20:14 +0200, Michael S. Tsirkin wrote: > > On Tue, Jan 12, 2016 at 02:25:24PM -0800, H. Peter Anvin wrote: > > > On 01/12/16 14:10, Michael S. Tsirkin wrote: > > > > mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's > > > > 2

[PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks

2016 Jan 12

0

[PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks

On 01/12/16 14:10, Michael S. Tsirkin wrote: > mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's > 2 to 3 times slower than lock; addl $0,(%%e/rsp) that we use on older CPUs. > > So let's use the locked variant everywhere - helps keep the code simple as > well. > > While I was at it, I found some inconsistencies in comments in >

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

2016 Jan 12

5

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

On Tue, Jan 12, 2016 at 12:54 PM, Linus Torvalds <torvalds at linux-foundation.org> wrote: > On Tue, Jan 12, 2016 at 12:30 PM, Andy Lutomirski <luto at kernel.org> wrote: >> >> I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe even 64) >> was better because it avoided stomping on very-likely-to-be-hot write >> buffers. > > I suspect it

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

2016 Jan 12

5

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

On Tue, Jan 12, 2016 at 12:54 PM, Linus Torvalds <torvalds at linux-foundation.org> wrote: > On Tue, Jan 12, 2016 at 12:30 PM, Andy Lutomirski <luto at kernel.org> wrote: >> >> I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe even 64) >> was better because it avoided stomping on very-likely-to-be-hot write >> buffers. > > I suspect it

[PATCH v2 2/3] x86: drop a comment left over from X86_OOSTORE

2016 Jan 12

0

[PATCH v2 2/3] x86: drop a comment left over from X86_OOSTORE

The comment about wmb being non-nop is a left over from before commit 09df7c4c8097 ("x86: Remove CONFIG_X86_OOSTORE"). It makes no sense now: if you have an SMP system with out of order stores, making wmb not a nop will not help. Additionally, wmb is not a nop even for regular intel CPUs because of weird use-cases e.g. dealing with WC memory. Drop this comment. Signed-off-by: Michael

[PATCH v3 0/4] x86: faster mb()+documentation tweaks

2016 Jan 14

0

[PATCH v3 0/4] x86: faster mb()+documentation tweaks

On Wed, Jan 13, 2016 at 10:12:22PM +0200, Michael S. Tsirkin wrote: > mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's > 2 to 3 times slower than lock; addl that we use on older CPUs. > > So let's use the locked variant everywhere. > > While I was at it, I found some inconsistencies in comments in > arch/x86/include/asm/barrier.h >

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

2016 Jan 12

1

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

On Tue, Jan 12, 2016 at 09:20:06AM -0800, Linus Torvalds wrote: > On Tue, Jan 12, 2016 at 5:57 AM, Michael S. Tsirkin <mst at redhat.com> wrote: > > #ifdef xchgrz > > /* same as xchg but poking at gcc red zone */ > > #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); }

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

2016 Jan 12

1

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

On Tue, Jan 12, 2016 at 09:20:06AM -0800, Linus Torvalds wrote: > On Tue, Jan 12, 2016 at 5:57 AM, Michael S. Tsirkin <mst at redhat.com> wrote: > > #ifdef xchgrz > > /* same as xchg but poking at gcc red zone */ > > #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); }

similar to: [PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks