thr3ads.net - similar to: "[PATCH 3/4] x86,asm: Re-work smp_store

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

2016 Jan 12

1

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

On Tue, Jan 12, 2016 at 09:20:06AM -0800, Linus Torvalds wrote: > On Tue, Jan 12, 2016 at 5:57 AM, Michael S. Tsirkin <mst at redhat.com> wrote: > > #ifdef xchgrz > > /* same as xchg but poking at gcc red zone */ > > #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); }

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

2016 Jan 12

1

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

On Tue, Jan 12, 2016 at 09:20:06AM -0800, Linus Torvalds wrote: > On Tue, Jan 12, 2016 at 5:57 AM, Michael S. Tsirkin <mst at redhat.com> wrote: > > #ifdef xchgrz > > /* same as xchg but poking at gcc red zone */ > > #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); }

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

2016 Jan 12

0

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

On Tue, Jan 12, 2016 at 5:57 AM, Michael S. Tsirkin <mst at redhat.com> wrote: > #ifdef xchgrz > /* same as xchg but poking at gcc red zone */ > #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0) > #endif That's not safe in general. gcc might be using its

[PATCH v6] x86: use lock+addl for smp_mb()

2017 Oct 27

1

[PATCH v6] x86: use lock+addl for smp_mb()

mfence appears to be way slower than a locked instruction - let's use lock+add unconditionally, as we always did on old 32-bit. Results: perf stat -r 10 -- ./virtio_ring_0_9 --sleep --host-affinity 0 --guest-affinity 0 Before: 0.922565990 seconds time elapsed ( +- 1.15% ) After: 0.578667024 seconds time elapsed

[PATCH v6] x86: use lock+addl for smp_mb()

2017 Oct 27

1

[PATCH v6] x86: use lock+addl for smp_mb()

mfence appears to be way slower than a locked instruction - let's use lock+add unconditionally, as we always did on old 32-bit. Results: perf stat -r 10 -- ./virtio_ring_0_9 --sleep --host-affinity 0 --guest-affinity 0 Before: 0.922565990 seconds time elapsed ( +- 1.15% ) After: 0.578667024 seconds time elapsed

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

2016 Jan 12

5

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

On Tue, Jan 12, 2016 at 12:54 PM, Linus Torvalds <torvalds at linux-foundation.org> wrote: > On Tue, Jan 12, 2016 at 12:30 PM, Andy Lutomirski <luto at kernel.org> wrote: >> >> I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe even 64) >> was better because it avoided stomping on very-likely-to-be-hot write >> buffers. > > I suspect it

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

2016 Jan 12

5

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

On Tue, Jan 12, 2016 at 12:54 PM, Linus Torvalds <torvalds at linux-foundation.org> wrote: > On Tue, Jan 12, 2016 at 12:30 PM, Andy Lutomirski <luto at kernel.org> wrote: >> >> I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe even 64) >> was better because it avoided stomping on very-likely-to-be-hot write >> buffers. > > I suspect it

[PATCH v4 0/5] x86: faster smp_mb()+documentation tweaks

2016 Jan 27

6

[PATCH v4 0/5] x86: faster smp_mb()+documentation tweaks

mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's 2 to 3 times slower than lock; addl that we use on older CPUs. So we really should use the locked variant everywhere, except that intel manual says that clflush is only ordered by mfence, so we can't. Note: some callers of clflush seems to assume sfence will order it, so there could be existing bugs around

[PATCH v4 0/5] x86: faster smp_mb()+documentation tweaks

2016 Jan 27

6

[PATCH v4 0/5] x86: faster smp_mb()+documentation tweaks

mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's 2 to 3 times slower than lock; addl that we use on older CPUs. So we really should use the locked variant everywhere, except that intel manual says that clflush is only ordered by mfence, so we can't. Note: some callers of clflush seems to assume sfence will order it, so there could be existing bugs around

RFC: non-temporal fencing in LLVM IR

2016 Jan 13

4

RFC: non-temporal fencing in LLVM IR

Hello, fencing enthusiasts! *TL;DR:* We'd like to propose an addition to the LLVM memory model requiring non-temporal accesses be surrounded by non-temporal load barriers and non-temporal store barriers, and we'd like to add such orderings to the fence IR opcode. We are open to different approaches, hence this email instead of a patch. *Who's "we"?* Philip Reames brought

RFC: non-temporal fencing in LLVM IR

2016 Jan 14

2

RFC: non-temporal fencing in LLVM IR

Hi JF, Philip, Clang currently has __builtin_nontemporal_store and __builtin_nontemporal_load. How will the usage model for those change? Thanks again, Hal ----- Original Message ----- > From: "Philip Reames via llvm-dev" <llvm-dev at lists.llvm.org> > To: "JF Bastien" <jfb at google.com>, "llvm-dev" > <llvm-dev at lists.llvm.org> >

RFC: non-temporal fencing in LLVM IR

2016 Jan 14

2

RFC: non-temporal fencing in LLVM IR

On Thu, Jan 14, 2016 at 1:35 PM, David Majnemer <david.majnemer at gmail.com> wrote: > > > On Thu, Jan 14, 2016 at 1:13 PM, JF Bastien <jfb at google.com> wrote: > >> On Thu, Jan 14, 2016 at 1:10 PM, David Majnemer via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >>> >>> >>> On Wed, Jan 13, 2016 at 7:00 PM, Hans

[PATCH v5 0/5] x86: faster smp_mb()+documentation tweaks

2016 Jan 28

10

[PATCH v5 0/5] x86: faster smp_mb()+documentation tweaks

mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's 2 to 3 times slower than lock; addl that we use on older CPUs. So we really should use the locked variant everywhere, except that intel manual says that clflush is only ordered by mfence, so we can't. Note: some callers of clflush seems to assume sfence will order it, so there could be existing bugs around

[PATCH v5 0/5] x86: faster smp_mb()+documentation tweaks

2016 Jan 28

10

[PATCH v5 0/5] x86: faster smp_mb()+documentation tweaks

mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's 2 to 3 times slower than lock; addl that we use on older CPUs. So we really should use the locked variant everywhere, except that intel manual says that clflush is only ordered by mfence, so we can't. Note: some callers of clflush seems to assume sfence will order it, so there could be existing bugs around

RFC: non-temporal fencing in LLVM IR

2016 Jan 14

2

RFC: non-temporal fencing in LLVM IR

On Thu, Jan 14, 2016 at 1:10 PM, David Majnemer via llvm-dev < llvm-dev at lists.llvm.org> wrote: > > > On Wed, Jan 13, 2016 at 7:00 PM, Hans Boehm via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> I agree with Tim's assessment for ARM. That's interesting; I wasn't >> previously aware of that instruction. >> >> My

[PATCH v3 0/4] x86: faster mb()+documentation tweaks

2016 Jan 13

6

[PATCH v3 0/4] x86: faster mb()+documentation tweaks

mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's 2 to 3 times slower than lock; addl that we use on older CPUs. So let's use the locked variant everywhere. While I was at it, I found some inconsistencies in comments in arch/x86/include/asm/barrier.h The documentation fixes are included first - I verified that they do not change the generated code at all.

[PATCH v3 0/4] x86: faster mb()+documentation tweaks

2016 Jan 13

6

[PATCH v3 0/4] x86: faster mb()+documentation tweaks

mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's 2 to 3 times slower than lock; addl that we use on older CPUs. So let's use the locked variant everywhere. While I was at it, I found some inconsistencies in comments in arch/x86/include/asm/barrier.h The documentation fixes are included first - I verified that they do not change the generated code at all.

[PATCH v4 5/5] x86: drop mfence in favor of lock+addl

2016 Jan 27

0

[PATCH v4 5/5] x86: drop mfence in favor of lock+addl

mfence appears to be way slower than a locked instruction - let's use lock+add unconditionally, as we always did on old 32-bit. Just poking at SP would be the most natural, but if we then read the value from SP, we get a false dependency which will slow us down. This was noted in this article: http://shipilev.net/blog/2014/on-the-fence-with-dependencies/ And is easy to reproduce by sticking

RFC: non-temporal fencing in LLVM IR

2016 Jan 15

3

RFC: non-temporal fencing in LLVM IR

On 01/14/2016 04:05 PM, Hans Boehm via llvm-dev wrote: > > > On Thu, Jan 14, 2016 at 1:37 PM, JF Bastien <jfb at google.com > <mailto:jfb at google.com>> wrote: > > On Thu, Jan 14, 2016 at 1:35 PM, David Majnemer > <david.majnemer at gmail.com <mailto:david.majnemer at gmail.com>> wrote: > > > > On Thu, Jan 14, 2016 at 1:13

[PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks

2016 Jan 12

7

[PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks

mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's 2 to 3 times slower than lock; addl $0,(%%e/rsp) that we use on older CPUs. So let's use the locked variant everywhere - helps keep the code simple as well. While I was at it, I found some inconsistencies in comments in arch/x86/include/asm/barrier.h I hope I'm not splitting this up too much - the reason

similar to: [PATCH 3/4] x86,asm: Re-work smp_store_mb()