thr3ads.net - search: "mfenc"

2008 Oct 17

2

[LLVMdev] MFENCE encoding

Hi, I have a problem with creating a MFENCE on X86 with SSE In X86InstrSSE.td, a MFENCE is def MFENCE : I<0xAE, MRM6m, (outs), (ins), "mfence", [(int_x86_sse2_mfence)]>, TB, Requires< [HasSSE2]>; In X86CodeEmitter.cpp in emitInstruction case X86II::MRM6m: case X86II::MRM7m: { intptr_t PCAdj = (...

[LLVMdev] MFENCE encoding

2008 Oct 17

0

[LLVMdev] MFENCE encoding

Hmm. mfence and lfence needs special handling. I'll take a look. Evan On Oct 16, 2008, at 10:46 PM, Mon Ping Wang wrote: > Hi, > > I have a problem with creating a MFENCE on X86 with SSE > > In X86InstrSSE.td, a MFENCE is > def MFENCE : I<0xAE, MRM6m, (outs), (ins), >...

[LLVMdev] MFENCE encoding

2008 Oct 17

1

[LLVMdev] MFENCE encoding

I've fixed this (untested though). http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20081013/068611.html Evan On Oct 17, 2008, at 9:51 AM, Evan Cheng wrote: > Hmm. mfence and lfence needs special handling. I'll take a look. > > Evan > > On Oct 16, 2008, at 10:46 PM, Mon Ping Wang wrote: > >> Hi, >> >> I have a problem with creating a MFENCE on X86 with SSE >> >> In X86InstrSSE.td, a MFENCE is >> def MFENCE : I&...

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

2016 Jan 12

3

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

..., 2015 at 04:06:46PM -0800, Linus Torvalds wrote: > On Mon, Nov 2, 2015 at 12:15 PM, Davidlohr Bueso <dave at stgolabs.net> wrote: > > > > So I ran some experiments on an IvyBridge (2.8GHz) and the cost of XCHG is > > constantly cheaper (by at least half the latency) than MFENCE. While there > > was a decent amount of variation, this difference remained rather constant. > > Mind testing "lock addq $0,0(%rsp)" instead of mfence? That's what we > use on old cpu's without one (ie 32-bit). > > I'm not actually convinced that mfence...

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

2016 Jan 12

3

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

..., 2015 at 04:06:46PM -0800, Linus Torvalds wrote: > On Mon, Nov 2, 2015 at 12:15 PM, Davidlohr Bueso <dave at stgolabs.net> wrote: > > > > So I ran some experiments on an IvyBridge (2.8GHz) and the cost of XCHG is > > constantly cheaper (by at least half the latency) than MFENCE. While there > > was a decent amount of variation, this difference remained rather constant. > > Mind testing "lock addq $0,0(%rsp)" instead of mfence? That's what we > use on old cpu's without one (ie 32-bit). > > I'm not actually convinced that mfence...

[PATCH v4 5/5] x86: drop mfence in favor of lock+addl

2016 Jan 27

0

[PATCH v4 5/5] x86: drop mfence in favor of lock+addl

mfence appears to be way slower than a locked instruction - let's use lock+add unconditionally, as we always did on old 32-bit. Just poking at SP would be the most natural, but if we then read the value from SP, we get a false dependency which will slow us down. This was noted in this article: http...

[PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks

2016 Jan 12

7

[PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks

mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's 2 to 3 times slower than lock; addl $0,(%%e/rsp) that we use on older CPUs. So let's use the locked variant everywhere - helps keep the code simple as well. While I was at it, I found some inconsistencies in comments in arch/x86/includ...

[PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks

2016 Jan 12

7

[PATCH v2 0/3] x86: faster mb()+other barrier.h tweaks

mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's 2 to 3 times slower than lock; addl $0,(%%e/rsp) that we use on older CPUs. So let's use the locked variant everywhere - helps keep the code simple as well. While I was at it, I found some inconsistencies in comments in arch/x86/includ...

[PATCH v6] x86: use lock+addl for smp_mb()

2017 Oct 27

1

[PATCH v6] x86: use lock+addl for smp_mb()

mfence appears to be way slower than a locked instruction - let's use lock+add unconditionally, as we always did on old 32-bit. Results: perf stat -r 10 -- ./virtio_ring_0_9 --sleep --host-affinity 0 --guest-affinity 0 Before: 0.922565990 seconds time elapsed...

[PATCH v6] x86: use lock+addl for smp_mb()

2017 Oct 27

1

[PATCH v6] x86: use lock+addl for smp_mb()

mfence appears to be way slower than a locked instruction - let's use lock+add unconditionally, as we always did on old 32-bit. Results: perf stat -r 10 -- ./virtio_ring_0_9 --sleep --host-affinity 0 --guest-affinity 0 Before: 0.922565990 seconds time elapsed...

[PATCH v4 0/5] x86: faster smp_mb()+documentation tweaks

2016 Jan 27

6

[PATCH v4 0/5] x86: faster smp_mb()+documentation tweaks

mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's 2 to 3 times slower than lock; addl that we use on older CPUs. So we really should use the locked variant everywhere, except that intel manual says that clflush is only ordered by mfence, so we can't. Note: some callers of clflush seems...

[PATCH v4 0/5] x86: faster smp_mb()+documentation tweaks

2016 Jan 27

6

[PATCH v4 0/5] x86: faster smp_mb()+documentation tweaks

mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's 2 to 3 times slower than lock; addl that we use on older CPUs. So we really should use the locked variant everywhere, except that intel manual says that clflush is only ordered by mfence, so we can't. Note: some callers of clflush seems...

[PATCH v5 0/5] x86: faster smp_mb()+documentation tweaks

2016 Jan 28

10

[PATCH v5 0/5] x86: faster smp_mb()+documentation tweaks

mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's 2 to 3 times slower than lock; addl that we use on older CPUs. So we really should use the locked variant everywhere, except that intel manual says that clflush is only ordered by mfence, so we can't. Note: some callers of clflush seems...

[PATCH v5 0/5] x86: faster smp_mb()+documentation tweaks

2016 Jan 28

10

[PATCH v5 0/5] x86: faster smp_mb()+documentation tweaks

mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's 2 to 3 times slower than lock; addl that we use on older CPUs. So we really should use the locked variant everywhere, except that intel manual says that clflush is only ordered by mfence, so we can't. Note: some callers of clflush seems...

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

2016 Jan 12

0

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

...uffers are busy etc), but as a baseline for "how fast can things go" the stupid raw loop is fine. And while the xchg into the redzoen wouldn't be acceptable as a real implementation, for timing testing it's likely fine (ie you aren't hitting the problem it can cause). > So mfence is more expensive than locked instructions/xchg, but sfence/lfence > are slightly faster, and xchg and locked instructions are very close if > not the same. Note that we never actually *use* lfence/sfence. They are pointless instructions when looking at CPU memory ordering, because for pure...

[PATCH v3 0/4] x86: faster mb()+documentation tweaks

2016 Jan 13

6

[PATCH v3 0/4] x86: faster mb()+documentation tweaks

mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's 2 to 3 times slower than lock; addl that we use on older CPUs. So let's use the locked variant everywhere. While I was at it, I found some inconsistencies in comments in arch/x86/include/asm/barrier.h The documentation fixes are inclu...

[PATCH v3 0/4] x86: faster mb()+documentation tweaks

2016 Jan 13

6

[PATCH v3 0/4] x86: faster mb()+documentation tweaks

mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's 2 to 3 times slower than lock; addl that we use on older CPUs. So let's use the locked variant everywhere. While I was at it, I found some inconsistencies in comments in arch/x86/include/asm/barrier.h The documentation fixes are inclu...

RFC: non-temporal fencing in LLVM IR

2016 Jan 14

2

RFC: non-temporal fencing in LLVM IR

...rising if they did occur. >>>> >>> >>> Today's LLVM already emits 'lock or %eax, (%esp)' for 'fence >>> seq_cst'/__sync_synchronize/__atomic_thread_fence(__ATOMIC_SEQ_CST) when >>> targeting 32-bit x86 machines which do not support mfence. What >>> instruction sequence should we be using instead? >>> >> >> Do they have non-temporal accesses in the ISA? >> > > I thought not but there appear to be instructions like movntps. mfence > was introduced in SSE2 while movntps and sfence were in...

[LLVMdev] confused about llvm.memory.barrier

2008 Sep 25

5

[LLVMdev] confused about llvm.memory.barrier

When I request a write-before-read memory barrier on x86 I would expect to get an assembly instruction that would enforce this ordering (mfence, xchg, cas), but it just turns into a nop. 1. ; ModuleID = 'test.bc' 2. target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32- i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128" 3. target triple = "i686-apple-darwin9" 4. @a...

[LLVMdev] confused about llvm.memory.barrier

2008 Sep 25

0

[LLVMdev] confused about llvm.memory.barrier

On Thu, 2008-09-25 at 10:28 -0400, Luke Dalessandro wrote: > When I request a write-before-read memory barrier on x86 I would expect > to get an assembly instruction that would enforce this ordering (mfence, > xchg, cas), but it just turns into a nop. In its usual configuration, an x86 family CPU implements a strong memory ordering constraint for all loads and stores, so as long as the ordering of the read and write operations is preserved no atomic operation is required between them. XCHG and C...

search for: mfenc