thr3ads.net - search: "lfenc"

_mm_lfence in both pathes of an if/else are hoisted by SimplfyCFG potentially breaking use as a speculation barrier

2020 Jul 28

2

_mm_lfence in both pathes of an if/else are hoisted by SimplfyCFG potentially breaking use as a speculation barrier

_mm_lfence was originally documented as a load fence. But in light of speculative execution vulnerabilities it has started being advertised as a way to prevent speculative execution. Current Intel Software Development Manual documents it as "Specifically, LFENCE does not execute until all prior instruct...

_mm_lfence in both pathes of an if/else are hoisted by SimplfyCFG potentially breaking use as a speculation barrier

2020 Aug 09

2

_mm_lfence in both pathes of an if/else are hoisted by SimplfyCFG potentially breaking use as a speculation barrier

...stack). >From a pragmatic perspective, the constraints added to program transforms there are sufficient for what you need. You'd produce IR such as: %token = call token @llvm.experimental.convergence.anchor() br i1 %c, label %then, label %else then: call void @llvm.x86.sse2.lfence() convergent [ "convergencectrl"(token%token) ] ... else: call void @llvm.x86.sse2.lfence() convergent [ "convergencectrl"(token %token) ] ... ... and this would prevent the hoisting of the lfences. The puzzle to me is whether one can justify this use of the c...

LFENCE instruction (was: [rfc][patch 3/3] x86: optimise barriers)

2007 Oct 16

1

LFENCE instruction (was: [rfc][patch 3/3] x86: optimise barriers)

Nick Piggin <npiggin@suse.de> wrote: > > Also, for non-wb memory. I don't think the Intel document referenced > says anything about this, but the AMD document says that loads can pass > loads (page 8, rule b). > > This is why our rmb() is still an lfence. BTW, Xen (in particular, the code in drivers/xen) uses mb/rmb/wmb instead of smp_mb/smp_rmb/smp_wmb when it accesses memory that's shared with other Xen domains or the hypervisor. The reason this is necessary is because even if a Xen domain is UP the hypervisor might be SMP. It would be ni...

LFENCE instruction (was: [rfc][patch 3/3] x86: optimise barriers)

2007 Oct 16

1

LFENCE instruction (was: [rfc][patch 3/3] x86: optimise barriers)

Nick Piggin <npiggin@suse.de> wrote: > > Also, for non-wb memory. I don't think the Intel document referenced > says anything about this, but the AMD document says that loads can pass > loads (page 8, rule b). > > This is why our rmb() is still an lfence. BTW, Xen (in particular, the code in drivers/xen) uses mb/rmb/wmb instead of smp_mb/smp_rmb/smp_wmb when it accesses memory that's shared with other Xen domains or the hypervisor. The reason this is necessary is because even if a Xen domain is UP the hypervisor might be SMP. It would be ni...

[PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support

2017 Feb 14

2

[PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support

...e completed before the > second read of the sequence counter. I am working with the Windows team to correctly > reflect this algorithm in the Hyper-V specification. Thank you, do I get it right that combining the above I only need to replace virt_rmb() barriers with plain rmb() to get 'lfence' in hv_read_tsc_page (PATCH 2)? As members of struct ms_hyperv_tsc_page are volatile we don't need READ_ONCE(), compilers are not allowed to merge accesses. The resulting code looks good to me: (gdb) disassemble read_hv_clock_tsc Dump of assembler code for function read_hv_clock_tsc:...

[PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support

2017 Feb 14

2

[PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support

...e completed before the > second read of the sequence counter. I am working with the Windows team to correctly > reflect this algorithm in the Hyper-V specification. Thank you, do I get it right that combining the above I only need to replace virt_rmb() barriers with plain rmb() to get 'lfence' in hv_read_tsc_page (PATCH 2)? As members of struct ms_hyperv_tsc_page are volatile we don't need READ_ONCE(), compilers are not allowed to merge accesses. The resulting code looks good to me: (gdb) disassemble read_hv_clock_tsc Dump of assembler code for function read_hv_clock_tsc:...

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

2016 Jan 12

3

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

...xchg but poking at gcc red zone */ #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0) #endif #ifdef mfence #define barrier() asm("mfence" ::: "memory") #endif #ifdef lfence #define barrier() asm("lfence" ::: "memory") #endif #ifdef sfence #define barrier() asm("sfence" ::: "memory") #endif int main(int argc, char **argv) { int i; int j = 1234; /* * Test barrier in a loop. We also poke at a volatile variable in an * att...

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

2016 Jan 12

3

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

...xchg but poking at gcc red zone */ #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0) #endif #ifdef mfence #define barrier() asm("mfence" ::: "memory") #endif #ifdef lfence #define barrier() asm("lfence" ::: "memory") #endif #ifdef sfence #define barrier() asm("sfence" ::: "memory") #endif int main(int argc, char **argv) { int i; int j = 1234; /* * Test barrier in a loop. We also poke at a volatile variable in an * att...

[PATCH v3 0/4] x86: faster mb()+documentation tweaks

2016 Jan 13

6

[PATCH v3 0/4] x86: faster mb()+documentation tweaks

mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's 2 to 3 times slower than lock; addl that we use on older CPUs. So let's use the locked variant everywhere. While I was at it, I found some inconsistencies in comments in arch/x86/include/asm/barrier.h The documentation fixes are included first - I verified that they do not change the generated code at all.

[PATCH v3 0/4] x86: faster mb()+documentation tweaks

2016 Jan 13

6

[PATCH v3 0/4] x86: faster mb()+documentation tweaks

mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's 2 to 3 times slower than lock; addl that we use on older CPUs. So let's use the locked variant everywhere. While I was at it, I found some inconsistencies in comments in arch/x86/include/asm/barrier.h The documentation fixes are included first - I verified that they do not change the generated code at all.

[PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support

2017 Feb 14

0

[PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support

...ond read of the sequence counter. I am working with the Windows team to correctly >> reflect this algorithm in the Hyper-V specification. > > > Thank you, > > do I get it right that combining the above I only need to replace > virt_rmb() barriers with plain rmb() to get 'lfence' in hv_read_tsc_page > (PATCH 2)? As members of struct ms_hyperv_tsc_page are volatile we don't > need READ_ONCE(), compilers are not allowed to merge accesses. The > resulting code looks good to me: No, on multiple counts, unfortunately. 1. LFENCE is basically useless except fo...

retpoline mitigation and 6.0

2018 Feb 03

0

retpoline mitigation and 6.0

...re somewhat reluctant to guarantee an ABI here. At least I > am. While we don't *expect* rampant divergence here, I don't want > this to become something we cannot change if there are good reasons > to do so. We've already changed the thunks once based on feedback > (putting LFENCE after the PAUSE). Surely adding the lfence was changing your implementation, not the ABI? And if we really are talking about the *ABI* not the implementation, I'm not sure I understand your concern. The ABI for each thunk is that it is identical in all respects, apart from speculation, to &...

[PATCH v4 0/5] x86: faster smp_mb()+documentation tweaks

2016 Jan 27

6

[PATCH v4 0/5] x86: faster smp_mb()+documentation tweaks

mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's 2 to 3 times slower than lock; addl that we use on older CPUs. So we really should use the locked variant everywhere, except that intel manual says that clflush is only ordered by mfence, so we can't. Note: some callers of clflush seems to assume sfence will order it, so there could be existing bugs around

[PATCH v4 0/5] x86: faster smp_mb()+documentation tweaks

2016 Jan 27

6

[PATCH v4 0/5] x86: faster smp_mb()+documentation tweaks

mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's 2 to 3 times slower than lock; addl that we use on older CPUs. So we really should use the locked variant everywhere, except that intel manual says that clflush is only ordered by mfence, so we can't. Note: some callers of clflush seems to assume sfence will order it, so there could be existing bugs around

[PATCH v5 0/5] x86: faster smp_mb()+documentation tweaks

2016 Jan 28

10

[PATCH v5 0/5] x86: faster smp_mb()+documentation tweaks

mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's 2 to 3 times slower than lock; addl that we use on older CPUs. So we really should use the locked variant everywhere, except that intel manual says that clflush is only ordered by mfence, so we can't. Note: some callers of clflush seems to assume sfence will order it, so there could be existing bugs around

[PATCH v5 0/5] x86: faster smp_mb()+documentation tweaks

2016 Jan 28

10

[PATCH v5 0/5] x86: faster smp_mb()+documentation tweaks

mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's 2 to 3 times slower than lock; addl that we use on older CPUs. So we really should use the locked variant everywhere, except that intel manual says that clflush is only ordered by mfence, so we can't. Note: some callers of clflush seems to assume sfence will order it, so there could be existing bugs around

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

2016 Jan 12

1

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

...w loop is fine. And while the xchg into > the redzoen wouldn't be acceptable as a real implementation, for > timing testing it's likely fine (ie you aren't hitting the problem it > can cause). > > > So mfence is more expensive than locked instructions/xchg, but sfence/lfence > > are slightly faster, and xchg and locked instructions are very close if > > not the same. > > Note that we never actually *use* lfence/sfence. They are pointless > instructions when looking at CPU memory ordering, because for pure CPU > memory ordering stores and loads...

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

2016 Jan 12

1

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

...w loop is fine. And while the xchg into > the redzoen wouldn't be acceptable as a real implementation, for > timing testing it's likely fine (ie you aren't hitting the problem it > can cause). > > > So mfence is more expensive than locked instructions/xchg, but sfence/lfence > > are slightly faster, and xchg and locked instructions are very close if > > not the same. > > Note that we never actually *use* lfence/sfence. They are pointless > instructions when looking at CPU memory ordering, because for pure CPU > memory ordering stores and loads...

RFC: Speculative Load Hardening (a Spectre variant #1 mitigation)

2018 Mar 23

5

RFC: Speculative Load Hardening (a Spectre variant #1 mitigation)

...movq (%rcx), %rdi # Hardened load. movl (%rdi), %edx # Unhardened load due to dependent addr. ``` This doesn't check the load through `%rdi` as that pointer is dependent on a checked load already. ###### Protect large, load-heavy blocks with a single lfence It may be worth using a single `lfence` instruction at the start of a block which begins with a (very) large number of loads that require independent protection *and* which require hardening the address of the load. However, this is unlikely to be profitable in practice. The latency hit of the ha...

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

2016 Jan 12

0

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

...gs go" the stupid raw loop is fine. And while the xchg into the redzoen wouldn't be acceptable as a real implementation, for timing testing it's likely fine (ie you aren't hitting the problem it can cause). > So mfence is more expensive than locked instructions/xchg, but sfence/lfence > are slightly faster, and xchg and locked instructions are very close if > not the same. Note that we never actually *use* lfence/sfence. They are pointless instructions when looking at CPU memory ordering, because for pure CPU memory ordering stores and loads are already ordered. The onl...

search for: lfenc