thr3ads.net - search: "xchg"

Slow XCHG in arch/i386/libgcc/__ashrdi3.S and arch/i386/libgcc/__lshrdi3.S

2019 Aug 20

1

Slow XCHG in arch/i386/libgcc/__ashrdi3.S and arch/i386/libgcc/__lshrdi3.S

...libc/arch/i386/libgcc/__lshrdi3.S >> use the following code sequences for shift counts greater 31: >> >> 1: 1: >> xorl %edx,%edx shrl %cl,%edx >> shl %cl,%eax xorl %eax,%eax >> ^ >> xchgl %edx,%eax xchgl %edx,%eax >> ret ret >> >> At least and especially on Intel processors XCHG was and >> still is a rather slow instruction and should be avoided. >> Use the following better code sequences instead: >> >&...

[PATCH 3/4] Pte xchg optimization.patch

2007 Apr 18

1

[PATCH 3/4] Pte xchg optimization.patch

In situations where page table updates need only be made locally, and there is no cross-processor A/D bit races involved, we need not use the heavyweight xchg instruction to atomically fetch and clear page table entries. Instead, we can just read and clear them directly. This introduces a neat optimization for non-SMP kernels; drop the atomic xchg operations from page table updates. Thanks to Michel Lespinasse for noting this potential optimization....

[PATCH 3/4] Pte xchg optimization.patch

2007 Apr 18

1

[PATCH 3/4] Pte xchg optimization.patch

In situations where page table updates need only be made locally, and there is no cross-processor A/D bit races involved, we need not use the heavyweight xchg instruction to atomically fetch and clear page table entries. Instead, we can just read and clear them directly. This introduces a neat optimization for non-SMP kernels; drop the atomic xchg operations from page table updates. Thanks to Michel Lespinasse for noting this potential optimization....

Slow XCHG in arch/i386/libgcc/__ashrdi3.S and arch/i386/libgcc/__lshrdi3.S

2019 Aug 15

2

Slow XCHG in arch/i386/libgcc/__ashrdi3.S and arch/i386/libgcc/__lshrdi3.S

...https://git.kernel.org/pub/scm/libs/klibc/klibc.git/plain/usr/klibc/arch/i386/libgcc/__lshrdi3.S use the following code sequences for shift counts greater 31: 1: 1: xorl %edx,%edx shrl %cl,%edx shl %cl,%eax xorl %eax,%eax ^ xchgl %edx,%eax xchgl %edx,%eax ret ret At least and especially on Intel processors XCHG was and still is a rather slow instruction and should be avoided. Use the following better code sequences instead: 1: 1: shll %cl,%eax...

[PATCH] xen: Use wmb instead of rmb in xen_evtchn_do_upcall().

2008 Jun 10

1

[PATCH] xen: Use wmb instead of rmb in xen_evtchn_do_upcall().

...hanged, 1 insertions(+), 1 deletions(-) diff --git a/drivers/xen/events.c b/drivers/xen/events.c index 73d78dc..332dd63 100644 --- a/drivers/xen/events.c +++ b/drivers/xen/events.c @@ -529,7 +529,7 @@ void xen_evtchn_do_upcall(struct pt_regs *regs) #ifndef CONFIG_X86 /* No need for a barrier -- XCHG is a barrier on x86. */ /* Clear master flag /before/ clearing selector flag. */ - rmb(); + wmb(); #endif pending_words = xchg(&vcpu_info->evtchn_pending_sel, 0); while (pending_words != 0) { -- 1.5.3

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

2016 Jan 12

3

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

On Mon, Nov 02, 2015 at 04:06:46PM -0800, Linus Torvalds wrote: > On Mon, Nov 2, 2015 at 12:15 PM, Davidlohr Bueso <dave at stgolabs.net> wrote: > > > > So I ran some experiments on an IvyBridge (2.8GHz) and the cost of XCHG is > > constantly cheaper (by at least half the latency) than MFENCE. While there > > was a decent amount of variation, this difference remained rather constant. > > Mind testing "lock addq $0,0(%rsp)" instead of mfence? That's what we > use on old cpu's with...

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

2016 Jan 12

3

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

On Mon, Nov 02, 2015 at 04:06:46PM -0800, Linus Torvalds wrote: > On Mon, Nov 2, 2015 at 12:15 PM, Davidlohr Bueso <dave at stgolabs.net> wrote: > > > > So I ran some experiments on an IvyBridge (2.8GHz) and the cost of XCHG is > > constantly cheaper (by at least half the latency) than MFENCE. While there > > was a decent amount of variation, this difference remained rather constant. > > Mind testing "lock addq $0,0(%rsp)" instead of mfence? That's what we > use on old cpu's with...

[PATCH 04/11] qspinlock: Extract out the exchange of tail code word

2014 Jun 17

3

[PATCH 04/11] qspinlock: Extract out the exchange of tail code word

On Sun, Jun 15, 2014 at 02:47:01PM +0200, Peter Zijlstra wrote: > From: Waiman Long <Waiman.Long at hp.com> > > This patch extracts the logic for the exchange of new and previous tail > code words into a new xchg_tail() function which can be optimized in a > later patch. And also adds a third try on acquiring the lock. That I think should be a seperate patch. And instead of saying 'later patch' you should spell out the name of the patch. Especially as this might not be obvious from somebody doi...

[PATCH 04/11] qspinlock: Extract out the exchange of tail code word

2014 Jun 17

3

[PATCH 04/11] qspinlock: Extract out the exchange of tail code word

On Sun, Jun 15, 2014 at 02:47:01PM +0200, Peter Zijlstra wrote: > From: Waiman Long <Waiman.Long at hp.com> > > This patch extracts the logic for the exchange of new and previous tail > code words into a new xchg_tail() function which can be optimized in a > later patch. And also adds a third try on acquiring the lock. That I think should be a seperate patch. And instead of saying 'later patch' you should spell out the name of the patch. Especially as this might not be obvious from somebody doi...

Slow XCHG in arch/i386/libgcc/__ashrdi3.S and arch/i386/libgcc/__lshrdi3.S

2019 Aug 19

0

Slow XCHG in arch/i386/libgcc/__ashrdi3.S and arch/i386/libgcc/__lshrdi3.S

.../klibc/klibc.git/plain/usr/klibc/arch/i386/libgcc/__lshrdi3.S > use the following code sequences for shift counts greater 31: > > 1: 1: > xorl %edx,%edx shrl %cl,%edx > shl %cl,%eax xorl %eax,%eax > ^ > xchgl %edx,%eax xchgl %edx,%eax > ret ret > > At least and especially on Intel processors XCHG was and > still is a rather slow instruction and should be avoided. > Use the following better code sequences instead: > > 1:...

[PATCH]: Fixing the xchg emulation bug

2006 May 26

0

[PATCH]: Fixing the xchg emulation bug

The following patch fixes a bug in the emulation of the xchg instruction. This bug has prevented us from booting fully virtualized SMP guests that write to the APIC using the xchg instruction (when CONFIG_X86_GOOD_APIC is not set). On 32 bit platforms, sles 10 kernels are built without CONFIG_x86_GOOD_APIC not set and hence we have had problems booting full...

[PATCH 8/9] qspinlock: Generic paravirt support

2015 Apr 02

3

[PATCH 8/9] qspinlock: Generic paravirt support

...but can probe until we find the entry. > >This will be bound in cost to the same number if probes we required for > >insertion and avoids the full array scan. > > > >Now I think we can indeed do this, if as said earlier we do not clear > >the bucket on insert if the cmpxchg succeeds, in that case the unlock > >will observe _Q_SLOW_VAL and do the lookup, the lookup will then find > >the entry. And we then need the unlock to clear the entry. > >_Q_SLOW_VAL > >Does that explain this? Or should I try again with code? > > OK, I got your propo...

[PATCH 8/9] qspinlock: Generic paravirt support

2015 Apr 02

3

[PATCH 8/9] qspinlock: Generic paravirt support

...but can probe until we find the entry. > >This will be bound in cost to the same number if probes we required for > >insertion and avoids the full array scan. > > > >Now I think we can indeed do this, if as said earlier we do not clear > >the bucket on insert if the cmpxchg succeeds, in that case the unlock > >will observe _Q_SLOW_VAL and do the lookup, the lookup will then find > >the entry. And we then need the unlock to clear the entry. > >_Q_SLOW_VAL > >Does that explain this? Or should I try again with code? > > OK, I got your propo...

[PATCH 04/11] qspinlock: Extract out the exchange of tail code word

2014 Jun 18

0

[PATCH 04/11] qspinlock: Extract out the exchange of tail code word

...17/06/2014 22:55, Konrad Rzeszutek Wilk ha scritto: > On Sun, Jun 15, 2014 at 02:47:01PM +0200, Peter Zijlstra wrote: >> From: Waiman Long <Waiman.Long at hp.com> >> >> This patch extracts the logic for the exchange of new and previous tail >> code words into a new xchg_tail() function which can be optimized in a >> later patch. > > And also adds a third try on acquiring the lock. That I think should > be a seperate patch. It doesn't really add a new try, the old code is: - for (;;) { - new = _Q_LOCKED_VAL; - if (val) - new = tail | (val...

Bugs in pxelinux.asm - syslinux 3.75

2009 Apr 24

1

Bugs in pxelinux.asm - syslinux 3.75

In pxelinux.asm the xchg instruction in xchg ax,ax . data_on_top: should be xchg ax,dx, I think At the end of pxe_get_cached_info routine,there is and ax,ax jnz .err It is supposed to test for AX status, but since pxenv does pushad and popad, AX doesn't contain statu...

[PATCH v16 13/14] pvqspinlock: Improve slowpath performance by avoiding cmpxchg

2015 Apr 29

4

[PATCH v16 13/14] pvqspinlock: Improve slowpath performance by avoiding cmpxchg

On Fri, Apr 24, 2015 at 02:56:42PM -0400, Waiman Long wrote: > In the pv_scan_next() function, the slow cmpxchg atomic operation is > performed even if the other CPU is not even close to being halted. This > extra cmpxchg can harm slowpath performance. > > This patch introduces the new mayhalt flag to indicate if the other > spinning CPU is close to being halted or not. The current threshold...

[PATCH v16 13/14] pvqspinlock: Improve slowpath performance by avoiding cmpxchg

2015 Apr 29

4

[PATCH v16 13/14] pvqspinlock: Improve slowpath performance by avoiding cmpxchg

On Fri, Apr 24, 2015 at 02:56:42PM -0400, Waiman Long wrote: > In the pv_scan_next() function, the slow cmpxchg atomic operation is > performed even if the other CPU is not even close to being halted. This > extra cmpxchg can harm slowpath performance. > > This patch introduces the new mayhalt flag to indicate if the other > spinning CPU is close to being halted or not. The current threshold...

[PATCH v10 03/19] qspinlock: Add pending bit

2014 May 08

1

[PATCH v10 03/19] qspinlock: Add pending bit

...e_tail(smp_processor_id(), idx); > @@ -119,15 +196,18 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) > node->next = NULL; > > /* > + * we already touched the queueing cacheline; don't bother with pending > + * stuff. > + * > * trylock || xchg(lock, node) > * > - * 0,0 -> 0,1 ; trylock > - * p,x -> n,x ; prev = xchg(lock, node) > + * 0,0,0 -> 0,0,1 ; trylock > + * p,y,x -> n,y,x ; prev = xchg(lock, node) > */ And any value of @val we might have had here is completely out-dated. The only thing that...

[PATCH v10 03/19] qspinlock: Add pending bit

2014 May 08

1

[PATCH v10 03/19] qspinlock: Add pending bit

...e_tail(smp_processor_id(), idx); > @@ -119,15 +196,18 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) > node->next = NULL; > > /* > + * we already touched the queueing cacheline; don't bother with pending > + * stuff. > + * > * trylock || xchg(lock, node) > * > - * 0,0 -> 0,1 ; trylock > - * p,x -> n,x ; prev = xchg(lock, node) > + * 0,0,0 -> 0,0,1 ; trylock > + * p,y,x -> n,y,x ; prev = xchg(lock, node) > */ And any value of @val we might have had here is completely out-dated. The only thing that...

atomic ops are optimized with incorrect semantics .

2020 Feb 10

3

atomic ops are optimized with incorrect semantics .

Hi All, With the "https://gcc.godbolt.org/z/yBYTrd" case . the atomic is converted to non atomic ops for x86 like from xchg dword ptr [100], eax to mov dword ptr [100], 1 the pass is responsible for this tranformation was instCombine i.e InstCombiner::visitAtomicRMWInst which converts the IR like %0 = atomicrmw xchg i32* inttoptr (i64 100 to i32*), i32 1 monotonic to store atomic i32 1, i32* inttoptr (i64 100 to i32...

search for: xchg