thr3ads.net - search: "xchg8"

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Mar 04

1

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

...ocks# LOCK=./qspinlock-pending-opt ./test-4.sh ; LOCK=./qspinlock-pending-opt2 ./test-4.sh 4: 50783.509653 8: 146295.875715 16: 332942.964709 4: 51033.341441 8: 146320.656285 16: 332586.355194 And the difference between opt and opt2 is that opt2 replaces 2 cmpxchg loops with unconditional ops (xchg8 and xchg16). And I'd think that 4 CPUs x 4 Nodes would be heavy contention. I'll have another poke tomorrow; including verifying asm tomorrow, need to go sleep now.

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Mar 04

1

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

...ocks# LOCK=./qspinlock-pending-opt ./test-4.sh ; LOCK=./qspinlock-pending-opt2 ./test-4.sh 4: 50783.509653 8: 146295.875715 16: 332942.964709 4: 51033.341441 8: 146320.656285 16: 332586.355194 And the difference between opt and opt2 is that opt2 replaces 2 cmpxchg loops with unconditional ops (xchg8 and xchg16). And I'd think that 4 CPUs x 4 Nodes would be heavy contention. I'll have another poke tomorrow; including verifying asm tomorrow, need to go sleep now.

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Mar 03

5

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

Hi, Here are some numbers for my version -- also attached is the test code. I found that booting big machines is tediously slow so I lifted the whole lot to userspace. I measure the cycles spend in arch_spin_lock() + arch_spin_unlock(). The machines used are a 4 node (2 socket) AMD Interlagos, and a 2 node (2 socket) Intel Westmere-EP. AMD (ticket) AMD (qspinlock + pending + opt) Local:

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Mar 03

5

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

Hi, Here are some numbers for my version -- also attached is the test code. I found that booting big machines is tediously slow so I lifted the whole lot to userspace. I measure the cycles spend in arch_spin_lock() + arch_spin_unlock(). The machines used are a 4 node (2 socket) AMD Interlagos, and a 2 node (2 socket) Intel Westmere-EP. AMD (ticket) AMD (qspinlock + pending + opt) Local:

search for: xchg8