On Tue, Jan 12, 2016 at 12:54 PM, Linus Torvalds <torvalds at linux-foundation.org> wrote:> On Tue, Jan 12, 2016 at 12:30 PM, Andy Lutomirski <luto at kernel.org> wrote: >> >> I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe even 64) >> was better because it avoided stomping on very-likely-to-be-hot write >> buffers. > > I suspect it could go either way. You want a small constant (for the > isntruction size), but any small constant is likely to be within the > current stack frame anyway. I don't think 0(%rsp) is particularly > likely to have a spill on it right then and there, but who knows.. > > And 64(%rsp) is possibly going to be cold in the L1 cache, especially > if it's just after a deep function call. Which it might be. So it > might work the other way. > > So my guess would be that you wouldn't be able to measure the > difference. It might be there, but probably too small to really see in > any noise. > > But numbers talk, bullshit walks. It would be interesting to be proven wrong.Here's an article with numbers: http://shipilev.net/blog/2014/on-the-fence-with-dependencies/ I think they're suggesting using a negative offset, which is safe as long as it doesn't page fault, even though we have the redzone disabled. --Andy
On Tue, Jan 12, 2016 at 12:59 PM, Andy Lutomirski <luto at amacapital.net> wrote:> > Here's an article with numbers: > > http://shipilev.net/blog/2014/on-the-fence-with-dependencies/Well, that's with the busy loop and one set of code generation. It doesn't show the "oops, deeper stack isn't even in the cache any more due to call chains" issue. But yes:> I think they're suggesting using a negative offset, which is safe as > long as it doesn't page fault, even though we have the redzone > disabled.I think a negative offset might work very well. Partly exactly *because* we have the redzone disabled: we know that inside the kernel, we'll never have any live stack frame accesses under the stack pointer, so "-4(%rsp)" sounds good to me. There should never be any pending writes in the write buffer, because even if it *was* live, it would have been read off first. Yeah, it potentially does extend the stack cache footprint by another 4 bytes, but that sounds very benign. So perhaps it might be worth trying to switch the "mfence" to "lock ; addl $0,-4(%rsp)" in the kernel for x86-64, and remove the alternate for x86-32. I'd still want to see somebody try to benchmark it. I doubt it's noticeable, but making changes because you think it might save a few cycles without then even measuring it is just wrong. Linus
On Tue, Jan 12, 2016 at 01:37:38PM -0800, Linus Torvalds wrote:> On Tue, Jan 12, 2016 at 12:59 PM, Andy Lutomirski <luto at amacapital.net> wrote: > > > > Here's an article with numbers: > > > > http://shipilev.net/blog/2014/on-the-fence-with-dependencies/ > > Well, that's with the busy loop and one set of code generation. It > doesn't show the "oops, deeper stack isn't even in the cache any more > due to call chains" issue. > > But yes: > > > I think they're suggesting using a negative offset, which is safe as > > long as it doesn't page fault, even though we have the redzone > > disabled. > > I think a negative offset might work very well. Partly exactly > *because* we have the redzone disabled: we know that inside the > kernel, we'll never have any live stack frame accesses under the stack > pointer, so "-4(%rsp)" sounds good to me. There should never be any > pending writes in the write buffer, because even if it *was* live, it > would have been read off first. > > Yeah, it potentially does extend the stack cache footprint by another > 4 bytes, but that sounds very benign. > > So perhaps it might be worth trying to switch the "mfence" to "lock ; > addl $0,-4(%rsp)" in the kernel for x86-64, and remove the alternate > for x86-32. > > I'd still want to see somebody try to benchmark it. I doubt it's > noticeable, but making changes because you think it might save a few > cycles without then even measuring it is just wrong. > > LinusOops, I posted v2 with just offset 0 before reading the rest of this thread. I did try with offset 0 and didn't measure any change on any perf bench test, or on kernel build. I wonder which benchmark stresses smp_mb the most. I'll look into using a negative offset. -- MST
On Tue, Jan 12, 2016 at 12:59:58PM -0800, Andy Lutomirski wrote:> On Tue, Jan 12, 2016 at 12:54 PM, Linus Torvalds > <torvalds at linux-foundation.org> wrote: > > On Tue, Jan 12, 2016 at 12:30 PM, Andy Lutomirski <luto at kernel.org> wrote: > >> > >> I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe even 64) > >> was better because it avoided stomping on very-likely-to-be-hot write > >> buffers. > > > > I suspect it could go either way. You want a small constant (for the > > isntruction size), but any small constant is likely to be within the > > current stack frame anyway. I don't think 0(%rsp) is particularly > > likely to have a spill on it right then and there, but who knows.. > > > > And 64(%rsp) is possibly going to be cold in the L1 cache, especially > > if it's just after a deep function call. Which it might be. So it > > might work the other way. > > > > So my guess would be that you wouldn't be able to measure the > > difference. It might be there, but probably too small to really see in > > any noise. > > > > But numbers talk, bullshit walks. It would be interesting to be proven wrong. > > Here's an article with numbers: > > http://shipilev.net/blog/2014/on-the-fence-with-dependencies/ > > I think they're suggesting using a negative offset, which is safe as > long as it doesn't page fault, even though we have the redzone > disabled. > > --AndyOK so I'll have to tweak the test to put something on stack to measure the difference: my test tweaks a global variable instead. I'll try that by tomorrow. I couldn't measure any difference between mfence and lock+addl except in a micro-benchmark, but hey since we are tweaking this, let's do the optimal thing. -- MST
On 01/12/16 14:21, Michael S. Tsirkin wrote:> > OK so I'll have to tweak the test to put something > on stack to measure the difference: my test tweaks a > global variable instead. > I'll try that by tomorrow. > > I couldn't measure any difference between mfence and lock+addl > except in a micro-benchmark, but hey since we are tweaking this, > let's do the optimal thing. >Be careful with this: if it only shows up in a microbenchmark, we may introduce a hard-to-debug regression for no real benefit. -hpa
On Tue, Jan 12, 2016 at 01:37:38PM -0800, Linus Torvalds wrote:> On Tue, Jan 12, 2016 at 12:59 PM, Andy Lutomirski <luto at amacapital.net> wrote: > > > > Here's an article with numbers: > > > > http://shipilev.net/blog/2014/on-the-fence-with-dependencies/ > > Well, that's with the busy loop and one set of code generation. It > doesn't show the "oops, deeper stack isn't even in the cache any more > due to call chains" issue.It's an interesting read, thanks! So sp is read on return from function I think. I added a function and sure enough, it slows the add 0(sp) variant down. It's still faster than mfence for me though! Testing code + results below. Reaching below stack, or allocating extra 4 bytes above the stack pointer gives us back the performance.> But yes: > > > I think they're suggesting using a negative offset, which is safe as > > long as it doesn't page fault, even though we have the redzone > > disabled. > > I think a negative offset might work very well. Partly exactly > *because* we have the redzone disabled: we know that inside the > kernel, we'll never have any live stack frame accesses under the stack > pointer, so "-4(%rsp)" sounds good to me. There should never be any > pending writes in the write buffer, because even if it *was* live, it > would have been read off first. > > Yeah, it potentially does extend the stack cache footprint by another > 4 bytes, but that sounds very benign. > > So perhaps it might be worth trying to switch the "mfence" to "lock ; > addl $0,-4(%rsp)" in the kernel for x86-64, and remove the alternate > for x86-32. > > > I'd still want to see somebody try to benchmark it. I doubt it's > noticeable, but making changes because you think it might save a few > cycles without then even measuring it is just wrong. > > LinusI'll try this in the kernel now, will report, though I'm not optimistic a high level benchmark can show this kind of thing. --------------- main.c: --------------- extern volatile int x; volatile int x; #ifdef __x86_64__ #define SP "rsp" #else #define SP "esp" #endif #ifdef lock #define barrier() do { int p; asm volatile ("lock; addl $0,%0" ::"m"(p): "memory"); } while (0) #endif #ifdef locksp #define barrier() asm("lock; addl $0,0(%%" SP ")" ::: "memory") #endif #ifdef lockrz #define barrier() asm("lock; addl $0,-4(%%" SP ")" ::: "memory") #endif #ifdef xchg #define barrier() do { int p; int ret; asm volatile ("xchgl %0, %1;": "=r"(ret) : "m"(p): "memory", "cc"); } while (0) #endif #ifdef xchgrz /* same as xchg but poking at gcc red zone */ #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0) #endif #ifdef mfence #define barrier() asm("mfence" ::: "memory") #endif #ifdef lfence #define barrier() asm("lfence" ::: "memory") #endif #ifdef sfence #define barrier() asm("sfence" ::: "memory") #endif void __attribute__ ((noinline)) test(int i, int *j) { /* * Test barrier in a loop. We also poke at a volatile variable in an * attempt to make it a bit more realistic - this way there's something * in the store-buffer. */ x = i - *j; barrier(); *j = x; } int main(int argc, char **argv) { int i; int j = 1234; for (i = 0; i < 10000000; ++i) test(i, &j); return 0; } --------------- ALL = xchg xchgrz lock locksp lockrz mfence lfence sfence CC = gcc CFLAGS += -Wall -O2 -ggdb PERF = perf stat -r 10 --log-fd 1 -- TIME = /usr/bin/time -f %e FILTER = cat all: ${ALL} clean: rm -f ${ALL} run: all for file in ${ALL}; do echo ${RUN} ./$$file "|" ${FILTER}; ${RUN} ./$$file | ${FILTER}; done perf time: run time: RUN=${TIME} perf: RUN=${PERF} perf: FILTER=grep elapsed .PHONY: all clean run perf time xchgrz: CFLAGS += -mno-red-zone ${ALL}: main.c ${CC} ${CFLAGS} -D$@ -o $@ main.c -------------------------------------------- perf stat -r 10 --log-fd 1 -- ./xchg | grep elapsed 0.080420565 seconds time elapsed ( +- 2.31% ) perf stat -r 10 --log-fd 1 -- ./xchgrz | grep elapsed 0.087798571 seconds time elapsed ( +- 2.58% ) perf stat -r 10 --log-fd 1 -- ./lock | grep elapsed 0.083023724 seconds time elapsed ( +- 2.44% ) perf stat -r 10 --log-fd 1 -- ./locksp | grep elapsed 0.102880750 seconds time elapsed ( +- 0.13% ) perf stat -r 10 --log-fd 1 -- ./lockrz | grep elapsed 0.084917420 seconds time elapsed ( +- 3.28% ) perf stat -r 10 --log-fd 1 -- ./mfence | grep elapsed 0.156014715 seconds time elapsed ( +- 0.16% ) perf stat -r 10 --log-fd 1 -- ./lfence | grep elapsed 0.077731443 seconds time elapsed ( +- 0.12% ) perf stat -r 10 --log-fd 1 -- ./sfence | grep elapsed 0.036655741 seconds time elapsed ( +- 0.21% )