thr3ads.net - Linux Virtualization - [PATCH 3/4] x86,asm: Re-work smp_store

If this information is useful, please help other people find it:
Share via:

Andy Lutomirski

2016-Jan-12 20:59 UTC

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

On Tue, Jan 12, 2016 at 12:54 PM, Linus Torvalds
<torvalds at linux-foundation.org> wrote:> On Tue, Jan 12, 2016 at 12:30 PM, Andy Lutomirski <luto at
kernel.org> wrote:
>>
>> I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe
even 64)
>> was better because it avoided stomping on very-likely-to-be-hot write
>> buffers.
>
> I suspect it could go either way. You want a small constant (for the
> isntruction size), but any small constant is likely to be within the
> current stack frame anyway. I don't think 0(%rsp) is particularly
> likely to have a spill on it right then and there, but who knows..
>
> And 64(%rsp) is  possibly going to be cold in the L1 cache, especially
> if it's just after a deep function call. Which it might be. So it
> might work the other way.
>
> So my guess would be that you wouldn't be able to measure the
> difference. It might be there, but probably too small to really see in
> any noise.
>
> But numbers talk, bullshit walks. It would be interesting to be proven
wrong.
Here's an article with numbers:

http://shipilev.net/blog/2014/on-the-fence-with-dependencies/

I think they're suggesting using a negative offset, which is safe as
long as it doesn't page fault, even though we have the redzone
disabled.

--Andy

Linus Torvalds

2016-Jan-12 21:37 UTC

head link

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

On Tue, Jan 12, 2016 at 12:59 PM, Andy Lutomirski <luto at amacapital.net>
wrote:>
> Here's an article with numbers:
>
> http://shipilev.net/blog/2014/on-the-fence-with-dependencies/
Well, that's with the busy loop and one set of code generation. It
doesn't show the "oops, deeper stack isn't even in the cache any
more
due to call chains" issue.

But yes:
> I think they're suggesting using a negative offset, which is safe as
> long as it doesn't page fault, even though we have the redzone
> disabled.
I think a negative offset might work very well. Partly exactly
*because* we have the redzone disabled: we know that inside the
kernel, we'll never have any live stack frame accesses under the stack
pointer, so "-4(%rsp)" sounds good to me. There should never be any
pending writes in the write buffer, because even if it *was* live, it
would have been read off first.

Yeah, it potentially does extend the stack cache footprint by another
4 bytes, but that sounds very benign.

So perhaps it might be worth trying to switch the "mfence" to
"lock ;
addl $0,-4(%rsp)" in the kernel for x86-64, and remove the alternate
for x86-32.

I'd still want to see somebody try to benchmark it. I doubt it's
noticeable, but making changes because you think it might save a few
cycles without then even measuring it is just wrong.

                 Linus

Michael S. Tsirkin

2016-Jan-12 22:14 UTC

head link

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

On Tue, Jan 12, 2016 at 01:37:38PM -0800, Linus Torvalds
wrote:> On Tue, Jan 12, 2016 at 12:59 PM, Andy Lutomirski <luto at
amacapital.net> wrote:
> >
> > Here's an article with numbers:
> >
> > http://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> 
> Well, that's with the busy loop and one set of code generation. It
> doesn't show the "oops, deeper stack isn't even in the cache
any more
> due to call chains" issue.
> 
> But yes:
> 
> > I think they're suggesting using a negative offset, which is safe
as
> > long as it doesn't page fault, even though we have the redzone
> > disabled.
> 
> I think a negative offset might work very well. Partly exactly
> *because* we have the redzone disabled: we know that inside the
> kernel, we'll never have any live stack frame accesses under the stack
> pointer, so "-4(%rsp)" sounds good to me. There should never be
any
> pending writes in the write buffer, because even if it *was* live, it
> would have been read off first.
> 
> Yeah, it potentially does extend the stack cache footprint by another
> 4 bytes, but that sounds very benign.
> 
> So perhaps it might be worth trying to switch the "mfence" to
"lock ;
> addl $0,-4(%rsp)" in the kernel for x86-64, and remove the alternate
> for x86-32.
> 
> I'd still want to see somebody try to benchmark it. I doubt it's
> noticeable, but making changes because you think it might save a few
> cycles without then even measuring it is just wrong.
> 
>                  Linus
Oops, I posted v2 with just offset 0 before reading
the rest of this thread.
I did try with offset 0 and didn't measure any
change on any perf bench test, or on kernel build.
I wonder which benchmark stresses smp_mb the most.
I'll look into using a negative offset.

-- 
MST

Michael S. Tsirkin

2016-Jan-12 22:21 UTC

head link

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

On Tue, Jan 12, 2016 at 12:59:58PM -0800, Andy Lutomirski
wrote:> On Tue, Jan 12, 2016 at 12:54 PM, Linus Torvalds
> <torvalds at linux-foundation.org> wrote:
> > On Tue, Jan 12, 2016 at 12:30 PM, Andy Lutomirski <luto at
kernel.org> wrote:
> >>
> >> I recall reading somewhere that lock addl $0, 32(%rsp) or so
(maybe even 64)
> >> was better because it avoided stomping on very-likely-to-be-hot
write
> >> buffers.
> >
> > I suspect it could go either way. You want a small constant (for the
> > isntruction size), but any small constant is likely to be within the
> > current stack frame anyway. I don't think 0(%rsp) is particularly
> > likely to have a spill on it right then and there, but who knows..
> >
> > And 64(%rsp) is  possibly going to be cold in the L1 cache, especially
> > if it's just after a deep function call. Which it might be. So it
> > might work the other way.
> >
> > So my guess would be that you wouldn't be able to measure the
> > difference. It might be there, but probably too small to really see in
> > any noise.
> >
> > But numbers talk, bullshit walks. It would be interesting to be proven
wrong.
> 
> Here's an article with numbers:
> 
> http://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> 
> I think they're suggesting using a negative offset, which is safe as
> long as it doesn't page fault, even though we have the redzone
> disabled.
> 
> --Andy
OK so I'll have to tweak the test to put something
on stack to measure the difference: my test tweaks a
global variable instead.
I'll try that by tomorrow.

I couldn't measure any difference between mfence and lock+addl
except in a micro-benchmark, but hey since we are tweaking this,
let's do the optimal thing.

-- 
MST

H. Peter Anvin

2016-Jan-12 22:55 UTC

head link

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

On 01/12/16 14:21, Michael S. Tsirkin wrote:> 
> OK so I'll have to tweak the test to put something
> on stack to measure the difference: my test tweaks a
> global variable instead.
> I'll try that by tomorrow.
> 
> I couldn't measure any difference between mfence and lock+addl
> except in a micro-benchmark, but hey since we are tweaking this,
> let's do the optimal thing.
> 
Be careful with this: if it only shows up in a microbenchmark, we may
introduce a hard-to-debug regression for no real benefit.

	-hpa

Michael S. Tsirkin

2016-Jan-13 16:20 UTC

head link

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

On Tue, Jan 12, 2016 at 01:37:38PM -0800, Linus Torvalds
wrote:> On Tue, Jan 12, 2016 at 12:59 PM, Andy Lutomirski <luto at
amacapital.net> wrote:
> >
> > Here's an article with numbers:
> >
> > http://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> 
> Well, that's with the busy loop and one set of code generation. It
> doesn't show the "oops, deeper stack isn't even in the cache
any more
> due to call chains" issue.
It's an interesting read, thanks!

So sp is read on return from function I think.  I added a function and sure
enough, it slows the add 0(sp) variant down. It's still faster than mfence
for
me though! Testing code + results below. Reaching below stack, or
allocating extra 4 bytes above the stack pointer gives us back the performance.

> But yes:
> 
> > I think they're suggesting using a negative offset, which is safe
as
> > long as it doesn't page fault, even though we have the redzone
> > disabled.
> 
> I think a negative offset might work very well. Partly exactly
> *because* we have the redzone disabled: we know that inside the
> kernel, we'll never have any live stack frame accesses under the stack
> pointer, so "-4(%rsp)" sounds good to me. There should never be
any
> pending writes in the write buffer, because even if it *was* live, it
> would have been read off first.
> 
> Yeah, it potentially does extend the stack cache footprint by another
> 4 bytes, but that sounds very benign.
> 
> So perhaps it might be worth trying to switch the "mfence" to
"lock ;
> addl $0,-4(%rsp)" in the kernel for x86-64, and remove the alternate
> for x86-32.
>
>
> I'd still want to see somebody try to benchmark it. I doubt it's
> noticeable, but making changes because you think it might save a few
> cycles without then even measuring it is just wrong.
> 
>                  Linus
I'll try this in the kernel now, will report, though I'm
not optimistic a high level benchmark can show this
kind of thing.


---------------
main.c:
---------------

extern volatile int x;
volatile int x;

#ifdef __x86_64__
#define SP "rsp"
#else
#define SP "esp"
#endif


#ifdef lock
#define barrier() do { int p; asm volatile ("lock; addl $0,%0"
::"m"(p): "memory"); } while (0)
#endif
#ifdef locksp
#define barrier() asm("lock; addl $0,0(%%" SP ")" :::
"memory")
#endif
#ifdef lockrz
#define barrier() asm("lock; addl $0,-4(%%" SP ")" :::
"memory")
#endif
#ifdef xchg
#define barrier() do { int p; int ret; asm volatile ("xchgl %0, %1;":
"=r"(ret) : "m"(p): "memory", "cc"); }
while (0)
#endif
#ifdef xchgrz
/* same as xchg but poking at gcc red zone */
#define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP
");": "=r"(ret) :: "memory", "cc"); }
while (0)
#endif
#ifdef mfence
#define barrier() asm("mfence" ::: "memory")
#endif
#ifdef lfence
#define barrier() asm("lfence" ::: "memory")
#endif
#ifdef sfence
#define barrier() asm("sfence" ::: "memory")
#endif

void __attribute__ ((noinline)) test(int i, int *j)
{
	/*
	 * Test barrier in a loop. We also poke at a volatile variable in an
	 * attempt to make it a bit more realistic - this way there's something
	 * in the store-buffer.
	 */
	x = i - *j;
	barrier();
	*j = x;
}

int main(int argc, char **argv)
{
	int i;
	int j = 1234;

	for (i = 0; i < 10000000; ++i)
		test(i, &j);

	return 0;
}
---------------

ALL = xchg xchgrz lock locksp lockrz mfence lfence sfence

CC = gcc
CFLAGS += -Wall -O2 -ggdb
PERF = perf stat -r 10 --log-fd 1 --
TIME = /usr/bin/time -f %e
FILTER = cat

all: ${ALL}
clean:
	rm -f ${ALL}
run: all
	for file in ${ALL}; do echo ${RUN} ./$$file "|" ${FILTER}; ${RUN}
./$$file | ${FILTER}; done
perf time: run
time: RUN=${TIME}
perf: RUN=${PERF}
perf: FILTER=grep elapsed

.PHONY: all clean run perf time

xchgrz: CFLAGS += -mno-red-zone

${ALL}: main.c
	${CC} ${CFLAGS} -D$@ -o $@ main.c

--------------------------------------------
perf stat -r 10 --log-fd 1 -- ./xchg | grep elapsed
       0.080420565 seconds time elapsed                                         
( +-  2.31% )
perf stat -r 10 --log-fd 1 -- ./xchgrz | grep elapsed
       0.087798571 seconds time elapsed                                         
( +-  2.58% )
perf stat -r 10 --log-fd 1 -- ./lock | grep elapsed
       0.083023724 seconds time elapsed                                         
( +-  2.44% )
perf stat -r 10 --log-fd 1 -- ./locksp | grep elapsed
       0.102880750 seconds time elapsed                                         
( +-  0.13% )
perf stat -r 10 --log-fd 1 -- ./lockrz | grep elapsed
       0.084917420 seconds time elapsed                                         
( +-  3.28% )
perf stat -r 10 --log-fd 1 -- ./mfence | grep elapsed
       0.156014715 seconds time elapsed                                         
( +-  0.16% )
perf stat -r 10 --log-fd 1 -- ./lfence | grep elapsed
       0.077731443 seconds time elapsed                                         
( +-  0.12% )
perf stat -r 10 --log-fd 1 -- ./sfence | grep elapsed
       0.036655741 seconds time elapsed                                         
( +-  0.21% )

Seemingly Similar Threads

Search for more possibly parallel threads

Linux Virtualization - Jan 2016 - [PATCH 3/4] x86,asm: Re-work smp_store_mb()

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

Seemingly Similar Threads