thr3ads.net - Linux Virtualization - [PATCH 3/4] x86,asm: Re-work smp_store

If this information is useful, please help other people find it:
Share via:

Michael S. Tsirkin

2016-Jan-12 13:57 UTC

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

On Mon, Nov 02, 2015 at 04:06:46PM -0800, Linus Torvalds
wrote:> On Mon, Nov 2, 2015 at 12:15 PM, Davidlohr Bueso <dave at
stgolabs.net> wrote:
> >
> > So I ran some experiments on an IvyBridge (2.8GHz) and the cost of
XCHG is
> > constantly cheaper (by at least half the latency) than MFENCE. While
there
> > was a decent amount of variation, this difference remained rather
constant.
> 
> Mind testing "lock addq $0,0(%rsp)" instead of mfence? That's
what we
> use on old cpu's without one (ie 32-bit).
> 
> I'm not actually convinced that mfence is necessarily a good idea. I
> could easily see it being microcode, for example.
> 
> At least on my Haswell, the "lock addq" is pretty much exactly
half
> the cost of "mfence".
> 
>                      Linus
mfence was high on some traces I was seeing, so I got curious, too:

---->
main.c
---->


extern volatile int x;
volatile int x;

#ifdef __x86_64__
#define SP "rsp"
#else
#define SP "esp"
#endif
#ifdef lock
#define barrier() asm("lock; addl $0,0(%%" SP ")" :::
"memory")
#endif
#ifdef xchg
#define barrier() do { int p; int ret; asm volatile ("xchgl %0, %1;":
"=r"(ret) : "m"(p): "memory", "cc"); }
while (0)
#endif
#ifdef xchgrz
/* same as xchg but poking at gcc red zone */
#define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP
");": "=r"(ret) :: "memory", "cc"); }
while (0)
#endif
#ifdef mfence
#define barrier() asm("mfence" ::: "memory")
#endif
#ifdef lfence
#define barrier() asm("lfence" ::: "memory")
#endif
#ifdef sfence
#define barrier() asm("sfence" ::: "memory")
#endif

int main(int argc, char **argv)
{
	int i;
	int j = 1234;

	/*
	 * Test barrier in a loop. We also poke at a volatile variable in an
	 * attempt to make it a bit more realistic - this way there's something
	 * in the store-buffer.
	 */
	for (i = 0; i < 10000000; ++i) {
		x = i - j;
		barrier();
		j = x;
	}

	return 0;
}
---->
Makefile:
---->

ALL = xchg xchgrz lock mfence lfence sfence

CC = gcc
CFLAGS += -Wall -O2 -ggdb
PERF = perf stat -r 10 --log-fd 1 --

all: ${ALL}
clean:
	rm -f ${ALL}
run: all
	for file in ${ALL}; do echo ${PERF} ./$$file ; ${PERF} ./$$file; done

.PHONY: all clean run

${ALL}: main.c
	${CC} ${CFLAGS} -D$@ -o $@ main.c

----->

Is this a good way to test it?

E.g. on my laptop I get:

perf stat -r 10 --log-fd 1 -- ./xchg

 Performance counter stats for './xchg' (10 runs):

         53.236967 task-clock                #    0.992 CPUs utilized           
( +-  0.09% )
                10 context-switches          #    0.180 K/sec                   
( +-  1.70% )
                 0 CPU-migrations            #    0.000 K/sec                  
                37 page-faults               #    0.691 K/sec                   
( +-  1.13% )
       190,997,612 cycles                    #    3.588 GHz                     
( +-  0.04% )
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
        80,654,850 instructions              #    0.42  insns per cycle         
( +-  0.01% )
        10,122,372 branches                  #  190.138 M/sec                   
( +-  0.01% )
             4,514 branch-misses             #    0.04% of all branches         
( +-  3.37% )

       0.053642809 seconds time elapsed                                         
( +-  0.12% )

perf stat -r 10 --log-fd 1 -- ./xchgrz

 Performance counter stats for './xchgrz' (10 runs):

         53.189533 task-clock                #    0.997 CPUs utilized           
( +-  0.22% )
                 0 context-switches          #    0.000 K/sec                  
                 0 CPU-migrations            #    0.000 K/sec                  
                37 page-faults               #    0.694 K/sec                   
( +-  0.75% )
       190,785,621 cycles                    #    3.587 GHz                     
( +-  0.03% )
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
        80,602,086 instructions              #    0.42  insns per cycle         
( +-  0.00% )
        10,112,154 branches                  #  190.115 M/sec                   
( +-  0.01% )
             3,743 branch-misses             #    0.04% of all branches         
( +-  4.02% )

       0.053343693 seconds time elapsed                                         
( +-  0.23% )

perf stat -r 10 --log-fd 1 -- ./lock

 Performance counter stats for './lock' (10 runs):

         53.096434 task-clock                #    0.997 CPUs utilized           
( +-  0.16% )
                 0 context-switches          #    0.002 K/sec                   
( +-100.00% )
                 0 CPU-migrations            #    0.000 K/sec                  
                37 page-faults               #    0.693 K/sec                   
( +-  0.98% )
       190,796,621 cycles                    #    3.593 GHz                     
( +-  0.02% )
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
        80,601,376 instructions              #    0.42  insns per cycle         
( +-  0.01% )
        10,112,074 branches                  #  190.447 M/sec                   
( +-  0.01% )
             3,475 branch-misses             #    0.03% of all branches         
( +-  1.33% )

       0.053252678 seconds time elapsed                                         
( +-  0.16% )

perf stat -r 10 --log-fd 1 -- ./mfence

 Performance counter stats for './mfence' (10 runs):

        126.376473 task-clock                #    0.999 CPUs utilized           
( +-  0.21% )
                 0 context-switches          #    0.002 K/sec                   
( +- 66.67% )
                 0 CPU-migrations            #    0.000 K/sec                  
                36 page-faults               #    0.289 K/sec                   
( +-  0.84% )
       456,147,770 cycles                    #    3.609 GHz                     
( +-  0.01% )
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
        80,892,416 instructions              #    0.18  insns per cycle         
( +-  0.00% )
        10,163,220 branches                  #   80.420 M/sec                   
( +-  0.01% )
             4,653 branch-misses             #    0.05% of all branches         
( +-  1.27% )

       0.126539273 seconds time elapsed                                         
( +-  0.21% )

perf stat -r 10 --log-fd 1 -- ./lfence

 Performance counter stats for './lfence' (10 runs):

         47.617861 task-clock                #    0.997 CPUs utilized           
( +-  0.06% )
                 0 context-switches          #    0.002 K/sec                   
( +-100.00% )
                 0 CPU-migrations            #    0.000 K/sec                  
                36 page-faults               #    0.764 K/sec                   
( +-  0.45% )
       170,767,856 cycles                    #    3.586 GHz                     
( +-  0.03% )
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
        80,581,607 instructions              #    0.47  insns per cycle         
( +-  0.00% )
        10,108,508 branches                  #  212.284 M/sec                   
( +-  0.00% )
             3,320 branch-misses             #    0.03% of all branches         
( +-  1.12% )

       0.047768505 seconds time elapsed                                         
( +-  0.07% )

perf stat -r 10 --log-fd 1 -- ./sfence

 Performance counter stats for './sfence' (10 runs):

         20.156676 task-clock                #    0.988 CPUs utilized           
( +-  0.45% )
                 3 context-switches          #    0.159 K/sec                   
( +- 12.15% )
                 0 CPU-migrations            #    0.000 K/sec                  
                36 page-faults               #    0.002 M/sec                   
( +-  0.87% )
        72,212,225 cycles                    #    3.583 GHz                     
( +-  0.33% )
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
        80,479,149 instructions              #    1.11  insns per cycle         
( +-  0.00% )
        10,090,785 branches                  #  500.618 M/sec                   
( +-  0.01% )
             3,626 branch-misses             #    0.04% of all branches         
( +-  3.59% )

       0.020411208 seconds time elapsed                                         
( +-  0.52% )


So mfence is more expensive than locked instructions/xchg, but sfence/lfence
are slightly faster, and xchg and locked instructions are very close if
not the same.

I poked at some 10 intel and AMD machines and the numbers are different
but the results seem more or less consistent with this.
>From size point of view xchg is longer and xchgrz pokes at the red zonewhich seems unnecessarily hacky, so good old lock+addl is probably the
best.

There isn't any extra magic behind mfence, is there?
E.g. I think lock orders accesses to WC memory as well,
so apparently mb() can be redefined unconditionally, without
looking at XMM2:

--->
x86: drop mfence in favor of lock+addl

mfence appears to be way slower than a locked instruction - let's use
lock+add unconditionally, same as we always did on old 32-bit.

Signed-off-by: Michael S. Tsirkin <mst at redhat.com>
---

I'll play with this some more before posting this as a
non-stand alone patch. Is there a macro-benchmark where mb
is prominent?

diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index a584e1c..f0d36e2 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -15,15 +15,15 @@
  * Some non-Intel clones support out of order store. wmb() ceases to be a
  * nop for these.
  */
-#define mb() alternative("lock; addl $0,0(%%esp)",
"mfence", X86_FEATURE_XMM2)
+#define mb() asm volatile("lock; addl
$0,0(%%esp)":::"memory")
 #define rmb() alternative("lock; addl $0,0(%%esp)",
"lfence", X86_FEATURE_XMM2)
 #define wmb() alternative("lock; addl $0,0(%%esp)",
"sfence", X86_FEATURE_XMM)
 #else
+#define mb()	asm volatile("lock; addl
$0,0(%%rsp)":::"memory")
 #define rmb()	asm volatile("lfence":::"memory")
 #define wmb()	asm volatile("sfence" ::: "memory")
 #endif
 
 #ifdef CONFIG_X86_PPRO_FENCE
 #define dma_rmb()	rmb()
 #else
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-kernel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

Linus Torvalds

2016-Jan-12 17:20 UTC

head link

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

On Tue, Jan 12, 2016 at 5:57 AM, Michael S. Tsirkin <mst at redhat.com>
wrote:> #ifdef xchgrz
> /* same as xchg but poking at gcc red zone */
> #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%"
SP ");": "=r"(ret) :: "memory", "cc"); }
while (0)
> #endif
That's not safe in general. gcc might be using its redzone, so doing
xchg into it is unsafe.

But..
> Is this a good way to test it?
.. it's fine for some basic testing. It doesn't show any subtle
interactions (ie some operations may have different dynamic behavior
when the write buffers are busy etc), but as a baseline for "how fast
can things go" the stupid raw loop is fine. And while the xchg into
the redzoen wouldn't be acceptable as a real implementation, for
timing testing it's likely fine (ie you aren't hitting the problem it
can cause).
> So mfence is more expensive than locked instructions/xchg, but
sfence/lfence
> are slightly faster, and xchg and locked instructions are very close if
> not the same.
Note that we never actually *use* lfence/sfence. They are pointless
instructions when looking at CPU memory ordering, because for pure CPU
memory ordering stores and loads are already ordered.

The only reason to use lfence/sfence is after you've used nontemporal
stores for IO. That's very very rare in the kernel. So I wouldn't
worry about those.

But yes, it does sound like mfence is just a bad idea too.
> There isn't any extra magic behind mfence, is there?
No.

I think the only issue is that there has never been any real reason
for CPU designers to try to make mfence go particularly fast. Nobody
uses it, again with the exception of some odd loops that use
nontemporal stores, and for those the cost tends to always be about
the nontemporal accesses themselves (often to things like GPU memory
over PCIe), and the mfence cost of a few extra cycles is negligible.

The reason "lock ; add $0" has generally been the fastest we've
found
is simply that locked ops have been important for CPU designers.

So I think the patch is fine, and we should likely drop the use of mfence..

                      Linus

Michael S. Tsirkin

2016-Jan-12 17:45 UTC

head link

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

On Tue, Jan 12, 2016 at 09:20:06AM -0800, Linus Torvalds
wrote:> On Tue, Jan 12, 2016 at 5:57 AM, Michael S. Tsirkin <mst at
redhat.com> wrote:
> > #ifdef xchgrz
> > /* same as xchg but poking at gcc red zone */
> > #define barrier() do { int ret; asm volatile ("xchgl %0,
-4(%%" SP ");": "=r"(ret) :: "memory",
"cc"); } while (0)
> > #endif
> 
> That's not safe in general. gcc might be using its redzone, so doing
> xchg into it is unsafe.
> 
> But..
> 
> > Is this a good way to test it?
> 
> .. it's fine for some basic testing. It doesn't show any subtle
> interactions (ie some operations may have different dynamic behavior
> when the write buffers are busy etc), but as a baseline for "how fast
> can things go" the stupid raw loop is fine. And while the xchg into
> the redzoen wouldn't be acceptable as a real implementation, for
> timing testing it's likely fine (ie you aren't hitting the problem
it
> can cause).
> 
> > So mfence is more expensive than locked instructions/xchg, but
sfence/lfence
> > are slightly faster, and xchg and locked instructions are very close
if
> > not the same.
> 
> Note that we never actually *use* lfence/sfence. They are pointless
> instructions when looking at CPU memory ordering, because for pure CPU
> memory ordering stores and loads are already ordered.
> 
> The only reason to use lfence/sfence is after you've used nontemporal
> stores for IO.

By the way, the comment in barrier.h says:

/*
 * Some non-Intel clones support out of order store. wmb() ceases to be
 * a nop for these.
 */

and while the 1st sentence may well be true, if you have
an SMP system with out of order stores, making wmb
not a nop will not help.

Additionally as you point out, wmb is not a nop even
for regular intel CPUs because of these weird use-cases.

Drop this comment?
> That's very very rare in the kernel. So I wouldn't
> worry about those.
Right - I'll leave these alone, whoever wants to optimize this path will
have to do the necessary research.
> But yes, it does sound like mfence is just a bad idea too.
> 
> > There isn't any extra magic behind mfence, is there?
> 
> No.
> 
> I think the only issue is that there has never been any real reason
> for CPU designers to try to make mfence go particularly fast. Nobody
> uses it, again with the exception of some odd loops that use
> nontemporal stores, and for those the cost tends to always be about
> the nontemporal accesses themselves (often to things like GPU memory
> over PCIe), and the mfence cost of a few extra cycles is negligible.
> 
> The reason "lock ; add $0" has generally been the fastest
we've found
> is simply that locked ops have been important for CPU designers.
> 
> So I think the patch is fine, and we should likely drop the use of mfence..
> 
>                       Linus
OK so should I repost after a bit more testing?  I don't believe this
will affect the kernel build benchmark, but I'll try :)


-- 
MST

Andy Lutomirski

2016-Jan-12 20:30 UTC

head link

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

On 01/12/2016 09:20 AM, Linus Torvalds wrote:> On Tue, Jan 12, 2016 at 5:57 AM, Michael S. Tsirkin <mst at
redhat.com> wrote:
>> #ifdef xchgrz
>> /* same as xchg but poking at gcc red zone */
>> #define barrier() do { int ret; asm volatile ("xchgl %0,
-4(%%" SP ");": "=r"(ret) :: "memory",
"cc"); } while (0)
>> #endif
>
> That's not safe in general. gcc might be using its redzone, so doing
> xchg into it is unsafe.
>
> But..
>
>> Is this a good way to test it?
>
> .. it's fine for some basic testing. It doesn't show any subtle
> interactions (ie some operations may have different dynamic behavior
> when the write buffers are busy etc), but as a baseline for "how fast
> can things go" the stupid raw loop is fine. And while the xchg into
> the redzoen wouldn't be acceptable as a real implementation, for
> timing testing it's likely fine (ie you aren't hitting the problem
it
> can cause).
I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe even 
64) was better because it avoided stomping on very-likely-to-be-hot 
write buffers.

--Andy

Maybe Matching Threads

Search for more apparently analagous threads

Linux Virtualization - Jan 2016 - [PATCH 3/4] x86,asm: Re-work smp_store_mb()

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

[PATCH 3/4] x86,asm: Re-work smp_store_mb()

Maybe Matching Threads