Herbert Xu
2017-Mar-20 13:27 UTC
[Bridge] [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Mon, Mar 20, 2017 at 02:23:57PM +0100, Peter Zijlstra wrote:> > So what bench/setup do you want ran?You can start by counting how many cycles an atomic op takes vs. how many cycles this new code takes. Cheers, -- Email: Herbert Xu <herbert at gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Peter Zijlstra
2017-Mar-20 13:40 UTC
[Bridge] [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Mon, Mar 20, 2017 at 09:27:13PM +0800, Herbert Xu wrote:> On Mon, Mar 20, 2017 at 02:23:57PM +0100, Peter Zijlstra wrote: > > > > So what bench/setup do you want ran? > > You can start by counting how many cycles an atomic op takes > vs. how many cycles this new code takes.On what uarch? I think I tested hand coded asm version and it ended up about double the cycles for a cmpxchg loop vs the direct instruction on an IVB-EX (until the memory bus saturated, at which point they took the same). Newer parts will of course have different numbers, Can't we run some iperf on a 40gbe fiber loop or something? It would be very useful to have an actual workload we can run.
Eric Dumazet
2017-Mar-20 14:51 UTC
[Bridge] [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Mon, 2017-03-20 at 14:40 +0100, Peter Zijlstra wrote:> On Mon, Mar 20, 2017 at 09:27:13PM +0800, Herbert Xu wrote: > > On Mon, Mar 20, 2017 at 02:23:57PM +0100, Peter Zijlstra wrote: > > > > > > So what bench/setup do you want ran? > > > > You can start by counting how many cycles an atomic op takes > > vs. how many cycles this new code takes. > > On what uarch? > > I think I tested hand coded asm version and it ended up about double the > cycles for a cmpxchg loop vs the direct instruction on an IVB-EX (until > the memory bus saturated, at which point they took the same). Newer > parts will of course have different numbers, > > Can't we run some iperf on a 40gbe fiber loop or something? It would be > very useful to have an actual workload we can run.If atomic ops are converted one by one, it is likely that results will be noise. We can not start a global conversion without having a way to have selective debugging ? Then, adopting this fine infra would really not be a problem. Some arches have efficient atomic_inc() ( no full barriers ) while load + test + atomic_cmpxchg() + test + loop" is more expensive. PowerPC has no efficient atomic_inc() and this definitely shows on network intensive workloads involving concurrent cores/threads. atomic_cmpxchg() on PowerPC is horribly more expensive because of the added two SYNC instructions. networking performance is quite poor on PowerPC as of today.