Kees Cook
2017-Mar-21 20:49 UTC
[Bridge] [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Mon, Mar 20, 2017 at 6:40 AM, Peter Zijlstra <peterz at infradead.org> wrote:> On Mon, Mar 20, 2017 at 09:27:13PM +0800, Herbert Xu wrote: >> On Mon, Mar 20, 2017 at 02:23:57PM +0100, Peter Zijlstra wrote: >> > >> > So what bench/setup do you want ran? >> >> You can start by counting how many cycles an atomic op takes >> vs. how many cycles this new code takes. > > On what uarch? > > I think I tested hand coded asm version and it ended up about double the > cycles for a cmpxchg loop vs the direct instruction on an IVB-EX (until > the memory bus saturated, at which point they took the same). Newer > parts will of course have different numbers, > > Can't we run some iperf on a 40gbe fiber loop or something? It would be > very useful to have an actual workload we can run.Yeah, this is exactly what I'd like to find as well. Just comparing cycles between refcount implementations, while interesting, doesn't show us real-world performance changes, which is what we need to measure. Is Eric's "20 concurrent 'netperf -t UDP_STREAM'" example (from elsewhere in this email thread) real-world meaningful enough? -Kees -- Kees Cook Pixel Security
Eric Dumazet
2017-Mar-21 21:23 UTC
[Bridge] [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Tue, 2017-03-21 at 13:49 -0700, Kees Cook wrote:> Yeah, this is exactly what I'd like to find as well. Just comparing > cycles between refcount implementations, while interesting, doesn't > show us real-world performance changes, which is what we need to > measure. > > Is Eric's "20 concurrent 'netperf -t UDP_STREAM'" example (from > elsewhere in this email thread) real-world meaningful enough?Not at all ;) This was targeting the specific change I had in mind for ip_idents_reserve(), which is not used by TCP flows. Unfortunately there is no good test simulating real-world workloads, which are mostly using TCP flows. Most synthetic tools you can find are not using epoll(), and very often hit bottlenecks in other layers. It looks like our suggestion to get kernel builds with atomic_inc() being exactly an atomic_inc() is not even discussed or implemented. Coding this would require less time than running a typical Google kernel qualification (roughly one month, thousands of hosts..., days of SWE).