Eric Blake
2019-Jun-06  22:48 UTC
[Libguestfs] [nbdkit PATCH 0/2] Reduce network overhead with corking
Slightly RFC, as I need more time to investigate why Unix sockets appeared to degrade with this patch. But as TCP sockets (over loopback to localhost) and TLS sessions (regardless of underlying Unix or TCP) both showed improvements, this looks like a worthwhile series. Eric Blake (2): server: Add support for corking server: Cork around grouped transmission send()s server/internal.h | 3 +++ server/connections.c | 27 +++++++++++++++++++++++ server/crypto.c | 19 ++++++++++++++++ server/protocol.c | 52 ++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 101 insertions(+) -- 2.20.1
Eric Blake
2019-Jun-06  22:48 UTC
[Libguestfs] [nbdkit PATCH 1/2] server: Add support for corking
Any time we reply to NBD_CMD_READ or NBD_CMD_BLOCK_STATUS, we end up
calling conn->send() more than once. Now that we've disabled Nagle's
algorithm, this implies that we try harder to send the small header
immediately, rather than batching it with the rest of the payload,
which causes more overhead in the amount of actual network traffic.
For interfaces that support corking (gnutls, or Linux TCP sockets), we
can give a hint that the separate send() calls should be batched into
a single network packet where practical.
This patch just wires up support; the next one will actually use it
and provide performance measurements.
Signed-off-by: Eric Blake <eblake@redhat.com>
---
 server/internal.h    |  3 +++
 server/connections.c | 25 +++++++++++++++++++++++++
 server/crypto.c      | 19 +++++++++++++++++++
 3 files changed, 47 insertions(+)
diff --git a/server/internal.h b/server/internal.h
index 2ee5e23..cb34323 100644
--- a/server/internal.h
+++ b/server/internal.h
@@ -145,6 +145,8 @@ typedef int (*connection_recv_function) (struct connection
*,
 typedef int (*connection_send_function) (struct connection *,
                                          const void *buf, size_t len)
   __attribute__((__nonnull__ (1, 2)));
+typedef int (*connection_cork_function) (struct connection *, bool)
+  __attribute__((__nonnull__ (1)));
 typedef void (*connection_close_function) (struct connection *)
   __attribute__((__nonnull__ (1)));
@@ -180,6 +182,7 @@ struct connection {
   int sockin, sockout;
   connection_recv_function recv;
   connection_send_function send;
+  connection_cork_function cork;
   connection_close_function close;
 };
diff --git a/server/connections.c b/server/connections.c
index b7d9a6a..9b0b75c 100644
--- a/server/connections.c
+++ b/server/connections.c
@@ -38,6 +38,8 @@
 #include <inttypes.h>
 #include <string.h>
 #include <unistd.h>
+#include <netinet/ip.h>
+#include <netinet/tcp.h>
 #include "internal.h"
@@ -51,6 +53,7 @@ static void free_connection (struct connection *conn);
 /* Don't call these raw socket functions directly.  Use conn->recv etc.
*/
 static int raw_recv (struct connection *, void *buf, size_t len);
 static int raw_send (struct connection *, const void *buf, size_t len);
+static int raw_cork (struct connection *conn, bool cork);
 static void raw_close (struct connection *);
 int
@@ -288,6 +291,15 @@ new_connection (int sockin, int sockout, int nworkers)
   conn->send = raw_send;
   conn->close = raw_close;
+  /* Install a cork handler, but only if corking works */
+#ifdef TCP_CORK
+  {
+    int opt = 0;
+    if (setsockopt (sockout, IPPROTO_TCP, TCP_CORK, &opt, sizeof opt) == 0)
+      conn->cork = raw_cork;
+  }
+#endif
+
   return conn;
 }
@@ -344,6 +356,19 @@ raw_send (struct connection *conn, const void *vbuf, size_t
len)
   return 0;
 }
+/* Change the cork status to batch a group of send calls, and either succeed
+ * completely (returns 0) or fail (returns -1).
+ */
+static int
+raw_cork (struct connection *conn, bool cork)
+{
+  int opt = cork;
+
+  /* Ignore failure; new_connection() checked that uncork should work */
+  setsockopt (conn->sockout, IPPROTO_TCP, TCP_CORK, &opt, sizeof opt);
+  return 0;
+}
+
 /* Read buffer from conn->sockin and either succeed completely
  * (returns > 0), read an EOF (returns 0), or fail (returns -1).
  */
diff --git a/server/crypto.c b/server/crypto.c
index 978a843..e4abca2 100644
--- a/server/crypto.c
+++ b/server/crypto.c
@@ -371,6 +371,24 @@ crypto_send (struct connection *conn, const void *vbuf,
size_t len)
   return 0;
 }
+/* Change the cork status to batch a group of send calls, and either succeed
+ * completely (returns 0) or fail (returns -1).
+ */
+static int
+crypto_cork (struct connection *conn, bool cork)
+{
+  gnutls_session_t session = conn->crypto_session;
+
+  assert (session != NULL);
+
+  if (cork)
+    gnutls_record_cork (session);
+  else if (gnutls_record_uncork (session, GNUTLS_RECORD_WAIT) < 0)
+    return -1;
+
+  return 0;
+}
+
 /* There's no place in the NBD protocol to send back errors from
  * close, so this function ignores errors.
  */
@@ -504,6 +522,7 @@ crypto_negotiate_tls (struct connection *conn, int sockin,
int sockout)
    */
   conn->crypto_session = session;
   conn->recv = crypto_recv;
+  conn->cork = crypto_cork;
   conn->send = crypto_send;
   conn->close = crypto_close;
   return 0;
-- 
2.20.1
Eric Blake
2019-Jun-06  22:48 UTC
[Libguestfs] [nbdkit PATCH 2/2] server: Cork around grouped transmission send()s
As mentioned in the previous patch, sending small packets as soon as
possible leads to network packet overhead.  Grouping related writes
under corking appears to help everything but unencrypted Unix plain
sockets. I tested with appropriate combinations from:
$ nbdkit {-p 10810,-U -} \
  {--tls=require --tls-verify-peer --tls-psk=./keys.psk,} memory size=64m \
  --run './aio-parallel-load{-tls,} {$unixsocket,nbd://localhost:10810}'
with the following IOPS measurements:
               pre       post
unix plain:    81003.5   78538.8
unix tls:      11678.0   12256.1
tcp plain:     65702.6   68010.9
tcp tls:       11058.5   11762.0
I intentionally did not bother with using corking during handshake
phase - that part of the protocol may benefit, but it is not the hot
spot in overall execution.
Signed-off-by: Eric Blake <eblake@redhat.com>
---
I'm not sure why Unix sockets degraded...
I also probably out to repeat my timings more than a single run...
---
 server/connections.c |  2 ++
 server/protocol.c    | 52 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+)
diff --git a/server/connections.c b/server/connections.c
index 9b0b75c..8c92dc5 100644
--- a/server/connections.c
+++ b/server/connections.c
@@ -356,6 +356,7 @@ raw_send (struct connection *conn, const void *vbuf, size_t
len)
   return 0;
 }
+#ifdef TCP_CORK
 /* Change the cork status to batch a group of send calls, and either succeed
  * completely (returns 0) or fail (returns -1).
  */
@@ -368,6 +369,7 @@ raw_cork (struct connection *conn, bool cork)
   setsockopt (conn->sockout, IPPROTO_TCP, TCP_CORK, &opt, sizeof opt);
   return 0;
 }
+#endif /* TCP_CORK */
 /* Read buffer from conn->sockin and either succeed completely
  * (returns > 0), read an EOF (returns 0), or fail (returns -1).
diff --git a/server/protocol.c b/server/protocol.c
index 6d519e7..0057aed 100644
--- a/server/protocol.c
+++ b/server/protocol.c
@@ -398,6 +398,11 @@ send_simple_reply (struct connection *conn,
   reply.handle = handle;
   reply.error = htobe32 (nbd_errno (error, false));
+  if (conn->cork) {
+    r = conn->cork (conn, true);
+    assert (r == 0); /* For now, only uncorking can fail */
+  }
+
   r = conn->send (conn, &reply, sizeof reply);
   if (r == -1) {
     nbdkit_error ("write reply: %s: %m", name_of_nbd_cmd (cmd));
@@ -413,6 +418,14 @@ send_simple_reply (struct connection *conn,
     }
   }
+  if (conn->cork) {
+    r = conn->cork (conn, false);
+    if (r == -1) {
+      nbdkit_error ("write uncork: %s: %m", name_of_nbd_cmd (cmd));
+      return connection_set_status (conn, -1);
+    }
+  }
+
   return 1;                     /* command processed ok */
 }
@@ -433,6 +446,11 @@ send_structured_reply_read (struct connection *conn,
   assert (cmd == NBD_CMD_READ);
+  if (conn->cork) {
+    r = conn->cork (conn, true);
+    assert (r == 0); /* For now, only uncorking can fail */
+  }
+
   reply.magic = htobe32 (NBD_STRUCTURED_REPLY_MAGIC);
   reply.handle = handle;
   reply.flags = htobe16 (NBD_REPLY_FLAG_DONE);
@@ -459,6 +477,14 @@ send_structured_reply_read (struct connection *conn,
     return connection_set_status (conn, -1);
   }
+  if (conn->cork) {
+    r = conn->cork (conn, false);
+    if (r == -1) {
+      nbdkit_error ("write uncork: %s: %m", name_of_nbd_cmd (cmd));
+      return connection_set_status (conn, -1);
+    }
+  }
+
   return 1;                     /* command processed ok */
 }
@@ -566,6 +592,11 @@ send_structured_reply_block_status (struct connection
*conn,
   if (blocks == NULL)
     return connection_set_status (conn, -1);
+  if (conn->cork) {
+    r = conn->cork (conn, true);
+    assert (r == 0); /* For now, only uncorking can fail */
+  }
+
   reply.magic = htobe32 (NBD_STRUCTURED_REPLY_MAGIC);
   reply.handle = handle;
   reply.flags = htobe16 (NBD_REPLY_FLAG_DONE);
@@ -596,6 +627,14 @@ send_structured_reply_block_status (struct connection
*conn,
     }
   }
+  if (conn->cork) {
+    r = conn->cork (conn, false);
+    if (r == -1) {
+      nbdkit_error ("write uncork: %s: %m", name_of_nbd_cmd (cmd));
+      return connection_set_status (conn, -1);
+    }
+  }
+
   return 1;                     /* command processed ok */
 }
@@ -609,6 +648,11 @@ send_structured_reply_error (struct connection *conn,
   struct structured_reply_error error_data;
   int r;
+  if (conn->cork) {
+    r = conn->cork (conn, true);
+    assert (r == 0); /* For now, only uncorking can fail */
+  }
+
   reply.magic = htobe32 (NBD_STRUCTURED_REPLY_MAGIC);
   reply.handle = handle;
   reply.flags = htobe16 (NBD_REPLY_FLAG_DONE);
@@ -631,6 +675,14 @@ send_structured_reply_error (struct connection *conn,
   }
   /* No human readable error message at the moment. */
+  if (conn->cork) {
+    r = conn->cork (conn, false);
+    if (r == -1) {
+      nbdkit_error ("write uncork: %s: %m", name_of_nbd_cmd (cmd));
+      return connection_set_status (conn, -1);
+    }
+  }
+
   return 1;                     /* command processed ok */
 }
-- 
2.20.1
Eric Blake
2019-Jun-07  01:40 UTC
Re: [Libguestfs] [nbdkit PATCH 2/2] server: Cork around grouped transmission send()s
On 6/6/19 5:48 PM, Eric Blake wrote:> As mentioned in the previous patch, sending small packets as soon as > possible leads to network packet overhead. Grouping related writes > under corking appears to help everything but unencrypted Unix plain > sockets. I tested with appropriate combinations from: > > $ nbdkit {-p 10810,-U -} \ > {--tls=require --tls-verify-peer --tls-psk=./keys.psk,} memory size=64m \ > --run './aio-parallel-load{-tls,} {$unixsocket,nbd://localhost:10810}' > > with the following IOPS measurements: > > pre post > unix plain: 81003.5 78538.8 > unix tls: 11678.0 12256.1 > tcp plain: 65702.6 68010.9 > tcp tls: 11058.5 11762.0 > > I intentionally did not bother with using corking during handshake > phase - that part of the protocol may benefit, but it is not the hot > spot in overall execution. > > Signed-off-by: Eric Blake <eblake@redhat.com> > > --- > I'm not sure why Unix sockets degraded... > I also probably out to repeat my timings more than a single run...I've rerun things 10 times and tossed out the first run (using it to warm up the cache); this time my numbers were: pre post gain: avg worst best unix plain: 809534.6 812175.4 0.3% -0.6% 1.9% unix tls: 124926.4 122546.2 -1.9% -4.6% 1.4% tcp plain: 663690.0 679575.8 2.4% -0.4% 4.1% tcp tls: 114133.0 117745.4 3.2% -2.3% 7.9% So, the numbers vary, and there's quite a bit of noise (perhaps from the random number sequence). In general, it seems that TCP benefits more (the effect may be even more pronounced across the network, rather than when using loopback on localhost). I confirmed that for plain Unix sockets, the new cork code is NOT called (so the numbers should be in the same ballback, giving us an estimate of how much noise the other columns have). At any rate, I'm leaning towards declaring that the idea makes sense even if it is nowhere near as dramatic as the Nagle's algorithm speedups.> --- > server/connections.c | 2 ++ > server/protocol.c | 52 ++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 54 insertions(+) > > diff --git a/server/connections.c b/server/connections.c > index 9b0b75c..8c92dc5 100644 > --- a/server/connections.c > +++ b/server/connections.c > @@ -356,6 +356,7 @@ raw_send (struct connection *conn, const void *vbuf, size_t len) > return 0; > } > > +#ifdef TCP_CORK > /* Change the cork status to batch a group of send calls, and either succeed > * completely (returns 0) or fail (returns -1). > */ > @@ -368,6 +369,7 @@ raw_cork (struct connection *conn, bool cork) > setsockopt (conn->sockout, IPPROTO_TCP, TCP_CORK, &opt, sizeof opt); > return 0; > } > +#endif /* TCP_CORK */Whoops - this part belongs in patch 1.> > /* Read buffer from conn->sockin and either succeed completely > * (returns > 0), read an EOF (returns 0), or fail (returns -1). > diff --git a/server/protocol.c b/server/protocol.c > index 6d519e7..0057aed 100644 > --- a/server/protocol.c > +++ b/server/protocol.c > @@ -398,6 +398,11 @@ send_simple_reply (struct connection *conn, > reply.handle = handle; > reply.error = htobe32 (nbd_errno (error, false)); > > + if (conn->cork) { > + r = conn->cork (conn, true); > + assert (r == 0); /* For now, only uncorking can fail */ > + }One possibility might be to always install a conn->cork (and write a no-op version for Unix/-s use), rather than checking if the cork callback is NULL (although my gut feel is a null check beats a no-op function indirection).> + > r = conn->send (conn, &reply, sizeof reply); > if (r == -1) { > nbdkit_error ("write reply: %s: %m", name_of_nbd_cmd (cmd)); > @@ -413,6 +418,14 @@ send_simple_reply (struct connection *conn, > } > } > > + if (conn->cork) { > + r = conn->cork (conn, false); > + if (r == -1) { > + nbdkit_error ("write uncork: %s: %m", name_of_nbd_cmd (cmd)); > + return connection_set_status (conn, -1); > + } > + } > +I probably also ought to document in the commit message that on any failure, I left the socket in corked mode - but I assumed that's okay, since once we fail, we're not sending anything else anyway. On the other hand, I just re-read docs: for TCP sockets, the Linux man page says a cork lasts at most 200ms, then auto-uncorks; but for gnutls, if you don't manually uncork, we could strand data in the gnutls buffers, and thereby possibly interfere with gnutls_bye, so maybe I should stick an uncork in connection_set_status() to catch all error paths. Still, it's probably nicer to have a generic uncork in the cleanup paths rather than worrying about a compiler-assisted attribute cleanup function to uncork on all early exit paths. For libnbd as it currently stands, only NBD_OPT (where I argued that handshake is not the hot path) and NBD_CMD_WRITE would benefit from cork code (but the benefit for writes may still be worth it) -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
Eric Blake
2019-Jun-07  11:32 UTC
Re: [Libguestfs] [nbdkit PATCH 1/2] server: Add support for corking
On 6/6/19 5:48 PM, Eric Blake wrote:> Any time we reply to NBD_CMD_READ or NBD_CMD_BLOCK_STATUS, we end up > calling conn->send() more than once. Now that we've disabled Nagle's > algorithm, this implies that we try harder to send the small header > immediately, rather than batching it with the rest of the payload, > which causes more overhead in the amount of actual network traffic. > For interfaces that support corking (gnutls, or Linux TCP sockets), we > can give a hint that the separate send() calls should be batched into > a single network packet where practical. > > This patch just wires up support; the next one will actually use it > and provide performance measurements. > > Signed-off-by: Eric Blake <eblake@redhat.com> > --- > server/internal.h | 3 +++ > server/connections.c | 25 +++++++++++++++++++++++++ > server/crypto.c | 19 +++++++++++++++++++ > 3 files changed, 47 insertions(+) > > diff --git a/server/internal.h b/server/internal.h > index 2ee5e23..cb34323 100644 > --- a/server/internal.h > +++ b/server/internal.h > @@ -145,6 +145,8 @@ typedef int (*connection_recv_function) (struct connection *, > typedef int (*connection_send_function) (struct connection *, > const void *buf, size_t len) > __attribute__((__nonnull__ (1, 2))); > +typedef int (*connection_cork_function) (struct connection *, bool) > + __attribute__((__nonnull__ (1))); > typedef void (*connection_close_function) (struct connection *) > __attribute__((__nonnull__ (1)));After thinking more on it, I'm not sure I like this interface. Linux also has MSG_MORE, which achieves the same effect as TCP_CORK but with fewer syscalls: setsockopt(TCP_CORK, 1) send() send() setsockopt(TCP_CORK, 0) vs. send(,MSG_MORE) send(,0) We can't use send() on a non-socket (think nbdkit -s), so we'd need two different handlers for raw_send (one for anything that passes an initial fstat()/S_ISSCK and/or getsockopt(), the other for all else), but I'm thinking that using send-like semantics everywhere will be just as easy to implement the same semantics: if flag NBDKIT_MORE is set, conn->send does either send(MSG_MORE) (if MSG_MORE is available) or the pair set_cork/write; if flag NBDKIT_MORE is not set, conn->send does either send(,0) (if MSG_MORE is available) or the pair write/uncork. I'll post a v2 along those lines, and see if it makes any difference on performance numbers (avoiding syscalls is always a good thing). Using sendmsg() could avoid even more syscalls by using a vectored send, but that's more work to the callers to prepare the vector (our reply to MSG_CMD_BLOCK_STATUS would require the most work), and we'd still have to write fallback code for non-sockets to unvector. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
Seemingly Similar Threads
- [nbdkit PATCH v2 0/2] Reduce network overhead with MSG_MORE/corking
- [nbdkit PATCH 2/2] server: Cork around grouped transmission send()s
- [nbdkit PATCH 1/2] server: Add support for corking
- [nbdkit PATCH] crypto: Tweak handling of SEND_MORE
- [PATCH nbdkit 0/2] server: Split out NBD protocol code from connections code.