thr3ads.net - Libguestfs - [Libguestfs] [libnbd PATCH] states: Never block state machine inside REPLY [Jun 2019]

If this information is useful, please help other people find it:
Share via:

Eric Blake

2019-Jun-19 18:18 UTC

[Libguestfs] [libnbd PATCH] states: Never block state machine inside REPLY

When processing a server reply within the REPLY subgroup, we will
often hit a situation where recv() requires us to block until the next
NotifyRead. But since NotifyRead is the only permitted external action
while in this group, we are effectively blocking CmdIssue and
NotifyWrite events from happening until the server finishes the
in-progress reply, even though those events have no strict dependence
on the server's progress.

The solution is similar to commit dd101bde - any time we need to pause
the reply cycle, we need to save enough information to recall where we
left off then return to the READY state, then teach REPLY.START to
check if we are starting a fresh reply or resuming an earlier reply.
The state machine will still be blocked until the next POLLIN, but now
is in a position to also accept CmdIssue and NotifyWrite without
delay.  With this patch in place, the only time is_state_processing is
true is during the ISSUE_COMMAND group when it is blocked on
NotifyWrite.  Thus, once handshaking is complete, we can reliably
equate nbd_aio_get_direction() == DIRECTION_READ with is_ready(),
nbd_aio_get_direction() == DIRECTION_BOTH with is_processing() in the
ISSUE_COMMAND substate.

Oddly enough, I am not getting any measurable performance difference
with this patch applied and using examples/threaded-reads-and-writes
coupled with nbdkit.  My explanation is that in the common case, once
a server has something to send, it is going to send the entire reply
as fast as it can, rather than sending a partial reply and waiting;
and our attempts to keep up to 64 commands in flight mean that our
real bottleneck is not sending our next command (even if we have to
wait for the server's reply to finish), but at the server (if we are
saturating the server's amount of in-flight requests, the server won't
read our next request until it has finished its current reply).
Still, I'm sure that it is possible to construct corner cases where
this shows more of an effect.
---

Applies on top of my series to add nbd_aio_pread_callback.

 generator/generator                 | 22 +++++++++---------
 generator/states-reply-simple.c     |  4 ++++
 generator/states-reply-structured.c | 32 +++++++++++++++++++++++++
 generator/states-reply.c            | 36 ++++++++++++++++++++++++++++-
 lib/internal.h                      |  1 +
 5 files changed, 83 insertions(+), 12 deletions(-)

diff --git a/generator/generator b/generator/generator
index 630260b..68a4fdf 100755
--- a/generator/generator
+++ b/generator/generator
@@ -682,14 +682,14 @@ and reply_state_machine = [
     default_state with
     name = "START";
     comment = "Prepare to receive a reply from the remote server";
-    external_events = [ NotifyRead, "" ];
+    external_events = [];
   };

   State {
     default_state with
     name = "RECV_REPLY";
     comment = "Receive a reply from the remote server";
-    external_events = [ NotifyRead, "" ];
+    external_events = [];
   };

   State {
@@ -723,7 +723,7 @@ and simple_reply_state_machine = [
     default_state with
     name = "RECV_READ_PAYLOAD";
     comment = "Receiving the read payload for a simple reply";
-    external_events = [ NotifyRead, "" ];
+    external_events = [];
   };
 ]

@@ -740,7 +740,7 @@ and structured_reply_state_machine = [
     default_state with
     name = "RECV_REMAINING";
     comment = "Receiving the remaining part of a structured reply";
-    external_events = [ NotifyRead, "" ];
+    external_events = [];
   };

   State {
@@ -754,49 +754,49 @@ and structured_reply_state_machine = [
     default_state with
     name = "RECV_ERROR";
     comment = "Receive a structured reply error header";
-    external_events = [ NotifyRead, "" ];
+    external_events = []
   };

   State {
     default_state with
     name = "RECV_ERROR_MESSAGE";
     comment = "Receive a structured reply error message";
-    external_events = [ NotifyRead, "" ];
+    external_events = [];
   };

   State {
     default_state with
     name = "RECV_ERROR_TAIL";
     comment = "Receive a structured reply error tail";
-    external_events = [ NotifyRead, "" ];
+    external_events = [];
   };

   State {
     default_state with
     name = "RECV_OFFSET_DATA";
     comment = "Receive a structured reply offset-data header";
-    external_events = [ NotifyRead, "" ];
+    external_events = [];
   };

   State {
     default_state with
     name = "RECV_OFFSET_DATA_DATA";
     comment = "Receive a structured reply offset-data block of data";
-    external_events = [ NotifyRead, "" ];
+    external_events = [];
   };

   State {
     default_state with
     name = "RECV_OFFSET_HOLE";
     comment = "Receive a structured reply offset-hole header";
-    external_events = [ NotifyRead, "" ];
+    external_events = [];
   };

   State {
     default_state with
     name = "RECV_BS_ENTRIES";
     comment = "Receive a structured reply block-status payload";
-    external_events = [ NotifyRead, "" ];
+    external_events = [];
   };

   State {
diff --git a/generator/states-reply-simple.c b/generator/states-reply-simple.c
index ddc91ce..3b63d07 100644
--- a/generator/states-reply-simple.c
+++ b/generator/states-reply-simple.c
@@ -53,6 +53,10 @@

   switch (recv_into_rbuf (h)) {
   case -1: SET_NEXT_STATE (%.DEAD); return -1;
+  case 1:
+    save_reply_state (h);
+    SET_NEXT_STATE (%.READY);
+    return 0;
   case 0:
     /* guaranteed by START */
     assert (cmd);
diff --git a/generator/states-reply-structured.c
b/generator/states-reply-structured.c
index 00659de..594525e 100644
--- a/generator/states-reply-structured.c
+++ b/generator/states-reply-structured.c
@@ -38,6 +38,10 @@
  REPLY.STRUCTURED_REPLY.RECV_REMAINING:
   switch (recv_into_rbuf (h)) {
   case -1: SET_NEXT_STATE (%.DEAD); return -1;
+  case 1:
+    save_reply_state (h);
+    SET_NEXT_STATE (%.READY);
+    return 0;
   case 0:  SET_NEXT_STATE (%CHECK);
   }
   return 0;
@@ -154,6 +158,10 @@

   switch (recv_into_rbuf (h)) {
   case -1: SET_NEXT_STATE (%.DEAD); return -1;
+  case 1:
+    save_reply_state (h);
+    SET_NEXT_STATE (%.READY);
+    return 0;
   case 0:
     length = be32toh (h->sbuf.sr.structured_reply.length);
     msglen = be16toh (h->sbuf.sr.payload.error.error.len);
@@ -176,6 +184,10 @@

   switch (recv_into_rbuf (h)) {
   case -1: SET_NEXT_STATE (%.DEAD); return -1;
+  case 1:
+    save_reply_state (h);
+    SET_NEXT_STATE (%.READY);
+    return 0;
   case 0:
     length = be32toh (h->sbuf.sr.structured_reply.length);
     msglen = be16toh (h->sbuf.sr.payload.error.error.len);
@@ -219,6 +231,10 @@

   switch (recv_into_rbuf (h)) {
   case -1: SET_NEXT_STATE (%.DEAD); return -1;
+  case 1:
+    save_reply_state (h);
+    SET_NEXT_STATE (%.READY);
+    return 0;
   case 0:
     error = be32toh (h->sbuf.sr.payload.error.error.error);
     type = be16toh (h->sbuf.sr.structured_reply.type);
@@ -268,6 +284,10 @@

   switch (recv_into_rbuf (h)) {
   case -1: SET_NEXT_STATE (%.DEAD); return -1;
+  case 1:
+    save_reply_state (h);
+    SET_NEXT_STATE (%.READY);
+    return 0;
   case 0:
     length = be32toh (h->sbuf.sr.structured_reply.length);
     offset = be64toh (h->sbuf.sr.payload.offset_data.offset);
@@ -324,6 +344,10 @@

   switch (recv_into_rbuf (h)) {
   case -1: SET_NEXT_STATE (%.DEAD); return -1;
+  case 1:
+    save_reply_state (h);
+    SET_NEXT_STATE (%.READY);
+    return 0;
   case 0:
     length = be32toh (h->sbuf.sr.structured_reply.length);
     offset = be64toh (h->sbuf.sr.payload.offset_data.offset);
@@ -349,6 +373,10 @@

   switch (recv_into_rbuf (h)) {
   case -1: SET_NEXT_STATE (%.DEAD); return -1;
+  case 1:
+    save_reply_state (h);
+    SET_NEXT_STATE (%.READY);
+    return 0;
   case 0:
     offset = be64toh (h->sbuf.sr.payload.offset_hole.offset);
     length = be32toh (h->sbuf.sr.payload.offset_hole.length);
@@ -410,6 +438,10 @@

   switch (recv_into_rbuf (h)) {
   case -1: SET_NEXT_STATE (%.DEAD); return -1;
+  case 1:
+    save_reply_state (h);
+    SET_NEXT_STATE (%.READY);
+    return 0;
   case 0:
     length = be32toh (h->sbuf.sr.structured_reply.length);

diff --git a/generator/states-reply.c b/generator/states-reply.c
index 54f98c5..99e54f6 100644
--- a/generator/states-reply.c
+++ b/generator/states-reply.c
@@ -16,10 +16,43 @@
  * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
  */

-/* State machine for receiving reply messages from the server. */
+#include <assert.h>
+
+/* State machine for receiving reply messages from the server.
+ *
+ * Note that we never block while in this sub-group. If there is
+ * insufficient data to finish parsing a reply, requiring us to block
+ * until POLLIN, we instead track where in the state machine we left
+ * off, then return to READY to actually block. Then, on entry to
+ * REPLY.START, we can tell if this is the start of a new reply (rlen
+ * is 0, stay put), a continuation of the preamble (reply_cmd is NULL,
+ * resume with RECV_REPLY), or a continuation from any other location
+ * (reply_cmd contains the state to jump to).
+ */
+
+static void
+save_reply_state (struct nbd_handle *h)
+{
+  assert (h->reply_cmd);
+  assert (h->rlen);
+  h->reply_cmd->state = get_next_state (h);
+}
+
+/*----- End of prologue. -----*/

 /* STATE MACHINE */ {
  REPLY.START:
+  /* If rlen is non-zero, we are resuming an earlier reply cycle. */
+  if (h->rlen > 0) {
+    if (h->reply_cmd) {
+      assert (nbd_internal_is_state_processing (h->reply_cmd->state));
+      SET_NEXT_STATE (h->reply_cmd->state);
+    }
+    else
+      SET_NEXT_STATE (%RECV_REPLY);
+    return 0;
+  }
+
   /* This state is entered when a read notification is received in the
    * READY state.  Therefore we know the socket is readable here.
    * Reading a zero length now would indicate that the socket has been
@@ -66,6 +99,7 @@
  REPLY.RECV_REPLY:
   switch (recv_into_rbuf (h)) {
   case -1: SET_NEXT_STATE (%.DEAD); return -1;
+  case 1: SET_NEXT_STATE (%.READY); return 0;
   case 0: SET_NEXT_STATE (%CHECK_SIMPLE_OR_STRUCTURED_REPLY);
   }
   return 0;
diff --git a/lib/internal.h b/lib/internal.h
index a1e27df..662ff7a 100644
--- a/lib/internal.h
+++ b/lib/internal.h
@@ -253,6 +253,7 @@ struct command_in_flight {
   uint32_t count;
   void *data; /* Buffer for read/write */
   struct command_cb cb;
+  enum state state; /* State to resume with on next POLLIN */
   bool data_seen; /* For read, true if at least one data chunk seen */
   uint32_t error; /* Local errno value */
 };
-- 
2.20.1

Eric Blake

2019-Jun-20 02:11 UTC

head link

[Libguestfs] [libnbd PATCH 2/1] states: Avoid wasted send() when REPLY interrupts request

When we are blocked waiting for POLLOUT during a request, and happen
to receive notice of POLLIN instead, we know that the work done in
response to POLLIN will be non-blocking (it returns to %.READY as soon
as it would block, which in turn jumps right back into ISSUE_COMMAND
because we have a pending request not fully sent yet).  Since the
jaunt through REPLY was non-blocking, it is unlikely that the POLLOUT
situation has changed in the meantime, so if we use SET_NEXT_STATE()
to step back into SEND_REQUEST, our recv() call will likely fail with
EAGAIN, once again blocking us until our next POLLOUT.  Although the
wasted syscall is not on the hot-path (after all, we can't progress
until data arrives from the server), it's slightly cleaner if we
instead declare that we are already blocked.

I tested with:
$ nbdkit -U - null 16M --run 'examples/threaded-reads-and-writes
$unixsocket'

There was no real noticeable difference in timing, but I did observe
that pre-patch, the run encountered 168825 pre-emptions and 136976
send() EAGAIN failures (81%), while post-patch the run encountered
166066 pre-emptions and 552 EAGAIN failures (0.3%).
---
 generator/states-issue-command.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/generator/states-issue-command.c b/generator/states-issue-command.c
index 821b28a..f746f80 100644
--- a/generator/states-issue-command.c
+++ b/generator/states-issue-command.c
@@ -25,12 +25,17 @@
   assert (h->cmds_to_issue != NULL);
   cmd = h->cmds_to_issue;

-  /* Were we interrupted by reading a reply to an earlier command? */
+  /* Were we interrupted by reading a reply to an earlier command? If
+   * so, we can only get back here after a non-blocking jaunt through
+   * the REPLY engine, which means we are unlikely to be unblocked for
+   * writes yet; we want to advance back to the correct state but
+   * without trying a send_from_wbuf that will likely return 1.
+   */
   if (h->wlen) {
     if (h->in_write_payload)
-      SET_NEXT_STATE(%SEND_WRITE_PAYLOAD);
+      *next_state = %SEND_WRITE_PAYLOAD;
     else
-      SET_NEXT_STATE(%SEND_REQUEST);
+      *next_state = %SEND_REQUEST;
     return 0;
   }

-- 
2.20.1

Richard W.M. Jones

2019-Jun-25 09:03 UTC

head link

Re: [Libguestfs] [libnbd PATCH] states: Never block state machine inside REPLY

On Wed, Jun 19, 2019 at 01:18:01PM -0500, Eric Blake
wrote:> Oddly enough, I am not getting any measurable performance difference
> with this patch applied and using examples/threaded-reads-and-writes
> coupled with nbdkit.  My explanation is that in the common case, once
> a server has something to send, it is going to send the entire reply
> as fast as it can, rather than sending a partial reply and waiting;
> and our attempts to keep up to 64 commands in flight mean that our
> real bottleneck is not sending our next command (even if we have to
> wait for the server's reply to finish), but at the server (if we are
> saturating the server's amount of in-flight requests, the server
won't
> read our next request until it has finished its current reply).
> Still, I'm sure that it is possible to construct corner cases where
> this shows more of an effect.
What I'm going to say is I think stating the obvious, but my intuition
is there are going to be two types of load.  In the first and easiest
case you want to read or write sequentially over the whole or a large
section of the image (think: qemu-img convert).  In this case you know
well in advance what parts of the image you want to read/write and can
keep the in flight queue full at all times.  This is what
threaded-reads-and-writes actually tests.

The harder case is random access (think: qemu with a database guest).
There is a short queue, for example because of data dependencies
between the requests, so the issued commands buffer is always short.
Also there may be server latency because of the random access pattern.
This kind of workload should show the benefit of this commit, but we
don't really have a test for this kind of workload.
> 
> Applies on top of my series to add nbd_aio_pread_callback.
> 
>  generator/generator                 | 22 +++++++++---------
>  generator/states-reply-simple.c     |  4 ++++
>  generator/states-reply-structured.c | 32 +++++++++++++++++++++++++
>  generator/states-reply.c            | 36 ++++++++++++++++++++++++++++-
>  lib/internal.h                      |  1 +
>  5 files changed, 83 insertions(+), 12 deletions(-)
> 
> diff --git a/generator/generator b/generator/generator
> index 630260b..68a4fdf 100755
> --- a/generator/generator
> +++ b/generator/generator
> @@ -682,14 +682,14 @@ and reply_state_machine = [
>      default_state with
>      name = "START";
>      comment = "Prepare to receive a reply from the remote
server";
> -    external_events = [ NotifyRead, "" ];
> +    external_events = [];
>    };
> 
>    State {
>      default_state with
>      name = "RECV_REPLY";
>      comment = "Receive a reply from the remote server";
> -    external_events = [ NotifyRead, "" ];
> +    external_events = [];
>    };
> 
>    State {
> @@ -723,7 +723,7 @@ and simple_reply_state_machine = [
>      default_state with
>      name = "RECV_READ_PAYLOAD";
>      comment = "Receiving the read payload for a simple reply";
> -    external_events = [ NotifyRead, "" ];
> +    external_events = [];
>    };
>  ]
> 
> @@ -740,7 +740,7 @@ and structured_reply_state_machine = [
>      default_state with
>      name = "RECV_REMAINING";
>      comment = "Receiving the remaining part of a structured
reply";
> -    external_events = [ NotifyRead, "" ];
> +    external_events = [];
>    };
> 
>    State {
> @@ -754,49 +754,49 @@ and structured_reply_state_machine = [
>      default_state with
>      name = "RECV_ERROR";
>      comment = "Receive a structured reply error header";
> -    external_events = [ NotifyRead, "" ];
> +    external_events = []
>    };
> 
>    State {
>      default_state with
>      name = "RECV_ERROR_MESSAGE";
>      comment = "Receive a structured reply error message";
> -    external_events = [ NotifyRead, "" ];
> +    external_events = [];
>    };
> 
>    State {
>      default_state with
>      name = "RECV_ERROR_TAIL";
>      comment = "Receive a structured reply error tail";
> -    external_events = [ NotifyRead, "" ];
> +    external_events = [];
>    };
> 
>    State {
>      default_state with
>      name = "RECV_OFFSET_DATA";
>      comment = "Receive a structured reply offset-data header";
> -    external_events = [ NotifyRead, "" ];
> +    external_events = [];
>    };
> 
>    State {
>      default_state with
>      name = "RECV_OFFSET_DATA_DATA";
>      comment = "Receive a structured reply offset-data block of
data";
> -    external_events = [ NotifyRead, "" ];
> +    external_events = [];
>    };
> 
>    State {
>      default_state with
>      name = "RECV_OFFSET_HOLE";
>      comment = "Receive a structured reply offset-hole header";
> -    external_events = [ NotifyRead, "" ];
> +    external_events = [];
>    };
> 
>    State {
>      default_state with
>      name = "RECV_BS_ENTRIES";
>      comment = "Receive a structured reply block-status payload";
> -    external_events = [ NotifyRead, "" ];
> +    external_events = [];
>    };
> 
>    State {
> diff --git a/generator/states-reply-simple.c
b/generator/states-reply-simple.c
> index ddc91ce..3b63d07 100644
> --- a/generator/states-reply-simple.c
> +++ b/generator/states-reply-simple.c
> @@ -53,6 +53,10 @@
> 
>    switch (recv_into_rbuf (h)) {
>    case -1: SET_NEXT_STATE (%.DEAD); return -1;
> +  case 1:
> +    save_reply_state (h);
> +    SET_NEXT_STATE (%.READY);
> +    return 0;
>    case 0:
>      /* guaranteed by START */
>      assert (cmd);
> diff --git a/generator/states-reply-structured.c
b/generator/states-reply-structured.c
> index 00659de..594525e 100644
> --- a/generator/states-reply-structured.c
> +++ b/generator/states-reply-structured.c
> @@ -38,6 +38,10 @@
>   REPLY.STRUCTURED_REPLY.RECV_REMAINING:
>    switch (recv_into_rbuf (h)) {
>    case -1: SET_NEXT_STATE (%.DEAD); return -1;
> +  case 1:
> +    save_reply_state (h);
> +    SET_NEXT_STATE (%.READY);
> +    return 0;
>    case 0:  SET_NEXT_STATE (%CHECK);
>    }
>    return 0;
> @@ -154,6 +158,10 @@
> 
>    switch (recv_into_rbuf (h)) {
>    case -1: SET_NEXT_STATE (%.DEAD); return -1;
> +  case 1:
> +    save_reply_state (h);
> +    SET_NEXT_STATE (%.READY);
> +    return 0;
>    case 0:
>      length = be32toh (h->sbuf.sr.structured_reply.length);
>      msglen = be16toh (h->sbuf.sr.payload.error.error.len);
> @@ -176,6 +184,10 @@
> 
>    switch (recv_into_rbuf (h)) {
>    case -1: SET_NEXT_STATE (%.DEAD); return -1;
> +  case 1:
> +    save_reply_state (h);
> +    SET_NEXT_STATE (%.READY);
> +    return 0;
>    case 0:
>      length = be32toh (h->sbuf.sr.structured_reply.length);
>      msglen = be16toh (h->sbuf.sr.payload.error.error.len);
> @@ -219,6 +231,10 @@
> 
>    switch (recv_into_rbuf (h)) {
>    case -1: SET_NEXT_STATE (%.DEAD); return -1;
> +  case 1:
> +    save_reply_state (h);
> +    SET_NEXT_STATE (%.READY);
> +    return 0;
>    case 0:
>      error = be32toh (h->sbuf.sr.payload.error.error.error);
>      type = be16toh (h->sbuf.sr.structured_reply.type);
> @@ -268,6 +284,10 @@
> 
>    switch (recv_into_rbuf (h)) {
>    case -1: SET_NEXT_STATE (%.DEAD); return -1;
> +  case 1:
> +    save_reply_state (h);
> +    SET_NEXT_STATE (%.READY);
> +    return 0;
>    case 0:
>      length = be32toh (h->sbuf.sr.structured_reply.length);
>      offset = be64toh (h->sbuf.sr.payload.offset_data.offset);
> @@ -324,6 +344,10 @@
> 
>    switch (recv_into_rbuf (h)) {
>    case -1: SET_NEXT_STATE (%.DEAD); return -1;
> +  case 1:
> +    save_reply_state (h);
> +    SET_NEXT_STATE (%.READY);
> +    return 0;
>    case 0:
>      length = be32toh (h->sbuf.sr.structured_reply.length);
>      offset = be64toh (h->sbuf.sr.payload.offset_data.offset);
> @@ -349,6 +373,10 @@
> 
>    switch (recv_into_rbuf (h)) {
>    case -1: SET_NEXT_STATE (%.DEAD); return -1;
> +  case 1:
> +    save_reply_state (h);
> +    SET_NEXT_STATE (%.READY);
> +    return 0;
>    case 0:
>      offset = be64toh (h->sbuf.sr.payload.offset_hole.offset);
>      length = be32toh (h->sbuf.sr.payload.offset_hole.length);
> @@ -410,6 +438,10 @@
> 
>    switch (recv_into_rbuf (h)) {
>    case -1: SET_NEXT_STATE (%.DEAD); return -1;
> +  case 1:
> +    save_reply_state (h);
> +    SET_NEXT_STATE (%.READY);
> +    return 0;
>    case 0:
>      length = be32toh (h->sbuf.sr.structured_reply.length);
> 
> diff --git a/generator/states-reply.c b/generator/states-reply.c
> index 54f98c5..99e54f6 100644
> --- a/generator/states-reply.c
> +++ b/generator/states-reply.c
> @@ -16,10 +16,43 @@
>   * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
02110-1301 USA
>   */
> 
> -/* State machine for receiving reply messages from the server. */
> +#include <assert.h>
> +
> +/* State machine for receiving reply messages from the server.
> + *
> + * Note that we never block while in this sub-group. If there is
> + * insufficient data to finish parsing a reply, requiring us to block
> + * until POLLIN, we instead track where in the state machine we left
> + * off, then return to READY to actually block. Then, on entry to
> + * REPLY.START, we can tell if this is the start of a new reply (rlen
> + * is 0, stay put), a continuation of the preamble (reply_cmd is NULL,
> + * resume with RECV_REPLY), or a continuation from any other location
> + * (reply_cmd contains the state to jump to).
> + */
> +
> +static void
> +save_reply_state (struct nbd_handle *h)
> +{
> +  assert (h->reply_cmd);
> +  assert (h->rlen);
> +  h->reply_cmd->state = get_next_state (h);
> +}
> +
> +/*----- End of prologue. -----*/
> 
>  /* STATE MACHINE */ {
>   REPLY.START:
> +  /* If rlen is non-zero, we are resuming an earlier reply cycle. */
> +  if (h->rlen > 0) {
> +    if (h->reply_cmd) {
> +      assert (nbd_internal_is_state_processing
(h->reply_cmd->state));
> +      SET_NEXT_STATE (h->reply_cmd->state);
This is essentially the "stack of states" idea that I had early on,
but with a maximum stack depth of 1.

I originally rejected the idea then because it means that the
generator would not longer have a complete view of all state
transitions in the state machine, and thus wouldn't be able to enforce
invariants such as not jumping to a state in the middle of a state
group.  (In fact with this patch that is now the case -- I thought the
generator would give an error about this, but maybe my test is wrong).

Nevertheless given that we need to do this, maybe it's better to drop
the idea that the generator needs to have a complete view.

In my original plan we would have had "push state" and "pop
state"
operations (pop state is a general jump to "any state", which is why
it breaks the assumption in the generator).
> +    }
> +    else
> +      SET_NEXT_STATE (%RECV_REPLY);
> +    return 0;
> +  }
> +
>    /* This state is entered when a read notification is received in the
>     * READY state.  Therefore we know the socket is readable here.
>     * Reading a zero length now would indicate that the socket has been
> @@ -66,6 +99,7 @@
>   REPLY.RECV_REPLY:
>    switch (recv_into_rbuf (h)) {
>    case -1: SET_NEXT_STATE (%.DEAD); return -1;
> +  case 1: SET_NEXT_STATE (%.READY); return 0;
>    case 0: SET_NEXT_STATE (%CHECK_SIMPLE_OR_STRUCTURED_REPLY);
>    }
>    return 0;
> diff --git a/lib/internal.h b/lib/internal.h
> index a1e27df..662ff7a 100644
> --- a/lib/internal.h
> +++ b/lib/internal.h
> @@ -253,6 +253,7 @@ struct command_in_flight {
>    uint32_t count;
>    void *data; /* Buffer for read/write */
>    struct command_cb cb;
> +  enum state state; /* State to resume with on next POLLIN */
>    bool data_seen; /* For read, true if at least one data chunk seen */
>    uint32_t error; /* Local errno value */
>  };
The patch seems reasonable.  Does this obviate any need to split the
state machine?

ACK

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

Richard W.M. Jones

2019-Jun-25 09:06 UTC

head link

Re: [Libguestfs] [libnbd PATCH 2/1] states: Avoid wasted send() when REPLY interrupts request

On Wed, Jun 19, 2019 at 09:11:52PM -0500, Eric Blake
wrote:> When we are blocked waiting for POLLOUT during a request, and happen
> to receive notice of POLLIN instead, we know that the work done in
> response to POLLIN will be non-blocking (it returns to %.READY as soon
> as it would block, which in turn jumps right back into ISSUE_COMMAND
> because we have a pending request not fully sent yet).  Since the
> jaunt through REPLY was non-blocking, it is unlikely that the POLLOUT
> situation has changed in the meantime, so if we use SET_NEXT_STATE()
> to step back into SEND_REQUEST, our recv() call will likely fail with
> EAGAIN, once again blocking us until our next POLLOUT.  Although the
> wasted syscall is not on the hot-path (after all, we can't progress
> until data arrives from the server), it's slightly cleaner if we
> instead declare that we are already blocked.
> 
> I tested with:
> $ nbdkit -U - null 16M --run 'examples/threaded-reads-and-writes
$unixsocket'
> 
> There was no real noticeable difference in timing, but I did observe
> that pre-patch, the run encountered 168825 pre-emptions and 136976
> send() EAGAIN failures (81%), while post-patch the run encountered
> 166066 pre-emptions and 552 EAGAIN failures (0.3%).
> ---
>  generator/states-issue-command.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/generator/states-issue-command.c
b/generator/states-issue-command.c
> index 821b28a..f746f80 100644
> --- a/generator/states-issue-command.c
> +++ b/generator/states-issue-command.c
> @@ -25,12 +25,17 @@
>    assert (h->cmds_to_issue != NULL);
>    cmd = h->cmds_to_issue;
> 
> -  /* Were we interrupted by reading a reply to an earlier command? */
> +  /* Were we interrupted by reading a reply to an earlier command? If
> +   * so, we can only get back here after a non-blocking jaunt through
> +   * the REPLY engine, which means we are unlikely to be unblocked for
> +   * writes yet; we want to advance back to the correct state but
> +   * without trying a send_from_wbuf that will likely return 1.
> +   */
>    if (h->wlen) {
>      if (h->in_write_payload)
> -      SET_NEXT_STATE(%SEND_WRITE_PAYLOAD);
> +      *next_state = %SEND_WRITE_PAYLOAD;
>      else
> -      SET_NEXT_STATE(%SEND_REQUEST);
> +      *next_state = %SEND_REQUEST;
It would be nice to do this without fiddling with essentially an
internal detail of the generated code.

Could we add another macro, something like "SET_NEXT_STATE_AND_BLOCK"?

On the other hand if it's not on the hot path, maybe we shouldn't
do this at all?

Rich.
>      return 0;
>    }
> 
> -- 
> 2.20.1
> 
> _______________________________________________
> Libguestfs mailing list
> Libguestfs@redhat.com
> https://www.redhat.com/mailman/listinfo/libguestfs
-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Eric Blake

2019-Jun-25 14:37 UTC

head link

Re: [Libguestfs] [libnbd PATCH] states: Never block state machine inside REPLY

On 6/25/19 4:03 AM, Richard W.M. Jones wrote:> On Wed, Jun 19, 2019 at 01:18:01PM -0500, Eric Blake wrote:
>> Oddly enough, I am not getting any measurable performance difference
>> with this patch applied and using examples/threaded-reads-and-writes
>> coupled with nbdkit.  My explanation is that in the common case, once
>> a server has something to send, it is going to send the entire reply
>> as fast as it can, rather than sending a partial reply and waiting;
>> and our attempts to keep up to 64 commands in flight mean that our
>> real bottleneck is not sending our next command (even if we have to
>> wait for the server's reply to finish), but at the server (if we
are
>> saturating the server's amount of in-flight requests, the server
won't
>> read our next request until it has finished its current reply).
>> Still, I'm sure that it is possible to construct corner cases where
>> this shows more of an effect.
> 
> What I'm going to say is I think stating the obvious, but my intuition
> is there are going to be two types of load.  In the first and easiest
> case you want to read or write sequentially over the whole or a large
> section of the image (think: qemu-img convert).  In this case you know
> well in advance what parts of the image you want to read/write and can
> keep the in flight queue full at all times.  This is what
> threaded-reads-and-writes actually tests.
> 
> The harder case is random access (think: qemu with a database guest).
> There is a short queue, for example because of data dependencies
> between the requests, so the issued commands buffer is always short.
> Also there may be server latency because of the random access pattern.
> This kind of workload should show the benefit of this commit, but we
> don't really have a test for this kind of workload.
Yes, that's a fair assessment.

>>  /* STATE MACHINE */ {
>>   REPLY.START:
>> +  /* If rlen is non-zero, we are resuming an earlier reply cycle. */
>> +  if (h->rlen > 0) {
>> +    if (h->reply_cmd) {
>> +      assert (nbd_internal_is_state_processing
(h->reply_cmd->state));
>> +      SET_NEXT_STATE (h->reply_cmd->state);
> 
> This is essentially the "stack of states" idea that I had early
on,
> but with a maximum stack depth of 1.
> 
> I originally rejected the idea then because it means that the
> generator would not longer have a complete view of all state
> transitions in the state machine, and thus wouldn't be able to enforce
> invariants such as not jumping to a state in the middle of a state
> group.  (In fact with this patch that is now the case -- I thought the
> generator would give an error about this, but maybe my test is wrong).
> 
> Nevertheless given that we need to do this, maybe it's better to drop
> the idea that the generator needs to have a complete view.
> 
> In my original plan we would have had "push state" and "pop
state"
> operations (pop state is a general jump to "any state", which is
why
> it breaks the assumption in the generator).
In addition to only stacking a queue of at most 1 state, this also
limits the stacking to occur during the REPLY sub-state.  But I was also
surprised that the generator did not balk at me setting the state to the
contents of a variable rather than to the usual situation of setting it
to an all-caps %FOO magic identifier.

>> +++ b/lib/internal.h
>> @@ -253,6 +253,7 @@ struct command_in_flight {
>>    uint32_t count;
>>    void *data; /* Buffer for read/write */
>>    struct command_cb cb;
>> +  enum state state; /* State to resume with on next POLLIN */
>>    bool data_seen; /* For read, true if at least one data chunk seen */
>>    uint32_t error; /* Local errno value */
>>  };
> 
> The patch seems reasonable.  Does this obviate any need to split the
> state machine?
Yes, as far as I can see, this is all the more we ever have to worry
about. It is possible to determine the entirety of wstate (either we're
blocked under the ISSUE_COMMAND sub-state, or h->request and
h->in_write_payload are sufficient to see what remains on a current
request), and the entirety of rstate (either we're under the REPLY
sub-state, or h->reply_cmd coupled with h->rlen and
h->reply_cmd->state
determine what remains on a current reply).
> 
> ACK
Okay, I'll go ahead and push this one today.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Maybe Matching Threads

Search for more apparently analagous threads

Libguestfs - Jun 2019 - [libnbd PATCH] states: Never block state machine inside REPLY

[Libguestfs] [libnbd PATCH] states: Never block state machine inside REPLY

[Libguestfs] [libnbd PATCH 2/1] states: Avoid wasted send() when REPLY interrupts request

Re: [Libguestfs] [libnbd PATCH] states: Never block state machine inside REPLY

Re: [Libguestfs] [libnbd PATCH 2/1] states: Avoid wasted send() when REPLY interrupts request

Re: [Libguestfs] [libnbd PATCH] states: Never block state machine inside REPLY

Maybe Matching Threads