Eric Blake
2019-Jun-19 18:18 UTC
[Libguestfs] [libnbd PATCH] states: Never block state machine inside REPLY
When processing a server reply within the REPLY subgroup, we will often hit a situation where recv() requires us to block until the next NotifyRead. But since NotifyRead is the only permitted external action while in this group, we are effectively blocking CmdIssue and NotifyWrite events from happening until the server finishes the in-progress reply, even though those events have no strict dependence on the server's progress. The solution is similar to commit dd101bde - any time we need to pause the reply cycle, we need to save enough information to recall where we left off then return to the READY state, then teach REPLY.START to check if we are starting a fresh reply or resuming an earlier reply. The state machine will still be blocked until the next POLLIN, but now is in a position to also accept CmdIssue and NotifyWrite without delay. With this patch in place, the only time is_state_processing is true is during the ISSUE_COMMAND group when it is blocked on NotifyWrite. Thus, once handshaking is complete, we can reliably equate nbd_aio_get_direction() == DIRECTION_READ with is_ready(), nbd_aio_get_direction() == DIRECTION_BOTH with is_processing() in the ISSUE_COMMAND substate. Oddly enough, I am not getting any measurable performance difference with this patch applied and using examples/threaded-reads-and-writes coupled with nbdkit. My explanation is that in the common case, once a server has something to send, it is going to send the entire reply as fast as it can, rather than sending a partial reply and waiting; and our attempts to keep up to 64 commands in flight mean that our real bottleneck is not sending our next command (even if we have to wait for the server's reply to finish), but at the server (if we are saturating the server's amount of in-flight requests, the server won't read our next request until it has finished its current reply). Still, I'm sure that it is possible to construct corner cases where this shows more of an effect. --- Applies on top of my series to add nbd_aio_pread_callback. generator/generator | 22 +++++++++--------- generator/states-reply-simple.c | 4 ++++ generator/states-reply-structured.c | 32 +++++++++++++++++++++++++ generator/states-reply.c | 36 ++++++++++++++++++++++++++++- lib/internal.h | 1 + 5 files changed, 83 insertions(+), 12 deletions(-) diff --git a/generator/generator b/generator/generator index 630260b..68a4fdf 100755 --- a/generator/generator +++ b/generator/generator @@ -682,14 +682,14 @@ and reply_state_machine = [ default_state with name = "START"; comment = "Prepare to receive a reply from the remote server"; - external_events = [ NotifyRead, "" ]; + external_events = []; }; State { default_state with name = "RECV_REPLY"; comment = "Receive a reply from the remote server"; - external_events = [ NotifyRead, "" ]; + external_events = []; }; State { @@ -723,7 +723,7 @@ and simple_reply_state_machine = [ default_state with name = "RECV_READ_PAYLOAD"; comment = "Receiving the read payload for a simple reply"; - external_events = [ NotifyRead, "" ]; + external_events = []; }; ] @@ -740,7 +740,7 @@ and structured_reply_state_machine = [ default_state with name = "RECV_REMAINING"; comment = "Receiving the remaining part of a structured reply"; - external_events = [ NotifyRead, "" ]; + external_events = []; }; State { @@ -754,49 +754,49 @@ and structured_reply_state_machine = [ default_state with name = "RECV_ERROR"; comment = "Receive a structured reply error header"; - external_events = [ NotifyRead, "" ]; + external_events = [] }; State { default_state with name = "RECV_ERROR_MESSAGE"; comment = "Receive a structured reply error message"; - external_events = [ NotifyRead, "" ]; + external_events = []; }; State { default_state with name = "RECV_ERROR_TAIL"; comment = "Receive a structured reply error tail"; - external_events = [ NotifyRead, "" ]; + external_events = []; }; State { default_state with name = "RECV_OFFSET_DATA"; comment = "Receive a structured reply offset-data header"; - external_events = [ NotifyRead, "" ]; + external_events = []; }; State { default_state with name = "RECV_OFFSET_DATA_DATA"; comment = "Receive a structured reply offset-data block of data"; - external_events = [ NotifyRead, "" ]; + external_events = []; }; State { default_state with name = "RECV_OFFSET_HOLE"; comment = "Receive a structured reply offset-hole header"; - external_events = [ NotifyRead, "" ]; + external_events = []; }; State { default_state with name = "RECV_BS_ENTRIES"; comment = "Receive a structured reply block-status payload"; - external_events = [ NotifyRead, "" ]; + external_events = []; }; State { diff --git a/generator/states-reply-simple.c b/generator/states-reply-simple.c index ddc91ce..3b63d07 100644 --- a/generator/states-reply-simple.c +++ b/generator/states-reply-simple.c @@ -53,6 +53,10 @@ switch (recv_into_rbuf (h)) { case -1: SET_NEXT_STATE (%.DEAD); return -1; + case 1: + save_reply_state (h); + SET_NEXT_STATE (%.READY); + return 0; case 0: /* guaranteed by START */ assert (cmd); diff --git a/generator/states-reply-structured.c b/generator/states-reply-structured.c index 00659de..594525e 100644 --- a/generator/states-reply-structured.c +++ b/generator/states-reply-structured.c @@ -38,6 +38,10 @@ REPLY.STRUCTURED_REPLY.RECV_REMAINING: switch (recv_into_rbuf (h)) { case -1: SET_NEXT_STATE (%.DEAD); return -1; + case 1: + save_reply_state (h); + SET_NEXT_STATE (%.READY); + return 0; case 0: SET_NEXT_STATE (%CHECK); } return 0; @@ -154,6 +158,10 @@ switch (recv_into_rbuf (h)) { case -1: SET_NEXT_STATE (%.DEAD); return -1; + case 1: + save_reply_state (h); + SET_NEXT_STATE (%.READY); + return 0; case 0: length = be32toh (h->sbuf.sr.structured_reply.length); msglen = be16toh (h->sbuf.sr.payload.error.error.len); @@ -176,6 +184,10 @@ switch (recv_into_rbuf (h)) { case -1: SET_NEXT_STATE (%.DEAD); return -1; + case 1: + save_reply_state (h); + SET_NEXT_STATE (%.READY); + return 0; case 0: length = be32toh (h->sbuf.sr.structured_reply.length); msglen = be16toh (h->sbuf.sr.payload.error.error.len); @@ -219,6 +231,10 @@ switch (recv_into_rbuf (h)) { case -1: SET_NEXT_STATE (%.DEAD); return -1; + case 1: + save_reply_state (h); + SET_NEXT_STATE (%.READY); + return 0; case 0: error = be32toh (h->sbuf.sr.payload.error.error.error); type = be16toh (h->sbuf.sr.structured_reply.type); @@ -268,6 +284,10 @@ switch (recv_into_rbuf (h)) { case -1: SET_NEXT_STATE (%.DEAD); return -1; + case 1: + save_reply_state (h); + SET_NEXT_STATE (%.READY); + return 0; case 0: length = be32toh (h->sbuf.sr.structured_reply.length); offset = be64toh (h->sbuf.sr.payload.offset_data.offset); @@ -324,6 +344,10 @@ switch (recv_into_rbuf (h)) { case -1: SET_NEXT_STATE (%.DEAD); return -1; + case 1: + save_reply_state (h); + SET_NEXT_STATE (%.READY); + return 0; case 0: length = be32toh (h->sbuf.sr.structured_reply.length); offset = be64toh (h->sbuf.sr.payload.offset_data.offset); @@ -349,6 +373,10 @@ switch (recv_into_rbuf (h)) { case -1: SET_NEXT_STATE (%.DEAD); return -1; + case 1: + save_reply_state (h); + SET_NEXT_STATE (%.READY); + return 0; case 0: offset = be64toh (h->sbuf.sr.payload.offset_hole.offset); length = be32toh (h->sbuf.sr.payload.offset_hole.length); @@ -410,6 +438,10 @@ switch (recv_into_rbuf (h)) { case -1: SET_NEXT_STATE (%.DEAD); return -1; + case 1: + save_reply_state (h); + SET_NEXT_STATE (%.READY); + return 0; case 0: length = be32toh (h->sbuf.sr.structured_reply.length); diff --git a/generator/states-reply.c b/generator/states-reply.c index 54f98c5..99e54f6 100644 --- a/generator/states-reply.c +++ b/generator/states-reply.c @@ -16,10 +16,43 @@ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ -/* State machine for receiving reply messages from the server. */ +#include <assert.h> + +/* State machine for receiving reply messages from the server. + * + * Note that we never block while in this sub-group. If there is + * insufficient data to finish parsing a reply, requiring us to block + * until POLLIN, we instead track where in the state machine we left + * off, then return to READY to actually block. Then, on entry to + * REPLY.START, we can tell if this is the start of a new reply (rlen + * is 0, stay put), a continuation of the preamble (reply_cmd is NULL, + * resume with RECV_REPLY), or a continuation from any other location + * (reply_cmd contains the state to jump to). + */ + +static void +save_reply_state (struct nbd_handle *h) +{ + assert (h->reply_cmd); + assert (h->rlen); + h->reply_cmd->state = get_next_state (h); +} + +/*----- End of prologue. -----*/ /* STATE MACHINE */ { REPLY.START: + /* If rlen is non-zero, we are resuming an earlier reply cycle. */ + if (h->rlen > 0) { + if (h->reply_cmd) { + assert (nbd_internal_is_state_processing (h->reply_cmd->state)); + SET_NEXT_STATE (h->reply_cmd->state); + } + else + SET_NEXT_STATE (%RECV_REPLY); + return 0; + } + /* This state is entered when a read notification is received in the * READY state. Therefore we know the socket is readable here. * Reading a zero length now would indicate that the socket has been @@ -66,6 +99,7 @@ REPLY.RECV_REPLY: switch (recv_into_rbuf (h)) { case -1: SET_NEXT_STATE (%.DEAD); return -1; + case 1: SET_NEXT_STATE (%.READY); return 0; case 0: SET_NEXT_STATE (%CHECK_SIMPLE_OR_STRUCTURED_REPLY); } return 0; diff --git a/lib/internal.h b/lib/internal.h index a1e27df..662ff7a 100644 --- a/lib/internal.h +++ b/lib/internal.h @@ -253,6 +253,7 @@ struct command_in_flight { uint32_t count; void *data; /* Buffer for read/write */ struct command_cb cb; + enum state state; /* State to resume with on next POLLIN */ bool data_seen; /* For read, true if at least one data chunk seen */ uint32_t error; /* Local errno value */ }; -- 2.20.1
Eric Blake
2019-Jun-20 02:11 UTC
[Libguestfs] [libnbd PATCH 2/1] states: Avoid wasted send() when REPLY interrupts request
When we are blocked waiting for POLLOUT during a request, and happen to receive notice of POLLIN instead, we know that the work done in response to POLLIN will be non-blocking (it returns to %.READY as soon as it would block, which in turn jumps right back into ISSUE_COMMAND because we have a pending request not fully sent yet). Since the jaunt through REPLY was non-blocking, it is unlikely that the POLLOUT situation has changed in the meantime, so if we use SET_NEXT_STATE() to step back into SEND_REQUEST, our recv() call will likely fail with EAGAIN, once again blocking us until our next POLLOUT. Although the wasted syscall is not on the hot-path (after all, we can't progress until data arrives from the server), it's slightly cleaner if we instead declare that we are already blocked. I tested with: $ nbdkit -U - null 16M --run 'examples/threaded-reads-and-writes $unixsocket' There was no real noticeable difference in timing, but I did observe that pre-patch, the run encountered 168825 pre-emptions and 136976 send() EAGAIN failures (81%), while post-patch the run encountered 166066 pre-emptions and 552 EAGAIN failures (0.3%). --- generator/states-issue-command.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/generator/states-issue-command.c b/generator/states-issue-command.c index 821b28a..f746f80 100644 --- a/generator/states-issue-command.c +++ b/generator/states-issue-command.c @@ -25,12 +25,17 @@ assert (h->cmds_to_issue != NULL); cmd = h->cmds_to_issue; - /* Were we interrupted by reading a reply to an earlier command? */ + /* Were we interrupted by reading a reply to an earlier command? If + * so, we can only get back here after a non-blocking jaunt through + * the REPLY engine, which means we are unlikely to be unblocked for + * writes yet; we want to advance back to the correct state but + * without trying a send_from_wbuf that will likely return 1. + */ if (h->wlen) { if (h->in_write_payload) - SET_NEXT_STATE(%SEND_WRITE_PAYLOAD); + *next_state = %SEND_WRITE_PAYLOAD; else - SET_NEXT_STATE(%SEND_REQUEST); + *next_state = %SEND_REQUEST; return 0; } -- 2.20.1
Richard W.M. Jones
2019-Jun-25 09:03 UTC
Re: [Libguestfs] [libnbd PATCH] states: Never block state machine inside REPLY
On Wed, Jun 19, 2019 at 01:18:01PM -0500, Eric Blake wrote:> Oddly enough, I am not getting any measurable performance difference > with this patch applied and using examples/threaded-reads-and-writes > coupled with nbdkit. My explanation is that in the common case, once > a server has something to send, it is going to send the entire reply > as fast as it can, rather than sending a partial reply and waiting; > and our attempts to keep up to 64 commands in flight mean that our > real bottleneck is not sending our next command (even if we have to > wait for the server's reply to finish), but at the server (if we are > saturating the server's amount of in-flight requests, the server won't > read our next request until it has finished its current reply). > Still, I'm sure that it is possible to construct corner cases where > this shows more of an effect.What I'm going to say is I think stating the obvious, but my intuition is there are going to be two types of load. In the first and easiest case you want to read or write sequentially over the whole or a large section of the image (think: qemu-img convert). In this case you know well in advance what parts of the image you want to read/write and can keep the in flight queue full at all times. This is what threaded-reads-and-writes actually tests. The harder case is random access (think: qemu with a database guest). There is a short queue, for example because of data dependencies between the requests, so the issued commands buffer is always short. Also there may be server latency because of the random access pattern. This kind of workload should show the benefit of this commit, but we don't really have a test for this kind of workload.> > Applies on top of my series to add nbd_aio_pread_callback. > > generator/generator | 22 +++++++++--------- > generator/states-reply-simple.c | 4 ++++ > generator/states-reply-structured.c | 32 +++++++++++++++++++++++++ > generator/states-reply.c | 36 ++++++++++++++++++++++++++++- > lib/internal.h | 1 + > 5 files changed, 83 insertions(+), 12 deletions(-) > > diff --git a/generator/generator b/generator/generator > index 630260b..68a4fdf 100755 > --- a/generator/generator > +++ b/generator/generator > @@ -682,14 +682,14 @@ and reply_state_machine = [ > default_state with > name = "START"; > comment = "Prepare to receive a reply from the remote server"; > - external_events = [ NotifyRead, "" ]; > + external_events = []; > }; > > State { > default_state with > name = "RECV_REPLY"; > comment = "Receive a reply from the remote server"; > - external_events = [ NotifyRead, "" ]; > + external_events = []; > }; > > State { > @@ -723,7 +723,7 @@ and simple_reply_state_machine = [ > default_state with > name = "RECV_READ_PAYLOAD"; > comment = "Receiving the read payload for a simple reply"; > - external_events = [ NotifyRead, "" ]; > + external_events = []; > }; > ] > > @@ -740,7 +740,7 @@ and structured_reply_state_machine = [ > default_state with > name = "RECV_REMAINING"; > comment = "Receiving the remaining part of a structured reply"; > - external_events = [ NotifyRead, "" ]; > + external_events = []; > }; > > State { > @@ -754,49 +754,49 @@ and structured_reply_state_machine = [ > default_state with > name = "RECV_ERROR"; > comment = "Receive a structured reply error header"; > - external_events = [ NotifyRead, "" ]; > + external_events = [] > }; > > State { > default_state with > name = "RECV_ERROR_MESSAGE"; > comment = "Receive a structured reply error message"; > - external_events = [ NotifyRead, "" ]; > + external_events = []; > }; > > State { > default_state with > name = "RECV_ERROR_TAIL"; > comment = "Receive a structured reply error tail"; > - external_events = [ NotifyRead, "" ]; > + external_events = []; > }; > > State { > default_state with > name = "RECV_OFFSET_DATA"; > comment = "Receive a structured reply offset-data header"; > - external_events = [ NotifyRead, "" ]; > + external_events = []; > }; > > State { > default_state with > name = "RECV_OFFSET_DATA_DATA"; > comment = "Receive a structured reply offset-data block of data"; > - external_events = [ NotifyRead, "" ]; > + external_events = []; > }; > > State { > default_state with > name = "RECV_OFFSET_HOLE"; > comment = "Receive a structured reply offset-hole header"; > - external_events = [ NotifyRead, "" ]; > + external_events = []; > }; > > State { > default_state with > name = "RECV_BS_ENTRIES"; > comment = "Receive a structured reply block-status payload"; > - external_events = [ NotifyRead, "" ]; > + external_events = []; > }; > > State { > diff --git a/generator/states-reply-simple.c b/generator/states-reply-simple.c > index ddc91ce..3b63d07 100644 > --- a/generator/states-reply-simple.c > +++ b/generator/states-reply-simple.c > @@ -53,6 +53,10 @@ > > switch (recv_into_rbuf (h)) { > case -1: SET_NEXT_STATE (%.DEAD); return -1; > + case 1: > + save_reply_state (h); > + SET_NEXT_STATE (%.READY); > + return 0; > case 0: > /* guaranteed by START */ > assert (cmd); > diff --git a/generator/states-reply-structured.c b/generator/states-reply-structured.c > index 00659de..594525e 100644 > --- a/generator/states-reply-structured.c > +++ b/generator/states-reply-structured.c > @@ -38,6 +38,10 @@ > REPLY.STRUCTURED_REPLY.RECV_REMAINING: > switch (recv_into_rbuf (h)) { > case -1: SET_NEXT_STATE (%.DEAD); return -1; > + case 1: > + save_reply_state (h); > + SET_NEXT_STATE (%.READY); > + return 0; > case 0: SET_NEXT_STATE (%CHECK); > } > return 0; > @@ -154,6 +158,10 @@ > > switch (recv_into_rbuf (h)) { > case -1: SET_NEXT_STATE (%.DEAD); return -1; > + case 1: > + save_reply_state (h); > + SET_NEXT_STATE (%.READY); > + return 0; > case 0: > length = be32toh (h->sbuf.sr.structured_reply.length); > msglen = be16toh (h->sbuf.sr.payload.error.error.len); > @@ -176,6 +184,10 @@ > > switch (recv_into_rbuf (h)) { > case -1: SET_NEXT_STATE (%.DEAD); return -1; > + case 1: > + save_reply_state (h); > + SET_NEXT_STATE (%.READY); > + return 0; > case 0: > length = be32toh (h->sbuf.sr.structured_reply.length); > msglen = be16toh (h->sbuf.sr.payload.error.error.len); > @@ -219,6 +231,10 @@ > > switch (recv_into_rbuf (h)) { > case -1: SET_NEXT_STATE (%.DEAD); return -1; > + case 1: > + save_reply_state (h); > + SET_NEXT_STATE (%.READY); > + return 0; > case 0: > error = be32toh (h->sbuf.sr.payload.error.error.error); > type = be16toh (h->sbuf.sr.structured_reply.type); > @@ -268,6 +284,10 @@ > > switch (recv_into_rbuf (h)) { > case -1: SET_NEXT_STATE (%.DEAD); return -1; > + case 1: > + save_reply_state (h); > + SET_NEXT_STATE (%.READY); > + return 0; > case 0: > length = be32toh (h->sbuf.sr.structured_reply.length); > offset = be64toh (h->sbuf.sr.payload.offset_data.offset); > @@ -324,6 +344,10 @@ > > switch (recv_into_rbuf (h)) { > case -1: SET_NEXT_STATE (%.DEAD); return -1; > + case 1: > + save_reply_state (h); > + SET_NEXT_STATE (%.READY); > + return 0; > case 0: > length = be32toh (h->sbuf.sr.structured_reply.length); > offset = be64toh (h->sbuf.sr.payload.offset_data.offset); > @@ -349,6 +373,10 @@ > > switch (recv_into_rbuf (h)) { > case -1: SET_NEXT_STATE (%.DEAD); return -1; > + case 1: > + save_reply_state (h); > + SET_NEXT_STATE (%.READY); > + return 0; > case 0: > offset = be64toh (h->sbuf.sr.payload.offset_hole.offset); > length = be32toh (h->sbuf.sr.payload.offset_hole.length); > @@ -410,6 +438,10 @@ > > switch (recv_into_rbuf (h)) { > case -1: SET_NEXT_STATE (%.DEAD); return -1; > + case 1: > + save_reply_state (h); > + SET_NEXT_STATE (%.READY); > + return 0; > case 0: > length = be32toh (h->sbuf.sr.structured_reply.length); > > diff --git a/generator/states-reply.c b/generator/states-reply.c > index 54f98c5..99e54f6 100644 > --- a/generator/states-reply.c > +++ b/generator/states-reply.c > @@ -16,10 +16,43 @@ > * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA > */ > > -/* State machine for receiving reply messages from the server. */ > +#include <assert.h> > + > +/* State machine for receiving reply messages from the server. > + * > + * Note that we never block while in this sub-group. If there is > + * insufficient data to finish parsing a reply, requiring us to block > + * until POLLIN, we instead track where in the state machine we left > + * off, then return to READY to actually block. Then, on entry to > + * REPLY.START, we can tell if this is the start of a new reply (rlen > + * is 0, stay put), a continuation of the preamble (reply_cmd is NULL, > + * resume with RECV_REPLY), or a continuation from any other location > + * (reply_cmd contains the state to jump to). > + */ > + > +static void > +save_reply_state (struct nbd_handle *h) > +{ > + assert (h->reply_cmd); > + assert (h->rlen); > + h->reply_cmd->state = get_next_state (h); > +} > + > +/*----- End of prologue. -----*/ > > /* STATE MACHINE */ { > REPLY.START: > + /* If rlen is non-zero, we are resuming an earlier reply cycle. */ > + if (h->rlen > 0) { > + if (h->reply_cmd) { > + assert (nbd_internal_is_state_processing (h->reply_cmd->state)); > + SET_NEXT_STATE (h->reply_cmd->state);This is essentially the "stack of states" idea that I had early on, but with a maximum stack depth of 1. I originally rejected the idea then because it means that the generator would not longer have a complete view of all state transitions in the state machine, and thus wouldn't be able to enforce invariants such as not jumping to a state in the middle of a state group. (In fact with this patch that is now the case -- I thought the generator would give an error about this, but maybe my test is wrong). Nevertheless given that we need to do this, maybe it's better to drop the idea that the generator needs to have a complete view. In my original plan we would have had "push state" and "pop state" operations (pop state is a general jump to "any state", which is why it breaks the assumption in the generator).> + } > + else > + SET_NEXT_STATE (%RECV_REPLY); > + return 0; > + } > + > /* This state is entered when a read notification is received in the > * READY state. Therefore we know the socket is readable here. > * Reading a zero length now would indicate that the socket has been > @@ -66,6 +99,7 @@ > REPLY.RECV_REPLY: > switch (recv_into_rbuf (h)) { > case -1: SET_NEXT_STATE (%.DEAD); return -1; > + case 1: SET_NEXT_STATE (%.READY); return 0; > case 0: SET_NEXT_STATE (%CHECK_SIMPLE_OR_STRUCTURED_REPLY); > } > return 0; > diff --git a/lib/internal.h b/lib/internal.h > index a1e27df..662ff7a 100644 > --- a/lib/internal.h > +++ b/lib/internal.h > @@ -253,6 +253,7 @@ struct command_in_flight { > uint32_t count; > void *data; /* Buffer for read/write */ > struct command_cb cb; > + enum state state; /* State to resume with on next POLLIN */ > bool data_seen; /* For read, true if at least one data chunk seen */ > uint32_t error; /* Local errno value */ > };The patch seems reasonable. Does this obviate any need to split the state machine? ACK Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-df lists disk usage of guests without needing to install any software inside the virtual machine. Supports Linux and Windows. http://people.redhat.com/~rjones/virt-df/
Richard W.M. Jones
2019-Jun-25 09:06 UTC
Re: [Libguestfs] [libnbd PATCH 2/1] states: Avoid wasted send() when REPLY interrupts request
On Wed, Jun 19, 2019 at 09:11:52PM -0500, Eric Blake wrote:> When we are blocked waiting for POLLOUT during a request, and happen > to receive notice of POLLIN instead, we know that the work done in > response to POLLIN will be non-blocking (it returns to %.READY as soon > as it would block, which in turn jumps right back into ISSUE_COMMAND > because we have a pending request not fully sent yet). Since the > jaunt through REPLY was non-blocking, it is unlikely that the POLLOUT > situation has changed in the meantime, so if we use SET_NEXT_STATE() > to step back into SEND_REQUEST, our recv() call will likely fail with > EAGAIN, once again blocking us until our next POLLOUT. Although the > wasted syscall is not on the hot-path (after all, we can't progress > until data arrives from the server), it's slightly cleaner if we > instead declare that we are already blocked. > > I tested with: > $ nbdkit -U - null 16M --run 'examples/threaded-reads-and-writes $unixsocket' > > There was no real noticeable difference in timing, but I did observe > that pre-patch, the run encountered 168825 pre-emptions and 136976 > send() EAGAIN failures (81%), while post-patch the run encountered > 166066 pre-emptions and 552 EAGAIN failures (0.3%). > --- > generator/states-issue-command.c | 11 ++++++++--- > 1 file changed, 8 insertions(+), 3 deletions(-) > > diff --git a/generator/states-issue-command.c b/generator/states-issue-command.c > index 821b28a..f746f80 100644 > --- a/generator/states-issue-command.c > +++ b/generator/states-issue-command.c > @@ -25,12 +25,17 @@ > assert (h->cmds_to_issue != NULL); > cmd = h->cmds_to_issue; > > - /* Were we interrupted by reading a reply to an earlier command? */ > + /* Were we interrupted by reading a reply to an earlier command? If > + * so, we can only get back here after a non-blocking jaunt through > + * the REPLY engine, which means we are unlikely to be unblocked for > + * writes yet; we want to advance back to the correct state but > + * without trying a send_from_wbuf that will likely return 1. > + */ > if (h->wlen) { > if (h->in_write_payload) > - SET_NEXT_STATE(%SEND_WRITE_PAYLOAD); > + *next_state = %SEND_WRITE_PAYLOAD; > else > - SET_NEXT_STATE(%SEND_REQUEST); > + *next_state = %SEND_REQUEST;It would be nice to do this without fiddling with essentially an internal detail of the generated code. Could we add another macro, something like "SET_NEXT_STATE_AND_BLOCK"? On the other hand if it's not on the hot path, maybe we shouldn't do this at all? Rich.> return 0; > } > > -- > 2.20.1 > > _______________________________________________ > Libguestfs mailing list > Libguestfs@redhat.com > https://www.redhat.com/mailman/listinfo/libguestfs-- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top
Eric Blake
2019-Jun-25 14:37 UTC
Re: [Libguestfs] [libnbd PATCH] states: Never block state machine inside REPLY
On 6/25/19 4:03 AM, Richard W.M. Jones wrote:> On Wed, Jun 19, 2019 at 01:18:01PM -0500, Eric Blake wrote: >> Oddly enough, I am not getting any measurable performance difference >> with this patch applied and using examples/threaded-reads-and-writes >> coupled with nbdkit. My explanation is that in the common case, once >> a server has something to send, it is going to send the entire reply >> as fast as it can, rather than sending a partial reply and waiting; >> and our attempts to keep up to 64 commands in flight mean that our >> real bottleneck is not sending our next command (even if we have to >> wait for the server's reply to finish), but at the server (if we are >> saturating the server's amount of in-flight requests, the server won't >> read our next request until it has finished its current reply). >> Still, I'm sure that it is possible to construct corner cases where >> this shows more of an effect. > > What I'm going to say is I think stating the obvious, but my intuition > is there are going to be two types of load. In the first and easiest > case you want to read or write sequentially over the whole or a large > section of the image (think: qemu-img convert). In this case you know > well in advance what parts of the image you want to read/write and can > keep the in flight queue full at all times. This is what > threaded-reads-and-writes actually tests. > > The harder case is random access (think: qemu with a database guest). > There is a short queue, for example because of data dependencies > between the requests, so the issued commands buffer is always short. > Also there may be server latency because of the random access pattern. > This kind of workload should show the benefit of this commit, but we > don't really have a test for this kind of workload.Yes, that's a fair assessment.>> /* STATE MACHINE */ { >> REPLY.START: >> + /* If rlen is non-zero, we are resuming an earlier reply cycle. */ >> + if (h->rlen > 0) { >> + if (h->reply_cmd) { >> + assert (nbd_internal_is_state_processing (h->reply_cmd->state)); >> + SET_NEXT_STATE (h->reply_cmd->state); > > This is essentially the "stack of states" idea that I had early on, > but with a maximum stack depth of 1. > > I originally rejected the idea then because it means that the > generator would not longer have a complete view of all state > transitions in the state machine, and thus wouldn't be able to enforce > invariants such as not jumping to a state in the middle of a state > group. (In fact with this patch that is now the case -- I thought the > generator would give an error about this, but maybe my test is wrong). > > Nevertheless given that we need to do this, maybe it's better to drop > the idea that the generator needs to have a complete view. > > In my original plan we would have had "push state" and "pop state" > operations (pop state is a general jump to "any state", which is why > it breaks the assumption in the generator).In addition to only stacking a queue of at most 1 state, this also limits the stacking to occur during the REPLY sub-state. But I was also surprised that the generator did not balk at me setting the state to the contents of a variable rather than to the usual situation of setting it to an all-caps %FOO magic identifier.>> +++ b/lib/internal.h >> @@ -253,6 +253,7 @@ struct command_in_flight { >> uint32_t count; >> void *data; /* Buffer for read/write */ >> struct command_cb cb; >> + enum state state; /* State to resume with on next POLLIN */ >> bool data_seen; /* For read, true if at least one data chunk seen */ >> uint32_t error; /* Local errno value */ >> }; > > The patch seems reasonable. Does this obviate any need to split the > state machine?Yes, as far as I can see, this is all the more we ever have to worry about. It is possible to determine the entirety of wstate (either we're blocked under the ISSUE_COMMAND sub-state, or h->request and h->in_write_payload are sufficient to see what remains on a current request), and the entirety of rstate (either we're under the REPLY sub-state, or h->reply_cmd coupled with h->rlen and h->reply_cmd->state determine what remains on a current reply).> > ACKOkay, I'll go ahead and push this one today. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
Seemingly Similar Threads
- Re: [libnbd PATCH] states: Never block state machine inside REPLY
- [libnbd PATCH v4 0/4] Saner reply header layout
- [libnbd PATCH] states: Validate error message size
- [libnbd PATCH 0/4] Various interop fixes
- [libnbd PATCH 5/7] states: Factor out NBD_REP payload prep