Eric Blake
2019-May-21 15:09 UTC
[Libguestfs] [libnbd PATCH 0/3] Avoid deadlock with in-flight commands
This might not be the final solution, but it certainly seems to solve a deadlock for me that I could trigger by using 'nbdkit --filter=noparallel memory 512k' and calling nbd_aio_pread for a request larger than 256k (enough for the Linux kernel to block the server until libnbd read()s), immediately followed by nbd_aio_pwrite for a request larger than 256k (enough to block libnbd until the server read()s, but the serialized server won't read until we parse off the reply). My solution was to allow a notifyread at any time we are in the middle of writing a request, at which point we pause the current write, force the state machine to completely receive the reply, then resume where we left off writing the request. Eric Blake (3): commands: Preserve FIFO ordering states: Split ISSUE_COMMAND.SEND_REQUEST states: Allow in-flight read while writing next command generator/generator | 34 +++++++++++++++- generator/states-issue-command.c | 68 ++++++++++++++++++++++++-------- generator/states-reply.c | 18 +++++++-- lib/internal.h | 1 + lib/rw.c | 13 ++++-- 5 files changed, 108 insertions(+), 26 deletions(-) -- 2.20.1
Eric Blake
2019-May-21 15:09 UTC
[Libguestfs] [libnbd PATCH 1/3] commands: Preserve FIFO ordering
A generic client exploiting multiple in-flight commands should be prepared for out-of-order responses (and should probably ensure that there are no overlaps between parallel in-flight commands to avoid unspecified disk contents if the server acts on commands in an arbitrary order or even exposing non-atomic splicing effects). But a specific client aware of a specific server's behavior of being fully serialized may depend on commands being processed in strict FIFO order, and we should not get in the way of that. When adding commands to be issued, and when moving a server's reply into commands to inform the client about, we need to insert at the end rather than the head of the appropriate list. Only the cmds_in_flight list does not have to care about maintaining FIFO ordering. --- generator/states-reply.c | 13 ++++++++++--- lib/rw.c | 13 ++++++++++--- 2 files changed, 20 insertions(+), 6 deletions(-) diff --git a/generator/states-reply.c b/generator/states-reply.c index 93f6cda..45362d4 100644 --- a/generator/states-reply.c +++ b/generator/states-reply.c @@ -103,13 +103,20 @@ } assert (cmd != NULL); - /* Move it to the cmds_done list. */ + /* Move it to the end of the cmds_done list. */ if (prev_cmd != NULL) prev_cmd->next = cmd->next; else conn->cmds_in_flight = cmd->next; - cmd->next = conn->cmds_done; - conn->cmds_done = cmd; + cmd->next = NULL; + if (conn->cmds_done) { + prev_cmd = conn->cmds_done; + while (prev_cmd->next) + prev_cmd = prev_cmd->next; + prev_cmd->next = cmd; + } + else + conn->cmds_done = cmd; SET_NEXT_STATE (%.READY); return 0; diff --git a/lib/rw.c b/lib/rw.c index 9dfce97..fa7dc52 100644 --- a/lib/rw.c +++ b/lib/rw.c @@ -246,7 +246,7 @@ command_common (struct nbd_connection *conn, uint16_t flags, uint16_t type, uint64_t offset, uint64_t count, void *data) { - struct command_in_flight *cmd; + struct command_in_flight *cmd, *prev_cmd; switch (type) { /* Commands which send or receive data are limited to MAX_REQUEST_SIZE. */ @@ -296,8 +296,15 @@ command_common (struct nbd_connection *conn, if (conn->structured_replies && cmd->data && type == NBD_CMD_READ) memset (cmd->data, 0, cmd->count); - cmd->next = conn->cmds_to_issue; - conn->cmds_to_issue = cmd; + /* Stick the command at the end of the list */ + if (conn->cmds_to_issue != NULL) { + prev_cmd = conn->cmds_to_issue; + while (prev_cmd->next) + prev_cmd = prev_cmd->next; + prev_cmd->next = cmd; + } + else + conn->cmds_to_issue = cmd; return cmd; } -- 2.20.1
Eric Blake
2019-May-21 15:09 UTC
[Libguestfs] [libnbd PATCH 2/3] states: Split ISSUE_COMMAND.SEND_REQUEST
In order to handle reading an in-flight response while in the middle of sending a second command, we'll need a way to jump back into the middle of a command being sent. This is easier if the state that sets wbuf is distinct from the state that reads into wbuf, and if we don't move the command to the in-flight queue until after the writes finish. --- generator/generator | 14 ++++++++++ generator/states-issue-command.c | 45 ++++++++++++++++++++------------ 2 files changed, 42 insertions(+), 17 deletions(-) diff --git a/generator/generator b/generator/generator index a1bf41d..a4ad362 100755 --- a/generator/generator +++ b/generator/generator @@ -637,12 +637,26 @@ and issue_command_state_machine = [ external_events = [ NotifyWrite, "" ]; }; + State { + default_state with + name = "PREPARE_WRITE_PAYLOAD"; + comment = "Prepare the write payload to send to the remote server"; + external_events = []; + }; + State { default_state with name = "SEND_WRITE_PAYLOAD"; comment = "Sending the write payload to the remote server"; external_events = [ NotifyWrite, "" ]; }; + +State { + default_state with + name = "FINISH"; + comment = "Finish issuing a command"; + external_events = []; + }; ] (* Receiving a reply from the server. *) diff --git a/generator/states-issue-command.c b/generator/states-issue-command.c index a57f40f..e24ea34 100644 --- a/generator/states-issue-command.c +++ b/generator/states-issue-command.c @@ -24,9 +24,6 @@ assert (conn->cmds_to_issue != NULL); cmd = conn->cmds_to_issue; - conn->cmds_to_issue = cmd->next; - cmd->next = conn->cmds_in_flight; - conn->cmds_in_flight = cmd; conn->sbuf.request.magic = htobe32 (NBD_REQUEST_MAGIC); conn->sbuf.request.flags = htobe16 (cmd->flags); @@ -40,29 +37,43 @@ return 0; ISSUE_COMMAND.SEND_REQUEST: - struct command_in_flight *cmd; - switch (send_from_wbuf (conn)) { case -1: SET_NEXT_STATE (%.DEAD); return -1; - case 0: - assert (conn->cmds_in_flight != NULL); - cmd = conn->cmds_in_flight; - assert (cmd->handle == be64toh (conn->sbuf.request.handle)); - if (cmd->type == NBD_CMD_WRITE) { - conn->wbuf = cmd->data; - conn->wlen = cmd->count; - SET_NEXT_STATE (%SEND_WRITE_PAYLOAD); - } - else - SET_NEXT_STATE (%.READY); + case 0: SET_NEXT_STATE (%PREPARE_WRITE_PAYLOAD); } return 0; + ISSUE_COMMAND.PREPARE_WRITE_PAYLOAD: + struct command_in_flight *cmd; + + assert (conn->cmds_to_issue != NULL); + cmd = conn->cmds_to_issue; + assert (cmd->handle == be64toh (conn->sbuf.request.handle)); + if (cmd->type == NBD_CMD_WRITE) { + conn->wbuf = cmd->data; + conn->wlen = cmd->count; + SET_NEXT_STATE (%SEND_WRITE_PAYLOAD); + } + else + SET_NEXT_STATE (%FINISH); + return 0; + ISSUE_COMMAND.SEND_WRITE_PAYLOAD: switch (send_from_wbuf (conn)) { case -1: SET_NEXT_STATE (%.DEAD); return -1; - case 0: SET_NEXT_STATE (%.READY); + case 0: SET_NEXT_STATE (%FINISH); } return 0; + ISSUE_COMMAND.FINISH: + struct command_in_flight *cmd; + + assert (conn->cmds_to_issue != NULL); + cmd = conn->cmds_to_issue; + conn->cmds_to_issue = cmd->next; + cmd->next = conn->cmds_in_flight; + conn->cmds_in_flight = cmd; + SET_NEXT_STATE (%.READY); + return 0; + } /* END STATE MACHINE */ -- 2.20.1
Eric Blake
2019-May-21 15:09 UTC
[Libguestfs] [libnbd PATCH 3/3] states: Allow in-flight read while writing next command
As already noted in our state machine, a client that batches up a large read followed by large writes, coupled with a server that only processes commands in order, can result in deadlock (the server won't read more until we unblock its ability to write out its reply to our first command; but we aren't willing to read until we are done writing out our second command). Break the deadlock by teaching the generator that while we are in the middle of writing a command, we must remain responsive to read_notify events; if the server has data for us to read, we should consume that before jumping back into the middle of our command issue (and consuming a reply can invalidate sbuf, so we have to drop an assertion in PREPARE_WRITE_PAYLOAD). --- generator/generator | 20 ++++++++++++++++++-- generator/states-issue-command.c | 25 ++++++++++++++++++++++++- generator/states-reply.c | 5 ++++- lib/internal.h | 1 + 4 files changed, 47 insertions(+), 4 deletions(-) diff --git a/generator/generator b/generator/generator index a4ad362..23b3cbf 100755 --- a/generator/generator +++ b/generator/generator @@ -634,7 +634,15 @@ and issue_command_state_machine = [ default_state with name = "SEND_REQUEST"; comment = "Sending a request to the remote server"; - external_events = [ NotifyWrite, "" ]; + external_events = [ NotifyWrite, ""; + NotifyRead, "PAUSE_SEND_REQUEST" ]; + }; + + State { + default_state with + name = "PAUSE_SEND_REQUEST"; + comment = "Interrupt send request to receive an earlier command's reply"; + external_events = []; }; State { @@ -648,7 +656,15 @@ and issue_command_state_machine = [ default_state with name = "SEND_WRITE_PAYLOAD"; comment = "Sending the write payload to the remote server"; - external_events = [ NotifyWrite, "" ]; + external_events = [ NotifyWrite, ""; + NotifyRead, "PAUSE_WRITE_PAYLOAD" ]; + }; + +State { + default_state with + name = "PAUSE_WRITE_PAYLOAD"; + comment = "Interrupt write payload to receive an earlier command's reply"; + external_events = []; }; State { diff --git a/generator/states-issue-command.c b/generator/states-issue-command.c index e24ea34..3a5980d 100644 --- a/generator/states-issue-command.c +++ b/generator/states-issue-command.c @@ -25,6 +25,15 @@ assert (conn->cmds_to_issue != NULL); cmd = conn->cmds_to_issue; + /* Were we interrupted by reading a reply to an earlier command? */ + if (conn->wlen) { + if (conn->in_write_payload) + SET_NEXT_STATE(%SEND_WRITE_PAYLOAD); + else + SET_NEXT_STATE(%SEND_REQUEST); + return 0; + } + conn->sbuf.request.magic = htobe32 (NBD_REQUEST_MAGIC); conn->sbuf.request.flags = htobe16 (cmd->flags); conn->sbuf.request.type = htobe16 (cmd->type); @@ -43,12 +52,18 @@ } return 0; + ISSUE_COMMAND.PAUSE_SEND_REQUEST: + assert (conn->wlen); + assert (conn->cmds_to_issue != NULL); + conn->in_write_payload = false; + SET_NEXT_STATE (%^REPLY.START); + return 0; + ISSUE_COMMAND.PREPARE_WRITE_PAYLOAD: struct command_in_flight *cmd; assert (conn->cmds_to_issue != NULL); cmd = conn->cmds_to_issue; - assert (cmd->handle == be64toh (conn->sbuf.request.handle)); if (cmd->type == NBD_CMD_WRITE) { conn->wbuf = cmd->data; conn->wlen = cmd->count; @@ -65,9 +80,17 @@ } return 0; + ISSUE_COMMAND.PAUSE_WRITE_PAYLOAD: + assert (conn->wlen); + assert (conn->cmds_to_issue != NULL); + conn->in_write_payload = true; + SET_NEXT_STATE (%^REPLY.START); + return 0; + ISSUE_COMMAND.FINISH: struct command_in_flight *cmd; + assert (!conn->wlen); assert (conn->cmds_to_issue != NULL); cmd = conn->cmds_to_issue; conn->cmds_to_issue = cmd->next; diff --git a/generator/states-reply.c b/generator/states-reply.c index 45362d4..6bb503a 100644 --- a/generator/states-reply.c +++ b/generator/states-reply.c @@ -118,7 +118,10 @@ else conn->cmds_done = cmd; - SET_NEXT_STATE (%.READY); + if (conn->cmds_to_issue) + SET_NEXT_STATE (%^ISSUE_COMMAND.START); + else + SET_NEXT_STATE (%.READY); return 0; } /* END STATE MACHINE */ diff --git a/lib/internal.h b/lib/internal.h index 3f2b729..466af9d 100644 --- a/lib/internal.h +++ b/lib/internal.h @@ -182,6 +182,7 @@ struct nbd_connection { * acknowledge them. */ struct command_in_flight *cmds_to_issue, *cmds_in_flight, *cmds_done; + bool in_write_payload; }; struct meta_context { -- 2.20.1
--- This is what I used to provoke the deadlocks; before my patch series, it was succeeding for a fully-parallel server: nbdkit -U - memory 2M --run './deadlock $unixsocket' as well as for a serialized server that didn't trip up Linux kernel buffering limits: nbdkit -U - --filter=noparallel memory 256k --run './deadlock $unixsocket' but was reliably hanging with both client and server on larger buffers: nbdkit -U - --filter=noparallel memory 512k --run './deadlock $unixsocket' Post-patch, you can see the state machine now shift through ISSUE_COMMAND.PAUSE_WRITE_PAYLOAD through an entire REPLY.START..REPLY.FINISH sequence, and then resume in the middle of the ISSUE_COMMAND.SEND_WRITE_PAYLOAD. .gitignore | 1 + examples/Makefile.am | 10 +++ examples/deadlock.c | 200 +++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 211 insertions(+) create mode 100644 examples/deadlock.c diff --git a/.gitignore b/.gitignore index 66ff811..c135c26 100644 --- a/.gitignore +++ b/.gitignore @@ -30,6 +30,7 @@ Makefile.in /docs/libnbd.3 /docs/libnbd-api.3 /docs/libnbd-api.pod +/examples/deadlock /examples/threaded-reads-and-writes /examples/simple-fetch-first-sector /examples/simple-reads-and-writes diff --git a/examples/Makefile.am b/examples/Makefile.am index be3f21d..16b3804 100644 --- a/examples/Makefile.am +++ b/examples/Makefile.am @@ -18,6 +18,7 @@ include $(top_srcdir)/subdir-rules.mk noinst_PROGRAMS = \ + deadlock \ simple-fetch-first-sector \ simple-reads-and-writes \ threaded-reads-and-writes @@ -50,3 +51,12 @@ threaded_reads_and_writes_CFLAGS = \ threaded_reads_and_writes_LDADD = \ $(top_builddir)/lib/libnbd.la \ $(PTHREAD_LIBS) + +deadlock_SOURCES = \ + deadlock.c +deadlock_CPPFLAGS = \ + -I$(top_srcdir)/include +deadlock_CFLAGS = \ + $(WARNINGS_CFLAGS) +deadlock_LDADD = \ + $(top_builddir)/lib/libnbd.la diff --git a/examples/deadlock.c b/examples/deadlock.c new file mode 100644 index 0000000..1c9be8d --- /dev/null +++ b/examples/deadlock.c @@ -0,0 +1,200 @@ +/* This example can be copied, used and modified for any purpose + * without restrictions. + * + * Example usage with nbdkit: + * + * nbdkit -U - --filter=noparallel memory 2M --run './deadlock $unixsocket' + * + * This will attempt to create a deadlock by sending a large read + * request, immediately followed by a large write request, prior to + * waiting for any command replies from the server. If the server does + * not support reading a second command until after the response to + * the first one has been sent, then this could deadlock with the + * server waiting for libnbd to finish reading the read response, + * while libnbd is waiting for the server to finish reading the write + * request. Fixing the deadlock requires that the client prioritize + * reads over writes when more than one command is in flight. + * + * To run it against a remote server over TCP: + * + * ./deadlock hostname port + */ + +#include <stdio.h> +#include <stdlib.h> +#include <stdbool.h> +#include <stdint.h> +#include <string.h> +#include <unistd.h> +#include <errno.h> +#include <poll.h> +#include <time.h> +#include <assert.h> + +#include <libnbd.h> + +/* The single NBD handle. */ +static struct nbd_handle *nbd; + +/* Buffers used for the test. */ +static char *in, *out; +static int64_t packetsize; + +static int +try_deadlock (void *arg) +{ + struct pollfd fds[1]; + struct nbd_connection *conn; + char buf[512]; + size_t i, j; + int64_t handles[2]; + size_t in_flight; /* counts number of requests in flight */ + int dir, r, cmd; + bool want_to_send; + + /* The single thread "owns" the connection. */ + nbd_set_debug (nbd, true); + conn = nbd_get_connection (nbd, 0); + + /* Issue commands. */ + in_flight = 0; + fprintf (stderr, " * before aio_pread\n"); + handles[0] = nbd_aio_pread (conn, in, packetsize, 0); + if (handles[0] == -1) { + fprintf (stderr, "%s\n", nbd_get_error ()); + goto error; + } + fprintf (stderr, " * after aio_pread\n"); + in_flight++; + fprintf (stderr, " * before aio_pwrite\n"); + handles[1] = nbd_aio_pwrite (conn, out, packetsize, packetsize, 0); + if (handles[1] == -1) { + fprintf (stderr, "%s\n", nbd_get_error ()); + goto error; + } + fprintf (stderr, " * after aio_pwrite\n"); + in_flight++; + + /* Now wait for commands to retire, or for deadlock to occur */ + while (in_flight > 0) { + if (nbd_aio_is_dead (conn) || nbd_aio_is_closed (conn)) { + fprintf (stderr, "connection is dead or closed\n"); + goto error; + } + + fds[0].fd = nbd_aio_get_fd (conn); + fds[0].events = 0; + fds[0].revents = 0; + dir = nbd_aio_get_direction (conn); + if ((dir & LIBNBD_AIO_DIRECTION_READ) != 0) + fds[0].events |= POLLIN; + if ((dir & LIBNBD_AIO_DIRECTION_WRITE) != 0) + fds[0].events |= POLLOUT; + + if (poll (fds, 1, -1) == -1) { + perror ("poll"); + goto error; + } + + if ((dir & LIBNBD_AIO_DIRECTION_READ) != 0 && + (fds[0].revents & POLLIN) != 0) + nbd_aio_notify_read (conn); + else if ((dir & LIBNBD_AIO_DIRECTION_WRITE) != 0 && + (fds[0].revents & POLLOUT) != 0) + nbd_aio_notify_write (conn); + + /* If a command is ready to retire, retire it. */ + for (j = 0; j < in_flight; ++j) { + r = nbd_aio_command_completed (conn, handles[j]); + if (r == -1) { + fprintf (stderr, "%s\n", nbd_get_error ()); + goto error; + } + if (r) { + memmove (&handles[j], &handles[j+1], + sizeof (handles[0]) * (in_flight - j - 1)); + j--; + in_flight--; + } + } + } + + printf ("finished OK\n"); + + return 0; + + error: + fprintf (stderr, "failed\n"); + return -1; +} + +int +main (int argc, char *argv[]) +{ + int err; + int64_t exportsize; + + if (argc < 2 || argc > 3) { + fprintf (stderr, "%s socket | hostname port\n", argv[0]); + exit (EXIT_FAILURE); + } + + nbd = nbd_create (); + if (nbd == NULL) { + fprintf (stderr, "%s\n", nbd_get_error ()); + exit (EXIT_FAILURE); + } + + /* Connect synchronously as this is simpler. */ + if (argc == 2) { + if (nbd_connect_unix (nbd, argv[1]) == -1) { + fprintf (stderr, "%s\n", nbd_get_error ()); + exit (EXIT_FAILURE); + } + } + else { + if (nbd_connect_tcp (nbd, argv[1], argv[2]) == -1) { + fprintf (stderr, "%s\n", nbd_get_error ()); + exit (EXIT_FAILURE); + } + } + + if (nbd_read_only (nbd) == 1) { + fprintf (stderr, "%s: error: this NBD export is read-only\n", argv[0]); + exit (EXIT_FAILURE); + } + + exportsize = nbd_get_size (nbd); + if (exportsize == -1) { + fprintf (stderr, "%s\n", nbd_get_error ()); + exit (EXIT_FAILURE); + } + packetsize = exportsize / 2; + if (packetsize > 2 * 1024 * 1024) + packetsize = 2 * 1024 * 1024; + + in = malloc (packetsize); + out = malloc (packetsize); + if (!in || !out) { + fprintf (stderr, "insufficient memory\n"); + exit (EXIT_FAILURE); + } + + /* Attempt to be non-destructive, by writing what file already contains */ + if (nbd_pread (nbd, out, packetsize, packetsize) == -1) { + fprintf (stderr, "sync read failed: %s\n", nbd_get_error ()); + exit (EXIT_FAILURE); + } + + if (try_deadlock (NULL) == -1) + exit (EXIT_FAILURE); + + if (nbd_shutdown (nbd) == -1) { + fprintf (stderr, "%s\n", nbd_get_error ()); + exit (EXIT_FAILURE); + } + + nbd_close (nbd); + + return EXIT_SUCCESS; +} -- 2.20.1
Eric Blake
2019-May-21 15:25 UTC
Re: [Libguestfs] [libnbd PATCH 1/3] commands: Preserve FIFO ordering
On 5/21/19 10:09 AM, Eric Blake wrote:> A generic client exploiting multiple in-flight commands should be > prepared for out-of-order responses (and should probably ensure that > there are no overlaps between parallel in-flight commands to avoid > unspecified disk contents if the server acts on commands in an > arbitrary order or even exposing non-atomic splicing effects). But a > specific client aware of a specific server's behavior of being fully > serialized may depend on commands being processed in strict FIFO > order, and we should not get in the way of that. When adding commands > to be issued, and when moving a server's reply into commands to inform > the client about, we need to insert at the end rather than the head of > the appropriate list. Only the cmds_in_flight list does not have to > care about maintaining FIFO ordering. > --- > generator/states-reply.c | 13 ++++++++++--- > lib/rw.c | 13 ++++++++++--- > 2 files changed, 20 insertions(+), 6 deletions(-)If O(n) traversal through the list is painful, we could instead tweak our storage to also store an end pointer (more bookkeeping to keep head and tail pointers up-to-date, but then we always have O(1) insertion at tail and removal at head). But typically a client won't have huge amounts of in-flight messages (qemu-nbd defaults to 16 coroutines, and nbdkit defaults to 16 threads, at which point any further attempts to send more requests batch up until existing in-flight commands are drained), so I'm not sure if the algorithmic complexity reaches the point where it will matter. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
Eric Blake
2019-May-21 15:28 UTC
Re: [Libguestfs] [libnbd PATCH 3/3] states: Allow in-flight read while writing next command
On 5/21/19 10:09 AM, Eric Blake wrote:> As already noted in our state machine, a client that batches up a > large read followed by large writes, coupled with a server that only > processes commands in order, can result in deadlock (the server won't > read more until we unblock its ability to write out its reply to our > first command; but we aren't willing to read until we are done writing > out our second command). Break the deadlock by teaching the generator > that while we are in the middle of writing a command, we must remain > responsive to read_notify events; if the server has data for us to > read, we should consume that before jumping back into the middle of > our command issue (and consuming a reply can invalidate sbuf, so we > have to drop an assertion in PREPARE_WRITE_PAYLOAD). > --- > generator/generator | 20 ++++++++++++++++++-- > generator/states-issue-command.c | 25 ++++++++++++++++++++++++- > generator/states-reply.c | 5 ++++- > lib/internal.h | 1 + > 4 files changed, 47 insertions(+), 4 deletions(-)Squash this in, if we think we solved the problem: diff --git i/generator/generator w/generator/generator index 23b3cbf..5c84a5d 100755 --- i/generator/generator +++ w/generator/generator @@ -620,12 +620,6 @@ and issue_command_state_machine = [ State { default_state with name = "START"; - (* XXX There's a possible deadlock here if a server cannot - * handle multiple requests pipelined on a single connection. - * We could try to issue a command and block, but reads might - * be available. It should be possible to break this with - * another state. - *) comment = "Begin issuing a command to the remote server"; external_events = []; }; -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
Richard W.M. Jones
2019-May-21 17:02 UTC
Re: [Libguestfs] [libnbd PATCH 2/3] states: Split ISSUE_COMMAND.SEND_REQUEST
On Tue, May 21, 2019 at 10:09:29AM -0500, Eric Blake wrote:> In order to handle reading an in-flight response while in the middle > of sending a second command, we'll need a way to jump back into the > middle of a command being sent. This is easier if the state that sets > wbuf is distinct from the state that reads into wbuf, and if we don't > move the command to the in-flight queue until after the writes finish. > --- > generator/generator | 14 ++++++++++ > generator/states-issue-command.c | 45 ++++++++++++++++++++------------ > 2 files changed, 42 insertions(+), 17 deletions(-) > > diff --git a/generator/generator b/generator/generator > index a1bf41d..a4ad362 100755 > --- a/generator/generator > +++ b/generator/generator > @@ -637,12 +637,26 @@ and issue_command_state_machine = [ > external_events = [ NotifyWrite, "" ]; > }; > > + State { > + default_state with > + name = "PREPARE_WRITE_PAYLOAD"; > + comment = "Prepare the write payload to send to the remote server"; > + external_events = []; > + }; > + > State { > default_state with > name = "SEND_WRITE_PAYLOAD"; > comment = "Sending the write payload to the remote server"; > external_events = [ NotifyWrite, "" ]; > }; > + > +State { > + default_state with > + name = "FINISH"; > + comment = "Finish issuing a command"; > + external_events = []; > + }; > ] > > (* Receiving a reply from the server. *) > diff --git a/generator/states-issue-command.c b/generator/states-issue-command.c > index a57f40f..e24ea34 100644 > --- a/generator/states-issue-command.c > +++ b/generator/states-issue-command.c > @@ -24,9 +24,6 @@ > > assert (conn->cmds_to_issue != NULL); > cmd = conn->cmds_to_issue; > - conn->cmds_to_issue = cmd->next; > - cmd->next = conn->cmds_in_flight; > - conn->cmds_in_flight = cmd; > > conn->sbuf.request.magic = htobe32 (NBD_REQUEST_MAGIC); > conn->sbuf.request.flags = htobe16 (cmd->flags); > @@ -40,29 +37,43 @@ > return 0; > > ISSUE_COMMAND.SEND_REQUEST: > - struct command_in_flight *cmd; > - > switch (send_from_wbuf (conn)) { > case -1: SET_NEXT_STATE (%.DEAD); return -1; > - case 0: > - assert (conn->cmds_in_flight != NULL); > - cmd = conn->cmds_in_flight; > - assert (cmd->handle == be64toh (conn->sbuf.request.handle)); > - if (cmd->type == NBD_CMD_WRITE) { > - conn->wbuf = cmd->data; > - conn->wlen = cmd->count; > - SET_NEXT_STATE (%SEND_WRITE_PAYLOAD); > - } > - else > - SET_NEXT_STATE (%.READY); > + case 0: SET_NEXT_STATE (%PREPARE_WRITE_PAYLOAD); > } > return 0; > > + ISSUE_COMMAND.PREPARE_WRITE_PAYLOAD: > + struct command_in_flight *cmd; > + > + assert (conn->cmds_to_issue != NULL); > + cmd = conn->cmds_to_issue; > + assert (cmd->handle == be64toh (conn->sbuf.request.handle)); > + if (cmd->type == NBD_CMD_WRITE) { > + conn->wbuf = cmd->data; > + conn->wlen = cmd->count; > + SET_NEXT_STATE (%SEND_WRITE_PAYLOAD); > + } > + else > + SET_NEXT_STATE (%FINISH); > + return 0; > + > ISSUE_COMMAND.SEND_WRITE_PAYLOAD: > switch (send_from_wbuf (conn)) { > case -1: SET_NEXT_STATE (%.DEAD); return -1; > - case 0: SET_NEXT_STATE (%.READY); > + case 0: SET_NEXT_STATE (%FINISH); > } > return 0; > > + ISSUE_COMMAND.FINISH: > + struct command_in_flight *cmd; > + > + assert (conn->cmds_to_issue != NULL); > + cmd = conn->cmds_to_issue; > + conn->cmds_to_issue = cmd->next; > + cmd->next = conn->cmds_in_flight; > + conn->cmds_in_flight = cmd; > + SET_NEXT_STATE (%.READY); > + return 0; > + > } /* END STATE MACHINE */This is a simple state splitting, along with a later move of the command to the in-flight list. Since no more commands could be added to the cmds_to_issue queue while we're not in the READY state, actually I think cmds_to_issue might just be a single command pointer after all. The patch however is generally fine. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-p2v converts physical machines to virtual machines. Boot with a live CD or over the network (PXE) and turn machines into KVM guests. http://libguestfs.org/virt-v2v
Richard W.M. Jones
2019-May-21 17:06 UTC
Re: [Libguestfs] [libnbd PATCH 3/3] states: Allow in-flight read while writing next command
On Tue, May 21, 2019 at 10:09:30AM -0500, Eric Blake wrote:> As already noted in our state machine, a client that batches up a > large read followed by large writes, coupled with a server that only > processes commands in order, can result in deadlock (the server won't > read more until we unblock its ability to write out its reply to our > first command; but we aren't willing to read until we are done writing > out our second command). Break the deadlock by teaching the generator > that while we are in the middle of writing a command, we must remain > responsive to read_notify events; if the server has data for us to > read, we should consume that before jumping back into the middle of > our command issue (and consuming a reply can invalidate sbuf, so we > have to drop an assertion in PREPARE_WRITE_PAYLOAD). > --- > generator/generator | 20 ++++++++++++++++++-- > generator/states-issue-command.c | 25 ++++++++++++++++++++++++- > generator/states-reply.c | 5 ++++- > lib/internal.h | 1 + > 4 files changed, 47 insertions(+), 4 deletions(-) > > diff --git a/generator/generator b/generator/generator > index a4ad362..23b3cbf 100755 > --- a/generator/generator > +++ b/generator/generator > @@ -634,7 +634,15 @@ and issue_command_state_machine = [ > default_state with > name = "SEND_REQUEST"; > comment = "Sending a request to the remote server"; > - external_events = [ NotifyWrite, "" ]; > + external_events = [ NotifyWrite, ""; > + NotifyRead, "PAUSE_SEND_REQUEST" ]; > + }; > + > + State { > + default_state with > + name = "PAUSE_SEND_REQUEST"; > + comment = "Interrupt send request to receive an earlier command's reply"; > + external_events = []; > }; > > State { > @@ -648,7 +656,15 @@ and issue_command_state_machine = [ > default_state with > name = "SEND_WRITE_PAYLOAD"; > comment = "Sending the write payload to the remote server"; > - external_events = [ NotifyWrite, "" ]; > + external_events = [ NotifyWrite, ""; > + NotifyRead, "PAUSE_WRITE_PAYLOAD" ]; > + }; > + > +State { > + default_state with > + name = "PAUSE_WRITE_PAYLOAD"; > + comment = "Interrupt write payload to receive an earlier command's reply"; > + external_events = []; > }; > > State { > diff --git a/generator/states-issue-command.c b/generator/states-issue-command.c > index e24ea34..3a5980d 100644 > --- a/generator/states-issue-command.c > +++ b/generator/states-issue-command.c > @@ -25,6 +25,15 @@ > assert (conn->cmds_to_issue != NULL); > cmd = conn->cmds_to_issue; > > + /* Were we interrupted by reading a reply to an earlier command? */ > + if (conn->wlen) { > + if (conn->in_write_payload) > + SET_NEXT_STATE(%SEND_WRITE_PAYLOAD); > + else > + SET_NEXT_STATE(%SEND_REQUEST); > + return 0; > + } > + > conn->sbuf.request.magic = htobe32 (NBD_REQUEST_MAGIC); > conn->sbuf.request.flags = htobe16 (cmd->flags); > conn->sbuf.request.type = htobe16 (cmd->type); > @@ -43,12 +52,18 @@ > } > return 0; > > + ISSUE_COMMAND.PAUSE_SEND_REQUEST: > + assert (conn->wlen); > + assert (conn->cmds_to_issue != NULL); > + conn->in_write_payload = false; > + SET_NEXT_STATE (%^REPLY.START); > + return 0; > + > ISSUE_COMMAND.PREPARE_WRITE_PAYLOAD: > struct command_in_flight *cmd; > > assert (conn->cmds_to_issue != NULL); > cmd = conn->cmds_to_issue; > - assert (cmd->handle == be64toh (conn->sbuf.request.handle)); > if (cmd->type == NBD_CMD_WRITE) { > conn->wbuf = cmd->data; > conn->wlen = cmd->count; > @@ -65,9 +80,17 @@ > } > return 0; > > + ISSUE_COMMAND.PAUSE_WRITE_PAYLOAD: > + assert (conn->wlen); > + assert (conn->cmds_to_issue != NULL); > + conn->in_write_payload = true; > + SET_NEXT_STATE (%^REPLY.START); > + return 0; > + > ISSUE_COMMAND.FINISH: > struct command_in_flight *cmd; > > + assert (!conn->wlen); > assert (conn->cmds_to_issue != NULL); > cmd = conn->cmds_to_issue; > conn->cmds_to_issue = cmd->next; > diff --git a/generator/states-reply.c b/generator/states-reply.c > index 45362d4..6bb503a 100644 > --- a/generator/states-reply.c > +++ b/generator/states-reply.c > @@ -118,7 +118,10 @@ > else > conn->cmds_done = cmd; > > - SET_NEXT_STATE (%.READY); > + if (conn->cmds_to_issue) > + SET_NEXT_STATE (%^ISSUE_COMMAND.START); > + else > + SET_NEXT_STATE (%.READY); > return 0; > > } /* END STATE MACHINE */ > diff --git a/lib/internal.h b/lib/internal.h > index 3f2b729..466af9d 100644 > --- a/lib/internal.h > +++ b/lib/internal.h > @@ -182,6 +182,7 @@ struct nbd_connection { > * acknowledge them. > */ > struct command_in_flight *cmds_to_issue, *cmds_in_flight, *cmds_done; > + bool in_write_payload; > }; > > struct meta_context {Yes, I think this is correct. However, I also think that cmds_to_issue might be replaced by a single command pointer. (If I'm wrong in my reasoning, we'll very quickly see an assert fail if something tries to overwrite a non-NULL conn->cmd_to_issue pointer). Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-df lists disk usage of guests without needing to install any software inside the virtual machine. Supports Linux and Windows. http://people.redhat.com/~rjones/virt-df/
Richard W.M. Jones
2019-May-21 17:08 UTC
Re: [Libguestfs] [libnbd] tmp patch adding deadlock test
On Tue, May 21, 2019 at 10:13:48AM -0500, Eric Blake wrote:> --- > > This is what I used to provoke the deadlocks; before my patch series, > it was succeeding for a fully-parallel server: > nbdkit -U - memory 2M --run './deadlock $unixsocket' > as well as for a serialized server that didn't trip up Linux kernel > buffering limits: > nbdkit -U - --filter=noparallel memory 256k --run './deadlock $unixsocket' > but was reliably hanging with both client and server on larger buffers: > nbdkit -U - --filter=noparallel memory 512k --run './deadlock $unixsocket' > > Post-patch, you can see the state machine now shift through > ISSUE_COMMAND.PAUSE_WRITE_PAYLOAD through an entire > REPLY.START..REPLY.FINISH sequence, and then resume in the middle of > the ISSUE_COMMAND.SEND_WRITE_PAYLOAD.This is fine, but will require us to either bump up the minimum version of nbdkit in configure.ac, or have some kind of test to see if the filter is present (see: https://github.com/libguestfs/nbdkit/blob/master/docs/nbdkit-filter.pod#pkg-configpkgconf ) ACK Rich.> .gitignore | 1 + > examples/Makefile.am | 10 +++ > examples/deadlock.c | 200 +++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 211 insertions(+) > create mode 100644 examples/deadlock.c > > diff --git a/.gitignore b/.gitignore > index 66ff811..c135c26 100644 > --- a/.gitignore > +++ b/.gitignore > @@ -30,6 +30,7 @@ Makefile.in > /docs/libnbd.3 > /docs/libnbd-api.3 > /docs/libnbd-api.pod > +/examples/deadlock > /examples/threaded-reads-and-writes > /examples/simple-fetch-first-sector > /examples/simple-reads-and-writes > diff --git a/examples/Makefile.am b/examples/Makefile.am > index be3f21d..16b3804 100644 > --- a/examples/Makefile.am > +++ b/examples/Makefile.am > @@ -18,6 +18,7 @@ > include $(top_srcdir)/subdir-rules.mk > > noinst_PROGRAMS = \ > + deadlock \ > simple-fetch-first-sector \ > simple-reads-and-writes \ > threaded-reads-and-writes > @@ -50,3 +51,12 @@ threaded_reads_and_writes_CFLAGS = \ > threaded_reads_and_writes_LDADD = \ > $(top_builddir)/lib/libnbd.la \ > $(PTHREAD_LIBS) > + > +deadlock_SOURCES = \ > + deadlock.c > +deadlock_CPPFLAGS = \ > + -I$(top_srcdir)/include > +deadlock_CFLAGS = \ > + $(WARNINGS_CFLAGS) > +deadlock_LDADD = \ > + $(top_builddir)/lib/libnbd.la > diff --git a/examples/deadlock.c b/examples/deadlock.c > new file mode 100644 > index 0000000..1c9be8d > --- /dev/null > +++ b/examples/deadlock.c > @@ -0,0 +1,200 @@ > +/* This example can be copied, used and modified for any purpose > + * without restrictions. > + * > + * Example usage with nbdkit: > + * > + * nbdkit -U - --filter=noparallel memory 2M --run './deadlock $unixsocket' > + * > + * This will attempt to create a deadlock by sending a large read > + * request, immediately followed by a large write request, prior to > + * waiting for any command replies from the server. If the server does > + * not support reading a second command until after the response to > + * the first one has been sent, then this could deadlock with the > + * server waiting for libnbd to finish reading the read response, > + * while libnbd is waiting for the server to finish reading the write > + * request. Fixing the deadlock requires that the client prioritize > + * reads over writes when more than one command is in flight. > + * > + * To run it against a remote server over TCP: > + * > + * ./deadlock hostname port > + */ > + > +#include <stdio.h> > +#include <stdlib.h> > +#include <stdbool.h> > +#include <stdint.h> > +#include <string.h> > +#include <unistd.h> > +#include <errno.h> > +#include <poll.h> > +#include <time.h> > +#include <assert.h> > + > +#include <libnbd.h> > + > +/* The single NBD handle. */ > +static struct nbd_handle *nbd; > + > +/* Buffers used for the test. */ > +static char *in, *out; > +static int64_t packetsize; > + > +static int > +try_deadlock (void *arg) > +{ > + struct pollfd fds[1]; > + struct nbd_connection *conn; > + char buf[512]; > + size_t i, j; > + int64_t handles[2]; > + size_t in_flight; /* counts number of requests in flight */ > + int dir, r, cmd; > + bool want_to_send; > + > + /* The single thread "owns" the connection. */ > + nbd_set_debug (nbd, true); > + conn = nbd_get_connection (nbd, 0); > + > + /* Issue commands. */ > + in_flight = 0; > + fprintf (stderr, " * before aio_pread\n"); > + handles[0] = nbd_aio_pread (conn, in, packetsize, 0); > + if (handles[0] == -1) { > + fprintf (stderr, "%s\n", nbd_get_error ()); > + goto error; > + } > + fprintf (stderr, " * after aio_pread\n"); > + in_flight++; > + fprintf (stderr, " * before aio_pwrite\n"); > + handles[1] = nbd_aio_pwrite (conn, out, packetsize, packetsize, 0); > + if (handles[1] == -1) { > + fprintf (stderr, "%s\n", nbd_get_error ()); > + goto error; > + } > + fprintf (stderr, " * after aio_pwrite\n"); > + in_flight++; > + > + /* Now wait for commands to retire, or for deadlock to occur */ > + while (in_flight > 0) { > + if (nbd_aio_is_dead (conn) || nbd_aio_is_closed (conn)) { > + fprintf (stderr, "connection is dead or closed\n"); > + goto error; > + } > + > + fds[0].fd = nbd_aio_get_fd (conn); > + fds[0].events = 0; > + fds[0].revents = 0; > + dir = nbd_aio_get_direction (conn); > + if ((dir & LIBNBD_AIO_DIRECTION_READ) != 0) > + fds[0].events |= POLLIN; > + if ((dir & LIBNBD_AIO_DIRECTION_WRITE) != 0) > + fds[0].events |= POLLOUT; > + > + if (poll (fds, 1, -1) == -1) { > + perror ("poll"); > + goto error; > + } > + > + if ((dir & LIBNBD_AIO_DIRECTION_READ) != 0 && > + (fds[0].revents & POLLIN) != 0) > + nbd_aio_notify_read (conn); > + else if ((dir & LIBNBD_AIO_DIRECTION_WRITE) != 0 && > + (fds[0].revents & POLLOUT) != 0) > + nbd_aio_notify_write (conn); > + > + /* If a command is ready to retire, retire it. */ > + for (j = 0; j < in_flight; ++j) { > + r = nbd_aio_command_completed (conn, handles[j]); > + if (r == -1) { > + fprintf (stderr, "%s\n", nbd_get_error ()); > + goto error; > + } > + if (r) { > + memmove (&handles[j], &handles[j+1], > + sizeof (handles[0]) * (in_flight - j - 1)); > + j--; > + in_flight--; > + } > + } > + } > + > + printf ("finished OK\n"); > + > + return 0; > + > + error: > + fprintf (stderr, "failed\n"); > + return -1; > +} > + > +int > +main (int argc, char *argv[]) > +{ > + int err; > + int64_t exportsize; > + > + if (argc < 2 || argc > 3) { > + fprintf (stderr, "%s socket | hostname port\n", argv[0]); > + exit (EXIT_FAILURE); > + } > + > + nbd = nbd_create (); > + if (nbd == NULL) { > + fprintf (stderr, "%s\n", nbd_get_error ()); > + exit (EXIT_FAILURE); > + } > + > + /* Connect synchronously as this is simpler. */ > + if (argc == 2) { > + if (nbd_connect_unix (nbd, argv[1]) == -1) { > + fprintf (stderr, "%s\n", nbd_get_error ()); > + exit (EXIT_FAILURE); > + } > + } > + else { > + if (nbd_connect_tcp (nbd, argv[1], argv[2]) == -1) { > + fprintf (stderr, "%s\n", nbd_get_error ()); > + exit (EXIT_FAILURE); > + } > + } > + > + if (nbd_read_only (nbd) == 1) { > + fprintf (stderr, "%s: error: this NBD export is read-only\n", argv[0]); > + exit (EXIT_FAILURE); > + } > + > + exportsize = nbd_get_size (nbd); > + if (exportsize == -1) { > + fprintf (stderr, "%s\n", nbd_get_error ()); > + exit (EXIT_FAILURE); > + } > + packetsize = exportsize / 2; > + if (packetsize > 2 * 1024 * 1024) > + packetsize = 2 * 1024 * 1024; > + > + in = malloc (packetsize); > + out = malloc (packetsize); > + if (!in || !out) { > + fprintf (stderr, "insufficient memory\n"); > + exit (EXIT_FAILURE); > + } > + > + /* Attempt to be non-destructive, by writing what file already contains */ > + if (nbd_pread (nbd, out, packetsize, packetsize) == -1) { > + fprintf (stderr, "sync read failed: %s\n", nbd_get_error ()); > + exit (EXIT_FAILURE); > + } > + > + if (try_deadlock (NULL) == -1) > + exit (EXIT_FAILURE); > + > + if (nbd_shutdown (nbd) == -1) { > + fprintf (stderr, "%s\n", nbd_get_error ()); > + exit (EXIT_FAILURE); > + } > + > + nbd_close (nbd); > + > + return EXIT_SUCCESS; > +} > -- > 2.20.1 > > _______________________________________________ > Libguestfs mailing list > Libguestfs@redhat.com > https://www.redhat.com/mailman/listinfo/libguestfs-- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org
Apparently Analagous Threads
- [libnbd PATCH 3/3] states: Allow in-flight read while writing next command
- [libnbd PATCH v2 4/5] states: Allow in-flight read while writing next command
- [libnbd PATCH] states: Never block state machine inside REPLY
- [libnbd PATCH 0/3] Avoid deadlock with in-flight commands
- [libnbd PATCH v2 0/5] Avoid deadlock with in-flight commands