Eric Blake
2020-Mar-28 20:57 UTC
[Libguestfs] [nbdkit PATCH v2] nbd: Avoid stuck poll() in nbdplug_close_handle()
We have been seeing sporadic hangs on test-nbd-tls-psk.sh and test-nbd-tls.sh (on my machine, running those two in a loop with commits 0a76cae4 and 09e34ba2 reverted would fail within 100 attempts), where even though the client to the 'nbdkit nbd' process has cleanly exited, things are stalled in .close where nbd is trying to pthread_join() the reader thread, while the reader thread is itself blocked on a poll() that will never make additional progress. In hindsight, the problem is obvious: we used the wrong libnbd API (synchronous instead of async), which often results in: nbd .close nbd reader server nbd_shutdown poll - send NBD_CMD_DISC receive NBD_CMD_DISC shutdown(SHUT_WR) wake up on EOF nbd_aio_notify_read - move state to CLOSED nbd_aio_is_dead -> true break loop - poll pthread_join close fds nbd_close see EOF from client but sometimes things happen differently: nbd .close nbd reader server nbd_shutdown poll - send NBD_CMD_DISC - poll receive NBD_CMD_DISC shutdown(SHUT_WR) - wake up on EOF - move state to CLOSED pthread_join stuck As can be seen, having two threads compete on poll() on the same fd is unwise. Worse, if the server uses a means that will cause us to see EOF even though the socket is still open (such as shutdown(SHUT_WR) on plaintext, or gnutls_bye() with TLS), but waits for us to close the socket first, we end up with two processes deadlocked on each other. (nbdkit as server has used gnutls_bye since commit bd9a52e0; qemu however does not use it.) Other commits such as 14ba4154, c70616f8, and 430f8141 show that getting the shutdown sequence race-free has not been trivial. Fixes: ab7760fc Signed-off-by: Eric Blake <eblake@redhat.com> --- plugins/nbd/nbd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/plugins/nbd/nbd.c b/plugins/nbd/nbd.c index ba1e7188..0ea2a4a2 100644 --- a/plugins/nbd/nbd.c +++ b/plugins/nbd/nbd.c @@ -532,7 +532,7 @@ nbdplug_open (int readonly) static void nbdplug_close_handle (struct handle *h) { - if (nbd_shutdown (h->nbd, 0) == -1) + if (nbd_aio_disconnect (h->nbd, 0) == -1) nbdkit_debug ("failed to clean up handle: %s", nbd_get_error ()); if ((errno = pthread_join (h->reader, NULL))) nbdkit_debug ("failed to join reader thread: %m"); -- 2.26.0.rc2
Apparently Analagous Threads
- [nbdkit PATCH 2/2] nbd: Reorder cleanup to avoid getting stuck in poll()
- Re: [nbdkit PATCH 2/2] nbd: Reorder cleanup to avoid getting stuck in poll()
- [libnbd PATCH 0/2] fix hangs against nbdkit 1.2
- [nbdkit PATCH 0/2] Improve shutdown race in nbd plugin
- [nbdkit PATCH] nbd: Fix race during close