thr3ads.net - Libguestfs - [Libguestfs] [PATCH nbdkit 0/2] connections: Protect open and close callbacks with the request lock. [Apr 2018]

If this information is useful, please help other people find it:
Share via:

Richard W.M. Jones

2018-Apr-12 17:43 UTC

[Libguestfs] [PATCH nbdkit 0/2] connections: Protect open and close callbacks with the request lock.

I'm fairly sure that these bugs which appear in the Python plugin:

https://bugzilla.redhat.com/show_bug.cgi?id=1566516
https://bugzilla.redhat.com/show_bug.cgi?id=1566522

are really bugs in the SERIALIZE_ALL_REQUESTS thread model.  See
the first patch for the full explanation.

The second patch is a fix for a race condition which is probably
nudged into being by the first patch.

Now this all works, EXCEPT it quite reliably breaks either the NBD
plugin or ‘tests/test-parallel-nbd.sh’.  I have no idea why because as
far as I know the change should not affect the NBD plugin or that test
in any way.  Unfortunately both are rather complex to debug.

Rich.

Richard W.M. Jones

2018-Apr-12 17:43 UTC

head link

[Libguestfs] [PATCH nbdkit 1/2] connections: Protect open and close callbacks with the request lock.

The thread model SERIALIZE_ALL_REQUESTS claims that although multiple
handles may be open at the same time,

    "[...] data requests
     are serialized so that for the plugin as a whole only one
     read/write/etc request will be in progress at any time."

The text seems to apply to only datapath requests but goes on to say:

    "This is a useful setting if the library you are using is not
     thread-safe.  However performance may not be good."

If the plugin calls into a non-thread-safe library during open() or
close() (as the Python plugin does) then you are not protected.

We could change every plugin, but it seems better to extend the
protection of SERIALIZE_ALL_REQUESTS so it also applies to open() and
close() callbacks, to reduce confusion.

This fixes two problems in the Python plugin if the user-defined
close() callback takes a long time to run (for example it is doing
cleanups).  See:

https://bugzilla.redhat.com/show_bug.cgi?id=1566516
https://bugzilla.redhat.com/show_bug.cgi?id=1566522
---
 docs/nbdkit-plugin.pod |  6 +++---
 src/connections.c      | 28 +++++++++++++++++-----------
 2 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/docs/nbdkit-plugin.pod b/docs/nbdkit-plugin.pod
index 6e89315..672f02a 100644
--- a/docs/nbdkit-plugin.pod
+++ b/docs/nbdkit-plugin.pod
@@ -663,9 +663,9 @@ first client to close the connection.
 
 I<This is a safe default for most plugins>.
 
-Multiple handles can be open at the same time, but data requests are
-serialized so that for the plugin as a whole only one read/write/etc
-request will be in progress at any time.
+Multiple handles can be open at the same time, but requests are
+serialized so that for the plugin as a whole only one
+open/read/write/close (etc) request will be in progress at any time.
 
 This is a useful setting if the library you are using is not
 thread-safe.  However performance may not be good.
diff --git a/src/connections.c b/src/connections.c
index fbcf5e5..5024e2f 100644
--- a/src/connections.c
+++ b/src/connections.c
@@ -233,7 +233,7 @@ connection_worker (void *data)
 static int
 _handle_single_connection (int sockin, int sockout)
 {
-  int r = -1;
+  int ret = -1, r;
   struct connection *conn;
   int nworkers = threads ? threads : DEFAULT_PARALLEL_REQUESTS;
   pthread_t *workers = NULL;
@@ -245,7 +245,10 @@ _handle_single_connection (int sockin, int sockout)
   if (!conn)
     goto done;
 
-  if (backend->open (backend, conn, readonly) == -1)
+  lock_request (conn);
+  r = backend->open (backend, conn, readonly);
+  unlock_request (conn);
+  if (r == -1)
     goto done;
 
   threadlocal_set_name (backend->plugin_name (backend));
@@ -312,11 +315,11 @@ _handle_single_connection (int sockin, int sockout)
   if (finalize (conn) == -1)
     goto done;
 
-  r = get_status (conn);
+  ret = get_status (conn);
  done:
-  debug ("connection cleanup with final status %d", r);
+  debug ("connection cleanup with final status %d", ret);
   free_connection (conn);
-  return r;
+  return ret;
 }
 
 int
@@ -366,20 +369,23 @@ free_connection (struct connection *conn)
 
   conn->close (conn);
 
-  pthread_mutex_destroy (&conn->request_lock);
-  pthread_mutex_destroy (&conn->read_lock);
-  pthread_mutex_destroy (&conn->write_lock);
-  pthread_mutex_destroy (&conn->status_lock);
-
   /* Don't call the plugin again if quit has been set because the main
    * thread will be in the process of unloading it.  The plugin.unload
    * callback should always be called.
    */
   if (!quit) {
-    if (conn->nr_handles > 0 && conn->handles[0])
+    if (conn->nr_handles > 0 && conn->handles[0]) {
+      lock_request (conn);
       backend->close (backend, conn);
+      unlock_request (conn);
+    }
   }
 
+  pthread_mutex_destroy (&conn->request_lock);
+  pthread_mutex_destroy (&conn->read_lock);
+  pthread_mutex_destroy (&conn->write_lock);
+  pthread_mutex_destroy (&conn->status_lock);
+
   free (conn->handles);
   free (conn);
 }
-- 
2.16.2

Richard W.M. Jones

2018-Apr-12 17:43 UTC

head link

[Libguestfs] [PATCH nbdkit 2/2] tests: test-single: Fix a race between socat creating the socket and running qemu-img.

If socat is slow and qemu-img is fast then qemu-img opens the socket
before socat creates it, resulting in a test failure.
---
 tests/test-single.sh | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/tests/test-single.sh b/tests/test-single.sh
index 9dc462b..fdcfc62 100755
--- a/tests/test-single.sh
+++ b/tests/test-single.sh
@@ -65,4 +65,16 @@ cleanup ()
 }
 trap cleanup INT QUIT TERM EXIT ERR
 
+# Wait for socat to start up and create the socket.
+for i in `seq 1 10`; do
+    if test -S single.sock; then
+        break
+    fi
+    sleep 1
+done
+if ! test -S single.sock; then
+    echo "$0: socket was not created"
+    exit 1
+fi
+
 qemu-img info nbd:unix:single.sock
-- 
2.16.2

Eric Blake

2018-Apr-18 20:19 UTC

head link

Re: [Libguestfs] [PATCH nbdkit 0/2] connections: Protect open and close callbacks with the request lock.

On 04/12/2018 12:43 PM, Richard W.M. Jones wrote:> I'm fairly sure that these bugs which appear in the Python plugin:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1566516
> https://bugzilla.redhat.com/show_bug.cgi?id=1566522
> 
> are really bugs in the SERIALIZE_ALL_REQUESTS thread model.  See
> the first patch for the full explanation.
> 
> The second patch is a fix for a race condition which is probably
> nudged into being by the first patch.
> 
> Now this all works, EXCEPT it quite reliably breaks either the NBD
> plugin or ‘tests/test-parallel-nbd.sh’.  I have no idea why because as
> far as I know the change should not affect the NBD plugin or that test
> in any way.  Unfortunately both are rather complex to debug.
I'm debugging the testsuite breakage: the problem is a hang in parallel
thread cleanup.  For example, in test-parallel-nbd.sh, we have two
nbdkit processes (one running plugin 'nbd', the other running
'file').
The hang is in the nbdkit running 'file'; it is spawning multiple
threads to process work requests, but then does pthread_join() on each
thread in order.  But each worker thread is competing to obtain
read_lock prior to read()ing from the client.  So, unless we get lucky
that the scheduler wakes up the threads in the right order, the
pthread_join is waiting for the wrong thread (one that is blocked on
obtaining read_lock), while the thread that DOES have read_lock is
blocked waiting for recv() from the client to pass in another command or
to indicate EOF.

Why did your patch expose it?  It's because pre-patch, the nbdkit
running 'nbd' closed the socket early without the locks (the nbdkit
running 'file' then gets a return of -1 from recv(), and from there,
each worker thread quickly exits, so that the while loop joining threads
gets to complete); while with your patch, the nbdkit running 'nbd' does
not close the socket until the unload_prevention_lock gives the all
clear (the nbdkit running 'file' is thus hung waiting on recv(), even
though it already knows that the client will be shutting down soon).
But there's also a problem with the nbdkit running 'nbd': it is
trying
to pthread_join() the reader thread (to see if the server has any
remaining replies); pre-patch, there was no problem joining the thread,
but post-patch, both nbdkit processes are stuck in a read() waiting on
the other process.

I'm still trying to come up with an appropriate patch (or maybe even two
patches).  I think it is desirable to continue read()ing from the client
even after the client has requested shutdown (NBD_CMD_DISC), if only to
gracefully reply with an error message to all subsequent commands
erroneously sent by an ill-behaved client.  But it also seems like it is
fairly obvious that once we are in shutdown mode, parallel processing is
no longer important, and that it is pointless to have threads compete on
the read lock if a well-behaved client will not be sending any more
data, rather than blocking the worker threads on a read() that won't
complete until the client actually hangs up.  So the nbdkit running
'file' needs to be more robust (perhaps once we have detected that the
client has requested shutdown, then limit all further read()s to a
worker thread 0, so that all other threads eventually get reaped
quickly); but also the nbdkit running 'nbd' needs to be fixed to not
keep the socket around forever.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Reasonably Related Threads

Search for more seemingly similar threads

Libguestfs - Apr 2018 - [PATCH nbdkit 0/2] connections: Protect open and close callbacks with the request lock.

[Libguestfs] [PATCH nbdkit 0/2] connections: Protect open and close callbacks with the request lock.

[Libguestfs] [PATCH nbdkit 1/2] connections: Protect open and close callbacks with the request lock.

[Libguestfs] [PATCH nbdkit 2/2] tests: test-single: Fix a race between socat creating the socket and running qemu-img.

Re: [Libguestfs] [PATCH nbdkit 0/2] connections: Protect open and close callbacks with the request lock.

Reasonably Related Threads