We noticed while writing various libnbd tests that when the delay filter is in use, there are scenarios where we had to resort to SIGKILL to get rid of nbdkit, because it was non-responsive to SIGINT. I'm still trying to figure out the best way to add testsuite coverage of this, but already proved to myself that it works from the command line, under two scenarios that both used to cause long delays: 1. Early client death: nbdkit -U - -vf --filter=delay null size=1m rdelay=10 --run \ 'timeout 1s nbdsh --connect "nbd+unix:///?socket=$unixsocket" -c "h.pread(1,0)"' Pre-patch, the server detects the death of the client, but the worker thread stays busy for the remaining 9 seconds before nbdkit can finally exit. Post-patch, the server exits right after the client. 2. Early server death: timeout 1s nbdkit -U - -vf --filter=delay null size=1m rdelay=10 --run \ 'nbdsh --connect "nbd+unix:///?socket=$unixsocket" -c "h.pread(1,0)"' Pre-patch, the server reacts to the signal and kills the client, but the worker thread stays busy for the remaining 9 seconds before nbdkit can finally exit. Post-patch, the server is able to finalize right after the signal. Use of --run in the above tests lets you test things in one command line, but to some extent hides the longevity of the nbdkit process (you get the shell prompt back when the main thread exits, even though the detatched threads are still around); if you avoid --run and actually keep nbdkit in the foreground in one terminal and use nbdsh in a different terminal, choosing which terminal gets ^C, the effects are a bit more apparent. Patch 3 needs porting to any platform lacking ppoll. I have ideas for that port, but don't want to spend time on it before knowing for sure we need the port. And the fact that the pre-patch tests show output after the shell prompt returns means we still have cases where our detached threads are executing past the point where the main thread has tried to unload the plugin, which is never a nice thing. We may still have more work ahead of us to ensure that we don't unload an in-use plugin. Eric Blake (3): server: Add threadlocal_get_conn server: Add pipe for tracking disconnects server: Add and use nbdkit_nanosleep docs/nbdkit-plugin.pod | 28 +++++++++++++++++ configure.ac | 1 + include/nbdkit-common.h | 1 + server/internal.h | 3 ++ filters/delay/delay.c | 14 ++------- filters/rate/rate.c | 10 +++---- server/connections.c | 66 ++++++++++++++++++++++++++++++++++++++++- server/public.c | 61 +++++++++++++++++++++++++++++++++++++ server/threadlocal.c | 22 +++++++++++++- server/nbdkit.syms | 1 + 10 files changed, 188 insertions(+), 19 deletions(-) -- 2.20.1
Eric Blake
2019-Aug-03 16:01 UTC
[Libguestfs] [nbdkit PATCH 1/3] server: Add threadlocal_get_conn
We'd like to interrupt a sleep during a filter/plugin transaction if it is obvious that finishing the transaction is pointless (because the client has delivered NBD_CMD_DISC or hit EOF). But to do that, we need to get back at the connection object from a function called from the plugin thread. Passing the connection as an opaque object would be a major API change, so it's easier to just add a thread-local tracking the current connection. Signed-off-by: Eric Blake <eblake@redhat.com> --- server/internal.h | 2 ++ server/connections.c | 4 ++++ server/threadlocal.c | 22 +++++++++++++++++++++- 3 files changed, 27 insertions(+), 1 deletion(-) diff --git a/server/internal.h b/server/internal.h index 76f2139f..df4df0f2 100644 --- a/server/internal.h +++ b/server/internal.h @@ -342,6 +342,8 @@ extern void threadlocal_set_sockaddr (const struct sockaddr *addr, extern void threadlocal_set_error (int err); extern int threadlocal_get_error (void); extern void *threadlocal_buffer (size_t size); +extern void threadlocal_set_conn (struct connection *conn); +extern struct connection *threadlocal_get_conn (void); /* Declare program_name. */ #if HAVE_DECL_PROGRAM_INVOCATION_SHORT_NAME == 1 diff --git a/server/connections.c b/server/connections.c index f56590c2..f49a74ee 100644 --- a/server/connections.c +++ b/server/connections.c @@ -137,6 +137,7 @@ connection_worker (void *data) debug ("starting worker thread %s", name); threadlocal_new_server_thread (); threadlocal_set_name (name); + threadlocal_set_conn (conn); free (worker); while (!quit && connection_get_status (conn) > 0) @@ -297,6 +298,8 @@ new_connection (int sockin, int sockout, int nworkers) conn->send = raw_send_other; conn->close = raw_close; + threadlocal_set_conn (conn); + return conn; } @@ -306,6 +309,7 @@ free_connection (struct connection *conn) if (!conn) return; + threadlocal_set_conn (NULL); conn->close (conn); /* Don't call the plugin again if quit has been set because the main diff --git a/server/threadlocal.c b/server/threadlocal.c index 49ae1ac2..a796fce8 100644 --- a/server/threadlocal.c +++ b/server/threadlocal.c @@ -1,5 +1,5 @@ /* nbdkit - * Copyright (C) 2013-2018 Red Hat Inc. + * Copyright (C) 2013-2019 Red Hat Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are @@ -60,6 +60,7 @@ struct threadlocal { int err; void *buffer; size_t buffer_size; + struct connection *conn; }; static pthread_key_t threadlocal_key; @@ -226,3 +227,22 @@ threadlocal_buffer (size_t size) return threadlocal->buffer; } + +/* Set (or clear) the connection that is using the current thread */ +void +threadlocal_set_conn (struct connection *conn) +{ + struct threadlocal *threadlocal = pthread_getspecific (threadlocal_key); + + if (threadlocal) + threadlocal->conn = conn; +} + +/* Get the connection associated with this thread, if available */ +struct connection * +threadlocal_get_conn (void) +{ + struct threadlocal *threadlocal = pthread_getspecific (threadlocal_key); + + return threadlocal ? threadlocal->conn : NULL; +} -- 2.20.1
Eric Blake
2019-Aug-03 16:01 UTC
[Libguestfs] [nbdkit PATCH 2/3] server: Add pipe for tracking disconnects
The next patch wants to add a utility function which will sleep until an appropriate amount of time has elapsed, except that it will short-circuit and report failure if the client is going away. The most obvious kernel interface for waiting for an action or a timeout is pselect or ppoll, but it requires a file descriptor as the witness of an action. We already have a pipe-to-self for when nbdkit itself is quitting because of a signal, but we also want a pipe-to-self for each multi-threaded connection so that detection of NBD_CMD_DISC or EOF in one thread can serve as the short circuit for another thread waiting for a timeout. Signed-off-by: Eric Blake <eblake@redhat.com> --- server/internal.h | 1 + server/connections.c | 62 +++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 62 insertions(+), 1 deletion(-) diff --git a/server/internal.h b/server/internal.h index df4df0f2..8c229816 100644 --- a/server/internal.h +++ b/server/internal.h @@ -152,6 +152,7 @@ struct connection { pthread_mutex_t write_lock; pthread_mutex_t status_lock; int status; /* 1 for more I/O with client, 0 for shutdown, -1 on error */ + int status_pipe[2]; /* track status changes via poll when nworkers > 1 */ void *crypto_session; int nworkers; diff --git a/server/connections.c b/server/connections.c index f49a74ee..c173df8d 100644 --- a/server/connections.c +++ b/server/connections.c @@ -39,8 +39,11 @@ #include <string.h> #include <unistd.h> #include <sys/socket.h> +#include <fcntl.h> +#include <assert.h> #include "internal.h" +#include "utils.h" /* Default number of parallel requests. */ #define DEFAULT_PARALLEL_REQUESTS 16 @@ -114,8 +117,16 @@ connection_set_status (struct connection *conn, int value) if (conn->nworkers && pthread_mutex_lock (&conn->status_lock)) abort (); - if (value < conn->status) + if (value < conn->status) { + if (conn->nworkers && conn->status > 0) { + char c = 0; + + assert (conn->status_pipe[1] >= 0); + if (write (conn->status_pipe[1], &c, 1) != 1 && errno != EAGAIN) + nbdkit_debug ("failed to notify pipe-to-self: %m"); + } conn->status = value; + } if (conn->nworkers && pthread_mutex_unlock (&conn->status_lock)) abort (); @@ -284,6 +295,50 @@ new_connection (int sockin, int sockout, int nworkers) conn->status = 1; conn->nworkers = nworkers; + if (nworkers) { +#ifdef HAVE_PIPE2 + if (pipe2 (conn->status_pipe, O_NONBLOCK | O_CLOEXEC)) { + perror ("pipe2"); + free (conn); + return NULL; + } +#else + /* If we were fully parallel, then this function could be + * accepting connections in one thread while another thread could + * be in a plugin trying to fork. But plugins.c forced + * thread_model to serialize_all_requests when it detects a lack + * of atomic CLOEXEC, at which point, we can use a mutex to ensure + * we aren't accepting until the plugin is not running, making + * non-atomicity okay. + */ + assert (backend->thread_model (backend) <+ NBDKIT_THREAD_MODEL_SERIALIZE_ALL_REQUESTS); + lock_request (NULL); + if (pipe (conn->status_pipe)) { + perror ("pipe"); + free (conn); + unlock_request (NULL); + return NULL; + } + if (set_nonblock (set_cloexec (conn->status_pipe[0])) == -1) { + perror ("fcntl"); + close (conn->status_pipe[1]); + free (conn); + unlock_request (NULL); + return NULL; + } + if (set_nonblock (set_cloexec (conn->status_pipe[1])) == -1) { + perror ("fcntl"); + close (conn->status_pipe[0]); + free (conn); + unlock_request (NULL); + return NULL; + } + unlock_request (NULL); +#endif + } + else + conn->status_pipe[0] = conn->status_pipe[1] = -1; conn->sockin = sockin; conn->sockout = sockout; pthread_mutex_init (&conn->request_lock, NULL); @@ -324,6 +379,11 @@ free_connection (struct connection *conn) } } + if (conn->status_pipe[0] >= 0) { + close (conn->status_pipe[0]); + close (conn->status_pipe[1]); + } + pthread_mutex_destroy (&conn->request_lock); pthread_mutex_destroy (&conn->read_lock); pthread_mutex_destroy (&conn->write_lock); -- 2.20.1
Eric Blake
2019-Aug-03 16:01 UTC
[Libguestfs] [nbdkit PATCH 3/3] server: Add and use nbdkit_nanosleep
There are a couple of problems with filters trying to sleep. First, when it is time to shut down nbdkit, we wait until all pending transactions have had a chance to wind down. But consider what happens if one or more of those pending transactions are blocked in a sleep. POSIX says nanosleep is interrupted with EINTR if that thread handles a signal, but wiring up signal masks just to ensure a specific thread will get the signal is not pretty, and that means the thread processing SIGINT is NOT the thread blocked in nanosleep. Couple that with the fact that if multiple threads are sleeping, only one thread needs to process the signal, so the rest continue to sleep. Thus, even though we know the user wants nbdkit to quit, we are blocked waiting for a potentially long sleep to complete. This problem can be solved by realizing we already have a pipe-to-self to learn about a quit request or the end of a particular connection, and check for activities on those pipes in parallel with a timeout through pselect or ppoll to break our wait as soon as we know there is no reason to continue on with the transaction. Second, if we ever add support for a signal that is not fatal to nbdkit (such as SIGHUP triggering a config reload), but don't mask which thread can process that signal, then if a thread servicing nanosleep gets the signal the sleep will terminate early. But in that case, we want execution to continue on. This problem can be solved by either setting a signal mask so as to avoid processing our non-fatal signal in the same thread as any filter/plugin processing, or by paying attention to EINTR results and restarting a sleep. And rather than making each sleeping client reimplement that logic, we might as well stick it in on our common helper. Thus, this patch introduces nbdkit_nanosleep, and implements it under the hood as a ppoll that redirects signals but wakes up as needed. Including <time.h> for struct timespec in nbdkit-common.h is difficult (C++ is not required to provide it), so we just take the two arguments in pieces; we can rely on int being 32-bits, so don't need nsec to be a long, and in turn can avoid EINVAL for nsec outside the [0,1000000000) range by updating sec as needed. And sleeping longer than INT_MAX seconds is indistinguishable from having no timeout, at least when compared to human lifetimes, so we reject larger values. Although it is more likely that filters will be the ones actually trying to slow I/O (and all that uses the new API in this commit), it is feasible that a third-party plugin may also want access to this behavior, so I added it in common and documented it for plugins. Note that ppoll is not yet POSIX [1], so we may have to provide fallbacks to older interfaces. But rather than write those now, I just punted on the compilation, and we'll deal with it when we actually encounter platforms needing it. [1] http://austingroupbugs.net/view.php?id=1263 Signed-off-by: Eric Blake <eblake@redhat.com> --- docs/nbdkit-plugin.pod | 28 +++++++++++++++++++ configure.ac | 1 + include/nbdkit-common.h | 1 + filters/delay/delay.c | 14 ++-------- filters/rate/rate.c | 10 +++---- server/public.c | 61 +++++++++++++++++++++++++++++++++++++++++ server/nbdkit.syms | 1 + 7 files changed, 99 insertions(+), 17 deletions(-) diff --git a/docs/nbdkit-plugin.pod b/docs/nbdkit-plugin.pod index 9510253f..f36778cf 100644 --- a/docs/nbdkit-plugin.pod +++ b/docs/nbdkit-plugin.pod @@ -342,6 +342,34 @@ and returns C<NULL>. The returned string must be freed by the caller. +=head2 C<nbdkit_nanosleep> + + int nbdkit_nanosleep (unsigned sec, unsigned nsec); + +The utility function C<nbdkit_nanosleep> suspends the current thread, +and returns 0 if it slept at least as many seconds and nanoseconds as +requested, or -1 after calling C<nbdkit_error> if there is no point in +continuing the current command. Attempts to sleep more than +C<INT_MAX> seconds are treated as an error. + +This is similar to L<nanosleep(3)>, although you specify the +components as separate parameters rather than as a C<struct timespec>. +This wrapper provides two benefits over the system library: in one +direction, the system library has no easy way to abort a sleep early +if other information determines that there is no point in finishing +the sleep (handling a signal in the same thread as the sleep will do +that, but you don't have full control over the signal masks of other +threads to ensure that your thread will get the intended interrupting +signal). In the other direction, the system library has no easy way +to avoid aborting a sleep early (you can restart the sleep with any +remaining unslept time, but calculating this gets tedious; or you can +use signal masks to avoid handling a signal, but risk making your +thread non-responsive to signals that were important after all, +stalling a timely shutdown of nbdkit). The system call L<ppoll(2)> +can solve these issues, but requires access to internal file +descriptors that the plugin does not need access to, hence this +function exists to do the work on your behalf. + =head2 umask All plugins will see a L<umask(2)> of C<0022>. diff --git a/configure.ac b/configure.ac index 054ad64a..2a34bf8a 100644 --- a/configure.ac +++ b/configure.ac @@ -198,6 +198,7 @@ AC_CHECK_FUNCS([\ get_current_dir_name \ mkostemp \ pipe2 \ + ppoll \ posix_fadvise]) dnl Check whether printf("%m") works diff --git a/include/nbdkit-common.h b/include/nbdkit-common.h index 5257d992..e004aa18 100644 --- a/include/nbdkit-common.h +++ b/include/nbdkit-common.h @@ -82,6 +82,7 @@ extern int64_t nbdkit_parse_size (const char *str); extern int nbdkit_parse_bool (const char *str); extern int nbdkit_read_password (const char *value, char **password); extern char *nbdkit_realpath (const char *path); +extern int nbdkit_nanosleep (unsigned sec, unsigned nsec); struct nbdkit_extents; extern int nbdkit_add_extent (struct nbdkit_extents *, diff --git a/filters/delay/delay.c b/filters/delay/delay.c index 498ab14d..c92a12d7 100644 --- a/filters/delay/delay.c +++ b/filters/delay/delay.c @@ -37,7 +37,6 @@ #include <stdlib.h> #include <stdint.h> #include <string.h> -#include <time.h> #include <nbdkit-filter.h> @@ -83,16 +82,9 @@ parse_delay (const char *key, const char *value) static int delay (int ms, int *err) { - if (ms > 0) { - const struct timespec ts = { - .tv_sec = ms / 1000, - .tv_nsec = (ms % 1000) * 1000000, - }; - if (nanosleep (&ts, NULL) == -1) { - nbdkit_error ("nanosleep: %m"); - *err = errno; - return -1; - } + if (ms > 0 && nbdkit_nanosleep (ms / 1000, (ms % 1000) * 1000000) == -1) { + *err = errno; + return -1; } return 0; } diff --git a/filters/rate/rate.c b/filters/rate/rate.c index dbd92ad6..dca5e9fc 100644 --- a/filters/rate/rate.c +++ b/filters/rate/rate.c @@ -262,12 +262,10 @@ maybe_sleep (struct bucket *bucket, pthread_mutex_t *lock, uint32_t count, bits = bucket_run (bucket, bits, &ts); } - if (bits > 0) - if (nanosleep (&ts, NULL) == -1) { - nbdkit_error ("nanosleep: %m"); - *err = errno; - return -1; - } + if (bits > 0 && nbdkit_nanosleep (ts.tv_sec, ts.tv_nsec) == -1) { + *err = errno; + return -1; + } } return 0; } diff --git a/server/public.c b/server/public.c index 523a31f9..8f860d73 100644 --- a/server/public.c +++ b/server/public.c @@ -47,6 +47,8 @@ #include <limits.h> #include <termios.h> #include <errno.h> +#include <poll.h> +#include <signal.h> #include "get-current-dir-name.h" @@ -297,3 +299,62 @@ nbdkit_realpath (const char *path) return ret; } + + +int +nbdkit_nanosleep (unsigned sec, unsigned nsec) +{ +#ifndef HAVE_PPOLL +# error "Please port this to your platform" + /* Porting ideas, in order of preference: + * - POSIX requires pselect; it's a bit clunkier to set up the poll, + * but the same ability to atomically mask all signals and operate + * on struct timespec makes it similar to the preferred ppoll interface + * - calculate an end time target, then use poll in a loop on EINTR with + * a recalculation of the timeout to still reach the end time + */ + nbdkit_error ("nbdkit_nanosleep not yet ported to systems without ppoll"); + errno = ENOSYS; + return -1; +#else + struct timespec ts; + struct connection *conn = threadlocal_get_conn (); + struct pollfd fds[2] = { + [0].fd = quit_fd, + [0].events = POLLIN, + [1].fd = conn ? conn->status_pipe[0] : -1, + [1].events = POLLIN, + }; + sigset_t all; + + if (sec >= INT_MAX - nsec / 1000000000) { + nbdkit_error ("sleep request is too long"); + errno = EINVAL; + return -1; + } + ts.tv_sec = sec + nsec / 1000000000; + ts.tv_nsec = nsec % 1000000000; + + /* Block all signals to this thread during the poll, so we don't + * have to worry about EINTR + */ + if (sigfillset(&all)) + abort (); + switch (ppoll (fds, 2, &ts, &all)) { + case -1: + assert (errno != EINTR); + nbdkit_error ("poll: %m"); + return -1; + case 0: + return 0; + } + + /* We don't have to read the pipe-to-self; if poll returned an + * event, we know the connection should be shutting down. + */ + assert (quit || (conn && connection_get_status (conn) < 1)); + nbdkit_error ("aborting sleep to shut down"); + errno = ESHUTDOWN; + return -1; +#endif +} diff --git a/server/nbdkit.syms b/server/nbdkit.syms index f8eef214..2a024ed5 100644 --- a/server/nbdkit.syms +++ b/server/nbdkit.syms @@ -46,6 +46,7 @@ nbdkit_extents_free; nbdkit_extents_new; nbdkit_get_extent; + nbdkit_nanosleep; nbdkit_parse_bool; nbdkit_parse_size; nbdkit_read_password; -- 2.20.1
Richard W.M. Jones
2019-Aug-05 13:50 UTC
Re: [Libguestfs] [nbdkit PATCH 3/3] server: Add and use nbdkit_nanosleep
On Sat, Aug 03, 2019 at 11:01:44AM -0500, Eric Blake wrote: [...] ACK series, but ...> +=head2 C<nbdkit_nanosleep> > +[...]> +This is similar to L<nanosleep(3)>, although you specify the > +components as separate parameters rather than as a C<struct timespec>. > +This wrapper provides two benefits over the system library: in one > +direction, the system library has no easy way to abort a sleep early > +if other information determines that there is no point in finishing > +the sleep (handling a signal in the same thread as the sleep will do > +that, but you don't have full control over the signal masks of other > +threads to ensure that your thread will get the intended interrupting > +signal). In the other direction, the system library has no easy way > +to avoid aborting a sleep early (you can restart the sleep with any > +remaining unslept time, but calculating this gets tedious; or you can > +use signal masks to avoid handling a signal, but risk making your > +thread non-responsive to signals that were important after all, > +stalling a timely shutdown of nbdkit). The system call L<ppoll(2)> > +can solve these issues, but requires access to internal file > +descriptors that the plugin does not need access to, hence this > +function exists to do the work on your behalf.Do we really need this paragraph? It's explaining how nbdkit_nanosleep works internally which might change in future and is largely irrelevant to plugin designers. I'd say something along the lines of: Plugins should prefer this function instead of sleeping using system calls like L<sleep(2)> or L<nanosleep(3)>, since it allows nbdkit to shut down cleanly without delay. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-p2v converts physical machines to virtual machines. Boot with a live CD or over the network (PXE) and turn machines into KVM guests. http://libguestfs.org/virt-v2v
Eric Blake
2019-Aug-05 22:16 UTC
Re: [Libguestfs] [nbdkit PATCH 3/3] server: Add and use nbdkit_nanosleep
On 8/3/19 11:01 AM, Eric Blake wrote:> There are a couple of problems with filters trying to sleep. First, > when it is time to shut down nbdkit, we wait until all pending > transactions have had a chance to wind down. But consider what > happens if one or more of those pending transactions are blocked in a > sleep. POSIX says nanosleep is interrupted with EINTR if that thread > handles a signal, but wiring up signal masks just to ensure a specific > thread will get the signal is not pretty, and that means the thread > processing SIGINT is NOT the thread blocked in nanosleep. Couple that > with the fact that if multiple threads are sleeping, only one thread > needs to process the signal, so the rest continue to sleep. Thus, > even though we know the user wants nbdkit to quit, we are blocked > waiting for a potentially long sleep to complete. This problem can be > solved by realizing we already have a pipe-to-self to learn about a > quit request or the end of a particular connection, and check for > activities on those pipes in parallel with a timeout through pselect > or ppoll to break our wait as soon as we know there is no reason to > continue on with the transaction. >> +++ b/server/public.c> +#else > + struct timespec ts; > + struct connection *conn = threadlocal_get_conn (); > + struct pollfd fds[2] = { > + [0].fd = quit_fd, > + [0].events = POLLIN, > + [1].fd = conn ? conn->status_pipe[0] : -1, > + [1].events = POLLIN,In testing this, the code is responsive to a multi-threaded connection detecting client death on any other thread, but not responsive to a single-threaded connection detecting client death on the lone thread. But even with a multi-threaded connection, if every single thread is tied up in a sleep, we lose responsiveness. So I'm currently testing an amendment to this patch that uses fds[3], with conn->sockout being polled with .events = 0 for POLLHUP/POLLERR. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
Possibly Parallel Threads
- [nbdkit PATCH 3/3] server: Add and use nbdkit_nanosleep
- [PATCH nbdkit] freebsd: In nbdkit_nanosleep, fallback to calling nanosleep(2).
- [nbdkit PATCH 0/3] More responsive shutdown
- [PATCH nbdkit 1/3] server: Add GET_CONN macro, alias for threadlocal_get_conn ().
- [PATCH nbdkit 5/9 patch split 5/5] server: Indirect slow path, non-self-contained functions through the server.