Eric Blake
2019-Sep-26 20:56 UTC
Re: [Libguestfs] [PATCH libnbd 2/2] api: Implement local command with systemd socket activation.
On 9/26/19 11:40 AM, Richard W.M. Jones wrote:> This adds new APIs for running a local NBD server and connecting to it > using systemd socket activation (instead of stdin/stdout). > > This includes interop tests against nbdkit and qemu-nbd which I > believe are the only NBD servers supporting socket activation. (If we > find others then we can add more interop tests in future.) > > The upstream spec for systemd socket activation is here: > http://0pointer.de/blog/projects/socket-activation.html > ---> +++ b/generator/states-connect.c > @@ -37,6 +37,9 @@ > #include <sys/socket.h> > #include <sys/un.h> > > +/* This is baked into the systemd socket activation API. */ > +#define FIRST_SOCKET_ACTIVATION_FD 3 > + > /* Disable Nagle's algorithm on the socket, but don't fail. */ > static void > disable_nagle (int sock) > @@ -292,4 +295,110 @@ disable_nagle (int sock) > SET_NEXT_STATE (%^MAGIC.START); > return 0; > > + CONNECT_SA.START: > + /* This is in some ways similar to nbd_aio_connect_command above, > + * but we must pass a listening socket to the child process and > + * we must set LISTEN_PID and LISTEN_FDS environment variables. > + */ > + size_t len; > + int s; > + struct sockaddr_un addr; > + pid_t pid; > + int flags; > + char listen_pid[16]; > + > + assert (!h->sock); > + assert (h->argv); > + assert (h->argv[0]); > + > + /* Use /tmp instead of TMPDIR because we must ensure the path is > + * short enough to store in the sockaddr_un. On some platforms this > + * may cause problems so we may need to revisit it. XXX > + */Is the use of socketpair() any better than creating a socket under /tmp?> + h->sa_tmpdir = strdup ("/tmp/libnbdXXXXXX"); > + h->sa_sockpath = strdup ("/tmp/libnbdXXXXXX/sock"); > + if (h->sa_tmpdir == NULL || h->sa_sockpath == NULL) { > + SET_NEXT_STATE (%.DEAD); > + set_error (errno, "strdup"); > + return 0; > + } > + > + if (mkdtemp (h->sa_tmpdir) == NULL) { > + SET_NEXT_STATE (%.DEAD); > + set_error (errno, "mkdtemp"); > + return 0; > + } > + len = strlen (h->sa_tmpdir); > + memcpy (h->sa_sockpath, h->sa_tmpdir, len); > + > + s = socket (AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0); > + if (s == -1) { > + SET_NEXT_STATE (%.DEAD); > + set_error (errno, "socket"); > + return 0; > + }If we fail here or later, do/should we try to clean up the /tmp/libnbdXXX directory created earlier? /me reads ahead - nbd_close tries to address it Still, if we fail at this point, h->sa_sockpath is set but not yet created [1]> + > + addr.sun_family = AF_UNIX; > + memcpy (addr.sun_path, h->sa_sockpath, strlen (h->sa_sockpath) + 1); > + if (bind (s, (struct sockaddr *) &addr, sizeof addr) == -1) { > + SET_NEXT_STATE (%.DEAD); > + set_error (errno, "bind: %s", h->sa_sockpath); > + return 0; > + } > + > + if (listen (s, 1) == -1) { > + SET_NEXT_STATE (%.DEAD); > + set_error (errno, "listen"); > + return 0; > + } > + > + pid = fork (); > + if (pid == -1) { > + SET_NEXT_STATE (%.DEAD); > + set_error (errno, "fork"); > + close (s); > + return 0; > + } > + if (pid == 0) { /* child - run command */ > + if (s != FIRST_SOCKET_ACTIVATION_FD) { > + dup2 (s, FIRST_SOCKET_ACTIVATION_FD);No check for errors? But that's probably okay in this instance (we know that we won't fail with EBADF, for example).> + close (s); > + } > + else { > + /* We must unset CLOEXEC on the fd. (dup2 above does this > + * implicitly because CLOEXEC is set on the fd, not on the > + * socket). > + */ > + flags = fcntl (s, F_GETFD, 0); > + if (flags == -1) { > + perror ("fcntl: F_GETFD");perror is not async-signal-safe. Calling it in a child of a potentially-multi-threaded parent is therefore prone to deadlock, if perror() triggers a request to grab any resource that was left locked by a different thread holding the lock but stranded by the fork.> + _exit (EXIT_FAILURE); > + } > + if (fcntl (s, F_SETFD, flags & ~FD_CLOEXEC) == -1) { > + perror ("fcntl: F_SETFD"); > + _exit (EXIT_FAILURE); > + }About all we can _safely_ do is let our _exit() status inform the parent process that something failed, and let the parent process print the error message. But even passing errno as the _exit() value is not portable (GNU Hurd errno values are intentionally larger than what fit in 8-bit returns that the parent would see). It's also possible to set up a cloexec pipe (in addition to everything else) so that the child can write() error details into the pipe, but that's a lot of effort compared to just blindly stating that the child failed but declaring full details of why as lost.> + } > + > + snprintf (listen_pid, sizeof listen_pid, "%ld", (long) getpid ()); > + setenv ("LISTEN_PID", listen_pid, 1); > + setenv ("LISTEN_FDS", "1", 1);snprintf() and setenv() are also not async-signal-safe. Which is a bummer, since you really don't know the child pid until in the child. You could open-code the translation of a decimal number into a buffer without snprintf, but you also have to figure out how to safely put it into the environ seen by the child. How does systemd do it? We _could_ create a helper executable, where we fork() into the child process, exec() the helper to overwrite the child pid with something that is no longer constrained by the limits of async-safety, so that the helper app can then use snprintf/setenv at will before re-execing again with the real target executable. The next question is how worried are we about actually being hit by an non-async-signal-safe hang. What you have works 99.9% of the time, but if we are trying to be a library that is usable by any application, we have to decide if that is good enough.> + > + /* Restore SIGPIPE back to SIG_DFL. */ > + signal (SIGPIPE, SIG_DFL); > + > + execvp (h->argv[0], h->argv); > + perror (h->argv[0]);and another unsafe perror.> + _exit (EXIT_FAILURE);Is EXIT_FAILURE the best here, or should we consider exiting with status 126/127 to match what other programs (sh, env, nice, nohup, ...) do when execvp() fails?> + } > + > + /* Parent. */ > + close (s); > + h->pid = pid; > + > + h->connaddrlen = sizeof addr; > + memcpy (&h->connaddr, &addr, h->connaddrlen); > + SET_NEXT_STATE (%^CONNECT.START); > + return 0; > + > } /* END STATE MACHINE */> +++ b/interop/socket-activation.c> +#define SIZE 1048576 > + > +int > +main (int argc, char *argv[]) > +{ > + struct nbd_handle *nbd; > + char tmpfile[] = "/tmp/nbdXXXXXX"; > + int fd; > + int64_t actual_size; > + char buf[512]; > + int r = -1; > + > + /* Create a large sparse temporary file. */ > + fd = mkstemp (tmpfile); > + if (fd == -1 || > + ftruncate (fd, SIZE) == -1 || > + close (fd) == -1) { > + perror (tmpfile); > + goto out; > + }Is it worth populating anything in the temp file...> + > + nbd = nbd_create (); > + if (nbd == NULL) { > + fprintf (stderr, "%s\n", nbd_get_error ()); > + goto out; > + } > + > + char *args[] = { SERVER, SERVER_PARAMS, NULL }; > + if (nbd_connect_socket_activation (nbd, args) == -1) { > + fprintf (stderr, "%s\n", nbd_get_error ()); > + goto out; > + } > + > + actual_size = nbd_get_size (nbd); > + if (actual_size == -1) { > + fprintf (stderr, "%s\n", nbd_get_error ()); > + goto out; > + } > + if (actual_size != SIZE) { > + fprintf (stderr, "%s: actual size %" PRIi64 " <> expected size %d", > + argv[0], actual_size, SIZE); > + goto out; > + } > + > + if (nbd_pread (nbd, buf, sizeof buf, 0, 0) == -1) { > + fprintf (stderr, "%s\n", nbd_get_error ()); > + goto out; > + }...and checking that we read it back out? But even a successful read without checking contents is pretty good evidence that the socket activation worked.> + > + if (nbd_shutdown (nbd, 0) == -1) { > + fprintf (stderr, "%s\n", nbd_get_error ()); > + goto out; > + } > + > + nbd_close (nbd); > + > + r = 0; > + out: > + unlink (tmpfile); > + > + exit (r == 0 ? EXIT_SUCCESS : EXIT_FAILURE); > +} > diff --git a/lib/connect.c b/lib/connect.c > index f98bcdb..c1cbef7 100644 > --- a/lib/connect.c> +++ b/lib/handle.c > @@ -129,6 +129,16 @@ nbd_close (struct nbd_handle *h) > free_cmd_list (h->cmds_in_flight); > free_cmd_list (h->cmds_done); > nbd_internal_free_string_list (h->argv); > + if (h->sa_sockpath) { > + if (h->pid > 0) > + kill (h->pid, SIGTERM);Are we sure that SIGTERM Is always going to be sufficient? Or do we need a tiered approach where we try SIGTERM, but followup with SIGKILL if too much time elapses? I guess it is the same situation for nbd_connect_command, does that mean we should have a common helper function for doing appropriate cleanup? Do we document that nbd_close() may block?> + unlink (h->sa_sockpath);[1] we can end up calling unlink() on a path that does not exist. But since we aren't checking error status here, we are effectively ignoring the ENOENT in that case.> + free (h->sa_sockpath); > + } > + if (h->sa_tmpdir) { > + rmdir (h->sa_tmpdir); > + free (h->sa_tmpdir); > + } > free (h->unixsocket); > free (h->hostname); > free (h->port); > diff --git a/lib/internal.h b/lib/internal.h > index bdb0e83..bf8b9aa 100644 > --- a/lib/internal.h > +++ b/lib/internal.h > @@ -188,6 +188,12 @@ struct nbd_handle { > char **argv; > pid_t pid; > > + /* When using systemd socket activation, this directory and socket > + * must be deleted, and the pid above must be killed. > + */ > + char *sa_tmpdir; > + char *sa_sockpath; > + > /* When connecting to Unix domain socket. */ > char *unixsocket; > >-- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
Richard W.M. Jones
2019-Sep-27 12:28 UTC
Re: [Libguestfs] [PATCH libnbd 2/2] api: Implement local command with systemd socket activation.
On Thu, Sep 26, 2019 at 03:56:27PM -0500, Eric Blake wrote:> >+ /* Use /tmp instead of TMPDIR because we must ensure the path is > >+ * short enough to store in the sockaddr_un. On some platforms this > >+ * may cause problems so we may need to revisit it. XXX > >+ */ > > Is the use of socketpair() any better than creating a socket under /tmp?AIUI socketpair() can't be used to create a listening socket. Systemd socket activation has two modes -- one where you pass an already accepted socket, and one where you pass a listening socket and the server accepts connections until it is killed. In nbdkit and qemu-nbd we are using the latter one. ...> If we fail here or later, do/should we try to clean up the > /tmp/libnbdXXX directory created earlier? > > /me reads ahead - nbd_close tries to address it > > Still, if we fail at this point, h->sa_sockpath is set but not yet > created [1]I modified the code so that it doesn't try to delete literal /tmp/libnbdXXXXXX on some error paths and should be more robust.> >+ close (s); > >+ } > >+ else { > >+ /* We must unset CLOEXEC on the fd. (dup2 above does this > >+ * implicitly because CLOEXEC is set on the fd, not on the > >+ * socket). > >+ */ > >+ flags = fcntl (s, F_GETFD, 0); > >+ if (flags == -1) { > >+ perror ("fcntl: F_GETFD"); > > perror is not async-signal-safe. Calling it in a child of a > potentially-multi-threaded parent is therefore prone to deadlock, if > perror() triggers a request to grab any resource that was left > locked by a different thread holding the lock but stranded by the > fork.I suspect these sort of problems are unsolvable. I've studied enough customer bug reports where we've had to reverse engineer exactly how something failed given only a slim error log to know that error messages are always useful and I wouldn't want to remove these. However because of the chance of deadlock, how about a deadlock-free function that write(2)'s the message + errno to stderr?> >+ _exit (EXIT_FAILURE); > >+ } > >+ if (fcntl (s, F_SETFD, flags & ~FD_CLOEXEC) == -1) { > >+ perror ("fcntl: F_SETFD"); > >+ _exit (EXIT_FAILURE); > >+ } > > About all we can _safely_ do is let our _exit() status inform the > parent process that something failed, and let the parent process > print the error message. But even passing errno as the _exit() > value is not portable (GNU Hurd errno values are intentionally > larger than what fit in 8-bit returns that the parent would see). > It's also possible to set up a cloexec pipe (in addition to > everything else) so that the child can write() error details into > the pipe, but that's a lot of effort compared to just blindly > stating that the child failed but declaring full details of why as > lost. > > >+ } > >+ > >+ snprintf (listen_pid, sizeof listen_pid, "%ld", (long) getpid ()); > >+ setenv ("LISTEN_PID", listen_pid, 1); > >+ setenv ("LISTEN_FDS", "1", 1);And we could use similar code here.> snprintf() and setenv() are also not async-signal-safe. Which is a > bummer, since you really don't know the child pid until in the > child. You could open-code the translation of a decimal number into > a buffer without snprintf, but you also have to figure out how to > safely put it into the environ seen by the child. How does systemd > do it?For setenv(), systemd does this by building a char **environ which is passed to execve. The code is complicated to say the least: https://github.com/systemd/systemd/blob/5ac1530eca652cd16819853fe06e76e156f5cf5e/src/core/dbus-execute.c> >+ _exit (EXIT_FAILURE); > > Is EXIT_FAILURE the best here, or should we consider exiting with > status 126/127 to match what other programs (sh, env, nice, nohup, > ...) do when execvp() fails?Yes (this is also a bug elsewhere). Not sure if 126 or 127 is right: https://www.gnu.org/software/bash/manual/html_node/Exit-Status.html ...> Do we document that nbd_close() may block?We should do. I've added it in a separate patch. I'll have a rethink about this patch and come back with something better in a while. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com Fedora Windows cross-compiler. Compile Windows programs, test, and build Windows installers. Over 100 libraries supported. http://fedoraproject.org/wiki/MinGW
Eric Blake
2019-Sep-27 13:16 UTC
Re: [Libguestfs] [PATCH libnbd 2/2] api: Implement local command with systemd socket activation.
On 9/27/19 7:28 AM, Richard W.M. Jones wrote:>>> + /* We must unset CLOEXEC on the fd. (dup2 above does this >>> + * implicitly because CLOEXEC is set on the fd, not on the >>> + * socket). >>> + */ >>> + flags = fcntl (s, F_GETFD, 0); >>> + if (flags == -1) { >>> + perror ("fcntl: F_GETFD"); >> >> perror is not async-signal-safe. Calling it in a child of a >> potentially-multi-threaded parent is therefore prone to deadlock, if >> perror() triggers a request to grab any resource that was left >> locked by a different thread holding the lock but stranded by the >> fork. > > I suspect these sort of problems are unsolvable. I've studied enough > customer bug reports where we've had to reverse engineer exactly how > something failed given only a slim error log to know that error > messages are always useful and I wouldn't want to remove these. > However because of the chance of deadlock, how about a deadlock-free > function that write(2)'s the message + errno to stderr?Yes, sticking to write() is better than nothing. Open-coding an int-to-string for displying errno is doable, or bypassing strerror() and going straight for the non-portable sys_errlist[errno] lookup (if configure says sys_errlist is available) can also be used - it's a question of how nice do you want to make things, while still trying to remain safe.>>> + >>> + snprintf (listen_pid, sizeof listen_pid, "%ld", (long) getpid ()); >>> + setenv ("LISTEN_PID", listen_pid, 1); >>> + setenv ("LISTEN_FDS", "1", 1); > > And we could use similar code here.Yeah, if we're going to open-code a signal-safe int-to-string, we have several uses for it :)> >> snprintf() and setenv() are also not async-signal-safe. Which is a >> bummer, since you really don't know the child pid until in the >> child. You could open-code the translation of a decimal number into >> a buffer without snprintf, but you also have to figure out how to >> safely put it into the environ seen by the child. How does systemd >> do it? > > For setenv(), systemd does this by building a char **environ which is > passed to execve. The code is complicated to say the least: > > https://github.com/systemd/systemd/blob/5ac1530eca652cd16819853fe06e76e156f5cf5e/src/core/dbus-execute.cCan we pre-build the bulk of the replacement environ in the parent, leaving just the one entry to munge in the child? At least that way we can malloc without worries.> >>> + _exit (EXIT_FAILURE); >> >> Is EXIT_FAILURE the best here, or should we consider exiting with >> status 126/127 to match what other programs (sh, env, nice, nohup, >> ...) do when execvp() fails? > > Yes (this is also a bug elsewhere). Not sure if 126 or 127 is > right: > > https://www.gnu.org/software/bash/manual/html_node/Exit-Status.htmlDepends on the value of errno after the fact. ENOENT => 127, all others 126.> > ... >> Do we document that nbd_close() may block? > > We should do. I've added it in a separate patch. > > I'll have a rethink about this patch and come back with something > better in a while. > > Rich. >-- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
Reasonably Related Threads
- Re: [PATCH libnbd 2/2] api: Implement local command with systemd socket activation.
- [PATCH libnbd 2/2] api: Implement local command with systemd socket activation.
- [PATCH libnbd v2 2/2] api: Implement local command with systemd socket activation.
- Re: [PATCH libnbd v2 2/2] api: Implement local command with systemd socket activation.
- Re: [PATCH libnbd v2 2/2] api: Implement local command with systemd socket activation.