thr3ads.net - Libguestfs - Re: [Libguestfs] [PATCH libnbd 2/2] api: Implement local command with systemd socket activation. [Sep 2019]

If this information is useful, please help other people find it:
Share via:

Eric Blake

2019-Sep-26 20:56 UTC

Re: [Libguestfs] [PATCH libnbd 2/2] api: Implement local command with systemd socket activation.

On 9/26/19 11:40 AM, Richard W.M. Jones wrote:> This adds new APIs for running a local NBD server and connecting to it
> using systemd socket activation (instead of stdin/stdout).
> 
> This includes interop tests against nbdkit and qemu-nbd which I
> believe are the only NBD servers supporting socket activation.  (If we
> find others then we can add more interop tests in future.)
> 
> The upstream spec for systemd socket activation is here:
> http://0pointer.de/blog/projects/socket-activation.html
> ---
> +++ b/generator/states-connect.c
> @@ -37,6 +37,9 @@
>   #include <sys/socket.h>
>   #include <sys/un.h>
>   
> +/* This is baked into the systemd socket activation API. */
> +#define FIRST_SOCKET_ACTIVATION_FD 3
> +
>   /* Disable Nagle's algorithm on the socket, but don't fail. */
>   static void
>   disable_nagle (int sock)
> @@ -292,4 +295,110 @@ disable_nagle (int sock)
>     SET_NEXT_STATE (%^MAGIC.START);
>     return 0;
>   
> + CONNECT_SA.START:
> +  /* This is in some ways similar to nbd_aio_connect_command above,
> +   * but we must pass a listening socket to the child process and
> +   * we must set LISTEN_PID and LISTEN_FDS environment variables.
> +   */
> +  size_t len;
> +  int s;
> +  struct sockaddr_un addr;
> +  pid_t pid;
> +  int flags;
> +  char listen_pid[16];
> +
> +  assert (!h->sock);
> +  assert (h->argv);
> +  assert (h->argv[0]);
> +
> +  /* Use /tmp instead of TMPDIR because we must ensure the path is
> +   * short enough to store in the sockaddr_un.  On some platforms this
> +   * may cause problems so we may need to revisit it.  XXX
> +   */
Is the use of socketpair() any better than creating a socket under /tmp?
> +  h->sa_tmpdir = strdup ("/tmp/libnbdXXXXXX");
> +  h->sa_sockpath = strdup ("/tmp/libnbdXXXXXX/sock");
> +  if (h->sa_tmpdir == NULL || h->sa_sockpath == NULL) {
> +    SET_NEXT_STATE (%.DEAD);
> +    set_error (errno, "strdup");
> +    return 0;
> +  }
> +
> +  if (mkdtemp (h->sa_tmpdir) == NULL) {
> +    SET_NEXT_STATE (%.DEAD);
> +    set_error (errno, "mkdtemp");
> +    return 0;
> +  }
> +  len = strlen (h->sa_tmpdir);
> +  memcpy (h->sa_sockpath, h->sa_tmpdir, len);
> +
> +  s = socket (AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0);
> +  if (s == -1) {
> +    SET_NEXT_STATE (%.DEAD);
> +    set_error (errno, "socket");
> +    return 0;
> +  }
If we fail here or later, do/should we try to clean up the 
/tmp/libnbdXXX directory created earlier?

/me reads ahead - nbd_close tries to address it

Still, if we fail at this point, h->sa_sockpath is set but not yet 
created [1]
> +
> +  addr.sun_family = AF_UNIX;
> +  memcpy (addr.sun_path, h->sa_sockpath, strlen (h->sa_sockpath) +
1);
> +  if (bind (s, (struct sockaddr *) &addr, sizeof addr) == -1) {
> +    SET_NEXT_STATE (%.DEAD);
> +    set_error (errno, "bind: %s", h->sa_sockpath);
> +    return 0;
> +  }
> +
> +  if (listen (s, 1) == -1) {
> +    SET_NEXT_STATE (%.DEAD);
> +    set_error (errno, "listen");
> +    return 0;
> +  }
> +
> +  pid = fork ();
> +  if (pid == -1) {
> +    SET_NEXT_STATE (%.DEAD);
> +    set_error (errno, "fork");
> +    close (s);
> +    return 0;
> +  }
> +  if (pid == 0) {         /* child - run command */
> +    if (s != FIRST_SOCKET_ACTIVATION_FD) {
> +      dup2 (s, FIRST_SOCKET_ACTIVATION_FD);
No check for errors?  But that's probably okay in this instance (we know 
that we won't fail with EBADF, for example).
> +      close (s);
> +    }
> +    else {
> +      /* We must unset CLOEXEC on the fd.  (dup2 above does this
> +       * implicitly because CLOEXEC is set on the fd, not on the
> +       * socket).
> +       */
> +      flags = fcntl (s, F_GETFD, 0);
> +      if (flags == -1) {
> +        perror ("fcntl: F_GETFD");
perror is not async-signal-safe.  Calling it in a child of a 
potentially-multi-threaded parent is therefore prone to deadlock, if 
perror() triggers a request to grab any resource that was left locked by 
a different thread holding the lock but stranded by the fork.
> +        _exit (EXIT_FAILURE);
> +      }
> +      if (fcntl (s, F_SETFD, flags & ~FD_CLOEXEC) == -1) {
> +        perror ("fcntl: F_SETFD");
> +        _exit (EXIT_FAILURE);
> +      }
About all we can _safely_ do is let our _exit() status inform the parent 
process that something failed, and let the parent process print the 
error message.  But even passing errno as the _exit() value is not 
portable (GNU Hurd errno values are intentionally larger than what fit 
in 8-bit returns that the parent would see).  It's also possible to set 
up a cloexec pipe (in addition to everything else) so that the child can 
write() error details into the pipe, but that's a lot of effort compared 
to just blindly stating that the child failed but declaring full details 
of why as lost.
> +    }
> +
> +    snprintf (listen_pid, sizeof listen_pid, "%ld", (long)
getpid ());
> +    setenv ("LISTEN_PID", listen_pid, 1);
> +    setenv ("LISTEN_FDS", "1", 1);
snprintf() and setenv() are also not async-signal-safe.  Which is a 
bummer, since you really don't know the child pid until in the child. 
You could open-code the translation of a decimal number into a buffer 
without snprintf, but you also have to figure out how to safely put it 
into the environ seen by the child.  How does systemd do it?

We _could_ create a helper executable, where we fork() into the child 
process, exec() the helper to overwrite the child pid with something 
that is no longer constrained by the limits of async-safety, so that the 
helper app can then use snprintf/setenv at will before re-execing again 
with the real target executable.  The next question is how worried are 
we about actually being hit by an non-async-signal-safe hang.  What you 
have works 99.9% of the time, but if we are trying to be a library that 
is usable by any application, we have to decide if that is good enough.
> +
> +    /* Restore SIGPIPE back to SIG_DFL. */
> +    signal (SIGPIPE, SIG_DFL);
> +
> +    execvp (h->argv[0], h->argv);
> +    perror (h->argv[0]);
and another unsafe perror.
> +    _exit (EXIT_FAILURE);
Is EXIT_FAILURE the best here, or should we consider exiting with status 
126/127 to match what other programs (sh, env, nice, nohup, ...) do when 
execvp() fails?
> +  }
> +
> +  /* Parent. */
> +  close (s);
> +  h->pid = pid;
> +
> +  h->connaddrlen = sizeof addr;
> +  memcpy (&h->connaddr, &addr, h->connaddrlen);
> +  SET_NEXT_STATE (%^CONNECT.START);
> +  return 0;
> +
>   } /* END STATE MACHINE */
> +++ b/interop/socket-activation.c
> +#define SIZE 1048576
> +
> +int
> +main (int argc, char *argv[])
> +{
> +  struct nbd_handle *nbd;
> +  char tmpfile[] = "/tmp/nbdXXXXXX";
> +  int fd;
> +  int64_t actual_size;
> +  char buf[512];
> +  int r = -1;
> +
> +  /* Create a large sparse temporary file. */
> +  fd = mkstemp (tmpfile);
> +  if (fd == -1 ||
> +      ftruncate (fd, SIZE) == -1 ||
> +      close (fd) == -1) {
> +    perror (tmpfile);
> +    goto out;
> +  }
Is it worth populating anything in the temp file...
> +
> +  nbd = nbd_create ();
> +  if (nbd == NULL) {
> +    fprintf (stderr, "%s\n", nbd_get_error ());
> +    goto out;
> +  }
> +
> +  char *args[] = { SERVER, SERVER_PARAMS, NULL };
> +  if (nbd_connect_socket_activation (nbd, args) == -1) {
> +    fprintf (stderr, "%s\n", nbd_get_error ());
> +    goto out;
> +  }
> +
> +  actual_size = nbd_get_size (nbd);
> +  if (actual_size == -1) {
> +    fprintf (stderr, "%s\n", nbd_get_error ());
> +    goto out;
> +  }
> +  if (actual_size != SIZE) {
> +    fprintf (stderr, "%s: actual size %" PRIi64 " <>
expected size %d",
> +             argv[0], actual_size, SIZE);
> +    goto out;
> +  }
> +
> +  if (nbd_pread (nbd, buf, sizeof buf, 0, 0) == -1) {
> +    fprintf (stderr, "%s\n", nbd_get_error ());
> +    goto out;
> +  }
...and checking that we read it back out?  But even a successful read 
without checking contents is pretty good evidence that the socket 
activation worked.
> +
> +  if (nbd_shutdown (nbd, 0) == -1) {
> +    fprintf (stderr, "%s\n", nbd_get_error ());
> +    goto out;
> +  }
> +
> +  nbd_close (nbd);
> +
> +  r = 0;
> + out:
> +  unlink (tmpfile);
> +
> +  exit (r == 0 ? EXIT_SUCCESS : EXIT_FAILURE);
> +}
> diff --git a/lib/connect.c b/lib/connect.c
> index f98bcdb..c1cbef7 100644
> --- a/lib/connect.c
> +++ b/lib/handle.c
> @@ -129,6 +129,16 @@ nbd_close (struct nbd_handle *h)
>     free_cmd_list (h->cmds_in_flight);
>     free_cmd_list (h->cmds_done);
>     nbd_internal_free_string_list (h->argv);
> +  if (h->sa_sockpath) {
> +    if (h->pid > 0)
> +      kill (h->pid, SIGTERM);
Are we sure that SIGTERM Is always going to be sufficient?  Or do we 
need a tiered approach where we try SIGTERM, but followup with SIGKILL 
if too much time elapses?  I guess it is the same situation for 
nbd_connect_command, does that mean we should have a common helper 
function for doing appropriate cleanup?

Do we document that nbd_close() may block?
> +    unlink (h->sa_sockpath);
[1] we can end up calling unlink() on a path that does not exist.  But 
since we aren't checking error status here, we are effectively ignoring 
the ENOENT in that case.
> +    free (h->sa_sockpath);
> +  }
> +  if (h->sa_tmpdir) {
> +    rmdir (h->sa_tmpdir);
> +    free (h->sa_tmpdir);
> +  }
>     free (h->unixsocket);
>     free (h->hostname);
>     free (h->port);
> diff --git a/lib/internal.h b/lib/internal.h
> index bdb0e83..bf8b9aa 100644
> --- a/lib/internal.h
> +++ b/lib/internal.h
> @@ -188,6 +188,12 @@ struct nbd_handle {
>     char **argv;
>     pid_t pid;
>   
> +  /* When using systemd socket activation, this directory and socket
> +   * must be deleted, and the pid above must be killed.
> +   */
> +  char *sa_tmpdir;
> +  char *sa_sockpath;
> +
>     /* When connecting to Unix domain socket. */
>     char *unixsocket;
>   
> 
-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Richard W.M. Jones

2019-Sep-27 12:28 UTC

head link

Re: [Libguestfs] [PATCH libnbd 2/2] api: Implement local command with systemd socket activation.

On Thu, Sep 26, 2019 at 03:56:27PM -0500, Eric Blake
wrote:> >+  /* Use /tmp instead of TMPDIR because we must ensure the path is
> >+   * short enough to store in the sockaddr_un.  On some platforms this
> >+   * may cause problems so we may need to revisit it.  XXX
> >+   */
> 
> Is the use of socketpair() any better than creating a socket under /tmp?
AIUI socketpair() can't be used to create a listening socket.  Systemd
socket activation has two modes -- one where you pass an already
accepted socket, and one where you pass a listening socket and the
server accepts connections until it is killed.  In nbdkit and qemu-nbd
we are using the latter one.

...> If we fail here or later, do/should we try to clean up the
> /tmp/libnbdXXX directory created earlier?
> 
> /me reads ahead - nbd_close tries to address it
> 
> Still, if we fail at this point, h->sa_sockpath is set but not yet
> created [1]
I modified the code so that it doesn't try to delete literal
/tmp/libnbdXXXXXX on some error paths and should be more robust.
> >+      close (s);
> >+    }
> >+    else {
> >+      /* We must unset CLOEXEC on the fd.  (dup2 above does this
> >+       * implicitly because CLOEXEC is set on the fd, not on the
> >+       * socket).
> >+       */
> >+      flags = fcntl (s, F_GETFD, 0);
> >+      if (flags == -1) {
> >+        perror ("fcntl: F_GETFD");
> 
> perror is not async-signal-safe.  Calling it in a child of a
> potentially-multi-threaded parent is therefore prone to deadlock, if
> perror() triggers a request to grab any resource that was left
> locked by a different thread holding the lock but stranded by the
> fork.
I suspect these sort of problems are unsolvable.  I've studied enough
customer bug reports where we've had to reverse engineer exactly how
something failed given only a slim error log to know that error
messages are always useful and I wouldn't want to remove these.
However because of the chance of deadlock, how about a deadlock-free
function that write(2)'s the message + errno to stderr?

> >+        _exit (EXIT_FAILURE);
> >+      }
> >+      if (fcntl (s, F_SETFD, flags & ~FD_CLOEXEC) == -1) {
> >+        perror ("fcntl: F_SETFD");
> >+        _exit (EXIT_FAILURE);
> >+      }
> 
> About all we can _safely_ do is let our _exit() status inform the
> parent process that something failed, and let the parent process
> print the error message.  But even passing errno as the _exit()
> value is not portable (GNU Hurd errno values are intentionally
> larger than what fit in 8-bit returns that the parent would see).
> It's also possible to set up a cloexec pipe (in addition to
> everything else) so that the child can write() error details into
> the pipe, but that's a lot of effort compared to just blindly
> stating that the child failed but declaring full details of why as
> lost.
> 
> >+    }
> >+
> >+    snprintf (listen_pid, sizeof listen_pid, "%ld", (long)
getpid ());
> >+    setenv ("LISTEN_PID", listen_pid, 1);
> >+    setenv ("LISTEN_FDS", "1", 1);
And we could use similar code here.
> snprintf() and setenv() are also not async-signal-safe.  Which is a
> bummer, since you really don't know the child pid until in the
> child. You could open-code the translation of a decimal number into
> a buffer without snprintf, but you also have to figure out how to
> safely put it into the environ seen by the child.  How does systemd
> do it?
For setenv(), systemd does this by building a char **environ which is
passed to execve.  The code is complicated to say the least:

https://github.com/systemd/systemd/blob/5ac1530eca652cd16819853fe06e76e156f5cf5e/src/core/dbus-execute.c
> >+    _exit (EXIT_FAILURE);
> 
> Is EXIT_FAILURE the best here, or should we consider exiting with
> status 126/127 to match what other programs (sh, env, nice, nohup,
> ...) do when execvp() fails?
Yes (this is also a bug elsewhere).  Not sure if 126 or 127 is
right:

https://www.gnu.org/software/bash/manual/html_node/Exit-Status.html

...> Do we document that nbd_close() may block?
We should do.  I've added it in a separate patch.

I'll have a rethink about this patch and come back with something
better in a while.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Eric Blake

2019-Sep-27 13:16 UTC

head link

Re: [Libguestfs] [PATCH libnbd 2/2] api: Implement local command with systemd socket activation.

On 9/27/19 7:28 AM, Richard W.M. Jones wrote:
>>> +      /* We must unset CLOEXEC on the fd.  (dup2 above does this
>>> +       * implicitly because CLOEXEC is set on the fd, not on the
>>> +       * socket).
>>> +       */
>>> +      flags = fcntl (s, F_GETFD, 0);
>>> +      if (flags == -1) {
>>> +        perror ("fcntl: F_GETFD");
>>
>> perror is not async-signal-safe.  Calling it in a child of a
>> potentially-multi-threaded parent is therefore prone to deadlock, if
>> perror() triggers a request to grab any resource that was left
>> locked by a different thread holding the lock but stranded by the
>> fork.
> 
> I suspect these sort of problems are unsolvable.  I've studied enough
> customer bug reports where we've had to reverse engineer exactly how
> something failed given only a slim error log to know that error
> messages are always useful and I wouldn't want to remove these.
> However because of the chance of deadlock, how about a deadlock-free
> function that write(2)'s the message + errno to stderr?
Yes, sticking to write() is better than nothing.  Open-coding an 
int-to-string for displying errno is doable, or bypassing strerror() and 
going straight for the non-portable sys_errlist[errno] lookup (if 
configure says sys_errlist is available) can also be used - it's a 
question of how nice do you want to make things, while still trying to 
remain safe.

>>> +
>>> +    snprintf (listen_pid, sizeof listen_pid, "%ld",
(long) getpid ());
>>> +    setenv ("LISTEN_PID", listen_pid, 1);
>>> +    setenv ("LISTEN_FDS", "1", 1);
> 
> And we could use similar code here.
Yeah, if we're going to open-code a signal-safe int-to-string, we have 
several uses for it :)
> 
>> snprintf() and setenv() are also not async-signal-safe.  Which is a
>> bummer, since you really don't know the child pid until in the
>> child. You could open-code the translation of a decimal number into
>> a buffer without snprintf, but you also have to figure out how to
>> safely put it into the environ seen by the child.  How does systemd
>> do it?
> 
> For setenv(), systemd does this by building a char **environ which is
> passed to execve.  The code is complicated to say the least:
> 
>
https://github.com/systemd/systemd/blob/5ac1530eca652cd16819853fe06e76e156f5cf5e/src/core/dbus-execute.c
Can we pre-build the bulk of the replacement environ in the parent, 
leaving just the one entry to munge in the child?  At least that way we 
can malloc without worries.
> 
>>> +    _exit (EXIT_FAILURE);
>>
>> Is EXIT_FAILURE the best here, or should we consider exiting with
>> status 126/127 to match what other programs (sh, env, nice, nohup,
>> ...) do when execvp() fails?
> 
> Yes (this is also a bug elsewhere).  Not sure if 126 or 127 is
> right:
> 
> https://www.gnu.org/software/bash/manual/html_node/Exit-Status.html
Depends on the value of errno after the fact.  ENOENT => 127, all others 
126.
> 
> ...
>> Do we document that nbd_close() may block?
> 
> We should do.  I've added it in a separate patch.
> 
> I'll have a rethink about this patch and come back with something
> better in a while.
> 
> Rich.
> 
-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Apparently Analagous Threads

Search for more maybe matching threads

Libguestfs - Sep 2019 - Re: [PATCH libnbd 2/2] api: Implement local command with systemd socket activation.

Re: [Libguestfs] [PATCH libnbd 2/2] api: Implement local command with systemd socket activation.

Re: [Libguestfs] [PATCH libnbd 2/2] api: Implement local command with systemd socket activation.

Re: [Libguestfs] [PATCH libnbd 2/2] api: Implement local command with systemd socket activation.

Apparently Analagous Threads