thr3ads.net - Libguestfs - [Libguestfs] [nbdkit PATCH 0/2] Fix nbdkit --run when nbdkit hits assertion [Sep 2019]

If this information is useful, please help other people find it:
Share via:

Eric Blake

2019-Sep-30 17:15 UTC

[Libguestfs] [nbdkit PATCH 0/2] Fix nbdkit --run when nbdkit hits assertion

Found while working on the retry filter.  Swap the order of the two
patches to see nbdkit ignore assertion failures with status 0.

Eric Blake (2):
  server: Propagate unexpected nbdkit failure with --run
  tests: Enhance captive test

 server/captive.c      | 43 ++++++++++++++++++++++++++++++----------
 tests/test-captive.sh | 46 +++++++++++++++++++++++++++++++++++++++----
 2 files changed, 75 insertions(+), 14 deletions(-)

-- 
2.21.0

Eric Blake

2019-Sep-30 17:15 UTC

head link

[Libguestfs] [nbdkit PATCH 1/2] server: Propagate unexpected nbdkit failure with --run

When running captive, we were blindly calling kill(pid) of the captive
nbdkit child process, even if that process has already exited
unexpectedly (most likely, from an assertion failure) and another
opened in its place (pid recycling is rare, but not impossible).  We
need to check that the child process still exists, and if it
unexpectedly died, ensure that our exit status reflects that fact.
Note that nbdkit normally should exit with status 0 (even when it is
sent a signal like SIGHUP or SIGINT) or die from a signal.

While at it, fix the fact that system() can fail with a value that is
not appropriate to hand to WIFEXITED() if the child was not even
spawned, but cannot fail with WIFSTOPPED.  Also, reflect death from
signal to a status > 128 rather than 1 (we could be even fancier and
also re-raise the signal so that we die from the same thing, but it's
not obvious we need that much work...).

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 server/captive.c | 43 +++++++++++++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 10 deletions(-)

diff --git a/server/captive.c b/server/captive.c
index 90e42050..1606eb1a 100644
--- a/server/captive.c
+++ b/server/captive.c
@@ -54,7 +54,7 @@ run_command (void)
   FILE *fp;
   char *cmd = NULL;
   size_t len = 0;
-  int r;
+  int r, status;
   pid_t pid;

   if (!run)
@@ -135,20 +135,43 @@ run_command (void)

   if (pid > 0) {              /* Parent process is the run command. */
     r = system (cmd);
-    if (WIFEXITED (r))
+    if (r == -1) {
+      nbdkit_error ("failure to execute external command: %m");
+      r = 1;
+    }
+    else if (WIFEXITED (r))
       r = WEXITSTATUS (r);
-    else if (WIFSIGNALED (r)) {
+    else {
+      assert (WIFSIGNALED (r));
       fprintf (stderr, "%s: external command was killed by signal
%d\n",
                program_name, WTERMSIG (r));
-      r = 1;
-    }
-    else if (WIFSTOPPED (r)) {
-      fprintf (stderr, "%s: external command was stopped by signal
%d\n",
-               program_name, WSTOPSIG (r));
-      r = 1;
+      r = WTERMSIG (r) + 128;
     }

-    kill (pid, SIGTERM);        /* Kill captive nbdkit. */
+    switch (waitpid (pid, &status, WNOHANG)) {
+    case -1:
+      nbdkit_error ("waitpid: %m");
+      r = 1;
+      break;
+    case 0:
+      /* Captive nbdkit still running; kill it, but no need to wait for it,
+       * as the captive program's exit status is good enough.
+       */
+      kill (pid, SIGTERM);
+      break;
+    default:
+      /* Captive nbdkit exited unexpectedly; update the exit status. */
+      if (WIFEXITED (status)) {
+        if (r == 0)
+          r = WEXITSTATUS (status);
+      }
+      else {
+        assert (WIFSIGNALED (status));
+        fprintf (stderr, "%s: nbdkit command was killed by signal
%d\n",
+                 program_name, WTERMSIG (status));
+        r = WTERMSIG (status) + 128;
+      }
+    }

     _exit (r);
   }
-- 
2.21.0

Eric Blake

2019-Sep-30 17:15 UTC

head link

[Libguestfs] [nbdkit PATCH 2/2] tests: Enhance captive test

Test the just-fixed bug in --run failing to detect an nbdkit assertion
failure.

While at it, sleep less when we don't actually need to wait for the
socket to be opened.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 tests/test-captive.sh | 46 +++++++++++++++++++++++++++++++++++++++----
 1 file changed, 42 insertions(+), 4 deletions(-)

diff --git a/tests/test-captive.sh b/tests/test-captive.sh
index e89c387d..88c0d818 100755
--- a/tests/test-captive.sh
+++ b/tests/test-captive.sh
@@ -1,6 +1,6 @@
 #!/usr/bin/env bash
 # nbdkit
-# Copyright (C) 2014-2018 Red Hat Inc.
+# Copyright (C) 2014-2019 Red Hat Inc.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions are
@@ -36,13 +36,15 @@ set -x

 # Test nbdkit --run (captive nbdkit) option.

+fail=0
+
 sock=`mktemp -u`
-files="$sock captive.out"
+files="$sock captive.out captive.pid"
 rm -f $files
 cleanup_fn rm -f $files

 nbdkit -U $sock example1 --run '
-    sleep 5; echo nbd=$nbd; echo port=$port; echo socket=$unixsocket
+    sleep 1; echo nbd=$nbd; echo port=$port; echo socket=$unixsocket
   ' > captive.out

 # Check the output.
@@ -51,5 +53,41 @@ port socket=$sock" ]; then
     echo "$0: unexpected output"
     cat captive.out
-    exit 1
+    fail=1
 fi
+
+# Check that a failed --run process affects exit status
+status=0
+nbdkit -U - example1 --run 'exit 2' > captive.out || status=$?
+if test $status != 2; then
+    echo "$0: unexpected exit status $status"
+    fail=1
+fi
+if test -s captive.out; then
+    echo "$0: unexpected output"
+    cat captive.out
+    fail=1
+fi
+
+# Check that nbdkit death from unhandled signal affects exit status
+status=0
+nbdkit -U - -P captive.pid example1 --run '
+test ! -s captive.pid || sleep 1
+if test ! -s captive.pid; then
+  echo "no pidfile yet"
+  exit 10
+fi
+kill -s ABRT $(cat captive.pid) || exit 10
+sleep 1
+' > captive.out || status=$?
+if test $status != $(( 128 + $(kill -l ABRT) )); then
+    echo "$0: unexpected exit status $status"
+    fail=1
+fi
+if test -s captive.out; then
+    echo "$0: unexpected output"
+    cat captive.out
+    fail=1
+fi
+
+exit $fail
-- 
2.21.0

Richard W.M. Jones

2019-Sep-30 20:26 UTC

head link

Re: [Libguestfs] [nbdkit PATCH 1/2] server: Propagate unexpected nbdkit failure with --run

On Mon, Sep 30, 2019 at 12:15:41PM -0500, Eric Blake
wrote:> diff --git a/server/captive.c b/server/captive.c
> index 90e42050..1606eb1a 100644
> --- a/server/captive.c
> +++ b/server/captive.c
> @@ -54,7 +54,7 @@ run_command (void)
>    FILE *fp;
>    char *cmd = NULL;
>    size_t len = 0;
> -  int r;
> +  int r, status;
>    pid_t pid;
> 
>    if (!run)
> @@ -135,20 +135,43 @@ run_command (void)
> 
>    if (pid > 0) {              /* Parent process is the run command. */
>      r = system (cmd);
> -    if (WIFEXITED (r))
> +    if (r == -1) {
> +      nbdkit_error ("failure to execute external command: %m");
> +      r = 1;
Although this is copying what the old code did, I wonder if we should
use r = EXIT_FAILURE here?

In any case the series is fine and I tested it, so:

ACK series

I don't know if you've noticed but because our tests use
MALLOC_PERTURB_ and because we have a race along the shutdown path,
some tests cause coredumps, eg:

$ make check
$ coredumpctl list
Mon 2019-09-30 21:20:45 BST  310684  1000  1000   6 present  
/home/rjones/d/nbdkit/server/nbdkit
Mon 2019-09-30 21:21:07 BST  317022  1000  1000  11 present  
/home/rjones/d/nbdkit/server/nbdkit
Mon 2019-09-30 21:21:08 BST  317217  1000  1000  11 present  
/home/rjones/d/nbdkit/server/nbdkit
Mon 2019-09-30 21:21:16 BST  323530  1000  1000  11 present  
/home/rjones/d/nbdkit/server/nbdkit
Mon 2019-09-30 21:21:17 BST  322363  1000  1000  11 present  
/home/rjones/d/nbdkit/server/nbdkit
Mon 2019-09-30 21:21:19 BST  325833  1000  1000  11 present  
/home/rjones/d/nbdkit/server/nbdkit
Mon 2019-09-30 21:21:22 BST  327989  1000  1000  11 present  
/home/rjones/d/nbdkit/server/nbdkit
Mon 2019-09-30 21:21:22 BST  328290  1000  1000  11 present  
/home/rjones/d/nbdkit/server/nbdkit
Mon 2019-09-30 21:21:23 BST  328520  1000  1000  11 present  
/home/rjones/d/nbdkit/server/nbdkit
Mon 2019-09-30 21:21:24 BST  328914  1000  1000  11 present  
/home/rjones/d/nbdkit/server/nbdkit
$ coredumpctl gdb
[...]
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000565098005718 in filter_finalize (b=0x5650a0693500, 
    conn=0x5650a0694940) at filters.c:440
440   return b->next->finalize (b->next, conn);
[Current thread is 1 (Thread 0x7fdacb08c700 (LWP 328918))]
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.30-4.fc31.x86_64
gmp-6.1.2-10.fc31.x86_64 gnutls-3.6.9-1.fc31.x86_64 libffi-3.1-23.fc31.x86_64
libgcc-9.1.1-2.fc31.1.x86_64 libidn2-2.2.0-2.fc31.x86_64
libselinux-2.9-5.fc31.x86_64 libtasn1-4.14-2.fc31.x86_64
libunistring-0.9.10-6.fc31.x86_64 nettle-3.5.1-3.fc31.x86_64
p11-kit-0.23.16.1-2.fc31.x86_64 pcre2-10.33-12.fc31.x86_64
(gdb) bt
#0  0x0000565098005718 in filter_finalize (b=0x5650a0693500, 
    conn=0x5650a0694940) at filters.c:440
#1  0x000056509800316c in _handle_single_connection (sockout=<optimized
out>,
    sockin=<optimized out>) at connections.c:220
#2  handle_single_connection (sockin=<optimized out>,
sockout=<optimized out>)
    at connections.c:239
#3  0x000056509800c7be in start_thread (datav=0x5650a06756f0) at sockets.c:276
#4  0x00007fdacb6dd4e2 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fdacb60c643 in clone () from /lib64/libc.so.6

(This has happened for a long time and has nothing to do with this
patch series.)  I thought that this series might expose some of these
crashes and cause test failures, but it seems as if it does not.  So I
guess that these shutdown failures are not happening with --run, but
in test cleanup functions instead.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

Eric Blake

2019-Sep-30 21:05 UTC

head link

Re: [Libguestfs] [nbdkit PATCH 1/2] server: Propagate unexpected nbdkit failure with --run

On 9/30/19 12:15 PM, Eric Blake wrote:> When running captive, we were blindly calling kill(pid) of the captive
> nbdkit child process, even if that process has already exited
> unexpectedly (most likely, from an assertion failure) and another
> opened in its place (pid recycling is rare, but not impossible).
Actually, pid recycling is NOT possible in this case, because the nbdkit 
process is in zombie state until we either wait() on it (which we 
didn't) or exit (in which case init(1) reaps it instead).  Still, the 
fact that we did not check whether kill() failed due to ESRCH was not 
helping the fact that we need the wait to detect early failure.

(Pid recycling is more of an issue when dealing with pids that are not 
your direct children, or if you ignore SIGCHLD)

I'm tempted to send a v2 of this that unconditionally tries kill() 
(which is safe after all), and which waits until the nbdkit process is 
gone (possibly sending SIGKILL if too much time elapses) rather than 
hoping that SIGTERM was sufficient (it's not nice to strand a wedged 
nbdkit child process), and which improves the commit message.
>  We
> need to check that the child process still exists, and if it
> unexpectedly died, ensure that our exit status reflects that fact.
This is true regardless of how we handle a zombie process.
> Note that nbdkit normally should exit with status 0 (even when it is
> sent a signal like SIGHUP or SIGINT) or die from a signal.
> 
> While at it, fix the fact that system() can fail with a value that is
> not appropriate to hand to WIFEXITED() if the child was not even
> spawned, but cannot fail with WIFSTOPPED.  Also, reflect death from
> signal to a status > 128 rather than 1 (we could be even fancier and
> also re-raise the signal so that we die from the same thing, but it's
> not obvious we need that much work...).
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>   server/captive.c | 43 +++++++++++++++++++++++++++++++++----------
>   1 file changed, 33 insertions(+), 10 deletions(-)
> 
-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Richard W.M. Jones

2019-Oct-19 17:33 UTC

head link

Re: [Libguestfs] [nbdkit PATCH 2/2] tests: Enhance captive test

On Mon, Sep 30, 2019 at 12:15:42PM -0500, Eric Blake
wrote:> Test the just-fixed bug in --run failing to detect an nbdkit assertion
> failure.
> 
> While at it, sleep less when we don't actually need to wait for the
> socket to be opened.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  tests/test-captive.sh | 46 +++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 42 insertions(+), 4 deletions(-)
> 
> diff --git a/tests/test-captive.sh b/tests/test-captive.sh
> index e89c387d..88c0d818 100755
> --- a/tests/test-captive.sh
> +++ b/tests/test-captive.sh
> @@ -1,6 +1,6 @@
>  #!/usr/bin/env bash
>  # nbdkit
> -# Copyright (C) 2014-2018 Red Hat Inc.
> +# Copyright (C) 2014-2019 Red Hat Inc.
>  #
>  # Redistribution and use in source and binary forms, with or without
>  # modification, are permitted provided that the following conditions are
> @@ -36,13 +36,15 @@ set -x
> 
>  # Test nbdkit --run (captive nbdkit) option.
> 
> +fail=0
> +
>  sock=`mktemp -u`
> -files="$sock captive.out"
> +files="$sock captive.out captive.pid"
>  rm -f $files
>  cleanup_fn rm -f $files
> 
>  nbdkit -U $sock example1 --run '
> -    sleep 5; echo nbd=$nbd; echo port=$port; echo socket=$unixsocket
> +    sleep 1; echo nbd=$nbd; echo port=$port; echo socket=$unixsocket
>    ' > captive.out
For some reason this test is now failing much more frequently under
load.  I can pretty easily reproduce the failure on my local 24 core
machine, especially if I use ‘make check-valgrind’, and it happens in
Koji too on the slower builders, eg:

  https://kojipkgs.fedoraproject.org//work/tasks/6571/38396571/build.log

I am planning to revert this single line from the commit for now, but
really I don't understand why it's failing because the sleep ought to
be completely unnecessary (ie. no sleep ought to work just as well).
I can't think right now why I put a sleep there in the first place ...

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

Reasonably Related Threads

Search for more maybe matching threads

Libguestfs - Sep 2019 - [nbdkit PATCH 0/2] Fix nbdkit --run when nbdkit hits assertion

[Libguestfs] [nbdkit PATCH 0/2] Fix nbdkit --run when nbdkit hits assertion

[Libguestfs] [nbdkit PATCH 1/2] server: Propagate unexpected nbdkit failure with --run

[Libguestfs] [nbdkit PATCH 2/2] tests: Enhance captive test

Re: [Libguestfs] [nbdkit PATCH 1/2] server: Propagate unexpected nbdkit failure with --run

Re: [Libguestfs] [nbdkit PATCH 1/2] server: Propagate unexpected nbdkit failure with --run

Re: [Libguestfs] [nbdkit PATCH 2/2] tests: Enhance captive test

Reasonably Related Threads