Eric Blake
2019-Sep-30 17:15 UTC
[Libguestfs] [nbdkit PATCH 0/2] Fix nbdkit --run when nbdkit hits assertion
Found while working on the retry filter. Swap the order of the two patches to see nbdkit ignore assertion failures with status 0. Eric Blake (2): server: Propagate unexpected nbdkit failure with --run tests: Enhance captive test server/captive.c | 43 ++++++++++++++++++++++++++++++---------- tests/test-captive.sh | 46 +++++++++++++++++++++++++++++++++++++++---- 2 files changed, 75 insertions(+), 14 deletions(-) -- 2.21.0
Eric Blake
2019-Sep-30 17:15 UTC
[Libguestfs] [nbdkit PATCH 1/2] server: Propagate unexpected nbdkit failure with --run
When running captive, we were blindly calling kill(pid) of the captive nbdkit child process, even if that process has already exited unexpectedly (most likely, from an assertion failure) and another opened in its place (pid recycling is rare, but not impossible). We need to check that the child process still exists, and if it unexpectedly died, ensure that our exit status reflects that fact. Note that nbdkit normally should exit with status 0 (even when it is sent a signal like SIGHUP or SIGINT) or die from a signal. While at it, fix the fact that system() can fail with a value that is not appropriate to hand to WIFEXITED() if the child was not even spawned, but cannot fail with WIFSTOPPED. Also, reflect death from signal to a status > 128 rather than 1 (we could be even fancier and also re-raise the signal so that we die from the same thing, but it's not obvious we need that much work...). Signed-off-by: Eric Blake <eblake@redhat.com> --- server/captive.c | 43 +++++++++++++++++++++++++++++++++---------- 1 file changed, 33 insertions(+), 10 deletions(-) diff --git a/server/captive.c b/server/captive.c index 90e42050..1606eb1a 100644 --- a/server/captive.c +++ b/server/captive.c @@ -54,7 +54,7 @@ run_command (void) FILE *fp; char *cmd = NULL; size_t len = 0; - int r; + int r, status; pid_t pid; if (!run) @@ -135,20 +135,43 @@ run_command (void) if (pid > 0) { /* Parent process is the run command. */ r = system (cmd); - if (WIFEXITED (r)) + if (r == -1) { + nbdkit_error ("failure to execute external command: %m"); + r = 1; + } + else if (WIFEXITED (r)) r = WEXITSTATUS (r); - else if (WIFSIGNALED (r)) { + else { + assert (WIFSIGNALED (r)); fprintf (stderr, "%s: external command was killed by signal %d\n", program_name, WTERMSIG (r)); - r = 1; - } - else if (WIFSTOPPED (r)) { - fprintf (stderr, "%s: external command was stopped by signal %d\n", - program_name, WSTOPSIG (r)); - r = 1; + r = WTERMSIG (r) + 128; } - kill (pid, SIGTERM); /* Kill captive nbdkit. */ + switch (waitpid (pid, &status, WNOHANG)) { + case -1: + nbdkit_error ("waitpid: %m"); + r = 1; + break; + case 0: + /* Captive nbdkit still running; kill it, but no need to wait for it, + * as the captive program's exit status is good enough. + */ + kill (pid, SIGTERM); + break; + default: + /* Captive nbdkit exited unexpectedly; update the exit status. */ + if (WIFEXITED (status)) { + if (r == 0) + r = WEXITSTATUS (status); + } + else { + assert (WIFSIGNALED (status)); + fprintf (stderr, "%s: nbdkit command was killed by signal %d\n", + program_name, WTERMSIG (status)); + r = WTERMSIG (status) + 128; + } + } _exit (r); } -- 2.21.0
Eric Blake
2019-Sep-30 17:15 UTC
[Libguestfs] [nbdkit PATCH 2/2] tests: Enhance captive test
Test the just-fixed bug in --run failing to detect an nbdkit assertion failure. While at it, sleep less when we don't actually need to wait for the socket to be opened. Signed-off-by: Eric Blake <eblake@redhat.com> --- tests/test-captive.sh | 46 +++++++++++++++++++++++++++++++++++++++---- 1 file changed, 42 insertions(+), 4 deletions(-) diff --git a/tests/test-captive.sh b/tests/test-captive.sh index e89c387d..88c0d818 100755 --- a/tests/test-captive.sh +++ b/tests/test-captive.sh @@ -1,6 +1,6 @@ #!/usr/bin/env bash # nbdkit -# Copyright (C) 2014-2018 Red Hat Inc. +# Copyright (C) 2014-2019 Red Hat Inc. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions are @@ -36,13 +36,15 @@ set -x # Test nbdkit --run (captive nbdkit) option. +fail=0 + sock=`mktemp -u` -files="$sock captive.out" +files="$sock captive.out captive.pid" rm -f $files cleanup_fn rm -f $files nbdkit -U $sock example1 --run ' - sleep 5; echo nbd=$nbd; echo port=$port; echo socket=$unixsocket + sleep 1; echo nbd=$nbd; echo port=$port; echo socket=$unixsocket ' > captive.out # Check the output. @@ -51,5 +53,41 @@ port socket=$sock" ]; then echo "$0: unexpected output" cat captive.out - exit 1 + fail=1 fi + +# Check that a failed --run process affects exit status +status=0 +nbdkit -U - example1 --run 'exit 2' > captive.out || status=$? +if test $status != 2; then + echo "$0: unexpected exit status $status" + fail=1 +fi +if test -s captive.out; then + echo "$0: unexpected output" + cat captive.out + fail=1 +fi + +# Check that nbdkit death from unhandled signal affects exit status +status=0 +nbdkit -U - -P captive.pid example1 --run ' +test ! -s captive.pid || sleep 1 +if test ! -s captive.pid; then + echo "no pidfile yet" + exit 10 +fi +kill -s ABRT $(cat captive.pid) || exit 10 +sleep 1 +' > captive.out || status=$? +if test $status != $(( 128 + $(kill -l ABRT) )); then + echo "$0: unexpected exit status $status" + fail=1 +fi +if test -s captive.out; then + echo "$0: unexpected output" + cat captive.out + fail=1 +fi + +exit $fail -- 2.21.0
Richard W.M. Jones
2019-Sep-30 20:26 UTC
Re: [Libguestfs] [nbdkit PATCH 1/2] server: Propagate unexpected nbdkit failure with --run
On Mon, Sep 30, 2019 at 12:15:41PM -0500, Eric Blake wrote:> diff --git a/server/captive.c b/server/captive.c > index 90e42050..1606eb1a 100644 > --- a/server/captive.c > +++ b/server/captive.c > @@ -54,7 +54,7 @@ run_command (void) > FILE *fp; > char *cmd = NULL; > size_t len = 0; > - int r; > + int r, status; > pid_t pid; > > if (!run) > @@ -135,20 +135,43 @@ run_command (void) > > if (pid > 0) { /* Parent process is the run command. */ > r = system (cmd); > - if (WIFEXITED (r)) > + if (r == -1) { > + nbdkit_error ("failure to execute external command: %m"); > + r = 1;Although this is copying what the old code did, I wonder if we should use r = EXIT_FAILURE here? In any case the series is fine and I tested it, so: ACK series I don't know if you've noticed but because our tests use MALLOC_PERTURB_ and because we have a race along the shutdown path, some tests cause coredumps, eg: $ make check $ coredumpctl list Mon 2019-09-30 21:20:45 BST 310684 1000 1000 6 present /home/rjones/d/nbdkit/server/nbdkit Mon 2019-09-30 21:21:07 BST 317022 1000 1000 11 present /home/rjones/d/nbdkit/server/nbdkit Mon 2019-09-30 21:21:08 BST 317217 1000 1000 11 present /home/rjones/d/nbdkit/server/nbdkit Mon 2019-09-30 21:21:16 BST 323530 1000 1000 11 present /home/rjones/d/nbdkit/server/nbdkit Mon 2019-09-30 21:21:17 BST 322363 1000 1000 11 present /home/rjones/d/nbdkit/server/nbdkit Mon 2019-09-30 21:21:19 BST 325833 1000 1000 11 present /home/rjones/d/nbdkit/server/nbdkit Mon 2019-09-30 21:21:22 BST 327989 1000 1000 11 present /home/rjones/d/nbdkit/server/nbdkit Mon 2019-09-30 21:21:22 BST 328290 1000 1000 11 present /home/rjones/d/nbdkit/server/nbdkit Mon 2019-09-30 21:21:23 BST 328520 1000 1000 11 present /home/rjones/d/nbdkit/server/nbdkit Mon 2019-09-30 21:21:24 BST 328914 1000 1000 11 present /home/rjones/d/nbdkit/server/nbdkit $ coredumpctl gdb [...] Program terminated with signal SIGSEGV, Segmentation fault. #0 0x0000565098005718 in filter_finalize (b=0x5650a0693500, conn=0x5650a0694940) at filters.c:440 440 return b->next->finalize (b->next, conn); [Current thread is 1 (Thread 0x7fdacb08c700 (LWP 328918))] Missing separate debuginfos, use: dnf debuginfo-install glibc-2.30-4.fc31.x86_64 gmp-6.1.2-10.fc31.x86_64 gnutls-3.6.9-1.fc31.x86_64 libffi-3.1-23.fc31.x86_64 libgcc-9.1.1-2.fc31.1.x86_64 libidn2-2.2.0-2.fc31.x86_64 libselinux-2.9-5.fc31.x86_64 libtasn1-4.14-2.fc31.x86_64 libunistring-0.9.10-6.fc31.x86_64 nettle-3.5.1-3.fc31.x86_64 p11-kit-0.23.16.1-2.fc31.x86_64 pcre2-10.33-12.fc31.x86_64 (gdb) bt #0 0x0000565098005718 in filter_finalize (b=0x5650a0693500, conn=0x5650a0694940) at filters.c:440 #1 0x000056509800316c in _handle_single_connection (sockout=<optimized out>, sockin=<optimized out>) at connections.c:220 #2 handle_single_connection (sockin=<optimized out>, sockout=<optimized out>) at connections.c:239 #3 0x000056509800c7be in start_thread (datav=0x5650a06756f0) at sockets.c:276 #4 0x00007fdacb6dd4e2 in start_thread () from /lib64/libpthread.so.0 #5 0x00007fdacb60c643 in clone () from /lib64/libc.so.6 (This has happened for a long time and has nothing to do with this patch series.) I thought that this series might expose some of these crashes and cause test failures, but it seems as if it does not. So I guess that these shutdown failures are not happening with --run, but in test cleanup functions instead. Rich. -- Richard Jones, Virtualization Group, Red Hat people.redhat.com/~rjones Read my programming and virtualization blog: rwmj.wordpress.com virt-df lists disk usage of guests without needing to install any software inside the virtual machine. Supports Linux and Windows. people.redhat.com/~rjones/virt-df
Eric Blake
2019-Sep-30 21:05 UTC
Re: [Libguestfs] [nbdkit PATCH 1/2] server: Propagate unexpected nbdkit failure with --run
On 9/30/19 12:15 PM, Eric Blake wrote:> When running captive, we were blindly calling kill(pid) of the captive > nbdkit child process, even if that process has already exited > unexpectedly (most likely, from an assertion failure) and another > opened in its place (pid recycling is rare, but not impossible).Actually, pid recycling is NOT possible in this case, because the nbdkit process is in zombie state until we either wait() on it (which we didn't) or exit (in which case init(1) reaps it instead). Still, the fact that we did not check whether kill() failed due to ESRCH was not helping the fact that we need the wait to detect early failure. (Pid recycling is more of an issue when dealing with pids that are not your direct children, or if you ignore SIGCHLD) I'm tempted to send a v2 of this that unconditionally tries kill() (which is safe after all), and which waits until the nbdkit process is gone (possibly sending SIGKILL if too much time elapses) rather than hoping that SIGTERM was sufficient (it's not nice to strand a wedged nbdkit child process), and which improves the commit message.> We > need to check that the child process still exists, and if it > unexpectedly died, ensure that our exit status reflects that fact.This is true regardless of how we handle a zombie process.> Note that nbdkit normally should exit with status 0 (even when it is > sent a signal like SIGHUP or SIGINT) or die from a signal. > > While at it, fix the fact that system() can fail with a value that is > not appropriate to hand to WIFEXITED() if the child was not even > spawned, but cannot fail with WIFSTOPPED. Also, reflect death from > signal to a status > 128 rather than 1 (we could be even fancier and > also re-raise the signal so that we die from the same thing, but it's > not obvious we need that much work...). > > Signed-off-by: Eric Blake <eblake@redhat.com> > --- > server/captive.c | 43 +++++++++++++++++++++++++++++++++---------- > 1 file changed, 33 insertions(+), 10 deletions(-) >-- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
Richard W.M. Jones
2019-Oct-19 17:33 UTC
Re: [Libguestfs] [nbdkit PATCH 2/2] tests: Enhance captive test
On Mon, Sep 30, 2019 at 12:15:42PM -0500, Eric Blake wrote:> Test the just-fixed bug in --run failing to detect an nbdkit assertion > failure. > > While at it, sleep less when we don't actually need to wait for the > socket to be opened. > > Signed-off-by: Eric Blake <eblake@redhat.com> > --- > tests/test-captive.sh | 46 +++++++++++++++++++++++++++++++++++++++---- > 1 file changed, 42 insertions(+), 4 deletions(-) > > diff --git a/tests/test-captive.sh b/tests/test-captive.sh > index e89c387d..88c0d818 100755 > --- a/tests/test-captive.sh > +++ b/tests/test-captive.sh > @@ -1,6 +1,6 @@ > #!/usr/bin/env bash > # nbdkit > -# Copyright (C) 2014-2018 Red Hat Inc. > +# Copyright (C) 2014-2019 Red Hat Inc. > # > # Redistribution and use in source and binary forms, with or without > # modification, are permitted provided that the following conditions are > @@ -36,13 +36,15 @@ set -x > > # Test nbdkit --run (captive nbdkit) option. > > +fail=0 > + > sock=`mktemp -u` > -files="$sock captive.out" > +files="$sock captive.out captive.pid" > rm -f $files > cleanup_fn rm -f $files > > nbdkit -U $sock example1 --run ' > - sleep 5; echo nbd=$nbd; echo port=$port; echo socket=$unixsocket > + sleep 1; echo nbd=$nbd; echo port=$port; echo socket=$unixsocket > ' > captive.outFor some reason this test is now failing much more frequently under load. I can pretty easily reproduce the failure on my local 24 core machine, especially if I use ‘make check-valgrind’, and it happens in Koji too on the slower builders, eg: kojipkgs.fedoraproject.org//work/tasks/6571/38396571/build.log I am planning to revert this single line from the commit for now, but really I don't understand why it's failing because the sleep ought to be completely unnecessary (ie. no sleep ought to work just as well). I can't think right now why I put a sleep there in the first place ... Rich. -- Richard Jones, Virtualization Group, Red Hat people.redhat.com/~rjones Read my programming and virtualization blog: rwmj.wordpress.com virt-p2v converts physical machines to virtual machines. Boot with a live CD or over the network (PXE) and turn machines into KVM guests. libguestfs.org/virt-v2v
Possibly Parallel Threads
- [nbdkit PATCH v2 0/6] Improve retry filter
- [nbdkit PATCH 0/2] Add nbd forwarder test coverage
- [nbdkit PATCH 1/2] server: Propagate unexpected nbdkit failure with --run
- "Unexplained error code xxx" in rsync-2.5.5
- Re: [PATCH nbdkit v3 4/4] Add linuxdisk plugin.