Eric Blake
2020-Mar-19 15:56 UTC
Re: [Libguestfs] Anyone seen build hangs (esp armv7, s390x) in Fedora?
[replying here, as I seem to have been dropped from cc on the subthread at https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/ELUEHAA7X7YKU5DFIOBS3UQ5AXQYJWLY/ - maybe I should subscribe to devel@ instead of seeing this second-hand... hmm - I can't even post to devel@ without subscribing, so now just sending this to libguestfs] [adding libguestfs - now that devel@ has helped point to a bug in nbdkit itself] On 3/18/20 4:49 AM, Richard W.M. Jones wrote:> On Wed, Mar 18, 2020 at 09:38:52AM +0000, Peter Robinson wrote: >>> This might be a bug in the package itself, but has anyone seen builds >>> hanging in weird places, in Rawhide, especially on armv7 and s390x? >>> >>> This packge build has hung 3 times in the same place, once on armv7 >>> and twice on s390x: >>> >>> https://koji.fedoraproject.org/koji/taskinfo?taskID=42570766 >>> >>> It's hard to explain how it could hang at that place in the build >>> unless something fundamental is broken like make. >> >> Well make 4.3 did land recently (March 12th) in rawhide so that's >> entirely possible. > > Yes, Eric Blake pointed this out to me too. However I don't really > want to blame make unless others have seen similar hangs. It could > easily be a new bug in the package itself. > > If anyone has access to that builder, it might be interesting to get a > process listing, or strace of whatever process is hanging.Dan Horak added:> it's a deadlock in the tests, not in make. Reproduced with "fedpkg local" in a cycle. > sharkcz 1649225 0.0 0.0 222288 3904 pts/5 S+ 06:24 0:00 /bin/sh -e /var/tmp/rpm-tmp.RXcMRr > sharkcz 1649230 0.0 0.0 10372 3248 pts/5 S+ 06:24 0:00 make -j4 check > sharkcz 1658088 0.0 0.0 251236 3400 pts/5 Sl+ 06:25 0:00 /home/sharkcz/nbdkit/nbdkit-1.19.3/server/nbdkit -v -P test-nbd-tls-psk.pid1 -U /tmp/tmp.7e7Gv5MPmZ --tls=require --tls-psk=keys.psk -- /home/sharkcz/nbdkit/nbdkit-1.19.3/plugins/example1/.libs/nbdkit-example1-plugin.so > sharkcz 1658091 0.0 0.1 192944 4464 pts/5 Sl+ 06:25 0:00 /home/sharkcz/nbdkit/nbdkit-1.19.3/server/nbdkit -v -P test-nbd-tls-psk.pid2 -U /tmp/tmp.yp61yXx09y --tls=off -- /home/sharkcz/nbdkit/nbdkit-1.19.3/plugins/nbd/.libs/nbdkit-nbd-plugin.so tls=require tls-psk=keys.psk tls-username=qemu socket=/tmp/tmp.7e7Gv5MPmZ > the 2 nbdkit processes are stuck in the futex() syscallReconstructing state from those command lines - we have a TLS test that operates 3 processes: client <=> nbdkit nbd <=> nbdkit example1 it looks like this particular test was checking a plain-text client connecting to nbdkit nbd, which in turn was connecting as a TLS client to nbdkit example1. I also know that 'nbdkit nbd' uses libnbd to support TLS, and that we have not fully implemented clean TLS teardown in libnbd - so it could be that the nbd side has told the example1 side that it will be shutting down soon, but due to unclean TLS library usage, is missing a poll() wakeup to realize that there will be no further response coming from the example1 side; while the example1 side is doing blocking I/O waiting for the nbd side to close the socket. The overall test that spawned both nbdkit processes in the background (tests/test-nbd-tls-psk.sh) has completed, though, stranding those two hung child processes without their original parent but letting 'make check' report testsuite success. As to why make is hanging, that is beyond me. Maybe something new in make 4.3 is detecting that we have stranded indirect processes, and is waiting for them to complete? Ideally, we need to fix libnbd TLS support to do cleaner shutdown. Pragmatically, nbdkit's tests/functions.sh start_nbdkit() function right now tries only a single: cleanup_fn kill "$(cat "$pidfile")" without waiting to see if it actually worked. We could probably turn that into a more robust kill_nbdkit() function that first tries the graceful SIGTERM, waits a few seconds to confirm whether the process actually died, and follows up with a harder SIGKILL as needed (preferably failing a test whenever SIGTERM was insufficient). It may not solve the bug in libnbd TLS shutdown, but would at least prevent stuck processes. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
Eric Blake
2020-Mar-20 19:54 UTC
Re: [Libguestfs] Anyone seen build hangs (esp armv7, s390x) in Fedora?
On 3/19/20 10:56 AM, Eric Blake wrote:> Ideally, we need to fix libnbd TLS support to do cleaner shutdown.I'm still working on that larger task.> > Pragmatically, nbdkit's tests/functions.sh start_nbdkit() function right > now tries only a single: > > cleanup_fn kill "$(cat "$pidfile")" > > without waiting to see if it actually worked. We could probably turn > that into a more robust kill_nbdkit() function that first tries the > graceful SIGTERM, waits a few seconds to confirm whether the process > actually died, and follows up with a harder SIGKILL as needed > (preferably failing a test whenever SIGTERM was insufficient). It may > not solve the bug in libnbd TLS shutdown, but would at least prevent > stuck processes.But this one is now posted for review. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org