Richard W.M. Jones
2020-Mar-19 12:13 UTC
Re: [Libguestfs] Anyone seen build hangs (esp armv7, s390x) in Fedora?
[Dropping devel, adding libguestfs] This can be reproduced on x86-64 so I can reproduce it locally. It only appears to happen when the tests are run under rpmbuild, not when I run them as ‘make check’, but I'm unclear why this is. As Eric described earlier, the test runs two copies of nbdkit and a client, connected like this: qemu-img info ===> nbdkit nbd ===> nbdkit example1 [3] [2] [1] These are started in order [1], [2] then [3]. When the client (process [3]) completes it exits and then the test harness kills processes [1] and [2] in that order. The stack trace of [2] at the hang is: Thread 3 (Thread 0x7fabbf4f7700 (LWP 3955842)): #0 0x00007fabc05c0f0f in poll () from /lib64/libc.so.6 #1 0x00007fabc090abba in poll (__timeout=-1, __nfds=2, __fds=0x7fabbf4f6bb0) at /usr/include/bits/poll2.h:46 #2 nbdplug_reader (handle=0x5584020e09b0) at nbd.c:323 #3 0x00007fabc069d472 in start_thread () from /lib64/libpthread.so.0 #4 0x00007fabc05cc063 in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7fabbfcf8700 (LWP 3955793)): #0 0x00007fabc069eab7 in __pthread_clockjoin_ex () from /lib64/libpthread.so.0 #1 0x00007fabc090af2b in nbdplug_close_handle (h=0x5584020e09b0) at nbd.c:538 #2 0x00005583f90caee0 in backend_close (b=<optimized out>) at backend.c:247 #3 0x00005583f90cdbf1 in free_connection (conn=0x5584020df890) at connections.c:359 #4 handle_single_connection (sockin=<optimized out>, sockout=<optimized out>) at connections.c:230 #5 0x00005583f90d63e8 in start_thread (datav=0x5584020bf1b0) at sockets.c:356 #6 0x00007fabc069d472 in start_thread () from /lib64/libpthread.so.0 #7 0x00007fabc05cc063 in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7fabc002ca40 (LWP 3955770)): #0 0x00007fabc06a3b02 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00005583f90d6dbf in accept_incoming_connections (socks=0x5584020bf010, nr_socks=1) at sockets.c:501 #2 0x00005583f90ca441 in start_serving () at main.c:936 #3 main (argc=<optimized out>, argv=<optimized out>) at main.c:702 The stack trace of [1] at the hang is: Thread 2 (Thread 0x7f30ac822700 (LWP 3955798)): #0 0x00007f30b32a384c in recv () from /lib64/libpthread.so.0 #1 0x00007f30b3342c8f in _gnutls_stream_read (ms=0x7f30ac821a2c, pull_func=0x7f30b336a570 <system_read>, size=5, bufel=<synthetic pointer>, session=0x55c5cf561620) at buffers.c:346 #2 _gnutls_read (ms=0x7f30ac821a2c, pull_func=0x7f30b336a570 <system_read>, size=5, bufel=<synthetic pointer>, session=0x55c5cf561620) at buffers.c:426 #3 _gnutls_io_read_buffered (session=session@entry=0x55c5cf561620, total=5, recv_type=recv_type@entry=4294967295, ms=0x7f30ac821a2c) at buffers.c:582 #4 0x00007f30b3338f3f in recv_headers (ms=<optimized out>, record=0x7f30ac821a80, htype=4294967295, type=GNUTLS_ALERT, record_params=0x55c5cf565f60, session=0x55c5cf561620) at record.c:1172 #5 _gnutls_recv_in_buffers (session=session@entry=0x55c5cf561620, type=type@entry=GNUTLS_ALERT, htype=htype@entry=4294967295, ms=<optimized out>, ms@entry=0) at record.c:1307 #6 0x00007f30b333b555 in _gnutls_recv_int (session=session@entry=0x55c5cf561620, type=type@entry=GNUTLS_ALERT, data=data@entry=0x0, data_size=data_size@entry=0, seq=seq@entry=0x0, ms=0) at record.c:1773 #7 0x00007f30b333b703 in gnutls_bye (session=session@entry=0x55c5cf561620, how=how@entry=GNUTLS_SHUT_RDWR) at record.c:312 #8 0x000055c5c57af171 in crypto_close () at crypto.c:407 #9 0x000055c5c57aea58 in free_connection (conn=0x55c5cf560500) at connections.c:339 #10 handle_single_connection (sockin=<optimized out>, sockout=<optimized out>) at connections.c:230 #11 0x000055c5c57b73e8 in start_thread (datav=0x55c5cf541550) at sockets.c:356 #12 0x00007f30b3299472 in start_thread () from /lib64/libpthread.so.0 #13 0x00007f30b31c8063 in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7f30b2c28a40 (LWP 3955740)): #0 0x00007f30b329fb02 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000055c5c57b7dbf in accept_incoming_connections (socks=0x55c5cf5413b0, nr_socks=1) at sockets.c:501 #2 0x000055c5c57ab441 in start_serving () at main.c:936 #3 main (argc=<optimized out>, argv=<optimized out>) at main.c:702 It seems like process [2] is hanging in pthread_join, waiting on thread 3 (the reader loop) to finish. The reader loop is stuck on poll. Meanwhile process [1] is trying to do a clean TLS disconnect. This involves waiting on process [2] to write something, which it is never going to do. I don't think this is actually anything to do with libnbd not cleanly shutting down TLS connections. If libnbd had closed the socket abruptly then process [1] wouldn't have to wait. I think this might be a bug in the nbd plugin. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-df lists disk usage of guests without needing to install any software inside the virtual machine. Supports Linux and Windows. http://people.redhat.com/~rjones/virt-df/
Eric Blake
2020-Mar-19 16:09 UTC
Re: [Libguestfs] Anyone seen build hangs (esp armv7, s390x) in Fedora?
On 3/19/20 7:13 AM, Richard W.M. Jones wrote:> [Dropping devel, adding libguestfs] > > This can be reproduced on x86-64 so I can reproduce it locally. It > only appears to happen when the tests are run under rpmbuild, not when > I run them as ‘make check’, but I'm unclear why this is. > > As Eric described earlier, the test runs two copies of nbdkit and a > client, connected like this: > > qemu-img info ===> nbdkit nbd ===> nbdkit example1 > [3] [2] [1] > > These are started in order [1], [2] then [3]. When the client > (process [3]) completes it exits and then the test harness kills > processes [1] and [2] in that order. > > The stack trace of [2] at the hang is: > > Thread 3 (Thread 0x7fabbf4f7700 (LWP 3955842)): > #0 0x00007fabc05c0f0f in poll () from /lib64/libc.so.6 > #1 0x00007fabc090abba in poll (__timeout=-1, __nfds=2, __fds=0x7fabbf4f6bb0) at /usr/include/bits/poll2.h:46 > #2 nbdplug_reader (handle=0x5584020e09b0) at nbd.c:323 > #3 0x00007fabc069d472 in start_thread () from /lib64/libpthread.so.0 > #4 0x00007fabc05cc063 in clone () from /lib64/libc.so.6This is nbdkit trying to poll() in order to know when to wake up the libnbd engine.> Thread 2 (Thread 0x7fabbfcf8700 (LWP 3955793)): > #0 0x00007fabc069eab7 in __pthread_clockjoin_ex () from /lib64/libpthread.so.0 > #1 0x00007fabc090af2b in nbdplug_close_handle (h=0x5584020e09b0) at nbd.c:538 > #2 0x00005583f90caee0 in backend_close (b=<optimized out>) at backend.c:247 > #3 0x00005583f90cdbf1 in free_connection (conn=0x5584020df890) at connections.c:359 > #4 handle_single_connection (sockin=<optimized out>, sockout=<optimized out>) at connections.c:230 > #5 0x00005583f90d63e8 in start_thread (datav=0x5584020bf1b0) at sockets.c:356 > #6 0x00007fabc069d472 in start_thread () from /lib64/libpthread.so.0 > #7 0x00007fabc05cc063 in clone () from /lib64/libc.so.6And this is nbdkit waiting for the poll thread to complete.> Thread 1 (Thread 0x7fabc002ca40 (LWP 3955770)): > #0 0x00007fabc06a3b02 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 > #1 0x00005583f90d6dbf in accept_incoming_connections (socks=0x5584020bf010, nr_socks=1) at sockets.c:501 > #2 0x00005583f90ca441 in start_serving () at main.c:936 > #3 main (argc=<optimized out>, argv=<optimized out>) at main.c:702nbdkit itself listening for a second client.> > The stack trace of [1] at the hang is: > > Thread 2 (Thread 0x7f30ac822700 (LWP 3955798)): > #0 0x00007f30b32a384c in recv () from /lib64/libpthread.so.0 > #1 0x00007f30b3342c8f in _gnutls_stream_read (ms=0x7f30ac821a2c, pull_func=0x7f30b336a570 <system_read>, size=5, bufel=<synthetic pointer>, session=0x55c5cf561620) at buffers.c:346 > #2 _gnutls_read (ms=0x7f30ac821a2c, pull_func=0x7f30b336a570 <system_read>, size=5, bufel=<synthetic pointer>, session=0x55c5cf561620) at buffers.c:426 > #3 _gnutls_io_read_buffered (session=session@entry=0x55c5cf561620, total=5, recv_type=recv_type@entry=4294967295, ms=0x7f30ac821a2c) at buffers.c:582 > #4 0x00007f30b3338f3f in recv_headers (ms=<optimized out>, record=0x7f30ac821a80, htype=4294967295, type=GNUTLS_ALERT, record_params=0x55c5cf565f60, session=0x55c5cf561620) at record.c:1172 > #5 _gnutls_recv_in_buffers (session=session@entry=0x55c5cf561620, type=type@entry=GNUTLS_ALERT, htype=htype@entry=4294967295, ms=<optimized out>, ms@entry=0) at record.c:1307 > #6 0x00007f30b333b555 in _gnutls_recv_int (session=session@entry=0x55c5cf561620, type=type@entry=GNUTLS_ALERT, data=data@entry=0x0, data_size=data_size@entry=0, seq=seq@entry=0x0, ms=0) at record.c:1773 > #7 0x00007f30b333b703 in gnutls_bye (session=session@entry=0x55c5cf561620, how=how@entry=GNUTLS_SHUT_RDWR) at record.c:312 > #8 0x000055c5c57af171 in crypto_close () at crypto.c:407 > #9 0x000055c5c57aea58 in free_connection (conn=0x55c5cf560500) at connections.c:339 > #10 handle_single_connection (sockin=<optimized out>, sockout=<optimized out>) at connections.c:230nbdkit trying to use TLS to perform a clean shutdown.> #11 0x000055c5c57b73e8 in start_thread (datav=0x55c5cf541550) at sockets.c:356 > #12 0x00007f30b3299472 in start_thread () from /lib64/libpthread.so.0 > #13 0x00007f30b31c8063 in clone () from /lib64/libc.so.6 > Thread 1 (Thread 0x7f30b2c28a40 (LWP 3955740)): > #0 0x00007f30b329fb02 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 > #1 0x000055c5c57b7dbf in accept_incoming_connections (socks=0x55c5cf5413b0, nr_socks=1) at sockets.c:501 > #2 0x000055c5c57ab441 in start_serving () at main.c:936 > #3 main (argc=<optimized out>, argv=<optimized out>) at main.c:702nbdkit waiting for another thread.> > It seems like process [2] is hanging in pthread_join, waiting on > thread 3 (the reader loop) to finish. The reader loop is stuck on > poll. > > Meanwhile process [1] is trying to do a clean TLS disconnect. This > involves waiting on process [2] to write something, which it is never > going to do.Mostly correct, except that [2] _could_ write something if we were to improve libnbd's TLS shutdown to not be so awkward.> > I don't think this is actually anything to do with libnbd not cleanly > shutting down TLS connections. If libnbd had closed the socket > abruptly then process [1] wouldn't have to wait. I think this might > be a bug in the nbd plugin.Possible, but I'm still leaning towards it being a problem that libnbd is not handling TLS very well, which makes it hard for the nbd plugin to know that it can quit poll()ing. Determining which of the two (or both) that need patching is trickier, though. I'll spend more cycles on it today, at any rate. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
Eric Blake
2020-Mar-28 20:44 UTC
Re: [Libguestfs] Anyone seen build hangs (esp armv7, s390x) in Fedora?
On 3/19/20 7:13 AM, Richard W.M. Jones wrote:> [Dropping devel, adding libguestfs] > > This can be reproduced on x86-64 so I can reproduce it locally. It > only appears to happen when the tests are run under rpmbuild, not when > I run them as ‘make check’, but I'm unclear why this is. > > As Eric described earlier, the test runs two copies of nbdkit and a > client, connected like this: > > qemu-img info ===> nbdkit nbd ===> nbdkit example1 > [3] [2] [1] > > These are started in order [1], [2] then [3]. When the client > (process [3]) completes it exits and then the test harness kills > processes [1] and [2] in that order.I just hit a breakthrough in understanding the deadlock.> > The stack trace of [2] at the hang is: > > Thread 3 (Thread 0x7fabbf4f7700 (LWP 3955842)): > #0 0x00007fabc05c0f0f in poll () from /lib64/libc.so.6This thread is calling poll() at the same time as:> #1 0x00007fabc090abba in poll (__timeout=-1, __nfds=2, __fds=0x7fabbf4f6bb0) at /usr/include/bits/poll2.h:46 > #2 nbdplug_reader (handle=0x5584020e09b0) at nbd.c:323 > #3 0x00007fabc069d472 in start_thread () from /lib64/libpthread.so.0 > #4 0x00007fabc05cc063 in clone () from /lib64/libc.so.6 > Thread 2 (Thread 0x7fabbfcf8700 (LWP 3955793)): > #0 0x00007fabc069eab7 in __pthread_clockjoin_ex () from /lib64/libpthread.so.0 > #1 0x00007fabc090af2b in nbdplug_close_handle (h=0x5584020e09b0) at nbd.c:538this one just finished a poll(), because I used the blocking nbd_shutdown instead of the non-blocking nbd_aio_disconnect. Depending on which of the two threads wakes up first to service the server's reaction, the other one can be stranded. Closing the pipe-to-self is a bandaid that ensures the reader thread eventually wakes up, but using the right API to begin with is even better. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org