thr3ads.net - Libguestfs - Re: [Libguestfs] Anyone seen build hangs (esp armv7, s390x) in Fedora? [Mar 2020]

If this information is useful, please help other people find it:
Share via:

Richard W.M. Jones

2020-Mar-19 12:13 UTC

Re: [Libguestfs] Anyone seen build hangs (esp armv7, s390x) in Fedora?

[Dropping devel, adding libguestfs]

This can be reproduced on x86-64 so I can reproduce it locally.  It
only appears to happen when the tests are run under rpmbuild, not when
I run them as ‘make check’, but I'm unclear why this is.

As Eric described earlier, the test runs two copies of nbdkit and a
client, connected like this:

  qemu-img info ===> nbdkit nbd ===> nbdkit example1
     [3]                [2]             [1]

These are started in order [1], [2] then [3].  When the client
(process [3]) completes it exits and then the test harness kills
processes [1] and [2] in that order.

The stack trace of [2] at the hang is:

Thread 3 (Thread 0x7fabbf4f7700 (LWP 3955842)):
#0  0x00007fabc05c0f0f in poll () from /lib64/libc.so.6
#1  0x00007fabc090abba in poll (__timeout=-1, __nfds=2, __fds=0x7fabbf4f6bb0) at
/usr/include/bits/poll2.h:46
#2  nbdplug_reader (handle=0x5584020e09b0) at nbd.c:323
#3  0x00007fabc069d472 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fabc05cc063 in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7fabbfcf8700 (LWP 3955793)):
#0  0x00007fabc069eab7 in __pthread_clockjoin_ex () from /lib64/libpthread.so.0
#1  0x00007fabc090af2b in nbdplug_close_handle (h=0x5584020e09b0) at nbd.c:538
#2  0x00005583f90caee0 in backend_close (b=<optimized out>) at
backend.c:247
#3  0x00005583f90cdbf1 in free_connection (conn=0x5584020df890) at
connections.c:359
#4  handle_single_connection (sockin=<optimized out>,
sockout=<optimized out>) at connections.c:230
#5  0x00005583f90d63e8 in start_thread (datav=0x5584020bf1b0) at sockets.c:356
#6  0x00007fabc069d472 in start_thread () from /lib64/libpthread.so.0
#7  0x00007fabc05cc063 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fabc002ca40 (LWP 3955770)):
#0  0x00007fabc06a3b02 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x00005583f90d6dbf in accept_incoming_connections (socks=0x5584020bf010,
nr_socks=1) at sockets.c:501
#2  0x00005583f90ca441 in start_serving () at main.c:936
#3  main (argc=<optimized out>, argv=<optimized out>) at main.c:702

The stack trace of [1] at the hang is:

Thread 2 (Thread 0x7f30ac822700 (LWP 3955798)):
#0  0x00007f30b32a384c in recv () from /lib64/libpthread.so.0
#1  0x00007f30b3342c8f in _gnutls_stream_read (ms=0x7f30ac821a2c,
pull_func=0x7f30b336a570 <system_read>, size=5, bufel=<synthetic
pointer>, session=0x55c5cf561620) at buffers.c:346
#2  _gnutls_read (ms=0x7f30ac821a2c, pull_func=0x7f30b336a570
<system_read>, size=5, bufel=<synthetic pointer>,
session=0x55c5cf561620) at buffers.c:426
#3  _gnutls_io_read_buffered (session=session@entry=0x55c5cf561620, total=5,
recv_type=recv_type@entry=4294967295, ms=0x7f30ac821a2c) at buffers.c:582
#4  0x00007f30b3338f3f in recv_headers (ms=<optimized out>,
record=0x7f30ac821a80, htype=4294967295, type=GNUTLS_ALERT,
record_params=0x55c5cf565f60, session=0x55c5cf561620) at record.c:1172
#5  _gnutls_recv_in_buffers (session=session@entry=0x55c5cf561620,
type=type@entry=GNUTLS_ALERT, htype=htype@entry=4294967295, ms=<optimized
out>, ms@entry=0) at record.c:1307
#6  0x00007f30b333b555 in _gnutls_recv_int
(session=session@entry=0x55c5cf561620, type=type@entry=GNUTLS_ALERT,
data=data@entry=0x0, data_size=data_size@entry=0, seq=seq@entry=0x0, ms=0) at
record.c:1773
#7  0x00007f30b333b703 in gnutls_bye (session=session@entry=0x55c5cf561620,
how=how@entry=GNUTLS_SHUT_RDWR) at record.c:312
#8  0x000055c5c57af171 in crypto_close () at crypto.c:407
#9  0x000055c5c57aea58 in free_connection (conn=0x55c5cf560500) at
connections.c:339
#10 handle_single_connection (sockin=<optimized out>,
sockout=<optimized out>) at connections.c:230
#11 0x000055c5c57b73e8 in start_thread (datav=0x55c5cf541550) at sockets.c:356
#12 0x00007f30b3299472 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f30b31c8063 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f30b2c28a40 (LWP 3955740)):
#0  0x00007f30b329fb02 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x000055c5c57b7dbf in accept_incoming_connections (socks=0x55c5cf5413b0,
nr_socks=1) at sockets.c:501
#2  0x000055c5c57ab441 in start_serving () at main.c:936
#3  main (argc=<optimized out>, argv=<optimized out>) at main.c:702

It seems like process [2] is hanging in pthread_join, waiting on
thread 3 (the reader loop) to finish.  The reader loop is stuck on
poll.

Meanwhile process [1] is trying to do a clean TLS disconnect.  This
involves waiting on process [2] to write something, which it is never
going to do.

I don't think this is actually anything to do with libnbd not cleanly
shutting down TLS connections.  If libnbd had closed the socket
abruptly then process [1] wouldn't have to wait.  I think this might
be a bug in the nbd plugin.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

Eric Blake

2020-Mar-19 16:09 UTC

head link

Re: [Libguestfs] Anyone seen build hangs (esp armv7, s390x) in Fedora?

On 3/19/20 7:13 AM, Richard W.M. Jones wrote:> [Dropping devel, adding libguestfs]
> 
> This can be reproduced on x86-64 so I can reproduce it locally.  It
> only appears to happen when the tests are run under rpmbuild, not when
> I run them as ‘make check’, but I'm unclear why this is.
> 
> As Eric described earlier, the test runs two copies of nbdkit and a
> client, connected like this:
> 
>    qemu-img info ===> nbdkit nbd ===> nbdkit example1
>       [3]                [2]             [1]
> 
> These are started in order [1], [2] then [3].  When the client
> (process [3]) completes it exits and then the test harness kills
> processes [1] and [2] in that order.
> 
> The stack trace of [2] at the hang is:
> 
> Thread 3 (Thread 0x7fabbf4f7700 (LWP 3955842)):
> #0  0x00007fabc05c0f0f in poll () from /lib64/libc.so.6
> #1  0x00007fabc090abba in poll (__timeout=-1, __nfds=2,
__fds=0x7fabbf4f6bb0) at /usr/include/bits/poll2.h:46
> #2  nbdplug_reader (handle=0x5584020e09b0) at nbd.c:323
> #3  0x00007fabc069d472 in start_thread () from /lib64/libpthread.so.0
> #4  0x00007fabc05cc063 in clone () from /lib64/libc.so.6
This is nbdkit trying to poll() in order to know when to wake up the 
libnbd engine.
> Thread 2 (Thread 0x7fabbfcf8700 (LWP 3955793)):
> #0  0x00007fabc069eab7 in __pthread_clockjoin_ex () from
/lib64/libpthread.so.0
> #1  0x00007fabc090af2b in nbdplug_close_handle (h=0x5584020e09b0) at
nbd.c:538
> #2  0x00005583f90caee0 in backend_close (b=<optimized out>) at
backend.c:247
> #3  0x00005583f90cdbf1 in free_connection (conn=0x5584020df890) at
connections.c:359
> #4  handle_single_connection (sockin=<optimized out>,
sockout=<optimized out>) at connections.c:230
> #5  0x00005583f90d63e8 in start_thread (datav=0x5584020bf1b0) at
sockets.c:356
> #6  0x00007fabc069d472 in start_thread () from /lib64/libpthread.so.0
> #7  0x00007fabc05cc063 in clone () from /lib64/libc.so.6
And this is nbdkit waiting for the poll thread to complete.
> Thread 1 (Thread 0x7fabc002ca40 (LWP 3955770)):
> #0  0x00007fabc06a3b02 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
> #1  0x00005583f90d6dbf in accept_incoming_connections
(socks=0x5584020bf010, nr_socks=1) at sockets.c:501
> #2  0x00005583f90ca441 in start_serving () at main.c:936
> #3  main (argc=<optimized out>, argv=<optimized out>) at
main.c:702
nbdkit itself listening for a second client.
> 
> The stack trace of [1] at the hang is:
> 
> Thread 2 (Thread 0x7f30ac822700 (LWP 3955798)):
> #0  0x00007f30b32a384c in recv () from /lib64/libpthread.so.0
> #1  0x00007f30b3342c8f in _gnutls_stream_read (ms=0x7f30ac821a2c,
pull_func=0x7f30b336a570 <system_read>, size=5, bufel=<synthetic
pointer>, session=0x55c5cf561620) at buffers.c:346
> #2  _gnutls_read (ms=0x7f30ac821a2c, pull_func=0x7f30b336a570
<system_read>, size=5, bufel=<synthetic pointer>,
session=0x55c5cf561620) at buffers.c:426
> #3  _gnutls_io_read_buffered (session=session@entry=0x55c5cf561620,
total=5, recv_type=recv_type@entry=4294967295, ms=0x7f30ac821a2c) at
buffers.c:582
> #4  0x00007f30b3338f3f in recv_headers (ms=<optimized out>,
record=0x7f30ac821a80, htype=4294967295, type=GNUTLS_ALERT,
record_params=0x55c5cf565f60, session=0x55c5cf561620) at record.c:1172
> #5  _gnutls_recv_in_buffers (session=session@entry=0x55c5cf561620,
type=type@entry=GNUTLS_ALERT, htype=htype@entry=4294967295, ms=<optimized
out>, ms@entry=0) at record.c:1307
> #6  0x00007f30b333b555 in _gnutls_recv_int
(session=session@entry=0x55c5cf561620, type=type@entry=GNUTLS_ALERT,
data=data@entry=0x0, data_size=data_size@entry=0, seq=seq@entry=0x0, ms=0) at
record.c:1773
> #7  0x00007f30b333b703 in gnutls_bye (session=session@entry=0x55c5cf561620,
how=how@entry=GNUTLS_SHUT_RDWR) at record.c:312
> #8  0x000055c5c57af171 in crypto_close () at crypto.c:407
> #9  0x000055c5c57aea58 in free_connection (conn=0x55c5cf560500) at
connections.c:339
> #10 handle_single_connection (sockin=<optimized out>,
sockout=<optimized out>) at connections.c:230
nbdkit trying to use TLS to perform a clean shutdown.
> #11 0x000055c5c57b73e8 in start_thread (datav=0x55c5cf541550) at
sockets.c:356
> #12 0x00007f30b3299472 in start_thread () from /lib64/libpthread.so.0
> #13 0x00007f30b31c8063 in clone () from /lib64/libc.so.6
> Thread 1 (Thread 0x7f30b2c28a40 (LWP 3955740)):
> #0  0x00007f30b329fb02 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
> #1  0x000055c5c57b7dbf in accept_incoming_connections
(socks=0x55c5cf5413b0, nr_socks=1) at sockets.c:501
> #2  0x000055c5c57ab441 in start_serving () at main.c:936
> #3  main (argc=<optimized out>, argv=<optimized out>) at
main.c:702
nbdkit waiting for another thread.
> 
> It seems like process [2] is hanging in pthread_join, waiting on
> thread 3 (the reader loop) to finish.  The reader loop is stuck on
> poll.
> 
> Meanwhile process [1] is trying to do a clean TLS disconnect.  This
> involves waiting on process [2] to write something, which it is never
> going to do.
Mostly correct, except that [2] _could_ write something if we were to 
improve libnbd's TLS shutdown to not be so awkward.
> 
> I don't think this is actually anything to do with libnbd not cleanly
> shutting down TLS connections.  If libnbd had closed the socket
> abruptly then process [1] wouldn't have to wait.  I think this might
> be a bug in the nbd plugin.
Possible, but I'm still leaning towards it being a problem that libnbd 
is not handling TLS very well, which makes it hard for the nbd plugin to 
know that it can quit poll()ing.

Determining which of the two (or both) that need patching is trickier, 
though.  I'll spend more cycles on it today, at any rate.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Eric Blake

2020-Mar-28 20:44 UTC

head link

Re: [Libguestfs] Anyone seen build hangs (esp armv7, s390x) in Fedora?

On 3/19/20 7:13 AM, Richard W.M. Jones wrote:> [Dropping devel, adding libguestfs]
> 
> This can be reproduced on x86-64 so I can reproduce it locally.  It
> only appears to happen when the tests are run under rpmbuild, not when
> I run them as ‘make check’, but I'm unclear why this is.
> 
> As Eric described earlier, the test runs two copies of nbdkit and a
> client, connected like this:
> 
>    qemu-img info ===> nbdkit nbd ===> nbdkit example1
>       [3]                [2]             [1]
> 
> These are started in order [1], [2] then [3].  When the client
> (process [3]) completes it exits and then the test harness kills
> processes [1] and [2] in that order.
I just hit a breakthrough in understanding the deadlock.
> 
> The stack trace of [2] at the hang is:
> 
> Thread 3 (Thread 0x7fabbf4f7700 (LWP 3955842)):
> #0  0x00007fabc05c0f0f in poll () from /lib64/libc.so.6
This thread is calling poll() at the same time as:
> #1  0x00007fabc090abba in poll (__timeout=-1, __nfds=2,
__fds=0x7fabbf4f6bb0) at /usr/include/bits/poll2.h:46
> #2  nbdplug_reader (handle=0x5584020e09b0) at nbd.c:323
> #3  0x00007fabc069d472 in start_thread () from /lib64/libpthread.so.0
> #4  0x00007fabc05cc063 in clone () from /lib64/libc.so.6
> Thread 2 (Thread 0x7fabbfcf8700 (LWP 3955793)):
> #0  0x00007fabc069eab7 in __pthread_clockjoin_ex () from
/lib64/libpthread.so.0
> #1  0x00007fabc090af2b in nbdplug_close_handle (h=0x5584020e09b0) at
nbd.c:538
this one just finished a poll(), because I used the blocking 
nbd_shutdown instead of the non-blocking nbd_aio_disconnect.  Depending 
on which of the two threads wakes up first to service the server's 
reaction, the other one can be stranded.

Closing the pipe-to-self is a bandaid that ensures the reader thread 
eventually wakes up, but using the right API to begin with is even better.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Possibly Parallel Threads

Search for more reasonably related threads

Libguestfs - Mar 2020 - Re: Anyone seen build hangs (esp armv7, s390x) in Fedora?

Re: [Libguestfs] Anyone seen build hangs (esp armv7, s390x) in Fedora?

Re: [Libguestfs] Anyone seen build hangs (esp armv7, s390x) in Fedora?

Re: [Libguestfs] Anyone seen build hangs (esp armv7, s390x) in Fedora?

Possibly Parallel Threads