I'm observing a difference in timing on nbdkit shutdown that is dependent on client behavior, and I wonder if that's a bug, but I can't figure out where to patch it. When there is no connection present, sending SIGINT to nbdkit terminates immediately. If, on the other hand, there is a single client currently connected, the termination behavior on SIGINT depends on what the client has done: if the client is currently silent with regards to issuing transmission-phase transactions, nbdkit hangs for 5 seconds, then forcibly tears down the connection: term1$ ./nbdkit -f -v -n file file=TODO ... nbdkit: debug: bound to IP address <any>:10809 (2 socket(s)) term2$ qemu-io -f raw nbd://localhost:10809/ -r qemu-io> term1$ ^C nbdkit: debug: waiting for running threads to complete nbdkit: debug: waited 5s for running threads to complete nbdkit: debug: /home/eblake/nbdkit/plugins/file/.libs/nbdkit-file-plugin.so: unload term2$ qemu-io> r 0 1 read failed: Input/output error qemu-io> q But, if the client makes ANY transaction during that window, then nbdkit lets that transaction succeed, then tears down the connection without any further wait: term1$ ./nbdkit -f -v -n file file=TODO ... nbdkit: debug: bound to IP address <any>:10809 (2 socket(s)) term2$ qemu-io -f raw nbd://localhost:10809/ -r qemu-io> r 0 1 # wait to hit Enter until right after ^C in term1 term1$ ^C nbdkit: debug: waiting for running threads to complete term2$ # hit enter term1$ nbdkit: file[1]: debug: acquire per-connection request lock nbdkit: file[1]: debug: pread count=1 offset=0 nbdkit: file[1]: debug: release per-connection request lock nbdkit: file[1]: debug: close nbdkit: debug: waited 2s for running threads to complete nbdkit: debug: /home/eblake/nbdkit/plugins/file/.libs/nbdkit-file-plugin.so: unload term2$ qemu-io> r 0 1 read 1/1 bytes at offset 0 1 bytes, 1 ops; 0.0002 sec (4.138 KiB/sec and 4237.2881 ops/sec) Why does current traffic from the client cause the plugin to be torn down faster? Does it matter? -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Richard W.M. Jones
2017-Nov-14 10:25 UTC
Re: [Libguestfs] question on nbdkit shutdown behavior
On Mon, Nov 13, 2017 at 12:42:48PM -0600, Eric Blake wrote:> I'm observing a difference in timing on nbdkit shutdown that is > dependent on client behavior, and I wonder if that's a bug, but I can't > figure out where to patch it. > > When there is no connection present, sending SIGINT to nbdkit terminates > immediately. > > If, on the other hand, there is a single client currently connected, the > termination behavior on SIGINT depends on what the client has done: if > the client is currently silent with regards to issuing > transmission-phase transactions, nbdkit hangs for 5 seconds, then > forcibly tears down the connection:I'm guessing it's because of commit 63f0eb0889c8f8a82ba06a02a8a92d695902baad which I added to fix a race in plugin_cleanup(). See also: https://www.redhat.com/archives/libguestfs/2017-September/msg00226.html [...]> Why does current traffic from the client cause the plugin to be torn > down faster? Does it matter?I believe because the main loop checks the !quit flag if there is traffic on the connection. There is most likely to be a better fix for the race than 63f0eb0889. I added that as a quick workaround for the segfault we saw in the tests. Perhaps we should actively cancel the threads on shutdown instead of waiting? Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top
On 11/14/2017 04:25 AM, Richard W.M. Jones wrote:> On Mon, Nov 13, 2017 at 12:42:48PM -0600, Eric Blake wrote: >> I'm observing a difference in timing on nbdkit shutdown that is >> dependent on client behavior, and I wonder if that's a bug, but I can't >> figure out where to patch it. >> >> When there is no connection present, sending SIGINT to nbdkit terminates >> immediately. >> >> If, on the other hand, there is a single client currently connected, the >> termination behavior on SIGINT depends on what the client has done: if >> the client is currently silent with regards to issuing >> transmission-phase transactions, nbdkit hangs for 5 seconds, then >> forcibly tears down the connection: > > I'm guessing it's because of commit 63f0eb0889c8f8a82ba06a02a8a92d695902baad > which I added to fix a race in plugin_cleanup(). See also: > https://www.redhat.com/archives/libguestfs/2017-September/msg00226.html > > [...] >> Why does current traffic from the client cause the plugin to be torn >> down faster? Does it matter? > > I believe because the main loop checks the !quit flag if there > is traffic on the connection. > > There is most likely to be a better fix for the race than 63f0eb0889. > I added that as a quick workaround for the segfault we saw in the > tests. Perhaps we should actively cancel the threads on shutdown > instead of waiting?If the plugin is still responding to an active request, and it is not expecting thread cancellation, that could be bad (writing code that is safe in the presence of pthread_cancel is not always trivial). But the scenario I described is when there are no active requests at the moment, so the threads are under our control. A better way might be to have a read-from-self pipe that we write on SIGINT, and a select() that checks for activity both on the read-from-self pipe and from the client, so that we can react to either situation immediately. But then we still have to be careful to (try) and let the pending plugin active requests complete (and merely reject any new requests from the client during that time), before giving up completely. In other words, I don't have a problem with waiting 5 seconds for SIGINT to have an effect in some cases, so much as a problem of explaining why we still have to wait even when there is no reason for it. At any rate, for now I'm just documenting the issues, rather than planning on tackling the code associated with the problem; it will be a task (for someone else?) down the road. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Reasonably Related Threads
- Re: question on nbdkit shutdown behavior
- Re: question on nbdkit shutdown behavior
- [PATCH 0/3] Alternate way to avoid race conditions when nbdkit exits.
- [nbdkit PATCH] server: Saner filter .close calls
- [nbdkit PATCH 6/6] Add --threads option for supporting true parallel requests