On 11/14/2017 04:25 AM, Richard W.M. Jones wrote:> On Mon, Nov 13, 2017 at 12:42:48PM -0600, Eric Blake wrote: >> I'm observing a difference in timing on nbdkit shutdown that is >> dependent on client behavior, and I wonder if that's a bug, but I can't >> figure out where to patch it. >> >> When there is no connection present, sending SIGINT to nbdkit terminates >> immediately. >> >> If, on the other hand, there is a single client currently connected, the >> termination behavior on SIGINT depends on what the client has done: if >> the client is currently silent with regards to issuing >> transmission-phase transactions, nbdkit hangs for 5 seconds, then >> forcibly tears down the connection: > > I'm guessing it's because of commit 63f0eb0889c8f8a82ba06a02a8a92d695902baad > which I added to fix a race in plugin_cleanup(). See also: > https://www.redhat.com/archives/libguestfs/2017-September/msg00226.html > > [...] >> Why does current traffic from the client cause the plugin to be torn >> down faster? Does it matter? > > I believe because the main loop checks the !quit flag if there > is traffic on the connection. > > There is most likely to be a better fix for the race than 63f0eb0889. > I added that as a quick workaround for the segfault we saw in the > tests. Perhaps we should actively cancel the threads on shutdown > instead of waiting?If the plugin is still responding to an active request, and it is not expecting thread cancellation, that could be bad (writing code that is safe in the presence of pthread_cancel is not always trivial). But the scenario I described is when there are no active requests at the moment, so the threads are under our control. A better way might be to have a read-from-self pipe that we write on SIGINT, and a select() that checks for activity both on the read-from-self pipe and from the client, so that we can react to either situation immediately. But then we still have to be careful to (try) and let the pending plugin active requests complete (and merely reject any new requests from the client during that time), before giving up completely. In other words, I don't have a problem with waiting 5 seconds for SIGINT to have an effect in some cases, so much as a problem of explaining why we still have to wait even when there is no reason for it. At any rate, for now I'm just documenting the issues, rather than planning on tackling the code associated with the problem; it will be a task (for someone else?) down the road. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Richard W.M. Jones
2017-Nov-14 15:56 UTC
Re: [Libguestfs] question on nbdkit shutdown behavior
A deeper problem I think is that it's ill-specified what is supposed to happen when nbdkit receives a termination signal. We don't specify (for example) that plugin callbacks could or could not be killed in the middle. Killing them in the middle of a callback could be bad for some plugins. Going back to the original problem, perhaps we could have fixed it instead by acquiring a lock around all plugin callbacks, and ensuring that plugin_cleanup also acquires the same lock. The lock would normally be uncontended so it shouldn't cause any performance issues. That would be a fairly simple revert of 63f0eb088 + alternate fix. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-p2v converts physical machines to virtual machines. Boot with a live CD or over the network (PXE) and turn machines into KVM guests. http://libguestfs.org/virt-v2v
Richard W.M. Jones
2017-Nov-14 16:02 UTC
Re: [Libguestfs] question on nbdkit shutdown behavior
On Tue, Nov 14, 2017 at 03:56:59PM +0000, Richard W.M. Jones wrote:> Going back to the original problem, perhaps we could have fixed > it instead by acquiring a lock around all plugin callbacks, and > ensuring that plugin_cleanup also acquires the same lock. The > lock would normally be uncontended so it shouldn't cause any > performance issues. That would be a fairly simple revert of > 63f0eb088 + alternate fix.No actually that would serialize everything. Some sort of mutex primitive, I think a shared/exclusive lock with the exclusive lock being taken in plugin_cleanup. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-p2v converts physical machines to virtual machines. Boot with a live CD or over the network (PXE) and turn machines into KVM guests. http://libguestfs.org/virt-v2v