Nikolaus Rath
2022-May-15 19:12 UTC
[Libguestfs] Communication issues between NBD driver and NBDKit server
On May 15 2022, "Richard W.M. Jones" <rjones at redhat.com> wrote:> On Sun, May 15, 2022 at 04:45:11PM +0100, Nikolaus Rath wrote: >> Hi, >> >> I am observing some strange errors when using the Kernel's NBD driver with >> NBDkit. >> >> On the kernel side, I see: >> >> May 15 16:16:11 vostro.rath.org kernel: nbd0: detected capacity change from 0 >> to 104857600 >> May 15 16:16:11 vostro.rath.org kernel: nbd1: detected capacity change from 0 >> to 104857600 >> May 15 16:18:23 vostro.rath.org kernel: block nbd0: Possible stuck request >> 00000000ae5feee7: control (write at 4836316160,32768B). Runtime 30 seconds >> May 15 16:18:25 vostro.rath.org kernel: block nbd0: Possible stuck request >> 000000007094eddc: control (write at 5372947456,10240B). Runtime 30 seconds >> May 15 16:18:27 vostro.rath.org kernel: block nbd0: Suspicious reply 89 (status >> 0 flags 0) >> May 15 16:18:31 vostro.rath.org kernel: block nbd0: Possible stuck request >> 0000000075f8b9bc: control (write at 8057764864,32768B). Runtime 30 seconds >> May 15 16:18:41 vostro.rath.org kernel: block nbd0: Possible stuck request >> 000000002d1b3e8b: control (write at 14499979264,32768B). Runtime 30 seconds >> [...] > > Does it really take over 30 seconds for nbdkit to respond? You might > want to insert some debugging into the S3 plugin to see what stage of > the request cycle is taking so long, although I'm going to guess it's > the remote Amazon server itself.It seems unlikely, but it is possible - especially since I'm serializing requests for debugging.> It seems like you can adjust this timeout using the nbd-client -t flag > (it calls ioctl(NBD_SET_TIMEOUT) in the kernel). If I understand the > logic correctly, the nbd timeout is currently set to 0, which causes > the default socket timeout to be used. Using the -t flag overrides this. > So I guess try setting it larger and see if the problem goes away.Well, my concern is more about the "suspicious reply" message which - according to Josef - means that NBDkit replied twice to the same request. If that is the case, that might explain why another request seemingly remained unanswered. Do you see any way for this to happen? Best, Nikolaus -- GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F ?Time flies like an arrow, fruit flies like a Banana.?
Richard W.M. Jones
2022-May-15 19:25 UTC
[Libguestfs] Communication issues between NBD driver and NBDKit server
On Sun, May 15, 2022 at 08:12:59PM +0100, Nikolaus Rath wrote:> Do you see any way for this to happen?I think it's impossible. A more likely explanation follows. If you look at the kernel code, the NBD_CMD_INFLIGHT command flag is cleared when a command times out: https://github.com/torvalds/linux/blob/0cdd776ec92c0fec768c7079331804d3e52d4b27/drivers/block/nbd.c#L407 That's the place where it would have printed the "Possible stuck request" message. Some time later, nbdkit actually replies to the message (for the first and only time) and in that code the flag is checked and found to be clear already, causing the "Suspicious reply" message to be printed: https://github.com/torvalds/linux/blob/0cdd776ec92c0fec768c7079331804d3e52d4b27/drivers/block/nbd.c#L749 I'd say you need to increase the timeout and/or work out why the S3 plugin is taking so long to respond. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org